CrashCourse Statistics
1
What can statistics do?
Descriptive statistics: description of the spreadout or central tendency of the data
Inferential statistics: make conclusions using the data(especially used to test hypothesis)
However, for different questions, they will have different standards for whether two things are related or not.
CAUTION: We still cannot say that the conclusion must be right, there is always some uncertainty in statistics.
2
It is hard for common people to 'feel' really big or really small numbers(numeracy). So we should think mathematically and use some tools to make better decisions.
3
measures of central tendency
- mean平均数
- median中位数
- mode众数(fine for non-numeric data!)
normal: data distribute roughly equal in both sides of the middle, and most of the values are close to the middle value
symmetric: mean and median are the same
if not symmetric, then the data is skewed.
distribution: how often the data occurs in the dataset(frequency)
When there is more than one modes in a dataset, you might just include two groups of data in a same dataset, especially in a large dataset.
4
measures of spread
range: largest-smallest
interquartile range: the spread of middle 50% of the data
variance方差
to be more precise, variance=sum of (data-mean)^2/(N-1)
standard deviation: square root of variance
5
data visualization part 1
data: categorical & quantitative
for categorical data:
- bar chart
- pie char
- pictograph
for quantitative data:
binning: bin quatitative data to categories
- histograms
6
data visualization part 2
still for quantitative data:
- dotplot(similar to histograms, replace the bar with a certain amount of dots)
- stem & leaf plot(use raw data, divided into different stems and add leaves according to the raw values without the stem)
- box-and-whiskers plot(fence,min,q1,median,q3,max,fence)(data out of the fences should be excluded)
- cumulative frequency plot(the frequency that accumulated up to that point)
7
distributions: the shape and spread of data
normal distribution:
- mean: the center of distribution
- standard deviation: how thin or squished the shape is
positive(right)/negative(left) skew
for some distributions that has more hthan one peaks, we might put the data of more than one unimodal distributions together
uniform distribution: each value has the same frequency
8
data relationships
clusters in scatter plot could find relationships
regression line: a line that as close to every point as possible at the same time
regression coefficient: a non-zero slope that tells there are some positive or negative relationships between two variables, but we cannot know more information from the value (direction & closeness)
standard correlation: standard deviation could scale the correlation to r: (-1,1) (r=0, no relation)
squared correlation: r^2, how well we could predict one variable if you know another
correlation doesn't equal causation