What can statistics do?
Descriptive statistics: description of the spreadout or central tendency of the data
Inferential statistics: make conclusions using the data(especially used to test hypothesis)
However, for different questions, they will have different standards for whether two things are related or not.
CAUTION: We still cannot say that the conclusion must be right, there is always some uncertainty in statistics.
It is hard for common people to 'feel' really big or really small numbers(numeracy). So we should think mathematically and use some tools to make better decisions.
measures of central tendency
- mode众数(fine for non-numeric data!)
normal: data distribute roughly equal in both sides of the middle, and most of the values are close to the middle value
symmetric: mean and median are the same
if not symmetric, then the data is skewed.
distribution: how often the data occurs in the dataset(frequency)
When there is more than one modes in a dataset, you might just include two groups of data in a same dataset, especially in a large dataset.
measures of spread
interquartile range: the spread of middle 50% of the data
to be more precise, variance=sum of (data-mean)^2/(N-1)
standard deviation: square root of variance
data visualization part 1
data: categorical & quantitative
for categorical data:
- bar chart
- pie char
for quantitative data:
binning: bin quatitative data to categories
data visualization part 2
still for quantitative data:
- dotplot(similar to histograms, replace the bar with a certain amount of dots)
- stem & leaf plot(use raw data, divided into different stems and add leaves according to the raw values without the stem)
- box-and-whiskers plot(fence,min,q1,median,q3,max,fence)(data out of the fences should be excluded)
- cumulative frequency plot(the frequency that accumulated up to that point)
distributions: the shape and spread of data
- mean: the center of distribution
- standard deviation: how thin or squished the shape is
for some distributions that has more hthan one peaks, we might put the data of more than one unimodal distributions together
uniform distribution: each value has the same frequency
clusters in scatter plot could find relationships
regression line: a line that as close to every point as possible at the same time
regression coefficient: a non-zero slope that tells there are some positive or negative relationships between two variables, but we cannot know more information from the value (direction & closeness)
standard correlation: standard deviation could scale the correlation to r: (-1,1) (r=0, no relation)
squared correlation: r^2, how well we could predict one variable if you know another
correlation doesn't equal causation