CrashCourse Statistics

1

What can statistics do?

Descriptive statistics: description of the spreadout or central tendency of the data

Inferential statistics: make conclusions using the data(especially used to test hypothesis)

However, for different questions, they will have different standards for whether two things are related or not.

CAUTION: We still cannot say that the conclusion must be right, there is always some uncertainty in statistics.

2

It is hard for common people to 'feel' really big or really small numbers(numeracy). So we should think mathematically and use some tools to make better decisions.

3

measures of central tendency

mean平均数
median中位数
mode众数(fine for non-numeric data!)

normal: data distribute roughly equal in both sides of the middle, and most of the values are close to the middle value

symmetric: mean and median are the same

if not symmetric, then the data is skewed.

distribution: how often the data occurs in the dataset(frequency)

When there is more than one modes in a dataset, you might just include two groups of data in a same dataset, especially in a large dataset.

4

measures of spread

range: largest-smallest

interquartile range: the spread of middle 50% of the data

variance方差

to be more precise, variance=sum of (data-mean)^2/(N-1)

standard deviation: square root of variance

5

data visualization part 1

data: categorical & quantitative

for categorical data:

bar chart
pie char
pictograph

for quantitative data:

binning: bin quatitative data to categories

histograms

6

data visualization part 2

still for quantitative data:

dotplot(similar to histograms, replace the bar with a certain amount of dots)
stem & leaf plot(use raw data, divided into different stems and add leaves according to the raw values without the stem)
box-and-whiskers plot(fence,min,q1,median,q3,max,fence)(data out of the fences should be excluded)
cumulative frequency plot(the frequency that accumulated up to that point)

7

distributions: the shape and spread of data

normal distribution:

mean: the center of distribution
standard deviation: how thin or squished the shape is

positive(right)/negative(left) skew

for some distributions that has more hthan one peaks, we might put the data of more than one unimodal distributions together

uniform distribution: each value has the same frequency

8

data relationships

clusters in scatter plot could find relationships

regression line: a line that as close to every point as possible at the same time

regression coefficient: a non-zero slope that tells there are some positive or negative relationships between two variables, but we cannot know more information from the value (direction & closeness)

standard correlation: standard deviation could scale the correlation to r: (-1,1) (r=0, no relation)

squared correlation: r^2, how well we could predict one variable if you know another

correlation doesn't equal causation