CrashCourse Statistics


What can statistics do?

Descriptive statistics: description of the spreadout or central tendency of the data

Inferential statistics: make conclusions using the data(especially used to test hypothesis)

However, for different questions, they will have different standards for whether two things are related or not.

CAUTION: We still cannot say that the conclusion must be right, there is always some uncertainty in statistics.


It is hard for common people to 'feel' really big or really small numbers(numeracy). So we should think mathematically and use some tools to make better decisions.


measures of central tendency

  • mean平均数
  • median中位数
  • mode众数(fine for non-numeric data!)

normal: data distribute roughly equal in both sides of the middle, and most of the values are close to the middle value

symmetric: mean and median are the same

if not symmetric, then the data is skewed.

distribution: how often the data occurs in the dataset(frequency)

When there is more than one modes in a dataset, you might just include two groups of data in a same dataset, especially in a large dataset.


measures of spread

range: largest-smallest

interquartile range: the spread of middle 50% of the data


to be more precise, variance=sum of (data-mean)^2/(N-1)

standard deviation: square root of variance


data visualization part 1

data: categorical & quantitative

for categorical data:

  • bar chart
  • pie char
  • pictograph

for quantitative data:

binning: bin quatitative data to categories

  • histograms


data visualization part 2

still for quantitative data:

  • dotplot(similar to histograms, replace the bar with a certain amount of dots)
  • stem & leaf plot(use raw data, divided into different stems and add leaves according to the raw values without the stem)
  • box-and-whiskers plot(fence,min,q1,median,q3,max,fence)(data out of the fences should be excluded)
  • cumulative frequency plot(the frequency that accumulated up to that point)


distributions: the shape and spread of data

normal distribution:

  • mean: the center of distribution
  • standard deviation: how thin or squished the shape is

positive(right)/negative(left) skew

for some distributions that has more hthan one peaks, we might put the data of more than one unimodal distributions together

uniform distribution: each value has the same frequency


data relationships

clusters in scatter plot could find relationships

regression line: a line that as close to every point as possible at the same time

regression coefficient: a non-zero slope that tells there are some positive or negative relationships between two variables, but we cannot know more information from the value (direction & closeness)

standard correlation: standard deviation could scale the correlation to r: (-1,1) (r=0, no relation)

squared correlation: r^2, how well we could predict one variable if you know another

correlation doesn't equal causation