CrashCourse Statistics

1

What can statistics do?

Descriptive statistics: description of the spreadout or central tendency of the data

Inferential statistics: make conclusions using the data(especially used to test hypothesis)

However, for different questions, they will have different standards for whether two things are related or not.

CAUTION: We still cannot say that the conclusion must be right, there is always some uncertainty in statistics.

2

It is hard for common people to 'feel' really big or really small numbers(numeracy). So we should think mathematically and use some tools to make better decisions.

3

measures of central tendency

  • mean平均数
  • median中位数
  • mode众数(fine for non-numeric data!)

normal: data distribute roughly equal in both sides of the middle, and most of the values are close to the middle value

symmetric: mean and median are the same

if not symmetric, then the data is skewed.

distribution: how often the data occurs in the dataset(frequency)

When there is more than one modes in a dataset, you might just include two groups of data in a same dataset, especially in a large dataset.

4

measures of spread

range: largest-smallest

interquartile range: the spread of middle 50% of the data

variance方差

to be more precise, variance=sum of (data-mean)^2/(N-1)

standard deviation: square root of variance

5

data visualization part 1

data: categorical & quantitative

for categorical data:

  • bar chart
  • pie char
  • pictograph

for quantitative data:

binning: bin quatitative data to categories

  • histograms

6

data visualization part 2

still for quantitative data:

  • dotplot(similar to histograms, replace the bar with a certain amount of dots)
  • stem & leaf plot(use raw data, divided into different stems and add leaves according to the raw values without the stem)
  • box-and-whiskers plot(fence,min,q1,median,q3,max,fence)(data out of the fences should be excluded)
  • cumulative frequency plot(the frequency that accumulated up to that point)

7

distributions: the shape and spread of data

normal distribution:

  • mean: the center of distribution
  • standard deviation: how thin or squished the shape is

positive(right)/negative(left) skew

for some distributions that has more hthan one peaks, we might put the data of more than one unimodal distributions together

uniform distribution: each value has the same frequency

8

data relationships

clusters in scatter plot could find relationships

regression line: a line that as close to every point as possible at the same time

regression coefficient: a non-zero slope that tells there are some positive or negative relationships between two variables, but we cannot know more information from the value (direction & closeness)

standard correlation: standard deviation could scale the correlation to r: (-1,1) (r=0, no relation)

squared correlation: r^2, how well we could predict one variable if you know another

correlation doesn't equal causation