# CrashCourse Statistics

## 1

What can statistics do?

Descriptive statistics: description of the spreadout or central tendency of the data

Inferential statistics: make conclusions using the data(especially used to test hypothesis)

However, for different questions, they will have different standards for whether two things are related or not.

CAUTION: We still cannot say that the conclusion must be right, there is always some uncertainty in statistics.

## 2

It is hard for common people to 'feel' really big or really small numbers(numeracy). So we should think mathematically and use some tools to make better decisions.

## 3

**measures of central tendency**

- mean平均数
- median中位数
- mode众数(fine for non-numeric data!)

normal: data distribute roughly equal in both sides of the middle, and most of the values are close to the middle value

symmetric: mean and median are the same

if not symmetric, then the data is skewed.

distribution: how often the data occurs in the dataset(frequency)

When there is more than one modes in a dataset, you might just include two groups of data in a same dataset, especially in a large dataset.

## 4

**measures of spread**

range: largest-smallest

interquartile range: the spread of middle 50% of the data

variance方差

to be more precise, variance=sum of (data-mean)^2/(N-1)

standard deviation: square root of variance

## 5

**data visualization part 1**

data: categorical & quantitative

for categorical data:

- bar chart
- pie char
- pictograph

for quantitative data:

binning: bin quatitative data to categories

- histograms

## 6

**data visualization part 2**

still for quantitative data:

- dotplot(similar to histograms, replace the bar with a certain amount of dots)
- stem & leaf plot(use raw data, divided into different stems and add leaves according to the raw values without the stem)
- box-and-whiskers plot(fence,min,q1,median,q3,max,fence)(data out of the fences should be excluded)
- cumulative frequency plot(the frequency that accumulated up to that point)

## 7

**distributions: the shape and spread of data**

normal distribution:

- mean: the center of distribution
- standard deviation: how thin or squished the shape is

positive(right)/negative(left) skew

for some distributions that has more hthan one peaks, we might put the data of more than one unimodal distributions together

uniform distribution: each value has the same frequency

## 8

**data relationships**

clusters in scatter plot could find relationships

regression line: a line that as close to every point as possible at the same time

regression coefficient: a non-zero slope that tells there are some positive or negative relationships between two variables, but we cannot know more information from the value (direction & closeness)

standard correlation: standard deviation could scale the correlation to r: (-1,1) (r=0, no relation)

squared correlation: r^2, how well we could predict one variable if you know another

correlation doesn't equal causation