2020-6-7 #Introduction to Data Science
1. Defining Data Science
“The application of data-centric computational, and inferential thinking to understand the world and solve problems.”-Joseph Gonzalez, University of California, Berkeley
The impact of Data Science: customer service/navigation/recommendations/voice-to-text-image processing/fraud detection/robotics……
Data Scientists:To make sure that value is extracted from data smoothly. Optimize data processing.Define metrics. Establish collection methods. Work with enterprise systems.
Data Engineers:To make sure data flows smoothly between the source (where the data collected) and the destination (where data is extracted and processed). Optimize data flow.
2. Data Science Life Cycle

3. Data Design: the process of data collection (as little bias as possible)
Probability sampling: SRS-Simple random sampling-sampling at random without a replacement. Cluster sampling-dividing data into clusters and using SRS to select clusters. (Pros: easy sample collection/used to conduct surveys. Cons: produces variation/requires larger samples.) Stratified sampling-dividing data into strata and producing SRS per stratum.
4. Computational tools

5. Tabular data
Tabular data is arranged in rows and columns and data files are stored in specific formats. (CSV files, Comma-separated values, data stored as a line with each field separately by a comma.) Reading tabular data. Gathering insights. Answering specific questions.
6. Exploratory data analysis(EDA)
An approach to analyzing data sets to summarize their main characteristic. Conducting EDA, examine data types, examine key properties, avoid making assumptions.
Three major statistical data: 1. Nominal Data (Data with no inherent order). 2.Ordinal Data (Data with ordered categories).3. Numerical Data (Data that consists of amounts or quantities).
The type of data you have, determines what you can do with that data.
Properties of data:
Granularity间隔(粒度): What does each record in your data represent? How fine or coarse is the data? This determines what kind of analysis can perform. Finer is safer.
Scope(域): What does the data describe? Does the data cover the topic you’re interested in?
Temporality(时态性): When was the data collected? Various formats of time you need pay attention.
Faithfulness(忠实度): How accurately does the data describe the real world? Should you trust this data?
7. Data cleaning
Detecting and fixing corrupt or inaccurate records from a record set. Missing values/formatting/structure/complex values/unit conversion/interpretation of magnitudes/misspellings/duplicated rows/inconsistent formats/unspecified units and so on. Questions to ask before cleaning: Are there missing values in the dataset? /Are there missing values that were filled in? /Which parts of the data were entered by humans?
8. Data visualization
A good data visualization can convey trends and anomalies in the data, more efficiently than a written description and great way to communicate your predictions and conclusions to other people. Use some computational tools to create data visualizations.
Visualize quantitative data (定量的), also known as categorical data, subtypes are nominal data (has no inherent order) and ordinal data (falls into ordered categories).
Visualize qualitative data (定性的). Two common charts: histograms and scatter plots.
9. Statistical inference
The process of using data analysis to deduce properties of an underlying probability distribution. Examples: election forecasting/test score predictions and more. Methods: hypothesis tests /confidence intervals.
9.1 Design a hypothesis test: (假设测验)
Null hypothesis: does not state that there are associations between variables, and usually attributes trends observed in the sample to random chance.
Alternative hypothesis: attributes trends observed in the data to associations between variables.
The goal of a hypothesis test is to decide between the null hypothesis and the alternative hypothesis.
Test statistic: helps to decide between the two hypotheses. Smaller values of this test statistic indicate that the alternative hypothesis is better supported, while larger values indicate the null hypothesis is better supported.
9.2 Conduct a specific type of hypothesis test—permutation test: (排列测验)
Permutation test: randomly permute, or rearrange, the data. (In a permutation test, you randomly permute the data.)
P-Value: the probability of obtaining results as extreme as the observed results of a statistical hypothesis test, assuming that the null hypothesis is correct. A smaller P-Value means that there is stronger evidence in favor of the alternative hypothesis.
9.3 Bootstrap a confidence interval: (引导一个置信区间。置信区间是指由样本统计量所构造的总体参数的估计区间。其给出的是被测量参数的测量值的可信程度。)
Oftentimes a data scientist needs to estimate an unknown parameter of a population. (未知参数的群体,and cannot take more samples due to time and cost.) In this situation, use a process called bootstrap.
Bootstrap(引导程序): can simulate new random samples by resampling(from the population by resampling from your original sample). Resampling consist of sampling at random (Resampling from your original sample consists of sampling at random with replacement from your original sample many times). Each time you resample, you compute an estimate of the unknown. Collect all the estimate computed along the way. Using your estimates, you create a confidence interval. You then know your unknown parameter is within that interval.
10. Classification
Classification is an important machine learning technique
Process of making categorical predictions using data.
You have some data in the correct categories.
You want to learn from this data.
Key terms use in classification: Observation (a situation where you want to make a prediction.)/Attributes (certain aspects that describe the observation.)/Attributes are known aspects describing the observation, Observation belongs to a class, which is not known, Classification predicts the classes using the attributes.
Training data: contains observations that have been classified. Analyzing the training data builds a classifier. Classifier is an algorithm to classify future observations.
Many different methods to approach classification.
K-nearest neighbor algorithm(k-NN): in pattern recognition, the k-nearest neighbors algorithm is a non-parametric method used for classification and regression. k-近邻算法是数据挖掘分类计数中最简单的方法之一。这个算法的核心思想是如果一个样本在特征空间中的k个最相邻的样本中的大多数属于某一个类别,则该样本也属于这个类别,并具有这个类别上样本的特性。K-近邻算法在类别决策时,只与极少量的相邻样本有关。