快速注册

Think Stats 笔记3（chapter 4）

会游泳的咸鱼 2012-12-14 00:05:31

chapter 4，Continuous Distributions，之前那些分布都是经验分布（empirical distributions），因为他们都是基于经验观察值，相对于阶梯函数，另一种分布叫做连续分布函数，自然界很多现象都可以用特定的连续分布函数逼近。
1）指数分布The Exponential Distribution
在概率论和统计学中，指数分布是一种连续概率分布。指数分布可以用来表示独立随机事件发生的时间间隔，比如旅客进机场的时间间隔、中文维基百科新条目出现的时间间隔等等。它的累积分布函数为：

In general, the mean of an exponential distribution is 1/λ. The median is log(2)/λ.One way is to plot the complementary CDF, 1 − CDF(x), on a log-y scale. For data from an exponential distribution, the result is a straight line.
注：在python中，The function “expovariate” in the random module generates random values from an exponential distribution with a given value of λ.
2）帕累托分布The Pareto Distribution
帕累托分布因意大利同名经济学家而得名，开始被用来描述财富的分布。它的累积分布函数为：

If you plot a the CCDF（complementary of cumulative distribution function） of a sample from a Pareto distribution on a linear scale, you expect to see a function like：

So if you plot log y versus log x, it should look like a straight line。
注：在python中 The random module provides paretovariate, which generates random values from a Pareto distribution.
3）Zipf’s law is an observation about how often different words are used. The most common words have very high frequencies, but there are many unusual words, like “hapaxlegomenon,” that appear only a few times. Zipf’s law predicts that in a body of text, called a “corpus,” the distribution of word frequencies is roughly Pareto.
4） Weibull distribution， a generalization of the exponential distribution that comes up in failure analysis

注：在python中Use random.weibullvariate to generate a sample from a Weibull distribution
5）The Normal Distribution
The normal distribution, also called Gaussian, is the most commonly used because it describes so many phenomena, at least approximately. It turns out that there is a good reason for its ubiquity, which we will get to in “Central Limit Theorem”
The most common alternative is to write it in terms of the error function, which is a special function written erf(x)

6）Normal Probability Plot
For the normal distribution, there is an alternative called a normal probability plot. It is based on rankits: if you generate n values from a normal distribution and sort them, the kth rankit is the mean of the distribution for the kth value.
7）对数正态分布The Lognormal Distribution
如果一列值的对数符合正态分布的话，则说这列值符合对数正太分布
It turns out that the distribution of weights for adults is approximately lognormal.

8）Why models 为何需要模型？
# 1 Like all models, continuous distributions are abstractions, which means they leave out details that are considered irrelevant.
# 2 Continuous models are also a form of data compression. When a model fits a dataset well, a small set of parameters can summarize a large amount of data.
# 3 It is sometimes surprising when data from a natural phenomenon fit a continuous distribution, but these observations can lead to insight into physical systems. Some times we can explain why an observed distribution has a particular form. For example, Pareto distributions are often the result of generative processes with positive feedback (so-called preferential attachment processes: see http://wikipedia.org/wiki/Preferential_attachment.).

回应转发赞收藏

还没人赞这篇日记

会游泳的咸鱼 (江苏南京)

Think Stats 笔记3（chapter 4）

热门话题 · · · · · · ( 去话题广场 )