什么是数据科学?《What is data science》 by Mike Loukides翻译和精读02
Where data comes from
数据从哪里来
Data is everywhere: your government, your web server, your business partners, even your body. While we aren’t drowning in a sea of data, we’re finding that almost everything can (or has) been instrumented. At O’Reilly, we frequently combine publishing industry data from Nielsen BookScanwith our own sales data, publicly available Amazon data, and even job data to see what’s happening in the publishing industry. Sites like Infochimpsand Factualprovide access to many large datasets, including climate data, MySpace activity streams, and game logs from sporting events. Factual enlists users to update and improve its datasets, which cover topics as diverse as endocrinologists to hiking trails.
数据分布在任何一个地方:你的政府,你的网络服务器,你的商业伙伴,甚至你的身体。当我们没有在数据的海洋中溺亡,我们发现几乎每一样东西都能(或已经)能被物联网化。在O’Reilly, 尼尔森图书调查机构的出版业数据和我们自己的销售数据,公开可用的亚马逊数据,甚至给ne遇见出版业发生什么的工作数据,我们频繁地将这些数据结合起来。像Infochimps和Factual这样的网址提供很多大型数据集的借口,包括气候数据,MySpace活动流,和体育比赛项目的比赛日志。Factual赞助用户来更新和提升它的数据集,涵盖了从内分泌学家到徒步旅行路线的多样主题。

IBM第一批商业磁盘驱动器的其中一个,它有5MB的容量,被放在一个大概相当于豪华电冰箱大小的柜子里。相反,一个32GB的microSD卡测量大概是5/8 x 3/8 英寸,大约重0.5克。
图片:Mike Loukides. 在IBM Almaden Research陈列的磁盘驱动器。
Much of the data we currently work with is the direct consequence of Web 2.0, and of Moore’s Law applied to data. The web has people spending more time online, and leaving a trail of data wherever they go. Mobile applications leave an even richer data trail, since many of them are annotated with geolocation, or involve video or audio, all of which can be mined. Point-of-sale devices and frequent-shopper’s cards make it possible to capture all of your retail transactions, not just the ones you make online. All of this data would be useless if we couldn’t store it, and that’s where Moore’s Law comes in. Since the early ‘80s, processor speed has increased from 10 MHzto 3.6 GHz -- an
increase of 360 (not counting increases in word length and number of cores). But we’ve seen much bigger increases in storage capacity, on every level. RAM has moved from $1,000/MB to roughly $25/GB -- a price reduction of about 40000, to say nothing of the reduction in size and increase in speed. Hitachi made the first gigabyte disk drivesin 1982, weighing in at roughly 250 pounds;
now terabyte drives are consumer equipment, and a 32 GB microSD card weighs about half a gram. Whether you look at bits per gram, bits per dollar, or raw capacity, storage has more than kept pace with the increase of CPU speed.
当下很多我们处理的数据是Web 2.0和摩尔定律应用到数据上的直接结果。网络让人们花更多事件在线上,并在他们访问的地方留下一条数据的踪迹。移动应用留下一条更为丰富的数据踪迹,因为很多种移动应用由地理定位注解,或涉及视频语音,所有这些(数据)都可以被挖掘。销售点设备和常客的(购物)卡使得捕捉所有你们的零售交易成为可能,而不只是你们在线上做的(交易)。如果我们不存储它(指数据),那么所有的这些数据都将是无用的,那就是摩尔定律到达的地方。从20世纪80年代早期开始,处理器速度从10 MHz 提高到3.6GHz——提高了360倍(这还没算上字长和内核数量的增加)。但是我们已经看到了在存储容量上的大的多的增长,在各个层面上。RAM从每MB1000美元降到了大约每GB25美元——价格下降了大约4万倍,更不必说体积的减少和速度的提升。日立公司在1982年制作了第一个千兆字节的磁盘驱动器,重大约250磅;现在万亿字节驱动是消费者标配,一个32G的存储卡重大概0.5克。不管你是否关注每克的字节数,每美元的字节数或原始容量,存储不仅仅是跟上CPU速度增长的步伐。
注:摩尔定律是英特尔创始人之一戈登·摩尔的经验之谈,其核心内容为:集成电路上可以容纳的晶体管数目在大约每经过18个月到24个月便会增加一倍。换言之,处理器的性能大约每两年翻一倍,同时价格下降为之前的一半。这里是用处理速度的爆炸性增长来类比数据的爆炸性增长。
字长是CPU一次能并行处理的二进制位数。各个层面——指的是计算机的寄存器,内存,外存都在容量的不断告诉增长中。
The importance of Moore’s law as applied to data isn’t just geek pyrotechnics. Data expands to fill the space you have to store it. The more storage is available, the more data you will find to put into it. The data exhaust you leave behind whenever you surf the web, friend someone on Facebook, or make a purchase in your local supermarket, is all carefully collected and analyzed. Increased storage capacity demands increased sophistication in the analysis and use of that data. That’s the foundation of data science.
摩尔定律应用到数据的重要性不止是极客的魔术展示。数据膨胀到填满你存储数据的空间。更多的存储是可用的,更多的数据你会发现将其放进存储中。当你在网上冲浪后留下的数据废气,在脸书上与人交友,在你本地的超市做一次采购,这些数据都会被仔细地收集和分析。增长的存储容量要求对那些数据的分析和使用提出了更高水平的要求。这是数据科学的基石。
注:exhaust除了动词,还可以作名词。The data exhaust you leave behind whenever you surf the web, friend someone on Facebook, or make a purchase in your local supermarket, is all carefully collected and analyzed. 这个长句子是被动语态,主语The data exhaust,动词is,collected and anyalyzed表示被收集和分析。而exhaust要么是及物动词,或者是名词。这里只能是名词。data exhaust你以为你留下的数据是废气,会被排放然后无踪迹,但不是,你的踪迹,数据会被用心保存。
So, how do we make that data useful? The first step of any data analysis project is “data conditioning,” or getting data into a state where it’s usable. We are seeing more data in formats that are easier to consume: Atom data feeds, web services, microformats, and other newer technologies provide data in formats that’s directly machine-consumable. But old-style screen scrapinghasn’t died, and isn’t going to die. Many sources of “wild data” are extremely messy. They aren’t well-behaved XML files with all the metadata nicely in place. The foreclosure data used in “Data Mashups in R” was posted on a public website by the Philadelphia county sheriff’s office. This data was presented as an HTML file that was probably generated automatically from a spreadsheet. If you’ve ever seen the HTML that’s generated by Excel, you know that’s going to be fun to process.
因此,我们该如何让数据有用?任何数据分析工程的第一步是“数据调节”,或者说让数据进入可使用的状态。我们发现格式的数据更容易处理:Atom data feeds(一种技术提供的数据),网络服务,微格式化,和其它提供机器可以直接处理的格式的数据的更新的技术。但是老式的screen scraping(一种屏幕抓取技术)还未消亡,也不会即将消亡。“野生数据”的多种来源极其混乱。它们不是表现良好的XML文件——所有的元数据都很好地放在适当的位置。《Data Mashup in R》中使用的丧失抵押品赎回权数据被费城县警长办公室发布在一个公开网址上。这些数据是以很有可能由电子表格自动生成的HTML文件的形式呈现。如果你曾经见过由Excel生成的HTML,你知道那些(数据,指HTML数据)数据处理起来是很开心有趣的。
注:excel就是spreadsheet的一种。最后一句话的意思是,如果不是严格格式的数据,处理起来会很痛苦。
Data conditioning can involve cleaning up messy HTML with tools like Beautiful Soup, natural language processing to parse plain text in English and other languages, or even getting humans to do the dirty work. You’re likely to be dealing with an array of data sources, all in different forms. It would be nice if there was a standard set of tools to do the job, but there isn’t. To do data conditioning, you have to be ready for whatever comes, and be willing to use anything from ancient Unix utilities such as awkto XML parsers and machine learning libraries. Scripting languages, such as Perland Python, are essential.
数据调节包含使用像Beautiful Soup这样的工具进行混乱的HTML的清理,自然语言处理来解析英语和其它语言的纯文本,或甚至让人们来处理dirty work。你很有可能要处理一堆的数据来源,都是不同的格式。要是有一系列标准工具来做这项工作就好了,但是没有。要做数据调整,你必须要为即将到来的一切做准备,还要愿意使用任何工具,从古老的Unix程序比如awk到XML parsers和机器学习库。脚本语言,像Perl和Python,是必要的。
注:李飞飞曾经将重复简单含金量很低的数据标注工作称为dirty work。
Once you’ve parsed the data, you can start thinking about the quality of your data. Data is frequently missing or incongruous. If data is missing, do you simply ignore the missing points? That isn’t always possible. If data is incongruous, do you decide that something is wrong with badly behaved data (after
all, equipment fails), or that the incongruous data is telling its own story, which may be more interesting? It’s reported that the discovery of ozone layer depletion was delayed because automated data collection tools discarded readings that were too low1. In data science, what you have is frequently all you’re going to get. It’s usually impossible to get “better” data, and you have no alternative but to work with the data at hand.
一旦你已经解析了数据,你可以开始思考你数据的质量。数据常常是缺少的,或是不一致的。如果数据是缺失的,你会简单地忽视丢失的点吗?那并不是在所有情况下都可能。如果数据是不一致的,你是判定表现不好的数据出现问题(毕竟,装备失败),还是不一致的数据在诉说它自己的故事,哪一种更有意思?报告声称臭氧层破坏的发现被推迟了,因为自动数据收集工具放弃读取(臭氧层)很低的数据1。在数据科学中,你已经有的都是你即将获得的。通常来说不可能获得“更好的”数据,并且你别无选择只能处理手上的数据。
If theproblem involves human language, understanding the data adds another dimension to the problem. Roger Magoulas, who runs the data analysis group at O’Reilly, was recently searching a database for Apple job listings requiring geolocation skills. While that sounds like a simple task, the trick was disambiguating “Apple” from many job postings in the growing Apple industry. To do it well you need to understand the grammatical structure of a job posting; you need to be able to parse the English. And that problem is showing up more and more frequently. Try using Google Trendsto figure out what’s happening with the Cassandradatabase or the Pythonlanguage, and you’ll get a sense of the problem. Google has indexed many, many websites about large snakes. Disambiguation is never an easy task, but tools like the Natural Language Toolkitlibrary can make it simpler.
如果这个难题涉及人类语言,理解数据给这个难题添加了另一个维度。在O’Reilly管理数据分析小组的Roger Magoulas,最近在寻找一个需要地理定位技能的Apple工作列表数据库。尽管那听起来像一个简单的任务,技巧是在增长中的苹果产业的许多工作信息中消除“Apple”的二义性。要想把它(消除二义性)做好,你需要理解一份工作信息的语法结构;你需要有能力来解析英语(至少要知道一词多义)。而且那个难题出现的越来越频繁。试着使用谷歌趋势来弄清楚Cassandra数据库或Python语言在做什么,你会感觉到问题的所在。谷歌已经为很多,很多关于大型蛇(这里应该是指Python)的网站编了索引。消除二义性从来不是一个简单的工作,但是像自然语言工具箱库之类的工作可以让它简单一些。
注:ambiguous模糊的,有歧义的,disambiguating消除二义性。Python有蟒蛇的意思,Python是编程语言,但是在为编程语言命名之前,本意是蛇。
When natural language processing fails, you can replace artificial intelligence with human intelligence. That’s where services like Amazon’s Mechanical Turkcome in. If you can split your task up into a large number of subtasks that are easily described, you can use Mechanical Turk’s marketplace for cheap labor. For example, if you’re looking at job listings, and want to know which originated with Apple, you can have real people do the classification for roughly $0.01 each. If you have already reduced the set to 10,000 postings with the word “Apple,” paying humans $0.01 to classify them only costs $100.
当自然语言处理失败,你可以用人类智能来替代人工智能。那就是比如亚马逊的Mechanical Turk服务发挥作用的地方。如果你能将你的任务分解为大量容易被描述的子任务,你就可以使用Mechanical Turk的廉价劳动力市场。举例子来说,如果你正在寻找工作列表,并想要知道哪些起源于苹果,你可以让真实的人来做分类,只要每条支付0.01美元。如果你已经缩小集合到一万条有着“Apple”这个词的信息,雇人0.01美元的报酬来分类它们只需要花费100美元。
注:replace A with B是用B替代A。replace A for B是用A替代B。关于marketplace for cheap labor,微软投资的openAI的chatGPT也传出使用成本低廉的次发达国家的劳工来处理大量数据标注的传闻。只能说,廉价劳动力来进行繁杂,技术含量不高的数据标注工作,是初期的必然。
1. The NASA article denies this, but also says that in 1984, they decided that the low values (which went back to the 70s) were “real.” Whether humans or software decided to ignore anomalous data, it appears that data was ignored.
脚注1:NASA报告否认了这点(指放弃读取臭氧层含量很低的数据),但也承认在1984年,他们判定那些低值(追溯到20世纪70年代)是“真实”的。无论是人类或软件决定忽略异常的数据,看起来数据是被忽略了。