agentzh さんのプロフィールHuman & Machineフォトブログリストその他 ツール ヘルプ

ブログ


12月25日

生活搜基于 Firefox 3.1 的 List Hunter 集群

NAME

List Hunter Cluster - 我们自己的基于 Mozilla Firefox 3.1 的深抓爬虫集群

DESCRIPTION

该文档介绍了我们的基于 Firefox 3.1 的 List Hunter 集群。目前是我们公司生活搜索引擎的一部分。

背景

在我们的生活搜索项目中,需要对网页进行深层次的识别和抽取。基于文本内容的分类我们目前采用的是美国雅虎基于最大熵的 DCP 系统。而对于网页结构方面的分类(即这个网页是列表页呢,还是详情页?),以及主体链接列表、主体区域抽取,则一直缺乏比较好的解决方案。我的同事尝试过通过纯粹的结构化的方法(如海维算法)进行识别,准确率只有 60%,而基于 SVM 这样的机器学习的方法,对网页类型比较敏感,如目标网页与训练集相差较多,则准确率迅速下降。

于是我尝试把网页显示时的视觉信息结合到海维算法以及块合并算法中。于是准确率和召回率分别达到了 90% 和 80%。这里的视觉信息主要包括一个网页区域的大小、形状、和在整个页面中的位置。更多的信息还包括字体、颜色等等。这样,便诞生了 List Hunter 插件。于是如何将 Firefox 插件做成一个大规模的集群用于生产,便成为了重要问题。

在下面这篇 blog 文章中我介绍了更多背景方面的细节以及 List Hunter 插件本身的情况:

http://blog.agentzh.org/#post-97

该插件只依赖于 Firefox,可以即装即用:

http://agentzh.org/misc/listhunter.xpi

集群的架构

该集群由四大部分组成:纯 Firefox 集群,Apache + mod_proxy + mod_disk_cache 集群,curl 预取器集群,和 OpenResty 集群。一共有十几台生产机"全职"或者"兼职"地参与了这个集群。下面逐一介绍一下哈:

纯 Firefox 集群

纯 Firefox 集群目前由 8 台 4 核的 redhat5 生产机组成。每台生产机运行 3 个 Firefox 3.1 进程实例。因为那 8 台机器同时服务于淘宝 VIP 搜索的商口图片显示接口(大约 600 万日 PV),所以我们没敢在这些机器上运行比较多的 Firefox 进程。

需要指出的是,Firefox 默认是"进程复用"的运行方式。即启动多次 firefox-bin 可执行程序,其实得到的还是单个 Firefox 进程。这种进程复用方式无法充分利用生产机的多核 CPU。因为在任意给定时刻,一个 firefox 进程(哪怕有多个窗口里的 JS 在同时打满运行)只能跑在一个核上,因为它不是多 OS 线程的。为了让 Firefox 以多进程方式运行,需要:

  1. 在调用 firefox-bin 程序时指定 -no-remote 命令行选项,或者设置环境变量 MOZ_NO_REMOTE=1
  2. 以不同的 profile 运行不同的 firefox-bin 进程(利用 -P 命令行选项)。

我们平常看到的 Firefox 的主窗口并不启动,而以 chrome 方式单独运行 List Hunter 插件的界面,例如:

    firefox -chrome chrome://listhunter/content/crawler.xul -P crawler2 -no-remote

以 chrome 方式运行的插件与 XULRunner 方式运行的 XUL 应用是很类似的。

由于 Firefox 3.1 还没有正式发布,我直接 checkout 官方 Mercurial 源码仓库内的最新版本,自己在我们的 redhat 生产机上编译的。我们目前几乎没有修改官方的 C++ 源代码,为了方便和官方版本保持同步。我们目前使用的是下面的 firefox 编译选项:

  # My .mozconfig
mk_add_options MOZ_MAKE_FLAGS="-j2"
mk_add_options MOZ_OBJDIR=@TOPSRCDIR@/ff-opt
ac_add_options --enable-crypto --enable-feeds --disable-profilesharing
--enable-rdf --enable-zipwriter --disable-tests --disable-gnomeui --disable-cookies
--disable-canvas --disable-gnomeui --disable-inspector-apis --disable-mailnews
--disable-mathml --disable-official-branding --enable-plaintext-editor-only
--disable-postscript --disable-printing --disable-profilelocking --disable-safe-browsing
--disable-startup-notification --disable-svg --disable-svg-foreignobject
--disable-updater --disable-javaxpcom --disable-plugins --disable-crashreporter
--disable-tests --disable-debug --enable-application=browser --build=i686-linux
--disable-jsd --disable-ldap --enable-strip --disable-accessibility --disable-ogg
--disable-dbus --disable-freetype2 --disable-optimize

这里能禁用的功能我们都禁用了,这里的 feeds, rdf, crypto 这三个都不能 disable,否则源码编译不通过,会报一些 .h 头文件找不到,呵呵。--disable-ogg 实际上也不起作用,但从网上的材料看曾经有效过,呵呵。

事实上,目前我们还是给官方的源码打了一个 C++ 补丁,用于将 Error Console 中的 Errors 重定向到 stderr,这样方便我们在集群环境下通过 Firefox 进程的 log 文件捕捉和诊断一些异常。目前的补丁是下面这个样子:

http://agentzh.org/misc/191src.patch.txt

值得一提的是,Firefox 进程本身是"无头"的,即它运行在 Xvfb 这个 X server 之上,只在内存里执行渲染,而不需要任何显示硬件的存在。这些 Firefox 进程本身是挂在我们自己的一个 Perl 写的进程监控脚本之下。该脚本来自我们的 Proc::Harness 模块:

http://svn.openfoundry.org/xulapp/trunk/demo/ListHunter/misc/Proc-Harness/

Proc::Harness 会像 lighttpd 的 Fastcgi server 一样,维护一组指定数目的进程(通过 CPAN 上的 Proc::Simple 模块)。当子进程挂掉后立即重启,或者当子进程的 stderr/stdout 输出不再变化一段时限之后也杀之重启。Proc::Harness 脚本自身则是挂在了 deamontools 之下。

这些 Firefox 进程受安装在其中的 List Hunter 插件的完全控制。它们都是高度自治的 robot。它们内部有一个处理循环,一批一批地从 OpenResty 的 web service 接口取到 URL 任务,然后一个一个地在 Firefox 的 browser 组件里加载和分析,最后把分析到的结果一批一批地通过 OpenResty 提交之。

curl 预取爬虫集群与 Apache mod_proxy 集群

该集群目前布署了 6 台双核的 redhat4 生产机。每台机器都安装了两个集群组件,一是预取器,一是 Apache mod_proxy. 预取器的作用是通过 curl (准确地说是 WWW::Curl 模块)将网页的 HTML 和 CSS 通过 mod_proxy 预取一遍,这样这些请求的结果就可以在 mod_proxy 中通过 mod_disk_cache 缓存住。于是当纯 Firefox 集群再通过 mod_proxy 去抓这些 URL 时,mod_proxy 就可以直接把缓存后的结果直接返回给 Firefox 了。

预取器和 Firefox 进程是同时工作的,但对于一个 URL 任务而言,只有通过预取器预取过之后,Firefox 进程才会进行处理。所以实际构成了一个两道工序的流水线。这种调度是由 OpenResty 集群来完成的。

预取器目前是以一个叫为 WWW::Prefetcher 的 Perl 模块的形式来实现的:

http://svn.openfoundry.org/xulapp/trunk/demo/ListHunter/misc/WWW-Prefetcher/

虽然 mod_cache 提供了许多选项,但其缓存行为还是比较遵守 RFC 的 cache 要求的。因此我对 mod_cache 模块进行了许多修改,使之可以无条件地缓存住请求过的所有网页,而不论其 URL 是否有 querystring,也不论其 response header 中的要求是什么。我们对最新的 httpd 2.2.11 的补丁如下:

http://agentzh.org/misc/httpd-2.2.11.patch.txt

特别地,mod_disk_cache 指向的不是磁盘目录,而是 RAM 中开辟的 tmpfs 分区。由于我们这 6 台机器都是很旧的 IDE 硬盘,因此直接用磁盘作 cache 存储时,高并发条件下,每台机器的 load 都在 20 以上,根本无法忍受。后来换为 tmpfs 结果 htcacheclean 工具之后,机器负载就降到 0.1 以下了。

OpenResty集群

由于 OpenResty 的通用性,我们直接复用了同时服务于 yahoo.cn 和口碑网的那个生产集群,(3 台 FastCGI 前端机和1 台 PL/Proxy 机器),所以我就没有布署新的机器。在服务于 Firefox 集群的 OpenResty 接口中通过 View API 暴露了若干的 PostgreSQL 函数,以完成整个 List Hunter 集群的任务调度和结果汇总。目前的实现中,我们通过 Pg 的 sequence 摸拟了一种循环任务队列,并通过计数器完成流水线中两道工序之间的相对同步。

相关的 Pg 函数、sequence、以及索引的定义在这里:

http://svn.openfoundry.org/xulapp/trunk/demo/ListHunter/misc/init-db.sql

相关的 OpenResty 对象的定义则在这里:

http://svn.openfoundry.org/xulapp/trunk/demo/ListHunter/misc/init-resty.pl

集群的性能

集群目前每小时的产出稳定在 10 万网页以上,一天的产出在 240 万以上。Firefox 机器的负载在 3 左右,Proxy 的负载在 0.1 以下。

JS 基准测试显示 Firefox 3.1 加载一个页面的平均延时是 200 ~ 300 ms,机房间的网络延时在 10 ~ 20 ms(因为网页已被 mod_cache 缓存住,故无到外网的网络开销), List Hunter 插件的 DOM 分析代码的用时在 200 ~ 300 ms。其他 OpenResty 开销再计入,一个 Firefox 进程大约 1 sec 处理一个页面。

在 Linux 上一个 Firefox 进程的内存占用情况如下:

    VIRT 276m, RES 86m, SHR 34m

已知瓶颈和缺陷

当 OpenResty 中的 URL 任务表的行数超过 100 ~ 200 万时,调度查询容易超过 PL/Proxy 的 10 秒限制。因此,我们目前采取"流式"的任务导入和导出方式。通过 cronjob 定期地向库中导入任务,并同时把完成了的任务及时移出。

Apache 的 mod_proxy 在高并发条件下不够稳定,而且限于 Apache 自身的体系结构,无法实现 proxy pipelining. 因此计划在未来集群规模进一步扩大时,改用 Squid. 当然了,Squid 很可能也需要进行修改才能满足我们这里的强制缓存一段指定时间的需求。

同时,受限于 Apache mod_cache 后端的非分布式,代理服务器的调度是在 Firefox 进程和 curl 预取进程中完成的,导致前端代码比较复杂,还带来了代理服务器列表的定时同步问题。因此,未来可以考虑为 Apache mod_cache 或者 Squid 添加 memcached 缓存后端的支持。这样代理前端的多台服务器可以实现对集群内其他部件的"透明化"。

TODO

  1. 换用 Squid + memcached 作为缓存用正向代理
  2. 通过 XULRunner 而非 firefox -chrome 方式运行 List Hunter 插件。(需要为我的 XUL::App 框架添加 XULRunner 支持)

与相似产品的异同

美国雅虎通过大量修改 Firefox 2 的 C++ 源代码,开发了叫为 HLFS 的爬虫集群,用于爬取 AJAX 网站的内容以及得到带有视觉信息的 DOM 树。他们将 Firefox 进程做成了 HTTP 代理的形式,对外部应用提供服务。

而我们的 List Hunter 集群中的 Firefox 进程则是高度自治的爬虫,它们自己从 OpenResty 中不断地批量取任务去完成。而外部应用则是批量地向 OpenResty 导入任务来让集群运转。由于 List Hunter 集群并没怎么修改 Firefox 的源代码,这使得我们可以很容易地与官方最新版本保持同步,从而第一时享受到官方优化带来的众多好处。

同时 List Hunter 集群本身是通用目的的,它可以作为各种 Firefox 插件的"集群容器"。换言之,这是一种将 Firefox 插件"集群化"的完整的框架。

由于 Firefox 插件开发本身已经通过我发布到 CPAN 的 XUL::App 框架得到了极大的简化,所以响应新的需求的成本是非常低的。

使用 Firefox 的利与弊

优点

Firefox 是世界级的浏览器。作为最复杂功能最丰富的 Internet 客户端之一,我们将之作为爬虫可以享受到和最终用户一样的丰富功能,无论是 AJAX 还是视觉信息都不是问题。

Firefox 有基于 XUL 和 chrome JS 的灵活的插件机制,极易扩展。事实上,Firefox 主界面自身就是一个大插件。同时,Gecko 是基于 XPCOM 组件方式的,因此可以很容易地使用 C/C++/Java 等语言开发 XPCOM 组件,然后再用 JavaSscript 把它们粘合在一起。于是乎,JavaScript 成为了像 Perl 一样的胶水类语言。

运行于 Gecko 之上的插件 JavaScript 拥有最高权限,这种 JS 可以访问磁盘文件,可以访问系统环境变量,可以使用原生的 XmlHttpRequest 对象发出跨域 AJAX 请求。

Firefox 的性能随着新版本的发布总会有戏剧性的变化。Firefox 3.1 中的 Gecko 引擎的渲染速度就比 3.0 中的快了好几倍(根据 List Hunter 回归测试集的 benchmark 结果,前者为平均 60 ms,后者则长达 200+ ms)。(Firefox 3.1 中 TraceMonkey 的 JIT 支持倒并没有给 List Hunter 中的 JS 带来可测量的性能提升。)

纯 JS 写的 Firefox 插件可以在 Win32/Linux/Mac 多种操作系统上即装即用,所以方便和编辑及产品经理沟通行为细节,方便演示。如若计算过于复杂,亦可使用 C++ 语言改写插件中的计算密集的部分。

缺点

Firefox 是高耦合的软件,这与 Google Chrome 及 Safari 浏览器的核心 Webkit 形成了鲜明对比。这意味着,我们比较难于对 Firefox 进行深层次的裁剪,无法轻易地免除一些比较大的功能部件,也很难将其中的某一个大部件剥出来单独使用(当然了,SpiderMonkey 是少数几个例外之一)。

AUTHOR

章亦春 (agentzh) <agentzh@yahoo.cn>

LICENSE

Copyright (c) 2007-2008, Yahoo! China EEEE Works, Alibaba Inc. All rights reserved.

Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met:

  • Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer.
  • Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution.
  • Neither the name of the Yahoo! China EEEE Works, Alibaba Inc. nor the names of its contributors may be used to endorse or promote products derived from this software without specific prior written permission.

THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.

11月29日

Q4 is crazy!

Yeah, Q4 is really crazy! I've been hacking on several company projects in parallel over the last few weeks. Fortunately they're all very interesting stuffs.

We've just kicked OpenResty 0.5.2 out of the door and I'm preparing for the 0.5.3 release right now. My teammate xunxin++ has quickly implemented the YLogin handler for OpenResty, via which the users can use Yahoo! ID to login their own applications on OpenResty. Our Yahoo! registeration team helpfully worked out a sane design to allow us to reuse the Yahoo! Login system, which effectively turned Yahoo! ID into something like a passport, at least from the perspective of OpenResty users :) Big moment! Lots of company products using Yahoo! IDs could be rewritten in 100% JavaScript! Actually our team is already rewriting the Search DIY product using all the goodies offered by OpenResty.

Meanwhile, some guys from Sina.com are doing their personal projects in OpenResty. They said they really appreciated the great opportunities provided by the OpenResty architecture since various kinds of clients (e.g. web sites, cellphones, desktop apps, and etc.) could share the same set of API via OpenResty's web services). They also sent a handful of useful feedbacks and suggestions regarding OpenResty's design and implementation.

I've also been working on an intelligent crawler cluster based on Firefox, Apache mod_proxy/mod_cache, and OpenResty. The crawler itself is a plain Firefox extension named List Hunter:

    http://agentzh.org/misc/listhunter.xpi

It's an enhanced version of the Haiway List Recognization Engine used by my SearchAll extension and also built by my XUL::App framework. You can install it to your Firefox and play with it if you like ;) What this extension does is very simple: recognizing "list regions" and "text regions" in an arbitrary web page and further deciding automatically whether it's a "list page" or a "text page". The latter functionality may sound a bit weird: why is it useful to categorize web pages that way? Anyway, our PM (Product Manager) has crazy ideas about that categorization in our Live Search project and knows better than us ;)

Turning such a Firefox extension into tens or even hundreds of Firefox crawlers running on a bunch of production machines requires a lot of work. I devised a prefetching system which prefetches HTML pages and CSS files included in them, and caches the headers and contents for a fixed amount of time in such a way that Firefox crawlers can later load pages and CSS stuffs directly from the same cache in our local network, thus significantly reducing the page loading time in Gecko. The cache is a heavily patched version of Apache2's mod_cache with mod_disk_cache as the backend storage. The way prefetchers and crawlers interact with the Internet and the cache is via HTTP proxies based on Apache2's mod_proxy. Pipeling the prefetching and crawling processes requires OpenResty with PgQ enabled. Well, I'm still working on this cluster and my goal is 2 pages/sec for every single Firefox process. Firefox 3.1's amazing performance boost (more than 30% faster according to my own benchmark) makes me very confident in abusing Gecko to build efficient crawlers that takes advantage of the rich rendering information.

Another Firefox crawler project haunting my head is a similar one that automatically recognizes and extracts user comments from arbatrary web pages (if any comments appear, of course). Such tasks would be hard if my code has to run without the geometric informations of every DOM nodes provided by the browser rendering engine (in the form of offsetWidth, offsetHeight, offsetTop, and offsetLeft attributes of DOM elements). Some other collegues in our Alibaba's Search Tech Center are putting their head around Cobra, a pure Java HTML renderer. But I'm doubting that it would run more correctly or more efficiently than Gecko. Oh well, I'm not a Java guy anyway...

Finally, just a short note: I had a wonderful time with clkao and Jesse Vincent at Beijing Perl Workshop 2008. I learned pretty a lot about the Prophet internals during the hackation after the conference, and Jesse quickly hacked out a stub OpenResty model API for Prophet. Then we went to the Great Wall the next day. I was amazed to find Jesse hacking crazily on the Great Wall and enjoying the sunshines alone...Wow.

Enough blogging...back to hacking ;)

6月12日

Optimizing Haskell code: from String to ByteString

Haskell's built-in strings are notoriously slow. The String type in Haskell is [Char] per se. I was told that there was a much faster alternative provided by the bytestring (or fps) library by the Pugs blog a few years ago. (Thanks Audrey!)

However, it took me a while to figure out how to use it in my code. Eventually I found that All I needed were in the Data.ByteString.Char8 module rather than Data.ByteString. (Thanks Hoogle!) According to the document, it's recommended to import the module this way

   import qualified Data.ByteString.Char8 as B

to prevent name clashing with Prelude.

Converting String to B.ByteString is straightforward:

    B.pack "Hello, world"

where "Hello, world" is of type String.

Or in the other direction:

    B.unpack s

where s is of type B.ByteString.

Concatenating several bytestrings together can be done by the B.concat function:

    B.concat [B.pack "hello", B.pack ", ", B.pack "world"]

or just use B.append for joining two bytestrings for handy:

    (B.pack "hello, ") `B.append` (B.pack "world")

Personally I like to define a ~~ operator for a bytestring version of ++ this way:

    (~~) :: B.ByteString -> B.ByteString -> B.ByteString
    (~~) = B.append

and then I can simply write:

    B.pack "hello, " ~~ B.pack "world"

Bytestring versions for most of the functions in Prelude are also provided. For instance, printing out a bytestring to stdout can be done directly by

    B.putStrLn bs   -- bs is of type B.ByteString

rather than the cumbosome and also slow

    putStrLn $ B.unpack bs

As bytestring's documentation points out, converting back and forth between bytestrings and Haskell's built-in strings could become the bottleneck of the program, especially when the source comes with lots of string literals like "Hello, world" shown above. Wouldn't it be nice if string literals get automatically interpreted by the GHC compiler to bytestrings without going through a B.pack? Fortunately, with bytestring 0.9.0.4 (or better) and GHC 6.8.1 (or better), it is possible to do that via the GHC option -XOverloadedStrings. So now we can write literals without mudding around with B.pack:

    B.concat ["hello", ", ", "world"]

or

    "hello, " ~~ "world"

Perfect! :D

Note that, as of this writing, the bytestring library in Ubuntu 8.04's debian repository is not new enough to support this. So ubuntu users have to install the latest version from HackageDB like this:

    $ wget http://hackage.haskell.org/packages/archive/bytestring/0.9.1.0/bytestring-0.9.1.0.tar.gz
    $ tar -xzf bytestring-0.9.1.0.tar.gz
    $ cd bytestring-0.9.1.0/
    $ runghc Setup.lhs configure -p
    $ runghc Setup.lhs build
    $ sudo runghc Setup.lhs install

By switching to B.ByteString in my code emitters for the minisql compiler mentioned in the previous blog post, the execution time dramatically reduced from 7.0 sec to 2.3 sec in my stress tests generated by the Perl module Parse::RandGen::Regexp. This is really an amazing improvement :) Furthermore, my UTF-8 regression tests kept passing as well.

In the next journal I'll present another optimization trick that further reduced the running time from 2.3 sec to 1.0 sec. (Well, it has nothing to do with -O2 BTW, and I turned on -O2 from the very beginning already ;))


1月18日

Re: Intercepting access to a method/property

On Jan 18, 2008 7:21 PM, AllSeeingI wrote:
> Is it possible (through an extension, XPCOM, other way) to call a
> particular JS function when a particular method or property is
> accessed by a user script (= script on a HTML page)?
>

Object.watch is the way to go for properties ;) Not sure about methods though.

> The reason I'm asking is that I'm trying to create an extension that
> intercepts JavaScript redirections:
>
> location.href = ...

Heh, I'm afraid it's more browser-specific. So it might be OT here. But I'd like to share some of my experiences (mostly from NSA++) in this mail.

I think the following code should work in Firefox 2 (i.e. the js 1.7 engine):

    top.watch("location", function () { throw "Permission denied." });
    top.location.watch("href", function () { throw "Permission denied." });

But unfortunately it won't work in Firefox 3 (i.e. the js 1.8 engine). AFAIK, Firefox has been trying much harder than IE to protect frame-busting sites.

> location.replace(...);
>

Well, I was trying very hard to defeat this one but with no luck. A good enough workaround for (static) sites is to (locally) disable JS for that particular frame loading the frame-busting page, as in:

    myBrowser.docShell.allowJavascript = false;

Basically, if you load the web page in a separate chrome window, frame-busting code will always fail. But if you're trying to load it in Firefox's own browser tab, you're not really "chrome" there.

Another trick that works is to use the onbeforeunload handler, as in:

    window.onbeforeunload = function (e) {
       e.returnValue = "This action might be caused by a
        frame-busting site.\nPlease click 'Cancel' if you're not meant to quit me.";
        return false;
    };

But this will pop up a confirmation dialog to the end user. There's no known way to bypass it without hacks ;)

There may exist much better solutions that I don't know.

Hope these help.

Cheers,
-agentzh


4月18日

为什么一个字节是 8 个比特?

记得我们班的“超级天才”宝权同志曾在大一学 C++ 的时候问过一个很特别的问题,即“一个字节为什么是 8 个比特?”

昨晚,我将此问题贴到了 irc.freenode.net 的 #perl6 通道上,Larry Wall (TimToady), jerry gay ([particle]), moritz 参与了讨论。下面是当时的聊天记录(agentzh 就是我啦,呵呵):

<agentzh> a friend of mine once asked me why a byte is of 8 bits.
<moritz> agentzh: what did you answer?
<moritz> agentzh: "computer scientist love powers of two"?
<agentzh> moritz: i told him because ASCII code has 7 bits and the people want to feel safer and add one more
<TimToady> lol
<moritz> *g* nice explanation ;-)
<agentzh> thanks :D
<TimToady> and then the Europeans all added one more, and did we feel safer?
<TimToady> I don't think so...
* agentzh wants to hear TimToady's explanation.
<TimToady> I think the ASCII explanation is basically correct, from a cultural point of view. When people started programming PDP-11s and doing a lot of string processing, they decided it was convenient that it came close to a power of two, and stuck with it.
<TimToady> and it was also fairly obvious about then that the next generation would be 32-bit processors, and then you get 4 chars into it.
<TimToady> but I think the powers-of-two argument was kind of a post-facto rationalization of the ASCII culture
<TimToady> basically, Pascal and C thought in bytes, so everything else followed along.
* TimToady remembers various contortions of trying to rationalize the type system of C on some weird old architectures that were not amenable to bytes...
<TimToady> and the term "byte" itself had not yet settled on 8 bits
* moritz thinks of "mix", Donald E. Knuth's assembler, that doesn't rely on a fixed byte size
<TimToady> yes. 36-bit computers tended to use 6 bit characters
<[particle]> octet is the correct term, but byte has become a synonym
<TimToady> byte is now the correct term. octet will die eventually
<TimToady> and go back to being 8 singers.
<TimToady> except for in standards documents, where it will likely remain a shibboleth

原始的聊天记录位于:

http://colabti.de/irclogger/irclogger_log/perl6?date=2007-04-17,Tue&sel=451#l672

包括上面这段记录的上下文,呵呵。昨天晚上 Larry 真是妙语连珠,Joke 不断啊。不愧是大师级人物……

章亦春
3月26日

解决 RealPlayer 在 ubuntu 中没声音的问题

记得一个月前我徒弟就报告过 RealPlayer 在 ubuntu 中光有图像没有声音的问题;没想到现在我自己却撞上了。好在经过反复的 Google,终于找到了下面的解决方法:

* 首先安装 ALSA OSS 驱动程序:

$ sudo apt-get install alsa-oss

* 然后编辑启动脚本 (/usr/lib/realplay-10.0.8/realplay) 并将第 73 行从

$REALPLAYBIN “$@”

改成

aoss $REALPLAYBIN “$@”

对于我自己的 feisty fawn 而言,装的是 RealPlayer 10.0.7 版,需要修改的 realplay 文件中的那行位于第 70 行,而不是 73 行,呵呵。现在播放 .rmvb 文件终于有声音了!好棒哦~~~不必再通过用 VirtualBox 跑 WinXP 来看电影了,呵呵。

7月7日

tuits是什么?

我在网上经常看到程序员们(当然还有许多非程序员)在他们的电子邮件、IRC 聊天信息以及文档中广泛地使用 tuits 这个词,可是一般的字典里无论如何也查不到,即便是网络字典中也难觅其踪迹。tuits 的典型的用法如下:

    A> Will you work on that project?
    B> Well, as soon as i have the tuits.

再比如,

    A> Oh, i'm exhausted. i don't think i have the tuits to finish the job today!
    B> alas...

But what do tuits mean? What are tuits?

其实从这些应用实例我们多少可以猜出,tuits 有“时间“、”灵感“、或者”动机”之类的意思。来自 libwww-perl 邮件组的美国程序员们可以证实我们的猜测:

邮件1

邮件2

有趣的是,后一个链接指向 Perl 语言之父 Larry Wall 对 tuits 的诠释。

从这些邮件不难看到,tuits 一词起源于短语 round tuit, 而 round tuit 又起源于下面这句话:

   I'll do that when I get around to it.

这里搭配用法 get around to 意为“抽出时间做某事或者考虑某事”。显然,to it 一融合便成了 tuit,呵呵。是不是太过分了一点儿?

en.wikipedia.org 网站上对 round tuit 的定义进一步证实了上面的说法:

A round tuit is an imaginary object whose name is derived from the phrase ``when I get around to it''.

我们看到,英语中的不少词汇也是很值得细细品味的。呵呵