agentzh さんのプロフィールHuman & Machineフォトブログリストその他 ツール ヘルプ

ブログ


11月18日

The "headers more" module: scripting input and output filters in your Nginx config file

I've been working madly on the "headers more" module:

   http://github.com/agentzh/headers-more-nginx-module

And got everything that I want working now. It also has a nice wiki page (which also has brief explanation of the underlying implementation):

   http://wiki.nginx.org/NginxHttpHeadersMoreModule

Our buzzword is that it can rewrite the "Server" output header dynamically! See this:

   location /foo {
        more_set_headers   "Server: $arg_server";
   }


Then GET /foo?server=Foo will get a response with the "Server: Foo" header set ;)

Input headers can be trivially rewritten as well, including the "Host" header:

    more_set_input_headers   "Host: some-other-host";

Well, the full practical power of this module is out of my current imagination. If you have some crazy uses, please drop me a line ;)

Happy Nginx hacking!
11月15日

The "chunkin" module: Experimental chunked input support for Nginx

Pushed by those cutting-edge users on the Nginx mailing list, I've quickly worked out the "chunkin" module which adds HTTP 1.1 chunked input support for Nginx without the need of patching the core:

    http://github.com/agentzh/chunkin-nginx-module

This module registers an access-phase handler that will eagerly read and decode incoming request bodies when a "Transfer-Encoding: chunked" header triggers a 411 error page in Nginx (hey, that's what you have to pay for avoiding patching the core ;)). For requests that are not in the "chunked" transfer encoding, this module is a "no-op".

To enable the magic, just turn on the "chunkin" config option like this:

    chunkin on;
    location /foo { ... }
    ....

No other modification is required in your nginx.conf file. (The "chunkin" directive is not allowed in the location block BTW.)

This module is still considered highly experimental and there must be some serious bugs lurking somewhere. But you're encouraged to play and test it in your non-production environment and report any quirks to me :)

Efforts have been made to reduce data copying and dynamic memory allocation, thus unfortunately raising the risk of potential buffer handling bugs caused by premature optimizations :P

This module is not supposed to be merged into the Nginx core because I've used Ragel to generate the chunked encoding parser for joy :)

The following Nginx versions have been successfully tested by this module's (very limited) test suite:

   0.8.0 ~ 0.8.24
   0.7.21 ~ 0.7.63

The test suite definitely needs more test cases and the code is hacky in various places. If you're willing to contribute, feel free to ask me for a commit bit in a private email :)

Update: I've also added a wiki page for it: http://wiki.nginx.org/NginxHttpChunkinModule
10月15日

Hacking on the Nginx echo module

Over the recent weeks, I've been reading a lot of Nginx and its modules' C source code and it's really enjoyable. I've got lots of good ideas in implementing the next generation of OpenResty server based on the Nginx architecture. Well, it's currently my full-time $work anyway.

For the sake of testing other modules, experimenting the Nginx internals, and for fun, I've started my first Nginx module named "echo":

   http://github.com/agentzh/echo-nginx-module

It's already quite usable, and it also has a declarative test suite based on Perl's Test::Base. At the moment, LWP is used for simplicity and it's rather weak in testing streaming behavior of Nginx (I'm using "curl" to test these aspects manually for now). I'm considering coding up my own Perl HTTP client library based on IO::Select and IO::Socket (there might be already one around?).

Along the way, I'm intentionally heavily commenting my C source in this "echo" module in the hope that newcomers would find it a "live tutorial" or something like that. I'll write more about the details here in subsequent posts. After all, it's able to do a lot more thingies other than just "echo" stuffs directly, such as sleeping and flushing output buffer. And it will be capable of outputing subrequests' responses as well.

Happy hacking Nginx C modules and stay tuned!

Update: I've also added a wiki page for it: http://wiki.nginx.org/NginxHttpEchoModule
9月20日

Slides for my perl testing & VDOM.pm talks in Beijing Perl Workshop 2009

I really enjoyed the talks in BJPW2009. Here's the slides for my two talks in this conference:

Hopefully you'll find them interesting ;)
9月14日

A plan for nginx-openresty

Now that I've joined Taobao.com's SDS department and will be able to work on OpenResty in a full time fashion, I've just worked out a (somewhat) detailed plan for the next generation of the OpenResty server. Well, sorry, this draft is in Chinese since my $manager reads Chinese better:

    http://www.pgsqldb.org/mwiki/index.php/Nginx_openresty_plan  (still being actively updated)

After talking with my friend and colleague chaoslawful++ for possible designs of a high performance implementation of the OpenResty server, we finally decided to rewrite OpenResty.pm in pure C and in the form of an nginx module.

Here's some highlights of the Chinese project plan given above:
  • Nginx-openresty will remain fully opensource.
  • We want to take full advantage of Nginx's event modle and asynchronous I/O and we don't want backend requests blocking as well.
  • We want to integrate Coco Lua into nginx's event model leveraging coco's C level coroutines, thus leading to transparent asynchronous I/O on the Lua land. For example, consider the following Lua code

              res = http.get('http://www.taobao.com')

    will automatically yield the current "lua thread" and register a socket fd to nginx's underlying epoll/kqueue/select/etc model and return control to nginx to do other things. Once data arrives at the socket fd, nginx will get informed and accumulated the response data. When the requested data is completely ready, nginx will resume the pending lua session and the Lua function "http.get" will successfully return.
  • We want Lua to become the first-class embedded langauge on the service level and we'd abandon the restyscript language.
  • The first two backends we want to implement for nginx-openresty are mysql and Oracle (and Hive afterwards) because these are heavily used in our department's production.
  • We'll only allow the HTTP POST method to emulate modification methods like PUT and DELETE. GET will no longer be allowed here to reduce XSS attach risk.
  • Password login method will require the client to request a random number from the OpenResty server and use it as a salt to encrypt her password using multi-pass MD5, as in
     passwd = md5(passwd)
    for 1..2 do
    passwd = md5(passwd + salt)
    done
  • For the REST interfce, we will introduce Type API to allow user use Lua snippet to define new parameter types for Views and Actions.
  • For RDBMS backends, Views and Actions can use the "map_to" attribute to map automatically to underlying DB functions and stored procedures.
  • Views and Actions' definitions (ako "queries") will specify in the backend's own query language (like PL/SQL, PL/PgSQL, and etc.). OpenResty willl only recognize special interpolated parameters in the form of $(param_name type=xxx checker=xxx default=xxx ...).
  • Views and Actions will support "cached" and "expire" attributes to allow caching of result data set, and "async" attribute to allow time-consuming backend queries to submit to remote queues like memcacheq and eventually run by async daemon workers. The original View invoker will immediately get a job ID for his request (unless the task queue is full) and poll OpenResty's Job API for the status of his task and get the final results when the task is marked "done".
  • View/Action parameters can specify types, Lua-specified checkers, and default values. For example:

         {"name":"my_view","query":
              "select * from animals where age > $(age type=integer checker='return age >= 0 and age <= 100')"}

  • Filter, Template, and Trigger APIs will also be introduced :)
I'll create a git repository on GitHub for the nginx-openresty's development. Participation will always be appreciated :) I'll keep you posted.

Update: Special thanks go to kindy++ for his detailed review of the nginx-openresty-plan document and helpful suggestions :)
9月4日

Slides for my VDOM + WebKit talk

I gave a presentation on VDOM + WebKit to the Taobao.com Search Frontend Team this morning. The slides are based on my talk in April's Beijing Perl Workshop, but with notable updates to reflect recent changes in the last few months:

   http://agentzh.org/misc/slides/taobao-fe/vdomwebkit.xul   (Firefox required to open this link)

Be patient when it's downloading big images, or you can download the whole tarball to your local side, unpack the package, open the vdomwebkit.xul in it, and browse the slides locally:

   http://agentzh.org/misc/slides/taobao-fe.tar.gz

Recent major development regarding our browser-based web scraping clusters are:

  • We've switched from the Visual DOM Firefox extension completely to the VDOM Browser based on QtWebKit for hunter development.
  • We've switched from OpenResty + Pg to our queue-size-ware version of memcacheq to coordinate the whole cluster.
  • We've extensively used our new "VDOM spectroscopy algorithm" to establish corronspondence between text nodes in similar page regions.
  • We've renamed the offsetX, offsetY, offsetWidth, and offsetHeight attributes in the VDOM data format to single-letter names x, y, w, and h, respectively.
  • We've tweaked QtWebKit to emit geometric information about text nodes as well as text runs in the VDOM data output.
I'll use almost the same slides for my 3rd (lightening) talk at the annual Beijing Perl Workshop conference weeks later :)
9月3日

Our queue-size-aware version of memcacheq

Xunxin++ and I have been working on a fork of memcacheq (originally just within the company), adding support for queue length constraint.

In our scenario, a pipelined webpage information extraction cluster based on apple's WebKit core, it's important to limit the queue's length and to make the queue "inform" the queue item producers by some way in case the queue is full.

We're not sure if it's worth merging back to the mainstream version because this new addition adds some cost (though the cost is low).

Here goes the project page on GitHub, with more explanation of the details (I won't repeat them here ;)):

  http://github.com/agentzh/memcacheq/tree/master

The newly added code is also licensed under the same license as the mainstream memcacheq.

Enjoy :)
9月2日

I'll talk in the upcoming Beijing Perl Workshop 2009 event

I've submitted 3 talk proposals to this year's upcoming Beijing Perl Workshop conference scheduled at Sep 19. I'll publish my slides for my talks later here for your preview :)

The 3 talks are
  1. Wonders of Perl automated testing
  2. Inventing mini languages in Perl
  3. Web Scraping based on Perl + WebKit + VDOM
If you feel like attending the conference, please register in the conference site below:

   http://conference.perlchina.org/bjpw2009/

Don't forget to specify a T-shirt size in your profile setting there so that we can prepare a T-shirt for you in this event (well, it's for free!).

See you there ;)
5月11日

OpenResty.pm has been moved to GitHub

As some of you may have already noticed, I've moved the source repository of OpenResty.pm from the good old OpenFoundry to GitHub:

    http://github.com/agentzh/openresty/tree/master

Feel free to branch it or ask me for a commit bit if you don't have one ;)

I'll destroy the stuffs in the old "openapi" repository on openfoundry and leave a note there to avoid potential confusion.

Mailing list for OpenResty

After releasing several new releases of OpenResty.pm to CPAN, I created a mailing list for OpenResty users/developers on Google Groups:

    http://groups.google.com/group/openresty?hl=en

This is for both OpenResty.pm and mod_openresty. You're very welcome to join us there ;) There's also a #openresty on freenode but it's been very quiet :P

4月28日

Text::SmartLinks: The Perl 6 love for Perl 5

I'm so glad to find this blog post while browsing the Iron Man planet:

   http://szabgab.com/blog/2009/04/1240827553.html

Three years ago, I wrote the smartlinks.pl script to integrate the Pugs test suite with the Perl 6 Synopses documentation. Gábor Szabó now has done an excellent job in refactoring and packaging the tool into a general-purpose CPAN module. It had been my TODO until I was caught by accumulated schoolwork :P

Enjoy his (well, also our) Text::SmartLinks module!

   http://search.cpan.org/perldoc?Text::SmartLinks

4月23日

SSH::Batch: Treating clusters as maths sets and intervals

System administration is also part of my $work. Playing with a (big) bunch of  machines without a handy tool is painful. So I refactored some of our old scripts and released SSH::Batch, a collection of useful parallel ssh scripts, to CPAN:

    http://search.cpan.org/dist/SSH-Batch/

SSH::Batch allows you to name your clusters using variables and interval/set syntax in your ~/.fornodesrc config file. For instance:

    $ cat ~/.fornodesrc
    A=foo[01-03].com bar.org
    B=bar.org baz[a-b,d,e-g].cn foo02.com
    C={A} * {B}
    D={A} - {B}

where cluster C is the intersection set of cluster A and B while D is those machines in A but not in B.

And then you can query machine host list by using SSH::Batch's fornodes script:

   $ fornodes '{C}'
   bar.org
   foo02.com

   $ fornodes '{D}'
   foo01.com
   foo03.com

Furthermore, to run a command on a cluster at the concurrency level of 6:

   atnodes 'ls -lh' '{A} + {B}' my.more.com -c 6

Or upload a local file to the remote cluster:

  tonodes ~/my.tar.gz '{A} / {B}' :/tmp/

There's also a key2nodes script to push SSH public keys to remote machines ;)

A colleague in Alibaba B2B is already using it. And one of my teammates is going to use it to operate on those thousands of machines in our instance of the YST (Yahoo! Search Technology) cluster and I'm ready to receive more feedback from him ;)

Have fun :)
4月10日

My VDOM.pm & WebKit Cluster Talk at the April Meeting of Beijing Perl Workshop

Last night I gave a talk to our PerlChina folks at the April meeting in the Flow Bar. Here's the slides that I used:

The XUL format is the best among the three ;)

Just as the topic of the talk suggests, we're migrating from Firefox clusters to WebKit ones. I'll post more details here in the near future.

Enjoy!
2月26日

mod_libmemcached_cache is now opensourced :)

I've opensourced my mod_libmemcached_cache project to GitHub.com with the permissions from my company:

    http://github.com/agentzh/mod-libmemcached-cache/tree/master

It's a memcached storage provider for Apache2's mod_cache. In contrast to the mod_memcached_cache module on Google Code, we use the popular libmemcached library rather than apr-util's. Feel free to branch it and I'm very willing to merge back any useful changes and I'd love to send out commit bit as well :)

Mind you, it's licensed under GPLv2. That's my company's decision, not me ;)

2月13日

The slides for my talk on Firefox cluster & vision-based web page extraction

I gave a talk at the Beijing Perl Mongers' Feb Meeting last night. It was about my Firefox cluster and vision-based web page extraction technology. I had not expected to see so many people there. Wow. The talk was well received and people asked lots of interesting questions :)

The slides can be freely downloaded from my site (open the ffcluster.xul file in the tarball via Firefox):

    http://agentzh.org/misc/slides/BJPW200902.tar.gz

or browse directly online by Firefox:

    http://agentzh.org/misc/slides/BJPW200902/ffcluster.xul

Because it has many big pictures in it, it's recommended to download it to your local side first and display offline :)

I'll also give this presentation again to those Ruby/Python/Java/C++ guys at Beijing OpenParty's Fox meeting:

    http://www.beijing-open-party.org/index.php/2009/02/beijing-open-party-2009-02-fox-event-begin.html

Just as a site note: recently I'm intrigued by Apache C hacking. My mod_libmemcached_cache is my first Apache module. And I'd love to see more in the near future, such as mod_openresty ;)

Have fun!





12月25日

生活搜基于 Firefox 3.1 的 List Hunter 集群

NAME

List Hunter Cluster - 我们自己的基于 Mozilla Firefox 3.1 的深抓爬虫集群

DESCRIPTION

该文档介绍了我们的基于 Firefox 3.1 的 List Hunter 集群。目前是我们公司生活搜索引擎的一部分。

背景

在我们的生活搜索项目中,需要对网页进行深层次的识别和抽取。基于文本内容的分类我们目前采用的是美国雅虎基于最大熵的 DCP 系统。而对于网页结构方面的分类(即这个网页是列表页呢,还是详情页?),以及主体链接列表、主体区域抽取,则一直缺乏比较好的解决方案。我的同事尝试过通过纯粹的结构化的方法(如海维算法)进行识别,准确率只有 60%,而基于 SVM 这样的机器学习的方法,对网页类型比较敏感,如目标网页与训练集相差较多,则准确率迅速下降。

于是我尝试把网页显示时的视觉信息结合到海维算法以及块合并算法中。于是准确率和召回率分别达到了 90% 和 80%。这里的视觉信息主要包括一个网页区域的大小、形状、和在整个页面中的位置。更多的信息还包括字体、颜色等等。这样,便诞生了 List Hunter 插件。于是如何将 Firefox 插件做成一个大规模的集群用于生产,便成为了重要问题。

在下面这篇 blog 文章中我介绍了更多背景方面的细节以及 List Hunter 插件本身的情况:

http://blog.agentzh.org/#post-97

该插件只依赖于 Firefox,可以即装即用:

http://agentzh.org/misc/listhunter.xpi

集群的架构

该集群由四大部分组成:纯 Firefox 集群,Apache + mod_proxy + mod_disk_cache 集群,curl 预取器集群,和 OpenResty 集群。一共有十几台生产机"全职"或者"兼职"地参与了这个集群。下面逐一介绍一下哈:

纯 Firefox 集群

纯 Firefox 集群目前由 8 台 4 核的 redhat5 生产机组成。每台生产机运行 3 个 Firefox 3.1 进程实例。因为那 8 台机器同时服务于淘宝 VIP 搜索的商口图片显示接口(大约 600 万日 PV),所以我们没敢在这些机器上运行比较多的 Firefox 进程。

需要指出的是,Firefox 默认是"进程复用"的运行方式。即启动多次 firefox-bin 可执行程序,其实得到的还是单个 Firefox 进程。这种进程复用方式无法充分利用生产机的多核 CPU。因为在任意给定时刻,一个 firefox 进程(哪怕有多个窗口里的 JS 在同时打满运行)只能跑在一个核上,因为它不是多 OS 线程的。为了让 Firefox 以多进程方式运行,需要:

  1. 在调用 firefox-bin 程序时指定 -no-remote 命令行选项,或者设置环境变量 MOZ_NO_REMOTE=1
  2. 以不同的 profile 运行不同的 firefox-bin 进程(利用 -P 命令行选项)。

我们平常看到的 Firefox 的主窗口并不启动,而以 chrome 方式单独运行 List Hunter 插件的界面,例如:

    firefox -chrome chrome://listhunter/content/crawler.xul -P crawler2 -no-remote

以 chrome 方式运行的插件与 XULRunner 方式运行的 XUL 应用是很类似的。

由于 Firefox 3.1 还没有正式发布,我直接 checkout 官方 Mercurial 源码仓库内的最新版本,自己在我们的 redhat 生产机上编译的。我们目前几乎没有修改官方的 C++ 源代码,为了方便和官方版本保持同步。我们目前使用的是下面的 firefox 编译选项:

  # My .mozconfig
mk_add_options MOZ_MAKE_FLAGS="-j2"
mk_add_options MOZ_OBJDIR=@TOPSRCDIR@/ff-opt
ac_add_options --enable-crypto --enable-feeds --disable-profilesharing
--enable-rdf --enable-zipwriter --disable-tests --disable-gnomeui --disable-cookies
--disable-canvas --disable-gnomeui --disable-inspector-apis --disable-mailnews
--disable-mathml --disable-official-branding --enable-plaintext-editor-only
--disable-postscript --disable-printing --disable-profilelocking --disable-safe-browsing
--disable-startup-notification --disable-svg --disable-svg-foreignobject
--disable-updater --disable-javaxpcom --disable-plugins --disable-crashreporter
--disable-tests --disable-debug --enable-application=browser --build=i686-linux
--disable-jsd --disable-ldap --enable-strip --disable-accessibility --disable-ogg
--disable-dbus --disable-freetype2 --disable-optimize

这里能禁用的功能我们都禁用了,这里的 feeds, rdf, crypto 这三个都不能 disable,否则源码编译不通过,会报一些 .h 头文件找不到,呵呵。--disable-ogg 实际上也不起作用,但从网上的材料看曾经有效过,呵呵。

事实上,目前我们还是给官方的源码打了一个 C++ 补丁,用于将 Error Console 中的 Errors 重定向到 stderr,这样方便我们在集群环境下通过 Firefox 进程的 log 文件捕捉和诊断一些异常。目前的补丁是下面这个样子:

http://agentzh.org/misc/191src.patch.txt

值得一提的是,Firefox 进程本身是"无头"的,即它运行在 Xvfb 这个 X server 之上,只在内存里执行渲染,而不需要任何显示硬件的存在。这些 Firefox 进程本身是挂在我们自己的一个 Perl 写的进程监控脚本之下。该脚本来自我们的 Proc::Harness 模块:

http://svn.openfoundry.org/xulapp/trunk/demo/ListHunter/misc/Proc-Harness/

Proc::Harness 会像 lighttpd 的 Fastcgi server 一样,维护一组指定数目的进程(通过 CPAN 上的 Proc::Simple 模块)。当子进程挂掉后立即重启,或者当子进程的 stderr/stdout 输出不再变化一段时限之后也杀之重启。Proc::Harness 脚本自身则是挂在了 deamontools 之下。

这些 Firefox 进程受安装在其中的 List Hunter 插件的完全控制。它们都是高度自治的 robot。它们内部有一个处理循环,一批一批地从 OpenResty 的 web service 接口取到 URL 任务,然后一个一个地在 Firefox 的 browser 组件里加载和分析,最后把分析到的结果一批一批地通过 OpenResty 提交之。

curl 预取爬虫集群与 Apache mod_proxy 集群

该集群目前布署了 6 台双核的 redhat4 生产机。每台机器都安装了两个集群组件,一是预取器,一是 Apache mod_proxy. 预取器的作用是通过 curl (准确地说是 WWW::Curl 模块)将网页的 HTML 和 CSS 通过 mod_proxy 预取一遍,这样这些请求的结果就可以在 mod_proxy 中通过 mod_disk_cache 缓存住。于是当纯 Firefox 集群再通过 mod_proxy 去抓这些 URL 时,mod_proxy 就可以直接把缓存后的结果直接返回给 Firefox 了。

预取器和 Firefox 进程是同时工作的,但对于一个 URL 任务而言,只有通过预取器预取过之后,Firefox 进程才会进行处理。所以实际构成了一个两道工序的流水线。这种调度是由 OpenResty 集群来完成的。

预取器目前是以一个叫为 WWW::Prefetcher 的 Perl 模块的形式来实现的:

http://svn.openfoundry.org/xulapp/trunk/demo/ListHunter/misc/WWW-Prefetcher/

虽然 mod_cache 提供了许多选项,但其缓存行为还是比较遵守 RFC 的 cache 要求的。因此我对 mod_cache 模块进行了许多修改,使之可以无条件地缓存住请求过的所有网页,而不论其 URL 是否有 querystring,也不论其 response header 中的要求是什么。我们对最新的 httpd 2.2.11 的补丁如下:

http://agentzh.org/misc/httpd-2.2.11.patch.txt

特别地,mod_disk_cache 指向的不是磁盘目录,而是 RAM 中开辟的 tmpfs 分区。由于我们这 6 台机器都是很旧的 IDE 硬盘,因此直接用磁盘作 cache 存储时,高并发条件下,每台机器的 load 都在 20 以上,根本无法忍受。后来换为 tmpfs 结果 htcacheclean 工具之后,机器负载就降到 0.1 以下了。

OpenResty集群

由于 OpenResty 的通用性,我们直接复用了同时服务于 yahoo.cn 和口碑网的那个生产集群,(3 台 FastCGI 前端机和1 台 PL/Proxy 机器),所以我就没有布署新的机器。在服务于 Firefox 集群的 OpenResty 接口中通过 View API 暴露了若干的 PostgreSQL 函数,以完成整个 List Hunter 集群的任务调度和结果汇总。目前的实现中,我们通过 Pg 的 sequence 摸拟了一种循环任务队列,并通过计数器完成流水线中两道工序之间的相对同步。

相关的 Pg 函数、sequence、以及索引的定义在这里:

http://svn.openfoundry.org/xulapp/trunk/demo/ListHunter/misc/init-db.sql

相关的 OpenResty 对象的定义则在这里:

http://svn.openfoundry.org/xulapp/trunk/demo/ListHunter/misc/init-resty.pl

集群的性能

集群目前每小时的产出稳定在 10 万网页以上,一天的产出在 240 万以上。Firefox 机器的负载在 3 左右,Proxy 的负载在 0.1 以下。

JS 基准测试显示 Firefox 3.1 加载一个页面的平均延时是 200 ~ 300 ms,机房间的网络延时在 10 ~ 20 ms(因为网页已被 mod_cache 缓存住,故无到外网的网络开销), List Hunter 插件的 DOM 分析代码的用时在 200 ~ 300 ms。其他 OpenResty 开销再计入,一个 Firefox 进程大约 1 sec 处理一个页面。

在 Linux 上一个 Firefox 进程的内存占用情况如下:

    VIRT 276m, RES 86m, SHR 34m

已知瓶颈和缺陷

当 OpenResty 中的 URL 任务表的行数超过 100 ~ 200 万时,调度查询容易超过 PL/Proxy 的 10 秒限制。因此,我们目前采取"流式"的任务导入和导出方式。通过 cronjob 定期地向库中导入任务,并同时把完成了的任务及时移出。

Apache 的 mod_proxy 在高并发条件下不够稳定,而且限于 Apache 自身的体系结构,无法实现 proxy pipelining. 因此计划在未来集群规模进一步扩大时,改用 Squid. 当然了,Squid 很可能也需要进行修改才能满足我们这里的强制缓存一段指定时间的需求。

同时,受限于 Apache mod_cache 后端的非分布式,代理服务器的调度是在 Firefox 进程和 curl 预取进程中完成的,导致前端代码比较复杂,还带来了代理服务器列表的定时同步问题。因此,未来可以考虑为 Apache mod_cache 或者 Squid 添加 memcached 缓存后端的支持。这样代理前端的多台服务器可以实现对集群内其他部件的"透明化"。

TODO

  1. 换用 Squid + memcached 作为缓存用正向代理
  2. 通过 XULRunner 而非 firefox -chrome 方式运行 List Hunter 插件。(需要为我的 XUL::App 框架添加 XULRunner 支持)

与相似产品的异同

美国雅虎通过大量修改 Firefox 2 的 C++ 源代码,开发了叫为 HLFS 的爬虫集群,用于爬取 AJAX 网站的内容以及得到带有视觉信息的 DOM 树。他们将 Firefox 进程做成了 HTTP 代理的形式,对外部应用提供服务。

而我们的 List Hunter 集群中的 Firefox 进程则是高度自治的爬虫,它们自己从 OpenResty 中不断地批量取任务去完成。而外部应用则是批量地向 OpenResty 导入任务来让集群运转。由于 List Hunter 集群并没怎么修改 Firefox 的源代码,这使得我们可以很容易地与官方最新版本保持同步,从而第一时享受到官方优化带来的众多好处。

同时 List Hunter 集群本身是通用目的的,它可以作为各种 Firefox 插件的"集群容器"。换言之,这是一种将 Firefox 插件"集群化"的完整的框架。

由于 Firefox 插件开发本身已经通过我发布到 CPAN 的 XUL::App 框架得到了极大的简化,所以响应新的需求的成本是非常低的。

使用 Firefox 的利与弊

优点

Firefox 是世界级的浏览器。作为最复杂功能最丰富的 Internet 客户端之一,我们将之作为爬虫可以享受到和最终用户一样的丰富功能,无论是 AJAX 还是视觉信息都不是问题。

Firefox 有基于 XUL 和 chrome JS 的灵活的插件机制,极易扩展。事实上,Firefox 主界面自身就是一个大插件。同时,Gecko 是基于 XPCOM 组件方式的,因此可以很容易地使用 C/C++/Java 等语言开发 XPCOM 组件,然后再用 JavaSscript 把它们粘合在一起。于是乎,JavaScript 成为了像 Perl 一样的胶水类语言。

运行于 Gecko 之上的插件 JavaScript 拥有最高权限,这种 JS 可以访问磁盘文件,可以访问系统环境变量,可以使用原生的 XmlHttpRequest 对象发出跨域 AJAX 请求。

Firefox 的性能随着新版本的发布总会有戏剧性的变化。Firefox 3.1 中的 Gecko 引擎的渲染速度就比 3.0 中的快了好几倍(根据 List Hunter 回归测试集的 benchmark 结果,前者为平均 60 ms,后者则长达 200+ ms)。(Firefox 3.1 中 TraceMonkey 的 JIT 支持倒并没有给 List Hunter 中的 JS 带来可测量的性能提升。)

纯 JS 写的 Firefox 插件可以在 Win32/Linux/Mac 多种操作系统上即装即用,所以方便和编辑及产品经理沟通行为细节,方便演示。如若计算过于复杂,亦可使用 C++ 语言改写插件中的计算密集的部分。

缺点

Firefox 是高耦合的软件,这与 Google Chrome 及 Safari 浏览器的核心 Webkit 形成了鲜明对比。这意味着,我们比较难于对 Firefox 进行深层次的裁剪,无法轻易地免除一些比较大的功能部件,也很难将其中的某一个大部件剥出来单独使用(当然了,SpiderMonkey 是少数几个例外之一)。

AUTHOR

章亦春 (agentzh) <agentzh@yahoo.cn>

LICENSE

Copyright (c) 2007-2008, Yahoo! China EEEE Works, Alibaba Inc. All rights reserved.

Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met:

  • Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer.
  • Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution.
  • Neither the name of the Yahoo! China EEEE Works, Alibaba Inc. nor the names of its contributors may be used to endorse or promote products derived from this software without specific prior written permission.

THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.

12月13日

漂在北京

漂在北京的感觉有时真的很好。

喜欢一个人傍晚在优美的东直门外大街散步的感觉。。。高大的桦树,宽阔的街道,还有两旁一片片宁静的使馆小楼。。。特别是雨后,在这里,感觉整个世界都是那么清爽。每当这种时候,我便不由地回味起中学时代思考过的各种有趣的问题,重温从前浮在脑海中的人工智能世界的美妙愿景,抑或是回顾学生时代的各种酸甜苦辣。。。"什么都可以想,什么都可以不想。"

在住处附近的团结湖公园散步,则是另一番滋味。桃红柳绿之间是一汪清澈的小湖,远远地能听见老人们在湖边悠扬的歌声。北京人怡然自得的一面,在这里显露无遗。而我则喜欢周末时分,独自坐在湖边的长椅上,慵懒地晒着太阳,同时静静地,静静地思考工作中遇到的一些引人入胜的数学和工程学课题 :)

每天午后,我也会偷偷跑到公司对面的首经贸大学漫步。这是一个很小的校园,却也算是在高楼林立的万达闹中取静了。与在清华散步时心中产生出来的对科学的崇敬和庄严感不同的是,在这里,我只选择凝视着树稍的小鸟上下跳跃,抑或是坐在高大的白杨树下的长椅上,看着这个学校的各种肤色的学生来去匆匆。

人生,或许就应该是一种悠闲的漫步历程吧。。。



12月1日

OpenResty now uses the BSD license

We've migrated OpenResty to the BSD license since the 0.5.3 CPAN release, because my $boss laser++ wants to maximize code reuse and collaboration :)

Just as a side note: I've created an #openresty IRC channel on irc.freenode.net. See you there ;)

11月30日

Project Roadmap for OpenResty

Today I wrote down OpenResty's milestone list into its documentation because many people had asked me for that.

  • 0.5.x - Action API and an enhanced version of the Model API.
  • 0.6.x - Migrate the View handler to the same style and implementation of the Action handler, i.e., using explicit parameter list and taking advantage of the Haskell version of the restyscript compiler. Compiling view definition to native PostgreSQL functions is also supposed to work in this series.
  • 0.7.x - Attachment API, which supports binary file uploading and downloading.
  • 0.8.x - Mail API, which introduces builtin Models for email sentbox and inbox based on third-party POP3/SMTP servers. It will also allow actions to be triggered and/or confirmed by emails.
  • 0.9.x - Prophet/Git integration.
Well, the actual release numbers may vary as we go. But I'd love to see all these happen sooner or later anyway ;)

Please don't hesitate to tell us what you think :)

11月29日

Q4 is crazy!

Yeah, Q4 is really crazy! I've been hacking on several company projects in parallel over the last few weeks. Fortunately they're all very interesting stuffs.

We've just kicked OpenResty 0.5.2 out of the door and I'm preparing for the 0.5.3 release right now. My teammate xunxin++ has quickly implemented the YLogin handler for OpenResty, via which the users can use Yahoo! ID to login their own applications on OpenResty. Our Yahoo! registeration team helpfully worked out a sane design to allow us to reuse the Yahoo! Login system, which effectively turned Yahoo! ID into something like a passport, at least from the perspective of OpenResty users :) Big moment! Lots of company products using Yahoo! IDs could be rewritten in 100% JavaScript! Actually our team is already rewriting the Search DIY product using all the goodies offered by OpenResty.

Meanwhile, some guys from Sina.com are doing their personal projects in OpenResty. They said they really appreciated the great opportunities provided by the OpenResty architecture since various kinds of clients (e.g. web sites, cellphones, desktop apps, and etc.) could share the same set of API via OpenResty's web services). They also sent a handful of useful feedbacks and suggestions regarding OpenResty's design and implementation.

I've also been working on an intelligent crawler cluster based on Firefox, Apache mod_proxy/mod_cache, and OpenResty. The crawler itself is a plain Firefox extension named List Hunter:

    http://agentzh.org/misc/listhunter.xpi

It's an enhanced version of the Haiway List Recognization Engine used by my SearchAll extension and also built by my XUL::App framework. You can install it to your Firefox and play with it if you like ;) What this extension does is very simple: recognizing "list regions" and "text regions" in an arbitrary web page and further deciding automatically whether it's a "list page" or a "text page". The latter functionality may sound a bit weird: why is it useful to categorize web pages that way? Anyway, our PM (Product Manager) has crazy ideas about that categorization in our Live Search project and knows better than us ;)

Turning such a Firefox extension into tens or even hundreds of Firefox crawlers running on a bunch of production machines requires a lot of work. I devised a prefetching system which prefetches HTML pages and CSS files included in them, and caches the headers and contents for a fixed amount of time in such a way that Firefox crawlers can later load pages and CSS stuffs directly from the same cache in our local network, thus significantly reducing the page loading time in Gecko. The cache is a heavily patched version of Apache2's mod_cache with mod_disk_cache as the backend storage. The way prefetchers and crawlers interact with the Internet and the cache is via HTTP proxies based on Apache2's mod_proxy. Pipeling the prefetching and crawling processes requires OpenResty with PgQ enabled. Well, I'm still working on this cluster and my goal is 2 pages/sec for every single Firefox process. Firefox 3.1's amazing performance boost (more than 30% faster according to my own benchmark) makes me very confident in abusing Gecko to build efficient crawlers that takes advantage of the rich rendering information.

Another Firefox crawler project haunting my head is a similar one that automatically recognizes and extracts user comments from arbatrary web pages (if any comments appear, of course). Such tasks would be hard if my code has to run without the geometric informations of every DOM nodes provided by the browser rendering engine (in the form of offsetWidth, offsetHeight, offsetTop, and offsetLeft attributes of DOM elements). Some other collegues in our Alibaba's Search Tech Center are putting their head around Cobra, a pure Java HTML renderer. But I'm doubting that it would run more correctly or more efficiently than Gecko. Oh well, I'm not a Java guy anyway...

Finally, just a short note: I had a wonderful time with clkao and Jesse Vincent at Beijing Perl Workshop 2008. I learned pretty a lot about the Prophet internals during the hackation after the conference, and Jesse quickly hacked out a stub OpenResty model API for Prophet. Then we went to the Great Wall the next day. I was amazed to find Jesse hacking crazily on the Great Wall and enjoying the sunshines alone...Wow.

Enough blogging...back to hacking ;)