agentzh さんのプロフィールHuman & Machineフォトブログリストその他 ![]() | ヘルプ |
|
|
11月18日 The "headers more" module: scripting input and output filters in your Nginx config file I've been working madly on the "headers more" module: http://github.com/agentzh/headers-more-nginx-module And got everything that I want working now. It also has a nice wiki page (which also has brief explanation of the underlying implementation): http://wiki.nginx.org/NginxHttpHeadersMoreModule Our buzzword is that it can rewrite the "Server" output header dynamically! See this: location /foo { more_set_headers "Server: $arg_server"; } Then GET /foo?server=Foo will get a response with the "Server: Foo" header set ;) Input headers can be trivially rewritten as well, including the "Host" header: more_set_input_headers "Host: some-other-host"; Well, the full practical power of this module is out of my current imagination. If you have some crazy uses, please drop me a line ;) Happy Nginx hacking! 11月15日 The "chunkin" module: Experimental chunked input support for Nginx Pushed by those cutting-edge users on the Nginx mailing list, I've quickly worked out the "chunkin" module which adds HTTP 1.1 chunked input support for Nginx without the need of patching the core: http://github.com/agentzh/chunkin-nginx-module This module registers an access-phase handler that will eagerly read and decode incoming request bodies when a "Transfer-Encoding: chunked" header triggers a 411 error page in Nginx (hey, that's what you have to pay for avoiding patching the core ;)). For requests that are not in the "chunked" transfer encoding, this module is a "no-op". To enable the magic, just turn on the "chunkin" config option like this: chunkin on; location /foo { ... } .... No other modification is required in your nginx.conf file. (The "chunkin" directive is not allowed in the location block BTW.) This module is still considered highly experimental and there must be some serious bugs lurking somewhere. But you're encouraged to play and test it in your non-production environment and report any quirks to me :) Efforts have been made to reduce data copying and dynamic memory allocation, thus unfortunately raising the risk of potential buffer handling bugs caused by premature optimizations :P This module is not supposed to be merged into the Nginx core because I've used Ragel to generate the chunked encoding parser for joy :) The following Nginx versions have been successfully tested by this module's (very limited) test suite: 0.8.0 ~ 0.8.24 0.7.21 ~ 0.7.63 The test suite definitely needs more test cases and the code is hacky in various places. If you're willing to contribute, feel free to ask me for a commit bit in a private email :) Update: I've also added a wiki page for it: http://wiki.nginx.org/NginxHttpChunkinModule 10月15日 Hacking on the Nginx echo module Over the recent weeks, I've been reading a lot of Nginx and its modules' C source code and it's really enjoyable. I've got lots of good ideas in implementing the next generation of OpenResty server based on the Nginx architecture. Well, it's currently my full-time $work anyway. For the sake of testing other modules, experimenting the Nginx internals, and for fun, I've started my first Nginx module named "echo": http://github.com/agentzh/echo-nginx-module It's already quite usable, and it also has a declarative test suite based on Perl's Test::Base. At the moment, LWP is used for simplicity and it's rather weak in testing streaming behavior of Nginx (I'm using "curl" to test these aspects manually for now). I'm considering coding up my own Perl HTTP client library based on IO::Select and IO::Socket (there might be already one around?). Along the way, I'm intentionally heavily commenting my C source in this "echo" module in the hope that newcomers would find it a "live tutorial" or something like that. I'll write more about the details here in subsequent posts. After all, it's able to do a lot more thingies other than just "echo" stuffs directly, such as sleeping and flushing output buffer. And it will be capable of outputing subrequests' responses as well. Happy hacking Nginx C modules and stay tuned! Update: I've also added a wiki page for it: http://wiki.nginx.org/NginxHttpEchoModule 9月20日 Slides for my perl testing & VDOM.pm talks in Beijing Perl Workshop 2009I really enjoyed the talks in BJPW2009. Here's the slides for my two talks in this conference:
9月14日 A plan for nginx-openresty Now that I've joined Taobao.com's SDS department and will be able to work on OpenResty in a full time fashion, I've just worked out a (somewhat) detailed plan for the next generation of the OpenResty server. Well, sorry, this draft is in Chinese since my $manager reads Chinese better: http://www.pgsqldb.org/mwiki/index.php/Nginx_openresty_plan (still being actively updated) After talking with my friend and colleague chaoslawful++ for possible designs of a high performance implementation of the OpenResty server, we finally decided to rewrite OpenResty.pm in pure C and in the form of an nginx module. Here's some highlights of the Chinese project plan given above:
Update: Special thanks go to kindy++ for his detailed review of the nginx-openresty-plan document and helpful suggestions :) 9月4日 Slides for my VDOM + WebKit talk I gave a presentation on VDOM + WebKit to the Taobao.com Search Frontend Team this morning. The slides are based on my talk in April's Beijing Perl Workshop, but with notable updates to reflect recent changes in the last few months: http://agentzh.org/misc/slides/taobao-fe/vdomwebkit.xul (Firefox required to open this link) Be patient when it's downloading big images, or you can download the whole tarball to your local side, unpack the package, open the vdomwebkit.xul in it, and browse the slides locally: http://agentzh.org/misc/slides/taobao-fe.tar.gz Recent major development regarding our browser-based web scraping clusters are:
9月3日 Our queue-size-aware version of memcacheqXunxin++ and I have been working on a fork of memcacheq (originally just within the company), adding support for queue length constraint. In our scenario, a pipelined webpage information extraction cluster based on apple's WebKit core, it's important to limit the queue's length and to make the queue "inform" the queue item producers by some way in case the queue is full. We're not sure if it's worth merging back to the mainstream version because this new addition adds some cost (though the cost is low). Here goes the project page on GitHub, with more explanation of the details (I won't repeat them here ;)): http://github.com/agentzh/memcacheq/tree/master The newly added code is also licensed under the same license as the mainstream memcacheq. Enjoy :) 9月2日 I'll talk in the upcoming Beijing Perl Workshop 2009 eventI've submitted 3 talk proposals to this year's upcoming Beijing Perl Workshop conference scheduled at Sep 19. I'll publish my slides for my talks later here for your preview :) The 3 talks are
http://conference.perlchina.org/bjpw2009/ Don't forget to specify a T-shirt size in your profile setting there so that we can prepare a T-shirt for you in this event (well, it's for free!). See you there ;) 5月11日 OpenResty.pm has been moved to GitHubAs some of you may have already noticed, I've moved the source repository of OpenResty.pm from the good old OpenFoundry to GitHub: http://github.com/agentzh/openresty/tree/master Feel free to branch it or ask me for a commit bit if you don't have one ;) I'll destroy the stuffs in the old "openapi" repository on openfoundry and leave a note there to avoid potential confusion. Mailing list for OpenRestyAfter releasing several new releases of OpenResty.pm to CPAN, I created a mailing list for OpenResty users/developers on Google Groups: http://groups.google.com/group/openresty?hl=en This is for both OpenResty.pm and mod_openresty. You're very welcome to join us there ;) There's also a #openresty on freenode but it's been very quiet :P 4月28日 Text::SmartLinks: The Perl 6 love for Perl 5 I'm so glad to find this blog post while browsing the Iron Man planet: http://szabgab.com/blog/2009/04/1240827553.html Three years ago, I wrote the smartlinks.pl script to integrate the Pugs test suite with the Perl 6 Synopses documentation. Gábor Szabó now has done an excellent job in refactoring and packaging the tool into a general-purpose CPAN module. It had been my TODO until I was caught by accumulated schoolwork :P Enjoy his (well, also our) Text::SmartLinks module! http://search.cpan.org/perldoc?Text::SmartLinks 4月23日 SSH::Batch: Treating clusters as maths sets and intervalsSystem administration is also part of my $work. Playing with a (big) bunch of machines without a handy tool is painful. So I refactored some of our old scripts and released SSH::Batch, a collection of useful parallel ssh scripts, to CPAN: http://search.cpan.org/dist/SSH-Batch/ SSH::Batch allows you to name your clusters using variables and interval/set syntax in your ~/.fornodesrc config file. For instance: $ cat ~/.fornodesrc A=foo[01-03].com bar.org B=bar.org baz[a-b,d,e-g].cn foo02.com C={A} * {B} D={A} - {B} where cluster C is the intersection set of cluster A and B while D is those machines in A but not in B. And then you can query machine host list by using SSH::Batch's fornodes script: $ fornodes '{C}' bar.org foo02.com $ fornodes '{D}' foo01.com foo03.com Furthermore, to run a command on a cluster at the concurrency level of 6: atnodes 'ls -lh' '{A} + {B}' my.more.com -c 6 Or upload a local file to the remote cluster: tonodes ~/my.tar.gz '{A} / {B}' :/tmp/ There's also a key2nodes script to push SSH public keys to remote machines ;) A colleague in Alibaba B2B is already using it. And one of my teammates is going to use it to operate on those thousands of machines in our instance of the YST (Yahoo! Search Technology) cluster and I'm ready to receive more feedback from him ;) Have fun :) 4月10日 My VDOM.pm & WebKit Cluster Talk at the April Meeting of Beijing Perl WorkshopLast night I gave a talk to our PerlChina folks at the April meeting in the Flow Bar. Here's the slides that I used:
Just as the topic of the talk suggests, we're migrating from Firefox clusters to WebKit ones. I'll post more details here in the near future. Enjoy! 2月26日 mod_libmemcached_cache is now opensourced :)I've opensourced my mod_libmemcached_cache project to GitHub.com with the permissions from my company: http://github.com/agentzh/mod-libmemcached-cache/tree/master It's a memcached storage provider for Apache2's mod_cache. In contrast to the mod_memcached_cache module on Google Code, we use the popular libmemcached library rather than apr-util's. Feel free to branch it and I'm very willing to merge back any useful changes and I'd love to send out commit bit as well :) Mind you, it's licensed under GPLv2. That's my company's decision, not me ;) 2月13日 The slides for my talk on Firefox cluster & vision-based web page extractionI gave a talk at the Beijing Perl Mongers' Feb Meeting last night. It was about my Firefox cluster and vision-based web page extraction technology. I had not expected to see so many people there. Wow. The talk was well received and people asked lots of interesting questions :) The slides can be freely downloaded from my site (open the ffcluster.xul file in the tarball via Firefox): http://agentzh.org/misc/slides/BJPW200902.tar.gz or browse directly online by Firefox: http://agentzh.org/misc/slides/BJPW200902/ffcluster.xul Because it has many big pictures in it, it's recommended to download it to your local side first and display offline :) I'll also give this presentation again to those Ruby/Python/Java/C++ guys at Beijing OpenParty's Fox meeting: http://www.beijing-open-party.org/index.php/2009/02/beijing-open-party-2009-02-fox-event-begin.html Just as a site note: recently I'm intrigued by Apache C hacking. My mod_libmemcached_cache is my first Apache module. And I'd love to see more in the near future, such as mod_openresty ;) Have fun! 12月25日 生活搜基于 Firefox 3.1 的 List Hunter 集群NAMEList Hunter Cluster - 我们自己的基于 Mozilla Firefox 3.1 的深抓爬虫集群 DESCRIPTION该文档介绍了我们的基于 Firefox 3.1 的 List Hunter 集群。目前是我们公司生活搜索引擎的一部分。 背景在我们的生活搜索项目中,需要对网页进行深层次的识别和抽取。基于文本内容的分类我们目前采用的是美国雅虎基于最大熵的 DCP 系统。而对于网页结构方面的分类(即这个网页是列表页呢,还是详情页?),以及主体链接列表、主体区域抽取,则一直缺乏比较好的解决方案。我的同事尝试过通过纯粹的结构化的方法(如海维算法)进行识别,准确率只有 60%,而基于 SVM 这样的机器学习的方法,对网页类型比较敏感,如目标网页与训练集相差较多,则准确率迅速下降。 于是我尝试把网页显示时的视觉信息结合到海维算法以及块合并算法中。于是准确率和召回率分别达到了 90% 和 80%。这里的视觉信息主要包括一个网页区域的大小、形状、和在整个页面中的位置。更多的信息还包括字体、颜色等等。这样,便诞生了 List Hunter 插件。于是如何将 Firefox 插件做成一个大规模的集群用于生产,便成为了重要问题。 在下面这篇 blog 文章中我介绍了更多背景方面的细节以及 List Hunter 插件本身的情况: http://blog.agentzh.org/#post-97 该插件只依赖于 Firefox,可以即装即用: http://agentzh.org/misc/listhunter.xpi 集群的架构该集群由四大部分组成:纯 Firefox 集群,Apache + mod_proxy + mod_disk_cache 集群,curl 预取器集群,和 OpenResty 集群。一共有十几台生产机"全职"或者"兼职"地参与了这个集群。下面逐一介绍一下哈:
集群的性能集群目前每小时的产出稳定在 10 万网页以上,一天的产出在 240 万以上。Firefox 机器的负载在 3 左右,Proxy 的负载在 0.1 以下。 JS 基准测试显示 Firefox 3.1 加载一个页面的平均延时是 200 ~ 300 ms,机房间的网络延时在 10 ~ 20 ms(因为网页已被 mod_cache 缓存住,故无到外网的网络开销), List Hunter 插件的 DOM 分析代码的用时在 200 ~ 300 ms。其他 OpenResty 开销再计入,一个 Firefox 进程大约 1 sec 处理一个页面。 在 Linux 上一个 Firefox 进程的内存占用情况如下: VIRT 276m, RES 86m, SHR 34m 已知瓶颈和缺陷当 OpenResty 中的 URL 任务表的行数超过 100 ~ 200 万时,调度查询容易超过 PL/Proxy 的 10 秒限制。因此,我们目前采取"流式"的任务导入和导出方式。通过 cronjob 定期地向库中导入任务,并同时把完成了的任务及时移出。 Apache 的 mod_proxy 在高并发条件下不够稳定,而且限于 Apache 自身的体系结构,无法实现 proxy pipelining. 因此计划在未来集群规模进一步扩大时,改用 Squid. 当然了,Squid 很可能也需要进行修改才能满足我们这里的强制缓存一段指定时间的需求。 同时,受限于 Apache mod_cache 后端的非分布式,代理服务器的调度是在 Firefox 进程和 curl 预取进程中完成的,导致前端代码比较复杂,还带来了代理服务器列表的定时同步问题。因此,未来可以考虑为 Apache mod_cache 或者 Squid 添加 memcached 缓存后端的支持。这样代理前端的多台服务器可以实现对集群内其他部件的"透明化"。 TODO
与相似产品的异同美国雅虎通过大量修改 Firefox 2 的 C++ 源代码,开发了叫为 HLFS 的爬虫集群,用于爬取 AJAX 网站的内容以及得到带有视觉信息的 DOM 树。他们将 Firefox 进程做成了 HTTP 代理的形式,对外部应用提供服务。 而我们的 List Hunter 集群中的 Firefox 进程则是高度自治的爬虫,它们自己从 OpenResty 中不断地批量取任务去完成。而外部应用则是批量地向 OpenResty 导入任务来让集群运转。由于 List Hunter 集群并没怎么修改 Firefox 的源代码,这使得我们可以很容易地与官方最新版本保持同步,从而第一时享受到官方优化带来的众多好处。 同时 List Hunter 集群本身是通用目的的,它可以作为各种 Firefox 插件的"集群容器"。换言之,这是一种将 Firefox 插件"集群化"的完整的框架。 由于 Firefox 插件开发本身已经通过我发布到 CPAN 的 XUL::App 框架得到了极大的简化,所以响应新的需求的成本是非常低的。 使用 Firefox 的利与弊
AUTHOR章亦春 (agentzh) LICENSECopyright (c) 2007-2008, Yahoo! China EEEE Works, Alibaba Inc. All rights reserved. Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met:
THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. 12月13日 漂在北京漂在北京的感觉有时真的很好。 喜欢一个人傍晚在优美的东直门外大街散步的感觉。。。高大的桦树,宽阔的街道,还有两旁一片片宁静的使馆小楼。。。特别是雨后,在这里,感觉整个世界都是那么清爽。每当这种时候,我便不由地回味起中学时代思考过的各种有趣的问题,重温从前浮在脑海中的人工智能世界的美妙愿景,抑或是回顾学生时代的各种酸甜苦辣。。。"什么都可以想,什么都可以不想。" 在住处附近的团结湖公园散步,则是另一番滋味。桃红柳绿之间是一汪清澈的小湖,远远地能听见老人们在湖边悠扬的歌声。北京人怡然自得的一面,在这里显露无遗。而我则喜欢周末时分,独自坐在湖边的长椅上,慵懒地晒着太阳,同时静静地,静静地思考工作中遇到的一些引人入胜的数学和工程学课题 :) 每天午后,我也会偷偷跑到公司对面的首经贸大学漫步。这是一个很小的校园,却也算是在高楼林立的万达闹中取静了。与在清华散步时心中产生出来的对科学的崇敬和庄严感不同的是,在这里,我只选择凝视着树稍的小鸟上下跳跃,抑或是坐在高大的白杨树下的长椅上,看着这个学校的各种肤色的学生来去匆匆。 人生,或许就应该是一种悠闲的漫步历程吧。。。 12月1日 OpenResty now uses the BSD licenseWe've migrated OpenResty to the BSD license since the 0.5.3 CPAN release, because my $boss laser++ wants to maximize code reuse and collaboration :) Just as a side note: I've created an #openresty IRC channel on irc.freenode.net. See you there ;) 11月30日 Project Roadmap for OpenRestyToday I wrote down OpenResty's milestone list into its documentation because many people had asked me for that.
Please don't hesitate to tell us what you think :) 11月29日 Q4 is crazy!Yeah, Q4 is really crazy! I've been hacking on several company projects in parallel over the last few weeks. Fortunately they're all very interesting stuffs. |
|
|