![]() |
|
Spaces home Human & MachinePhotosProfileFriendsMore ![]() | ![]() |
|
Human & MachineAugust 26 在 JavaScript 中使用 CPS 完成网站遍历海维发到邮件列表的第一封邮件是非常值得仔细一读的,在那里我们可以看到 Lisp 程序员是如何思考的,特别地,如何通过显式调度栈的方式来串行化异步执行流。 昨晚我经过认真的思考之后,惊喜地发现,我们可以把海维的 continuation 用法进一步抽象到一种著名而且又很时髦的编程模式:Continuation Passing Style: http://en.wikipedia.org/wiki/Continuation-passing_style 在海维的抽取案例中,我们遇到的主要问题是,timeout handler 的串行链条很长,而且更有趣的是,这条链接本身是动态构造出来的。如果保证这条链条不因中间某个环节出错而中断,如果很优雅地把这条执行链构造出来,乃是这里问题的核心。我们首先需要遍历省份列表,然后是市级行政区列表,县级行政区列表,最后是中学列表。如果用 EBNF 记法来表示的话,是下面这个样子: top: province* province: city* city: district* district: school* 这里 * 表示 0 次或多次重复。当我们出发时,位于 top 符号的位置上,然后我们经过必要的初始化操作,抓出所有的省份列表。对于每一个省份,我们进入到 province 符号,点击那个省的链接,过一段延时后再抓出 city 列表,再依次地处理 city,依此类推。这里使用正统的面向过程的编程方法的一大困难是,执行流本身不是顺序的,在每一步点击链接后,需要用 setTimeout 断开。而在 CPS 中,我们处理第 i 个 province 时,我们不妨告诉处于该 province 的例程,在它完成自己的工作后应该继续去处理余下的 i + 1, i + 2, ... 个省。例如: function process_province_list (c, province_list) { // 终止条件,遇到空省列表直接调用当前的 continuation 对象 c 返回之 if (province_list.length == 0) return c(); var province = province_list.pop(); // 弹出当前的省份 var cc = function () { // 生成子 continuation 对象 process_province_list(c, province_list); // "递归"地调用自身处理余下的省份列表 }; click_and_record(province); // 点击省份链接,并将之记录下来 setTimeout( function () { var city_list = get_cities(); // 一段延时之后抽取city列表 process_city_list(cc, city_list); // 调用下一级 city 列表的处理函数 }, delay); } 由于 city 在前面的 EBNF 定义与 province 完全相同,所以 process_city_list 和上面的 process_province_list 的形式几乎完全一样。同样的道理,process_district_list 也是"同构异形体"而已。于是想到把它们合并为一个通用的 process_list 函数: function process_list (c, list, level) { // 终止条件,遇到空列表直接调用当前的 continuation 对象 c 返回之 if (province_list.length == 0) return c(); var obj = list.pop(); // 弹出当前的项目 var cc = function () { // 生成子 continuation 对象 process_list(c, list); // "递归"地调用自身处理余下的列表 }; click_and_record(obj, level); // 点击省份链接,并将之记录下来 setTimeout( function () { var city_list = get_list(obj, level - 1); // 一段延时之后抽取下一级列表 process_list(cc, sublist, level - 1); // 调用下一级列表的处理函数 }, delay); } 这样我们只需要定义不同层次 (level) 上的 click_and_record 和 get_list 函数就可以了 :) 这里比较有趣的是 process_list 函数在"根",也就是 top 上的调用方法: var province_list = get_province_list(); var c = function { return true; } process_list(c, province_list, 0); 这里我们创建了最根部的 continuation 对象 c,我们看到,它就是一个直接返回的空函数而已,呵呵。这 3 行代码开始了下面的 continuation 和 setTimeout 交织延伸到 c 的一条很长很长的链条,或许我们可以把这条链形象地看成从两边往中间生长的 ;) 我已经基于 XUL::App 框架将这做成了一个 Firefox 插件: http://svn.openfoundry.org/xulapp/trunk/demo/ExtractX/extractx.xpi 安装后,在地址栏中输入 chrome://extractx/content/extractx.xul 然后登录一下(第一次使用时),再点一下 Extract! 按钮就出东西了。同时还实现了按指定的省份抓取,以及取消抓取,自定义抓取延时的 delay 等功能 :) 该插件完整的源代码位于 http://svn.openfoundry.org/xulapp/trunk/demo/ExtractX/ 整个东东不过是几十行 Perl 外加一百多行 JavaScript 而已,呵呵。 前面介绍的process_list 函数在插件里叫做 processList,比上面的示范代码要稍稍复杂一些,主要考虑了剪枝优化等操作,但主干是完全一样的: http://svn.openfoundry.org/xulapp/trunk/demo/ExtractX/js/extract.js 值得一提的是,CPS 是非常强大的编程技术,它一般可以带给程序员两样好处: 1. 使得程序能够自动回溯 2. 使用程序各个部分 composable,即像积木一样可以从小块搭成大块乃至更大) 基于 CPS 可以实现一个比较完整的 Perl 5 正则引擎。我就是通过阅读 putter 大拿用 Perl 书写的 Perl 5 正则引擎,才理解了 CPS 的用法。有兴趣的同学可以看一看 putter 的实现: http://svn.pugscode.org/pugs/misc/pX/Common/regexp_and_parser_spike/regexp_engine_demo.pl 在前面的讨论中,我们已经注意到了网站抓取路径可以用 EBNF 语言,或者说正则语言来描述。 更一般地,我们可以定义 EBNF 中的符号 a 表示一项抓取任务,它由两部分组成:一是对应的 DOM 节点的 selector,即抽取模式,二是执行的动作,比如触发一个到 DOM 节点的点击事件,或者执行某些网页状态的测试操作,动作的返回值决定了当前抽取任务 a 是否成功。然后我们再定义连接操作: a b 表示抽取任务 a 与 b 必须同时完成,否则 a b 这个整体任务则是失败。类似地,a | b 表示先执行任务 a,若成功则整个任务成功,若 a 失败,则执行任务 b,若 b 成功,则整个任务成功,否则整个任务失败。最后再定义一下闭包:a * 表示任务 a 在抽取时得到的是一个 DOM 列表,而对于该列表中的每一个节点,都作为一个独立的抽取任务执行之。这些任务必须全部成功才算成功。a* 可以拥有一些变种,比如非贪婪的 a*?,还有显式指定重复次数的 a{n}, a{m, n} 等等。 在后面的工作中,我准备实现这样一个从 EBNF 描述自动生成基于 CPS 的 JavaScript 遍历网站代码的编译器。这样自动化一个网站的抓取和测试工作,也就差不多等同于书写一个 EBNF 描述甚至于一个小正则表达式了 ;) Stay tuned! August 14 GHC 6.8.3 binary for older linuxI ran into the following error while trying to run a binary generated by GHC 6.8.x on our production machines with a not-so-recent linux installed (kernel 2.6.9).
$ ./restyscript restyscript: timer_create: Invalid argument Evan building a pure static-linking executable did not solve the problem. Then I downloaded the binary GHC 6.8.3 from haskell.org on that machine, but the ghc crashes as well: $ ghc Floating point exception My teammate chaos++ found that the last time precision argument passed to the timer_create function was just too big. Manually editing the binary executable (the restyscript file in the previous example) solved this issue completely, but, yeah, it's terribly hacky. So I decided to build a GHC 6.8.3 from source on that old system using GHC 6.4.2. Fortunately, the binary GHC 6.4.2 from haskell.org does work there. The newly-built GHC solves all the problem. No floating-point exception nor invalid argument for timer_create. I've put my binary distribution on my site here: http://agentzh.org/misc/ghc-6.8.3-i386-old-linux.tar.bz2 Hopefully it'll be useful for someone else ;) Not sure if it's worth putting to the official download page as well :) Ideally GHC should inspect the kernel version and timer_create support at *runtime*, rather compile-time. It'll make our lives much easier ;) August 12 找出数据表中的“id空洞”数据库中的 id 有时会产生大量空洞,有时是因为分批导入数据时漏了边界上的一些记录,有时则是因为数据有频繁的删除操作,浪费了大量在 id 对应的 sequence 空间。这时就需要找出 id 序列中的空洞,即记录 id 不连续的地方。
一般地,找出"空洞"区间的左边界列表可以使用类似下面的 SQL 语句: select id from store where (id + 1) not in (select id from store); 基本原理是,枚举所有使用过的 id,然后测试 id + 1 是否被使用过。由此找到的 id 便是左边界。值得一提的是,这样的下边界本身对应的 id 是不包含在"空洞"中的。 类似地,洞的右边界可以使用下面的 SQL: select id from store where (id - 1) not in (select id from store); 这当然算不上特别高效的查询,不过足够简单,在上千万行的表中也足够快 :) 几周前我借助于上面第一个查询,成功地找出了为 PM 批量导数据时不慎引入的 6 个"空洞",并逐块填上了那漏导的 1000 多行记录 :) August 05 Filter::QuasiQuote 0.01 is now on CPANAfter reading Audrey's blog post mentioning GHC's upcoming quasiquoting feature (as well as that quasiquoting paper), I quickly hacked up a (simple) quasiquoting mechanism for Perl, hence the Filter::QuasiQuote module already on CPAN:
http://search.cpan.org/perldoc?Filter::QuasiQuote I'm looking forward to using sensible filters in my production code (e.g. OpenResty) and eliminating ugly Perl code for with embedded DSL. For example, instead of writing my $sql = "alter table " . quote_identifer($table) . " drop column " . quote($column) . ";"; I can simply write use OpenResty::QuasiQuote::SQL; my $sql = [:sql| alter table $table drop column $column; |]; Also, a JSON-like DSL can be used to describe valid Perl data structures and to generate the Perl code doing validation. Filter::QuasiQuote supports subclassing, so the OpenResty::QuasiQuote::SQL module mentioned above could be derived from it. Also, multiple concrete filter classes could be composed in a single Perl source file. Just put a series of use statements together: use MyQuote1; use MyQuote2; and it should work. Because it's required that filters always return Perl source aligned in a single line, line numbers won't get corrupted. Of course, lots of nice consequences of the Haskell quasiquotations will be lost in my implementation, such as type safety. But the Perl version is much more flexible and powerful (by some definition) ;) It's still in alpha and could be buggy. Feel free to report bugs or send wishlist to the CPAN RT site or directly to me ;) Enjoy! July 18 OpenResty running directly in PostgreSQL via PL/PerlA few hours' hack gives rise to the following funny stuff:
test=# select openresty_init(); openresty_init ---------------- t (1 row) test=# select openresty_request( test=# 'GET', '/=/view/PrevNextPost/current/79?user=agentzh.Public', test=# '',''); openresty_request ------------------------------------------------------------------------------------------------------------ [{"title":"The slides for my OpenResty D2 talk","id":"78"},{"title":"Client-side web site DIY","id":"80"}] (1 row) test=# select openresty_request('GET','/=/view/PrevNextPost/current/79?user=agentzh','',''); openresty_request --------------------------------------------------------- {"success":0,"error":"agentzh.Admin is not anonymous."} (1 row) The PL/Perl backend is not fully working yet but it's already functional as we see above ;) The definition for the (thin) SQL wrapper functions openresty_init and openresty_request can be found in the following file: http://svn.openfoundry.org/openapi/trunk/misc/plperl.sql And the PL/Perl backend for OpenResty is http://svn.openfoundry.org/openapi/trunk/lib/OpenResty/Backend/PLPerl.pm The code is much cleaner than I ever expected ;) July 17 Building (Ubuntu) debian packages for OpenResty and its dependenciesI've been building debian packages for OpenResty and its dependencies that are not already in the official Ubuntu debian repository. Here's what I got yesterday: http://agentzh.org/misc/debian/ For now only the i386 versions are provided :P I'll set up a personal debian repository for these deb files later. During the process I found the following articles very helpful: "Building Debian packages of Perl modules" http://www.debian-administration.org/articles/78 "Debian New Maintainer's Guide" http://www.debian.org/doc/maint-guide/ But still, I spent a lot of time to figure out the following details: For CPAN modules based on Module::Install, http://svn.openfoundry.org/openapi/trunk/debian/control If the CPAN module wants to install configure files to places like /etc/..., then it's required to manually create the file "conffiles" under http://svn.openfoundry.org/openapi/trunk/debian/conffiles And then modify the http://svn.openfoundry.org/openapi/trunk/debian/rules Another hack I have to do against rules' "install" rule is that I have to remove the perllocal.pod file manually from $(TMP), since inclusion of this file in the .deb file will cause conflicts like this: dpkg: error processing libjson-xs-perl_2.22-1_i386.deb (--install): For modules which require doing some post-installation processing, we should create the file June 27 The first yahoo.cn feature that is powered by OpenRestySee the company blog post here for details: http://ysearchblog.cn/2008/06/post_120.html especially my comment at the bottom of the web page. Have fun! June 24 基于 Perl 的 Firefox 插件开发框架 XUL::App 发布我很高兴地宣布 XUL::App 已经发布到了 CPAN: http://search.cpan.org/perldoc?XUL::App XUL::App 是 Yahoo! 4E team 开发的基于 Perl 的 Firefox 插件的开发框架,最初是为了简化搜索引擎平行比较工具 SearchAll 插件的开发工作而设计的。随后,我们又利用 XUL::App 框架开发了一款能从 Google Reader 中导出 RSS 数据项的小插件 ExportReader。有趣的是,ExportReader 总共只手写了约 10 行 Perl 代码和 20 行 JavaScript 代码。最近,我又花了几个小时的时间,把原来我们 FE 开发的"雅虎收藏+"插件移植到了 XUL::App 框架中,从而使得调试和打包工作都只是简单的一条命令。(引用晓栓同学的话说,就是"今天下午看Agent 调程序,那是相当的酷:)") 我在 4E team 内部作的 XUL::App 演讲所使用的幻灯片介绍了 XUL::App 的几个重要特性: http://agentzh.org/misc/slides/xulapp.pdf 熟悉 Perl 酷炫的 Jifty 框架的朋友会发现 XUL::App 就是 XUL 星球上的 Jifty :) XUL::App 是一个我们领导的开源项目,目前仍处于比较早期的阶段。我们目前需要一个完整的测试集,以及更多的文档(特别是更多的教程)。如果你有兴趣参与开发和自动化测试工作的话,请发送邮件给我们(agentzh@yahoo.cn),我们很乐意递送 Subversion 提交权限 :) -agentzh June 20 UML::Class::Simple 0.10 releasedI've just uploaded UML::Class::Simple 0.10 to CPAN with the highlight of the XMI format support. It will appear on the CPAN mirror near you in the next few hours. Thanks Maxim Zenin for contributing this feature :) A Japanese user was requesting this in his blog as well. If you're a XMI fanboy, feel free to try it out. June 12 Optimizing Haskell code: from String to ByteString Haskell's built-in strings are notoriously slow. The String type in Haskell is [Char] per se. I was told that there was a much faster alternative provided by the bytestring (or fps) library by the Pugs blog a few years ago. (Thanks Audrey!) However, it took me a while to figure out how to use it in my code. Eventually I found that All I needed were in the Data.ByteString.Char8 module rather than Data.ByteString. (Thanks Hoogle!) According to the document, it's recommended to import the module this way import qualified Data.ByteString.Char8 as B to prevent name clashing with Prelude. Converting String to B.ByteString is straightforward: B.pack "Hello, world" where "Hello, world" is of type String. Or in the other direction: B.unpack s where s is of type B.ByteString. Concatenating several bytestrings together can be done by the B.concat function: B.concat [B.pack "hello", B.pack ", ", B.pack "world"] or just use B.append for joining two bytestrings for handy: (B.pack "hello, ") `B.append` (B.pack "world") Personally I like to define a ~~ operator for a bytestring version of ++ this way: (~~) :: B.ByteString -> B.ByteString -> B.ByteString (~~) = B.append and then I can simply write: B.pack "hello, " ~~ B.pack "world" Bytestring versions for most of the functions in Prelude are also provided. For instance, printing out a bytestring to stdout can be done directly by B.putStrLn bs -- bs is of type B.ByteString rather than the cumbosome and also slow putStrLn $ B.unpack bs As bytestring's documentation points out, converting back and forth between bytestrings and Haskell's built-in strings could become the bottleneck of the program, especially when the source comes with lots of string literals like "Hello, world" shown above. Wouldn't it be nice if string literals get automatically interpreted by the GHC compiler to bytestrings without going through a B.pack? Fortunately, with bytestring 0.9.0.4 (or better) and GHC 6.8.1 (or better), it is possible to do that via the GHC option -XOverloadedStrings. So now we can write literals without mudding around with B.pack: B.concat ["hello", ", ", "world"] or "hello, " ~~ "world" Perfect! :D Note that, as of this writing, the bytestring library in Ubuntu 8.04's debian repository is not new enough to support this. So ubuntu users have to install the latest version from HackageDB like this: $ wget http://hackage.haskell.org/packages/archive/bytestring/0.9.1.0/bytestring-0.9.1.0.tar.gz $ tar -xzf bytestring-0.9.1.0.tar.gz $ cd bytestring-0.9.1.0/ $ runghc Setup.lhs configure -p $ runghc Setup.lhs build $ sudo runghc Setup.lhs install By switching to B.ByteString in my code emitters for the minisql compiler mentioned in the previous blog post, the execution time dramatically reduced from 7.0 sec to 2.3 sec in my stress tests generated by the Perl module Parse::RandGen::Regexp. This is really an amazing improvement :) Furthermore, my UTF-8 regression tests kept passing as well. In the next journal I'll present another optimization trick that further reduced the running time from 2.3 sec to 1.0 sec. (Well, it has nothing to do with -O2 BTW, and I turned on -O2 from the very beginning already ;))
|
||||||||||||||||||||
|
|