agentzh さんのプロフィールHuman & Machineフォトブログリストその他 ![]() | ヘルプ |
|
|
12月25日 生活搜基于 Firefox 3.1 的 List Hunter 集群NAMEList Hunter Cluster - 我们自己的基于 Mozilla Firefox 3.1 的深抓爬虫集群 DESCRIPTION该文档介绍了我们的基于 Firefox 3.1 的 List Hunter 集群。目前是我们公司生活搜索引擎的一部分。 背景在我们的生活搜索项目中,需要对网页进行深层次的识别和抽取。基于文本内容的分类我们目前采用的是美国雅虎基于最大熵的 DCP 系统。而对于网页结构方面的分类(即这个网页是列表页呢,还是详情页?),以及主体链接列表、主体区域抽取,则一直缺乏比较好的解决方案。我的同事尝试过通过纯粹的结构化的方法(如海维算法)进行识别,准确率只有 60%,而基于 SVM 这样的机器学习的方法,对网页类型比较敏感,如目标网页与训练集相差较多,则准确率迅速下降。 于是我尝试把网页显示时的视觉信息结合到海维算法以及块合并算法中。于是准确率和召回率分别达到了 90% 和 80%。这里的视觉信息主要包括一个网页区域的大小、形状、和在整个页面中的位置。更多的信息还包括字体、颜色等等。这样,便诞生了 List Hunter 插件。于是如何将 Firefox 插件做成一个大规模的集群用于生产,便成为了重要问题。 在下面这篇 blog 文章中我介绍了更多背景方面的细节以及 List Hunter 插件本身的情况: http://blog.agentzh.org/#post-97 该插件只依赖于 Firefox,可以即装即用: http://agentzh.org/misc/listhunter.xpi 集群的架构该集群由四大部分组成:纯 Firefox 集群,Apache + mod_proxy + mod_disk_cache 集群,curl 预取器集群,和 OpenResty 集群。一共有十几台生产机"全职"或者"兼职"地参与了这个集群。下面逐一介绍一下哈:
集群的性能集群目前每小时的产出稳定在 10 万网页以上,一天的产出在 240 万以上。Firefox 机器的负载在 3 左右,Proxy 的负载在 0.1 以下。 JS 基准测试显示 Firefox 3.1 加载一个页面的平均延时是 200 ~ 300 ms,机房间的网络延时在 10 ~ 20 ms(因为网页已被 mod_cache 缓存住,故无到外网的网络开销), List Hunter 插件的 DOM 分析代码的用时在 200 ~ 300 ms。其他 OpenResty 开销再计入,一个 Firefox 进程大约 1 sec 处理一个页面。 在 Linux 上一个 Firefox 进程的内存占用情况如下: VIRT 276m, RES 86m, SHR 34m 已知瓶颈和缺陷当 OpenResty 中的 URL 任务表的行数超过 100 ~ 200 万时,调度查询容易超过 PL/Proxy 的 10 秒限制。因此,我们目前采取"流式"的任务导入和导出方式。通过 cronjob 定期地向库中导入任务,并同时把完成了的任务及时移出。 Apache 的 mod_proxy 在高并发条件下不够稳定,而且限于 Apache 自身的体系结构,无法实现 proxy pipelining. 因此计划在未来集群规模进一步扩大时,改用 Squid. 当然了,Squid 很可能也需要进行修改才能满足我们这里的强制缓存一段指定时间的需求。 同时,受限于 Apache mod_cache 后端的非分布式,代理服务器的调度是在 Firefox 进程和 curl 预取进程中完成的,导致前端代码比较复杂,还带来了代理服务器列表的定时同步问题。因此,未来可以考虑为 Apache mod_cache 或者 Squid 添加 memcached 缓存后端的支持。这样代理前端的多台服务器可以实现对集群内其他部件的"透明化"。 TODO
与相似产品的异同美国雅虎通过大量修改 Firefox 2 的 C++ 源代码,开发了叫为 HLFS 的爬虫集群,用于爬取 AJAX 网站的内容以及得到带有视觉信息的 DOM 树。他们将 Firefox 进程做成了 HTTP 代理的形式,对外部应用提供服务。 而我们的 List Hunter 集群中的 Firefox 进程则是高度自治的爬虫,它们自己从 OpenResty 中不断地批量取任务去完成。而外部应用则是批量地向 OpenResty 导入任务来让集群运转。由于 List Hunter 集群并没怎么修改 Firefox 的源代码,这使得我们可以很容易地与官方最新版本保持同步,从而第一时享受到官方优化带来的众多好处。 同时 List Hunter 集群本身是通用目的的,它可以作为各种 Firefox 插件的"集群容器"。换言之,这是一种将 Firefox 插件"集群化"的完整的框架。 由于 Firefox 插件开发本身已经通过我发布到 CPAN 的 XUL::App 框架得到了极大的简化,所以响应新的需求的成本是非常低的。 使用 Firefox 的利与弊
AUTHOR章亦春 (agentzh) LICENSECopyright (c) 2007-2008, Yahoo! China EEEE Works, Alibaba Inc. All rights reserved. Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met:
THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. 11月29日 Q4 is crazy!Yeah, Q4 is really crazy! I've been hacking on several company projects in parallel over the last few weeks. Fortunately they're all very interesting stuffs. 6月12日 Optimizing Haskell code: from String to ByteString Haskell's built-in strings are notoriously slow. The String type in Haskell is [Char] per se. I was told that there was a much faster alternative provided by the bytestring (or fps) library by the Pugs blog a few years ago. (Thanks Audrey!) However, it took me a while to figure out how to use it in my code. Eventually I found that All I needed were in the Data.ByteString.Char8 module rather than Data.ByteString. (Thanks Hoogle!) According to the document, it's recommended to import the module this way import qualified Data.ByteString.Char8 as B to prevent name clashing with Prelude. Converting String to B.ByteString is straightforward: B.pack "Hello, world" where "Hello, world" is of type String. Or in the other direction: B.unpack s where s is of type B.ByteString. Concatenating several bytestrings together can be done by the B.concat function: B.concat [B.pack "hello", B.pack ", ", B.pack "world"] or just use B.append for joining two bytestrings for handy: (B.pack "hello, ") `B.append` (B.pack "world") Personally I like to define a ~~ operator for a bytestring version of ++ this way: (~~) :: B.ByteString -> B.ByteString -> B.ByteString (~~) = B.append and then I can simply write: B.pack "hello, " ~~ B.pack "world" Bytestring versions for most of the functions in Prelude are also provided. For instance, printing out a bytestring to stdout can be done directly by B.putStrLn bs -- bs is of type B.ByteString rather than the cumbosome and also slow putStrLn $ B.unpack bs As bytestring's documentation points out, converting back and forth between bytestrings and Haskell's built-in strings could become the bottleneck of the program, especially when the source comes with lots of string literals like "Hello, world" shown above. Wouldn't it be nice if string literals get automatically interpreted by the GHC compiler to bytestrings without going through a B.pack? Fortunately, with bytestring 0.9.0.4 (or better) and GHC 6.8.1 (or better), it is possible to do that via the GHC option -XOverloadedStrings. So now we can write literals without mudding around with B.pack: B.concat ["hello", ", ", "world"] or "hello, " ~~ "world" Perfect! :D Note that, as of this writing, the bytestring library in Ubuntu 8.04's debian repository is not new enough to support this. So ubuntu users have to install the latest version from HackageDB like this: $ wget http://hackage.haskell.org/packages/archive/bytestring/0.9.1.0/bytestring-0.9.1.0.tar.gz $ tar -xzf bytestring-0.9.1.0.tar.gz $ cd bytestring-0.9.1.0/ $ runghc Setup.lhs configure -p $ runghc Setup.lhs build $ sudo runghc Setup.lhs install By switching to B.ByteString in my code emitters for the minisql compiler mentioned in the previous blog post, the execution time dramatically reduced from 7.0 sec to 2.3 sec in my stress tests generated by the Perl module Parse::RandGen::Regexp. This is really an amazing improvement :) Furthermore, my UTF-8 regression tests kept passing as well. In the next journal I'll present another optimization trick that further reduced the running time from 2.3 sec to 1.0 sec. (Well, it has nothing to do with -O2 BTW, and I turned on -O2 from the very beginning already ;)) 1月18日 Re: Intercepting access to a method/propertyOn Jan 18, 2008 7:21 PM, AllSeeingI wrote: > Is it possible (through an extension, XPCOM, other way) to call a > particular JS function when a particular method or property is > accessed by a user script (= script on a HTML page)? > Object.watch is the way to go for properties ;) Not sure about methods though. > The reason I'm asking is that I'm trying to create an extension that > intercepts JavaScript redirections: > > location.href = ... Heh, I'm afraid it's more browser-specific. So it might be OT here. But I'd like to share some of my experiences (mostly from NSA++) in this mail. I think the following code should work in Firefox 2 (i.e. the js 1.7 engine): top.watch("location", function () { throw "Permission denied." }); top.location.watch("href", function () { throw "Permission denied." }); But unfortunately it won't work in Firefox 3 (i.e. the js 1.8 engine). AFAIK, Firefox has been trying much harder than IE to protect frame-busting sites. > location.replace(...); > Well, I was trying very hard to defeat this one but with no luck. A good enough workaround for (static) sites is to (locally) disable JS for that particular frame loading the frame-busting page, as in: myBrowser.docShell.allowJavascript = false; Basically, if you load the web page in a separate chrome window, frame-busting code will always fail. But if you're trying to load it in Firefox's own browser tab, you're not really "chrome" there. Another trick that works is to use the onbeforeunload handler, as in: window.onbeforeunload = function (e) { e.returnValue = "This action might be caused by a frame-busting site.\nPlease click 'Cancel' if you're not meant to quit me."; return false; }; But this will pop up a confirmation dialog to the end user. There's no known way to bypass it without hacks ;) There may exist much better solutions that I don't know. Hope these help. Cheers, -agentzh 4月18日 为什么一个字节是 8 个比特?记得我们班的“超级天才”宝权同志曾在大一学 C++ 的时候问过一个很特别的问题,即“一个字节为什么是 8 个比特?” 昨晚,我将此问题贴到了 irc.freenode.net 的 #perl6 通道上,Larry Wall (TimToady), jerry gay ([particle]), moritz 参与了讨论。下面是当时的聊天记录(agentzh 就是我啦,呵呵): <agentzh> a friend of mine once asked me why a byte is of 8 bits. <moritz> agentzh: what did you answer? <moritz> agentzh: "computer scientist love powers of two"? <agentzh> moritz: i told him because ASCII code has 7 bits and the people want to feel safer and add one more <TimToady> lol <moritz> *g* nice explanation ;-) <agentzh> thanks :D <TimToady> and then the Europeans all added one more, and did we feel safer? <TimToady> I don't think so... * agentzh wants to hear TimToady's explanation. <TimToady> I think the ASCII explanation is basically correct, from a cultural point of view. When people started programming PDP-11s and doing a lot of string processing, they decided it was convenient that it came close to a power of two, and stuck with it. <TimToady> and it was also fairly obvious about then that the next generation would be 32-bit processors, and then you get 4 chars into it. <TimToady> but I think the powers-of-two argument was kind of a post-facto rationalization of the ASCII culture <TimToady> basically, Pascal and C thought in bytes, so everything else followed along. * TimToady remembers various contortions of trying to rationalize the type system of C on some weird old architectures that were not amenable to bytes... <TimToady> and the term "byte" itself had not yet settled on 8 bits * moritz thinks of "mix", Donald E. Knuth's assembler, that doesn't rely on a fixed byte size <TimToady> yes. 36-bit computers tended to use 6 bit characters <[particle]> octet is the correct term, but byte has become a synonym <TimToady> byte is now the correct term. octet will die eventually <TimToady> and go back to being 8 singers. <TimToady> except for in standards documents, where it will likely remain a shibboleth 原始的聊天记录位于: http://colabti.de/irclogger/irclogger_log/perl6?date=2007-04-17,Tue&sel=451#l672 包括上面这段记录的上下文,呵呵。昨天晚上 Larry 真是妙语连珠,Joke 不断啊。不愧是大师级人物…… 章亦春 3月26日 解决 RealPlayer 在 ubuntu 中没声音的问题记得一个月前我徒弟就报告过 RealPlayer 在 ubuntu 中光有图像没有声音的问题;没想到现在我自己却撞上了。好在经过反复的 Google,终于找到了下面的解决方法: * 首先安装 ALSA OSS 驱动程序: $ sudo apt-get install alsa-oss * 然后编辑启动脚本 (/usr/lib/realplay-10.0.8/realplay) 并将第 73 行从 $REALPLAYBIN “$@” 改成 aoss $REALPLAYBIN “$@” 对于我自己的 feisty fawn 而言,装的是 RealPlayer 10.0.7 版,需要修改的 realplay 文件中的那行位于第 70 行,而不是 73 行,呵呵。现在播放 .rmvb 文件终于有声音了!好棒哦~~~不必再通过用 VirtualBox 跑 WinXP 来看电影了,呵呵。 7月7日 tuits是什么?我在网上经常看到程序员们(当然还有许多非程序员)在他们的电子邮件、IRC 聊天信息以及文档中广泛地使用 tuits 这个词,可是一般的字典里无论如何也查不到,即便是网络字典中也难觅其踪迹。tuits 的典型的用法如下: A> Will you work on that project?
B> Well, as soon as i have the tuits.
再比如, A> Oh, i'm exhausted. i don't think i have the tuits to finish the job today!
B> alas...
But what do tuits mean? What are tuits? 其实从这些应用实例我们多少可以猜出,tuits 有“时间“、”灵感“、或者”动机”之类的意思。来自 libwww-perl 邮件组的美国程序员们可以证实我们的猜测: 有趣的是,后一个链接指向 Perl 语言之父 Larry Wall 对 tuits 的诠释。 从这些邮件不难看到,tuits 一词起源于短语 round tuit, 而 round tuit 又起源于下面这句话: I'll do that when I get around to it. 这里搭配用法 get around to 意为“抽出时间做某事或者考虑某事”。显然,to it 一融合便成了 tuit,呵呵。是不是太过分了一点儿? en.wikipedia.org 网站上对 round tuit 的定义进一步证实了上面的说法: A round tuit is an imaginary object whose name is derived from the phrase ``when I get around to it''. 我们看到,英语中的不少词汇也是很值得细细品味的。呵呵 |
|
|