Profil de agentzhHuman & MachinePhotosBlogListesPlus Outils Aide

Blog


20 septembre

Slides for my perl testing & VDOM.pm talks in Beijing Perl Workshop 2009

I really enjoyed the talks in BJPW2009. Here's the slides for my two talks in this conference:

Hopefully you'll find them interesting ;)
14 septembre

A plan for nginx-openresty

Now that I've joined Taobao.com's SDS department and will be able to work on OpenResty in a full time fashion, I've just worked out a (somewhat) detailed plan for the next generation of the OpenResty server. Well, sorry, this draft is in Chinese since my $manager reads Chinese better:

    http://www.pgsqldb.org/mwiki/index.php/Nginx_openresty_plan  (still being actively updated)

After talking with my friend and colleague chaoslawful++ for possible designs of a high performance implementation of the OpenResty server, we finally decided to rewrite OpenResty.pm in pure C and in the form of an nginx module.

Here's some highlights of the Chinese project plan given above:
  • Nginx-openresty will remain fully opensource.
  • We want to take full advantage of Nginx's event modle and asynchronous I/O and we don't want backend requests blocking as well.
  • We want to integrate Coco Lua into nginx's event model leveraging coco's C level coroutines, thus leading to transparent asynchronous I/O on the Lua land. For example, consider the following Lua code

              res = http.get('http://www.taobao.com')

    will automatically yield the current "lua thread" and register a socket fd to nginx's underlying epoll/kqueue/select/etc model and return control to nginx to do other things. Once data arrives at the socket fd, nginx will get informed and accumulated the response data. When the requested data is completely ready, nginx will resume the pending lua session and the Lua function "http.get" will successfully return.
  • We want Lua to become the first-class embedded langauge on the service level and we'd abandon the restyscript language.
  • The first two backends we want to implement for nginx-openresty are mysql and Oracle (and Hive afterwards) because these are heavily used in our department's production.
  • We'll only allow the HTTP POST method to emulate modification methods like PUT and DELETE. GET will no longer be allowed here to reduce XSS attach risk.
  • Password login method will require the client to request a random number from the OpenResty server and use it as a salt to encrypt her password using multi-pass MD5, as in
     passwd = md5(passwd)
    for 1..2 do
    passwd = md5(passwd + salt)
    done
  • For the REST interfce, we will introduce Type API to allow user use Lua snippet to define new parameter types for Views and Actions.
  • For RDBMS backends, Views and Actions can use the "map_to" attribute to map automatically to underlying DB functions and stored procedures.
  • Views and Actions' definitions (ako "queries") will specify in the backend's own query language (like PL/SQL, PL/PgSQL, and etc.). OpenResty willl only recognize special interpolated parameters in the form of $(param_name type=xxx checker=xxx default=xxx ...).
  • Views and Actions will support "cached" and "expire" attributes to allow caching of result data set, and "async" attribute to allow time-consuming backend queries to submit to remote queues like memcacheq and eventually run by async daemon workers. The original View invoker will immediately get a job ID for his request (unless the task queue is full) and poll OpenResty's Job API for the status of his task and get the final results when the task is marked "done".
  • View/Action parameters can specify types, Lua-specified checkers, and default values. For example:

         {"name":"my_view","query":
              "select * from animals where age > $(age type=integer checker='return age >= 0 and age <= 100')"}

  • Filter, Template, and Trigger APIs will also be introduced :)
I'll create a git repository on GitHub for the nginx-openresty's development. Participation will always be appreciated :) I'll keep you posted.

Update: Special thanks go to kindy++ for his detailed review of the nginx-openresty-plan document and helpful suggestions :)
4 septembre

Slides for my VDOM + WebKit talk

I gave a presentation on VDOM + WebKit to the Taobao.com Search Frontend Team this morning. The slides are based on my talk in April's Beijing Perl Workshop, but with notable updates to reflect recent changes in the last few months:

   http://agentzh.org/misc/slides/taobao-fe/vdomwebkit.xul   (Firefox required to open this link)

Be patient when it's downloading big images, or you can download the whole tarball to your local side, unpack the package, open the vdomwebkit.xul in it, and browse the slides locally:

   http://agentzh.org/misc/slides/taobao-fe.tar.gz

Recent major development regarding our browser-based web scraping clusters are:

  • We've switched from the Visual DOM Firefox extension completely to the VDOM Browser based on QtWebKit for hunter development.
  • We've switched from OpenResty + Pg to our queue-size-ware version of memcacheq to coordinate the whole cluster.
  • We've extensively used our new "VDOM spectroscopy algorithm" to establish corronspondence between text nodes in similar page regions.
  • We've renamed the offsetX, offsetY, offsetWidth, and offsetHeight attributes in the VDOM data format to single-letter names x, y, w, and h, respectively.
  • We've tweaked QtWebKit to emit geometric information about text nodes as well as text runs in the VDOM data output.
I'll use almost the same slides for my 3rd (lightening) talk at the annual Beijing Perl Workshop conference weeks later :)
3 septembre

Our queue-size-aware version of memcacheq

Xunxin++ and I have been working on a fork of memcacheq (originally just within the company), adding support for queue length constraint.

In our scenario, a pipelined webpage information extraction cluster based on apple's WebKit core, it's important to limit the queue's length and to make the queue "inform" the queue item producers by some way in case the queue is full.

We're not sure if it's worth merging back to the mainstream version because this new addition adds some cost (though the cost is low).

Here goes the project page on GitHub, with more explanation of the details (I won't repeat them here ;)):

  http://github.com/agentzh/memcacheq/tree/master

The newly added code is also licensed under the same license as the mainstream memcacheq.

Enjoy :)
2 septembre

I'll talk in the upcoming Beijing Perl Workshop 2009 event

I've submitted 3 talk proposals to this year's upcoming Beijing Perl Workshop conference scheduled at Sep 19. I'll publish my slides for my talks later here for your preview :)

The 3 talks are
  1. Wonders of Perl automated testing
  2. Inventing mini languages in Perl
  3. Web Scraping based on Perl + WebKit + VDOM
If you feel like attending the conference, please register in the conference site below:

   http://conference.perlchina.org/bjpw2009/

Don't forget to specify a T-shirt size in your profile setting there so that we can prepare a T-shirt for you in this event (well, it's for free!).

See you there ;)