ngx_pagespeed: how does it work? |
April 29th, 2013 |
ngx_pagespeed, tech |
I've been very busy the past few days getting ngx_pagespeed
out in beta, and then even busier with the flood of interest. I wrote
a post
for the Google
Developers blog, TechCrunch wrote
it up, and it hit #1 on HN:
Referenced in: Markov Me
So, how does it work? First, let's look at what happens normally when you just have the browser and the server, without PageSpeed. Imagine we have a page like this:
/index.html -> navbar.js -> site.css -> cat.jpgWe'll have:
- The browser requests
index.html
. - The server reads
/var/www/index.html
from disk and sends it out. - The browser parses the html, learns about
navbar.js
,site.css
, andcat.jpg
sends requests for them. - The server reads each of them from disk and sends them out.
Let's add PageSpeed to the picture:
- The browser requests
index.html
. - The server reads
/var/www/index.html
from disk. - The response passes through PageSpeed on the way out, giving an opportunity for optimization.
- PageSpeed sees references in
index.html
tonavbar.js
,site.css
, andcat.jpg
, but doesn't immediately know their contents. To find out it requests them from the server. - The fetches take too long for it to be ok blocking
the response on them, so PageSpeed let's them continue in the
background and sends out
index.html
without optimizing the resources. - The browser parses the html, learns about
navbar.js
,site.css
, andcat.jpg
sends requests for them. - The server reads each of them from disk and sends them out.
- The browser requests
index.html
. - The server reads
/var/www/index.html
from disk. - The response passes through PageSpeed on the way out, giving an opportunity for optimization.
- PageSpeed sees references in
index.html
tonavbar.js
,site.css
, andcat.jpg
, and this time knows what they contain because the fetches from before had time to complete. - PageSpeed sees that
navbar.js
is only a few lines. At that size it's probably not worth it to force the browser to make another round trip just to retrieve it, so PageSpeed inlines it. The css and image are large enough that inlining doesn't make sense, partly because inlining keeps caching from working, so for those it just wants to send optimized versions. - PageSpeed sends out
index.html
with some substitutions. Aside from inliningnavbar.js
, it replacessite.css
withA.site.css.pagespeed.cf.KM5K8SbHQL.css
andcat.jpg
with256x192xcat.jpg.pagespeed.ic.AOSDvKNItv.jpg
. These longer urls contain a hash of the contents, which means it's safe to serve them with a very long cache lifetime because when the content changes they'll get a different hash. - The browser parses the html, learns about
A.site.css.pagespeed.cf.KM5K8SbHQL.css
and256x192xcat.jpg.pagespeed.ic.AOSDvKNItv.jpg
, and requests them from the server. - PageSpeed handles the request and sends its contents out.
- PageSpeed needs to intercept outgoing html and rewrite it.
- PageSpeed needs to respond for requests for rewritten
resources like
A.site.css.pagespeed.cf.KM5K8SbHQL.css
.
.pagespeed.
rewritten resources. What does this look like?
- Nginx receives a request for `http://example.com/index.html`
GET /index.html HTTP/1.1 Host: example.com
- Nginx calls PageSpeed's content handler, which looks at the url
and determines whether this request is for an optimized
.pagespeed.
resource. In this case it isn't, so the content handler declines this request. - Nginx continues trying other content handlers until it finds one
that can handle the request. This may be a
proxy_pass
,fastcgi_pass
, atry_files
, static file, or anything else that the webmaster might have configured Nginx to use. - Whatever content handler Nginx selects will start streaming a
response as a linked list of buffers ("buffer chain").
ngx_chain_t in: ngx_buf_t* buf: u_char* start u_char* end ngx_chain_t* next
- Nginx passes that chain of buffers through all registered body filters, which includes PageSpeed's. If this were not html being sent, PageSpeed's body filter would immediately pass the buffers on to the next registered body filter.
- The body filter will see one buffer chain at a time, but it might not be the whole file's worth. For static files on disk it usually will be, but perhaps if we're proxying from an upstream that quickly dumps some layout html but takes much longer to generate personalized content.
- We pass this to PageSpeed via a ProxyFetch. While PageSpeed is running in another thread, ProxyFetch handles all the thread-safety complexity here. Nginx doesn't usually have any threads, so we need to be pretty careful.
- We need to give PageSpeed time to optimize this html, and it's running in a different thread, so we're not going to have output ready for Nginx immediately. Nginx uses an event loop so we can't just wait around here, or else Nginx won't be able to handle other requests until this one finishes. Instead we create a pipe and tell Nginx to watch the output end. Once PageSpeed has some data ready it will be able to write a byte to the pipe and notify Nginx.
- PageSpeed parses this html, identifies the resources in it, and tells the fetcher to retrieve them. This means a "loopback fetch" where PageSpeed requests the resources from Nginx over http.
- There's a Schedule thread that keeps this optimization under a
very tight deadline. If Nginx takes too long to respond with
resources or anything else makes us take too long we send the
html out with whatever optimizations we have completed so far.
Imagine that in this case only
site.css
is fetched and optimized by the time we hit our rewrite deadline. - PageSpeed writes a byte to the pipe Nginx is watching, which makes
Nginx invoke our code on its main thread (the only thread it
knows about). We copy the output bytes from PageSpeed to an
Nginx buffer chain and then Nginx sends them out to the user's
browser:
index.html -> navbar.js -> A.site.css.pagespeed.cf.KM5K8SbHQL.css -> cat.jpg
- All html will go through the same path: as it comes into Nginx it will go to the body filter and then to PageSpeed via the ProxyFetch, then after optimizing or hitting the deadline PageSpeed wakes up Nginx with the pipe, and Nginx sends out the rewritten html.
- When the user's browser sees
navbar.js
andcat.jpg
it will request them from Nginx, and while first the content handler and then the body filter will see each request they won't do anything. - The request for
A.site.css.pagespeed.cf.KM5K8SbHQL.css
, however, will be answered by the content filter. It will pass the request to PageSpeed via ResourceFetch, and PageSpeed will pull the rewritten resource out of cache. (In the unlikely event that the resource is not in cache, there is enough information in the requested filename that PageSpeed can fully reconstruct the optimized resource.) We go through the same output flow with writing a byte to a pipe to notify Nginx that we have data to send out.
Comment via: google plus, facebook