Cache Reverse Proxy - Varnish

Introduction

Varnish is an HTTP accelerator, the official page is: https://www.varnish-cache.org/.

Varnish stands in front of the web application server to be a cache reverse proxy (also can be a load balancer), it can simply cache all the static resources in memory, and can also be powerfully configed using VCL (Varnish Configuration Language, a DSL for Varnish configuration) to cache dynamic content, in addition, Varnish implements ESI (Edge Side Include) standards to provide ability to cache static parts of the page.

A web server uses Varnish can easily handle more than 10,000 requests/s on a single node, as a comparison, an Apache web server can handle about 1000 requests/s, thus it is extremely suitable and strongly recommended for "content-heavy dynamic websites" with high concurrency visits.

Varnish landscope — Varnish landscape (Shamelessly stole from MGM Tech blog)

Varnish Software has made a vivid video to demonstrate what is Varnish and how it makes the web fly.

Installation

There are two ways of installing it, package manager or compiling from the source.

Install by package manager

For Linux distributions, follow the guide on Varnish Official Download page, for Mac, simply run brew install varnish.

Compiling from source

Varnish relies on latest version GNU M4, autoconf, automake and libtool, so download them from GNU mirrors and compile/make them one by one:

For each ["M4", "autoconf", "automake", "libtool"] do:

  
curl -O http://mirrors.kernel.org/gnu/m4/m4-latest.tar.gz
cd m4-1.4.16/
./configure --prefix=/usr/local
make && sudo make install

Then download source from Varnish Official Download page, and install Varnish:

    
cd varnish-3.0.3
$ ./autogen.sh$ ./configure
make
sudo make install

For me, I miraculously forgot the existence of Homebrew and spent 1 hour in compiling from source...

Make it work!

I have a Rails 3 website running at http://localhost:3000, I can simply run:

  sudo varnishd -a :80 -b http://localhost:3000 -s file,/tmp,500M -T localhost:6082

Arguments explanation (we can always run varnishd --help):

-a Binding address
-b Backend server addr
-s Backend storage specification
-T Telnet addr(Management interface), e.g. -T localhost:6082
-F Run in Foreground, see runtime log from the terminal

This will cache all GET/HEAD requests for the all the resources, we can simply setup a Varnish in front of our static file server, to benefit huge performance improve.

However, we need deal with two typical scenarios since our website is dynamic:

Some resources should be cached, however they could be updated sometime, then the cache needs rebuilt. Simply run varnish in the above way will NOT achieve this!
Most of my website functionalities require user logging in. If I simply run a varnish, then there is no security at all, all the sensitive data was cached by Varnish, this is not acceptable!

I've investigated how to config VCL to achieve No.1, for No.2, it can be done by ESI, I will cover that later.

VCL basics

When Varnish got installed, it generates a default.vcl under /etc/sysconfig/varnish or /etc/default/varnish for Linux distros and /usr/local/etc/varnish/ for Mac, all the content inside all commented out and let you modify.

VCL has a number of functions, each of them is invoked at specified stage of the HTTP transaction, the process is below (shamelessly stole from: MGM Tech blog:

There are several important built-in objects which can be accessed in functions:

req

The request object. When Varnish has received the request the req object is created and populated. Most of the work you do in vcl_recv you do on or with the req object.

beresp

The backend respons object. It contains the headers of the object comming from the backend. Most of the work you do in vcl_fetch you do on the beresp object.

obj

The cached object. Mostly a read only object that resides in memory. obj.ttl is writable, the rest is read only.

I have a resource exposed by Rails: http://localhost/doc, I expect it can be cached by Varnish, and can refresh the cache when someone POST to update it; To achieve my goal, I need cache the resource for all GET/HEAD requests, however, when a update request comes in - POST request, Varnish should purge the object which has been cached, this is done by three steps:

Set default backend in VCL:

     
     
backend default {
  .host = "127.0.0.1";
  .port = "3000";
}

Tell Varnish to purge cache when this is an HTTP POST request AND server said "no cache":

    
sub vcl_fetch {
  if (req.request == "POST" && beresp.http.Cache-Control == "no-cache") {
      ban("req.url == " + req.url);
  }
  return (deliver);
}

Rerun Varnish and tell it to use this VCL configuration.

      sudo varnishd -a :80 -s file,/tmp,500M -T localhost:6082 -F -f /usr/local/etc/varnish/default.vcl

ban is a new action added in Varnish 3.0, there used to be purge and purge_url actions before but how they are replaced by ban, purge now can only be used without arguments.

Finally I update "edit" action of the resource controller:

response.header["Cache-Control"] = "no-cache"

Now if I update the resource:

  HTTP/1.1 POST http://localhost/doc

Varnish will firstly passby the request to backend, and after it got updated inside Rails, the controller returns "Cache-Control: no-cache" header, my VCL will then purge (ban) the requested url, so that next time a GET request comes in, Varnish will reload the resource from backend and the cached is re-built!

Below are some VCL examples I collected online:

Honor the Cache-Control header!

  
#Without this block Varnish would attempt to guess whether the response was cacheable and result in unexpected caching
if(obj.http.Pragma ~ "no-cache" ||
    obj.http.Cache-Control ~ "no-cache" ||
    obj.http.Cache-Control ~ "private") {
    return(pass);
}

Force refresh

Always look up backend if a client fires a "Force refresh request", e.g. CMD-Shift-R in Mac or <Ctrl>+F5 in IE.

  
if (req.http.Cache-Control ~ "no-cache" && client.ip ~ editors) {
    set req.hash_always_miss = true;
}

Remove cookie headers for images

  
sub vcl_fetch {
  if (req.url ~ "\.(png|gif|jpg)$") {
    unset beresp.http.set-cookie;
    set beresp.ttl = 1h;
  }
}

Pass sensitive data to the backend

For basic HTTP authentication:

  
if (req.http.Authorization) {
    # Not cacheable by default #
    return(pass);
}

For JavaEE web application:

  
sub vcl_recv {
    if (req.http.cookie ~ "JSESSIONID") {
        std.log("found jsessionid in request, passing to backend server");  # import std;

        return (pass);
    }
}

Tip for debug VCL: add stdmodule in vcl file:
    
        import std;
    
So that we can print some useful log in VCL:
std.syslog(888, "Purge cache for: " + req.url);

Cache invalidation

The VCL below will setup a access control list named "purgers" and expose an HTTP PURGE interface:

    
acl purgers { "127.0.0.1"; }

sub vcl_recv {
    if (req.request == "PURGE") {
        if (!client.ip ~ purgers) {
            error 405 "Method not allowed";
        }
        return (lookup);
    }
}
sub vcl_fetch {
    std.syslog(888, "vcl_fetch!!!!!!!!!!!!!!!");
    if (req.request == "POST" && beresp.http.Cache-Control == "no-cache") {
        std.syslog(888, "Purge cache for: " + req.url);
        ban("req.url == " + req.url);
    }
    return (deliver);
}
sub vcl_hit {
    if (req.request == "PURGE") {
        purge;
        error 200 "Purged";
    }
}
sub vcl_miss {
    if (req.request == "PURGE") { 
        purge;
        error 200 "Purged";
    }
}
sub vcl_pass {
    if (req.request == "PURGE") {
        error 502 "PURGE on a passed object";
    }
}

So when an HTTP PURGE request: curl -X PURGE http://localhost/doc is sent from "purgers", Varnish will purge the cache.