VelocityConf 2015

Service Workers, The Practical Bits
★★★★☆

Patrick Meenan, Google

Slides: http://www.slideshare.net/patrickmeenan/service-workers-for-performance

Basics about service workers

Tutorial about service workers. Pat isn’t an expert. Real experts – they wrote the specs and/or are working on implementations – Alex Russell, Jake Archibald.

Basic concept: service workers are a layer in between the browser’s resource fetcher and the networking stack (which includes the net cache). In other words, service workers intercept all requests issued by a page, whether it’s fetched from the browser’s disk cache or goes out to the internet.

Information captured include original URL, all request headers, and the response when available. It even includes cross-origin requests (response inspection depends on Cross-Origin policy.)

Service workers share a programmable cache in the form of an IndexedDB.

Service workers are available for HTTPS only pages

Service workers are not active on first view

service workers support async APIs only. They run in the background and have no notion of “page”. From a service worker’s point-of-view, there’s only request coming through. No page load, no after page load. That logic has to be built.

How to use service workers?

How do you register a service worker?

navigator.serviceWorker.register(serviceWorkerFile, scope)

What do you do with it? Attach handlers:

self.addEventListener('install', function(event) {
    // runs on install...
});
self.addEventListener('install', function(event) {
    // runs once the service worker is installed
});
self.addEventListener('fetch', function(event) {
    // runs on every request
});

From @patmeenan: “In the grand scheme of developers shooting themselves in the foot, this is a big cannon.”. For instance:

// Would intercept every request and makes it return 'hello world'. That
// includes the main HTML document!
self.addEventListener('fetch', function(event) {
    self.respondWith(new Response('Hello world'));
});

Regardless of everything else, the browser will check for an updated service worker every 24hrs. That means your site could get busted for 24hrs. So, you need good test coverage

cache.open(cacheName).then(function(cache) {
    return cache.addAll(['file1', '/path/to/file/2', etc]);
})

Service workers' cache is fully programmable. But that also means you have to manually expire entry or reset them. Double-edge sword?

Service workers' main use case is offline, through navigator.online check you can customize what the browser returns when the network is not available.

Important to note: all ES6 request/response are streams. So you need to clone them if you want to do something without burning your one time chance to read them.

Use cases

Set custom timeouts
Service workers let you set custom timeouts on ads, social widgets, etc. to prevent SPOFs (Single Point of Failure)

SPOFomatic is a Chrome extension to detect single point of failures

Catch DNS/CDN/Proxy failures/errors
Not only do service workers let you catch these errors by inspecting the response code for different type of assets on a page, but you can also do all kinds of cool things like:

Race CDNs
With service workers you can have one request spawn N requests, which means you could A/B test two CDNs. Might not be the best idea money-wise. Service workers can grab the first CDN to respond so that your CDNs are constantly “racing” one another.

Stale-While-Revalidate implementation
With service workers you can go beyond the normal expire headers and prevent your users from reaching out to the server. For instance, instead of waiting for a full request, serve a stale version of the asset and in parallel, check if a fresher version is available. If it is, update the cache! If not, you’ve just saved your user the latency of a round-trip. w00t.

Prefetching resources
That’s an obvious use-case. Since service workers are running in a completely different process you can start prefetching resources without incurring delay on the main page.

Re-writing the browser scheduler
That’s gnarly and super dangerous but can be done with service workers. You think that Chrome sucks when it downloads resources that are at the end of the body early? You can choose to delay that and prioritize your fancy hero image!

Custom compression
Service workers can also be used to implement compression algorithms that don’t ship with the browser natively. Got a new idea for a fancy new compression algorithm? You can do it. And that also applies for image compression! (but remember, service workers are written in JS, that’s probably not the best place to decompress/recompress an image!

Delta compression
Say a page issues a request for main.js?v=3. A service worker can intercept that and, instead of letting the browser reach out to the internet, decide to see what’s in its cache and ask the server for the diff between the cached version and the fresh version. That can save massive amounts of time.

Progressive JPEGs
Have your service worker ask the server just the bytes up until a certain point, leave the connection open, and serve the rest later!

Generate images instead of downloading them
Large images like webpagetest.org’s images are just charts. Instead of generating/serving this image from the server, a service worker can transparently ask the server for the data to generate that image (say a JSON blob) and generate a PNG client-side using a canvas.

Metrics
Service workers are awesome for debugging/metrics. You can not only report the usual perf data, but also errors cases. This gives visibility into failure modes that you wouldn’t see otherwise.

Last note: service workers can do importScripts, so they’re composable. At this point, there’s still opportunity to write the jQuery that lets you compose service workers easily!

Best Practices for MySQL High Availability
★★☆☆☆

Colin Charles, MariaDB

What is High availability?

different ways to handle replication: what is SAN? simple replication master to master Tungsten NDB cluster, Galera Cluster

Redundancy:

Durability:

Recovery can be long if innodb_log_file_size is too large

Redundancy through disk repllication

How can you achieve redundancy through MySQL replication?

Binary log, relay log, master_info_log, relay_log_info_log

Use percona toolkit to monitor replication delay

Galera cluster is a plugin to handle replication in MySQL.

NDBCluster: super high availability, amazing update performance, but very very expensive to run/upgrade.

Deep-dive about MHA (set of perl scripts)

Metrics, metrics everywhere (but where the heck do you start?)
★★☆☆☆

Tammy Everts & Cliff Crocker, SOASTA

@tameverts, @cliffcroker

47% of consumers expect pages to load in 2 seconds or less.

There is no one single metric that exists which can unite everyone.

RUM, synthetic? Both!

Interesting idea: plot the conversion rate of a page or a flow against its performance.

TODO: look into Boomerang.js

Navigation timing API.

The usual talk about how it’s great to have that data available and look at it. Yelp already reports this so it wasn’t that new. it wasn’t that new.

Interesting ideas related to navigation timings:

resource timings TODO: Timings-Allow-Origin header to be able to get a breakdown of Resource Timings

See beyond-page-timings

Degradations are worse on browse-type pages than on say…checkout.

Concept of conversion impact scores (how much does performance impact conversion depending on pages/flows?)

Example SLA: response time measured using resource timing forom Chrome browsers in the US should not exceed a median of 100ms or a 95th percentile of 500ms for a population of more than 500 users in a 24hrs period.

Linux Performance Tools
★★★★★

Brendan Gregg, Netflix

Link to Slides

Objectives:

Anti-methodologies

Actual Methodologies

Problem Statement Method

Workload Characterization Method

The USE method

For every resource, check:

Off-CPU Analysis

Try to track when your process gets off CPU and why. Very useful to detect problems where your program is competing with another process, i/o issues, etc.

CPU profile method

Take a CPU profile and understand all software in this profile that runs for more than one percent of the time.

RTFM Method

Read ("The F*cking") Man pages, books, web search, co-workers, slides, support services, source code, experimentation, social forums, etc.

Command line tools

Observability tools

Benchmarking tools

Benchmarks are tricky because it orders of magnitude easier to run them than to refute them. To avoid that: run the benchmark for hours. In the meantime, run the observability tools to confirm it’s hitting the right things.

unixbench, lmbench, sysbench, perf bench

fio –name=seqwrite –rw=write –bs=…..

pchar (traceroute with bandwidth per hop)

iperf

Tuning tools

Tuning tools are acting on your system and modifying it. Be careful. Don’t fall into the drunk-man anti-pattern. Don’t fiddle with params until the problem goes away. Instead:

Different tools:

Static tools

To do static performance tuning, check the static state and configuration of the system:

Tools to do that:

Profiling tools

Built-in perf is chill.

    perf record -F 99 -ag -- sleep 30
    perf report -n --stdio

(then use Bredan’s FlameGraph tool to turn the output into a cool flamechart)

Tracing

eBPF is getting integrated into Linux 4.1. Exciting!

Better Off Bad
★★★★☆

Lara Bell

This talk was about security.

Our apps are full of precious things (data), we don’t want those precious things to get stolen.

We all liars, cheats and thieves deep inside. The difference between us and real bad guys is we don’t mean harm.

Security is hard because breaking a box is against our nature. We’d rather build stuff than destroy. We also feel cheated on if an attack is not sophisticated or elegant. Spoiler alert: most of them aren’t!!

Three steps to start at your company:

Challenge: mentally plan what would be the worse thing to steal/break in your organization. How would you steal it?

Lessons learned for large-scale apps running in a hybrid cloud environment
★★★☆☆

Dana Quinn, Intuit

Why the cloud? Move quickly, speed up innovation.

How do you get to the cloud?

Don’t make your cloud feel like fog!

Track your costs!

Building a faster, highly available data tier for active workloads
★☆☆☆☆

Dave McCrory, Basho

Trend: “Scale or fail!”. The amount of data is expected to double every two years

Challenge:

Speaker didn’t expose any solutions. Boo!

Maintaining performance and reliability at the edge
★★☆☆☆

Rob Peters, Verizon

Conway’s law: software architecture will mirror your organization.

What organization do we need to solve problem X given current software?
VS
Given current organization, what software can we efficiently produce to solve problem X?

Second part of the talk was about culture and tooling to ensure good performance. Ended up being a bit generic like:

Engineering for the long game
★★★★★

Astrid Atkinson, Google

Loosely coupled systems are very unpredictable.

At Google, the Börg ecosystem rules services' world. Astrid has been at Google for 10 year and saw that system grow and evolve.

Rules of the long game:

How do systems grow?

Rules to enable growth:

Rules to engineer for maintainability:

Warning: if you share everything and consolidate too much, you’re sacrificing flexibility, so don’t do it too early. When do you consolidate? Pick your moment, be conservative – avoid building new systems wherever possible.

Make boring infrastructure choices. Those usually holds longest in time.

Shared systems require coordination. It’s not enough just to build it, you have to move existing workloads.

Systems should protect themselves (Google’s DDOS attacks comes from inside!). You don’t want to have a system where every user interacting with it has to file a ticket either, so AUTOMATE and build defense mechanisms early.

Second systems: cake or death?

Generally speaking: don’t let the weeds get higher than the garden.
Invest in your tools, keep an eye out for complexity and time sink and take care of your people.

There’s no victory. It’s an ongoing game.

Not your Parents' Microsoft
★☆☆☆☆

Jessica DeVita, Microsoft

Marketing talk about Azure and how Microsoft is doing “cool” things now. No content at all. Entertaining I guess, they had some good memes in there.

Prevent rather than fix
★☆☆☆☆

Jason Ding, Salesforce

When you have customers of different sizes…scope/plan/test/optimize. Talk is about how you should test and plan before taking on large customers…yup! Got it. Thanks.

##A practical type of Empathy
★★★☆☆ Indi Young

A broader range of ideas pop up when you solicit input from other people.

How do you do this? Empathy.
Empathy is a hot buzzword. It just means: be more sensitive.

There are several kinds of empathy:

Instead of checking, thinking and making (usual cycle when you chew on an engineering problem): try listening, then walking in that person’s shoes, then let that simmer for a bit. You’ll be surprised at the number of new ideas!

You have to listen to develop empathy.

Interesting idea: try to do your 1-1s on the phone to prevent fear of talking, assumptions and immediate reactions. It’ll prevent the conversation from steering too much.

Talk by Twitter

Stream processing and anomaly detection @Twitter
★☆☆☆☆

Arun Kejariwal and Sailesh Mittal, Twitter

Real-time analytics: streaming or interactive? You have two different architectures possible as a result.

Real-time means different things for different people (wall-street real-time is insane)

Storm is a streaming platform for analyzing real-time data as it arrives We already use storm at Yelp. Womp womp, wrong talk to attend :/

Visualizing performance data in engaging ways
★★★☆☆

Mark Zeman, SpeedCurve

We implicitly recognize and seek out patterns, textures and colors. Try to make your visualizations more interesting!

Sick demos at SpeedCurve Lab.

Crafting Performance alerting tools
★★★★☆

Allison McKnight, Etsy

Story of how Etsy added performance monitoring tools.

Etsy is PHP on the backend. They monitor backend with phptime, aggregate with Logster

At first performance monitoring was done through dashboard. Dashboards are good, but they have problems:

Solution: monitoring!!

First iteration: email top 5 slowest pages.
Cons: didn’t catch regressions on fast pages, the ones that people actually care about optimizing

Second iteration: perf regression report
Cons: didn’t catch small/slow-creep regressions, difficult to tune, alert fatigue (which regressions are meaningful?)

Third Iteration: change alerting mechanism! Instead of email, use Nagios!
Cons: alerts were hard to read

Fourth Iteration: change format with Nagios Heralds, to add context to Nagios alerts. Also, added relevant timing graphs to the email (past 1hr/24hrs)
Cons: lots of false positives on downstream services affecting top level alerts (if payment processor is slow, yes, payment flows are gonna be slow but there’s nothing you can do about it!)

Fifth Iteration: use Nagios service dependencies to cut down on non actionable alerts Cons: why do I have to use my mouse or my email for things?

Sixth Iteration: Improving sleuthing tools, Graphing integrated with IRC bot (!!)

Next up: alerting on perf improvements to celebrate more.

Maintaining the biggest machine in the world with mobile apps
★★☆☆☆

Lukasz Pater, CERN

What is CERN? The world’s largest reserach center for particle physics.
Also, the place where the web was born.

Large Hadon Collider, or LHC is millions of high-tech components installed in a 27km long circular tunnel.

Infor EAM is used throughout the whole organization to support asset management With 1000+ users working in the field, simple and mobile interfaces are vital.

Mobile strategy?

On the backend: Infor EAP, then a middle server that does all the mobile friendly stuff like compression, checksumming, caching, logging, web sockets, etc.

Mobile clients then make use of HTML5 features like localStorage and appcache to be able to function in tunnel when technicians do maintenance.

Then talk steered towards how JS is not Java, and how jQuery is great. Wat. It went downhill from there.

Conclusion: HTML5 is great, let’s hope JS gets more standardized. Wow. Really.

Performance for the next billion
★★★★☆

Bruce Lawson, Opera

How will the next billion Internet users come online? Asia, Africa.

Number of people in asia: 4 billions.

Indonesia is the 3rd top user of Facebook, yet they are still stuck with 2G connections!

India is also growing super fast from 400 million internet users today to 900 millions in 2018.

We have more and more low end devices coming online. Those markets aren’t dominated by iPhones and high end Android devices. Websites have to be performant on lower end devices because that’s what Asia and Africa use and will continue to use.

Apps don’t scale in those markets. Websites are the only solution to propagate updates faster.
Solution: installable web apps through web manifest specification. Apple has been doing it in a non-standard way for a long time. Now you can do it on Android and Opera browsers.

In rich nations, 1-2% of income is spent on internet connectivity. In developing nations, up to 10% of income is spent on connectivity. We cannot waste people’s money with large images. Instead, serve the right images for each device! Use responsive images (<picture>, srcset.)

Opera mini is a proxy browser used heavily in India/Africa to solve this exact problem: get users decent access to the web. How it works? Opera renders websites on the server, then proxies rendered websites to low-end devices.

“Doesn’t matter how smart your phone is if your network is dumb”

Great. You’re a software company. Now what?
★☆☆☆☆

Patrick Lightbody, New Relic

Talk about Chipotle’s burrito button on the Apple Watch and how this changes everything.

“Every company is now a software company.”

History of monitoring:

“Fast isn’t enough, we need to delight users.”

Twenty Thousand Leagues Inside the optical fiber
★★★☆☆

Ariya Hidayat, Shape Security

This talk was about the history of optical fiber.

At first: talk! Sound waves. As people are further and further away, they naturally tend to switch to visual communication (light waves!) like semaphores, smoke signals, lighting signals.

Bell invented the photophone but instead, telephone took off and radio took over the world.

It’s not until the invention of Laser light that light-based communication became popular again. Very popular. Now optical fiber is what everything runs on. The amount of data we can send through one optical fiber is mind-boggling.

How? Since Laser light is monochromatic, modulations (amplitude, frequency) are possible, but also multiplexing (by having multiple colors sent through the same optical fiber.

“I have seen further, it is by standing on the shoulders of giants.” – Isaac Newton.

Beyond the Hype Cycle
★★☆☆☆

Shane Evans, HP

Talk about the entreprise and how HP is helping them move the needle. QA and testing budgets have gone up from 18% to 30%. Yet a prime factor that hinders agile in enterprise is lack of testing.

Unicorn, horses and mules (respectively early, mid-range and late adopters). Mules are companies engineers hate (late adopters), yet people rely on them more than any other company.

How to move forward in enterprise?

  1. Build smarter automation
  2. Address the legacy
  3. Think about process and scalability

Overcoming the challenges of image delivery
★★★☆☆

Mohammed Aboui-magd, Akamai

Users demand better images. Users love images. Web pages are getting heavier and heavier.

Number of images served by Akamai per day: 0.75 trillion

If you want to do it right you need a version of an image for each:

That’s A LOT of different images to pre-generate, store, manage and serve dynamically. How do you manage that complexity?

Akamai does just that. (talk ended on that, no solutions given…kinda sucks.)

Reflections on mountain moving, from the first year of USDS
★★★☆☆

Mikey Deckerson, US Digital Service administrator

Talk about bureaucracy. Bureaucracy is a word everyone hates but it’s really just what happens when a large group of people have to take a decision that matters. Saying “my office has no bureaucracy” is like saying “my town has no climate”. At best, the climate doesn’t affect you, but it’s still there.

Rules to deal with bureaucracy:

Closing wish: apply your human capital to healthcare, education, or energy. Those are the things that matter.

Measurement is not enough
★☆☆☆☆

Buddy brewer, SOASTA

Performance means very different things to different people.

You have to find relationships between the data to establish relationship between people.

Recruiting for Diversity in Tech
★★★☆☆

Laine Campbell, Pythian

Diversity is a goal unto itself

There are two types of diversity:

Meritocracy is bullshit. It doesn’t work because of implicit biaises. That’s why a ridiculously small amount number of women lead open source projects. People nominate people they’re comfortable with. They don’t make the choice of diversity unless they’re forced or at least incentivized to.

Instead, create goals, enforce and track them!

Inbound recruiting is bad for diversity. Do outbound recruiting (meetups, online groups, linkedin, …)

Have a code of conduct and enforce it!

Just try it: eliminate names and pictures from applications. Anonymize online handles and technical test results. It’ll help you build balanced and diverse teams.

Reaching everywhere
★★★☆☆

Tim Kadleck, Akamai

This talk was about how Radio Free Europe responsive site was built with performance in mind.

Accessing Radio Free europe is punishable by death in certain country like Iran.
Our internet is not their internet.

2099KB is the average page size now. This is not acceptable when they are accessed over very slow networks.

“The future is already here–it’s just not very evenly distributed”

We have to be performant by default. How? Set a performance budget.

20% rule: most people don’t notice that it’s faster unless it’s at least 20% faster.

Goal set for Radio Free Europe: visually complete in less than 4000ms.

Tips to feel performance as you develop:

Other tools for performance:

Mobile image processing
★★★★☆

Today, 1.8 billion photos per day are taken. They also add up to 62.4% of the average webpage!

We usually don’t think about it, but browsers have to perform a lot of work to display an image:

To decode an image, we have to reverse the process of encoding an image. How are JPEGS encoded?

You can view this on chrome://tracing or in Firefox at about:memory. Just capture your page loading and search for DecodeAndSampling.

Decoding images takes a LOT more time to do when images are too big for the requested display size. If the size fits: 5ms. If it’s twice the size: 30ms. If the image is six times bigger: 200-300ms!

Mind your fancy hero images. Please please resize your images!

Badly sized images are impacting:

Talk about chroma subsampling. What the heck is this? Interesting stuff (4:4:4 vs 4:2:2 vs 4:2:0). Basically 4 pixels can be condensed by just taking 2 samples of color (or one) within them.

IE led the charge on offloading chroma upsampling (the step that reverses subsampling) to the GPU. With this, decoding images takes less time. Still, resize images!

Takeaways:

Building the new fast MSN
★★★☆☆

Amiya Gupta, Microsoft

Link to slides

Several types of optimizations.

Basic:

Intermediate:

Advanced. Let’s talk about those.

How Microsoft quantifies perceived performance: not speed index, not page phase time (rate of change of pixels displayed on the screen over time)

Microsoft’s approach to visual metrics:

Basically, not one number but several, to help narrow down where the regression might be. Also, heavy use of user timing markers to identify problems. They are displayed in the timeline view in IE’s F12 tools.

Lessons learned:

What is a forced layout operation? How does it happen?

  1. Invalidate a section of the display tree through DOM or style updates
  2. Read property under that tree when invalidated: e.g. clientWidth, getBoundingClientRect, etc

How to fix?

Caution! HTTP2 is coming!! With HTTP2 all of the following are anti-patterns:

TCP and the lower bound of web performance
★★★★★

John Rauser, Pinterest

Talk about how the web works. Very, very good talk. Nicely delivered.

Link to Slides

Stuart Cheshire said:

Let’s take the example of Seattle/New York latency. If those two cities were linked with a single, continuous piece of optical fiber, the theoretical latency would be 37ms.

Now, what’s the actual Seattle/NYC latency? 90ms. Only a factor of 2! The point is: we’ve already done a pretty good job at reducing latency. Latency is going to be there no matter what we do and what protocol we use.

Now, story about TCP/IP, the protocol which rules the web today.

Early days of the telephone: people were connected to each other physically, with operators operating switching. One conversation was going on in each circuit.

With a digital form of communication we switched to packet switched networks: messages are broken down into packets, and packets of different messages can be interleaved into the same circuit. This allows for a lot of multiplexing and a better utilization rate of our networking infrastructure.

But there’s no free lunch: for packet switched networks, you need congestion control!

RFC 793 is the initial TCP/IP RFC: reliability is implemented via ACKs, flow control is implemented through the concept of TCP window.

In October 1986 a series of congestion collapses hit the ARPA net. John Nagles, in RFC 896 describes the problem really well. Van Jacobson later on, through his research comes up with the concept of TCP slow start.

Later on, delayed ACK is proposed in RFC 813, reasoning being that immediate ACKs are a ton of extra overhead in most cases.

RFC 1122 recommends/codified both TCP slow start and delayed ACK, in 1989.

With those two things in place, network latency strictly limits the throughput of new TCP connection!

Takeaways: