Pipeline Architectures - Michael Heap

14 downloads 196 Views 2MB Size Report
Data. Profit! PHP. PHP. C++. If you have a serialisation format that all languages can talk, you can use any programming
Pipeline Architectures

Today, we’re here to talk about pipelines. I’m making a big assumption that most people here are web developers that are used to a request/response model, and that pipelines are quite a new thing. Pipelines aren’t a new thing. They exist in almost everything we do - take an input, produce an output.

@mheap I’m Michael, and I like hot air balloon rides. I’ll also be wearing this jazzy jacket all day so I’ll be nice and easy to find if you have any questions this evening. I’m @mheap on twitter if you want to tweet nice things about this talk.

Pipelines

Today, we’re here to talk about pipelines. How many of you have worked with a pipeline before?

Pipelines aren’t a new thing. They exist in almost everything we do - take an input, produce an output.

Tea/Coffee

How many of you made a cup of tea or coffee this morning before you headed down here? You worked with a pipeline

Waterfall

What about at work? I’m sure we all work in an agile way here, but what about that good old waterfall methodology? That’s a pipeline too

Manufacturing

And of course, manufacturing. Manufacturing is pretty much the canonical example of a pipeline. It takes things in, does something and spits things out of the other side

In fact, manufacturing is a perfect example of how pipelines can make you more efficient. Imagine that you’re making a car, and you have one machine that does it all

Engine: 20 mins Frame: 40 mins Windows: 10 mins

The engine takes 20 minutes to make, the frame takes 40 minutes and the windows take 10 minutes. If you make them sequentially, that’s one car every 70 minutes

1x Engine: 20 mins 1x Frame: 40 mins 1x Windows: 10 mins

What happens if we parallelise them? Now we’re down to one car every 40 minutes

2x Engine: 20 mins 4x Frame: 40 mins 1x Windows: 10 mins

Now make those independent and scale the machines up. Get two engine machines and four frame machines. How after 40 minutes, we have enough for four cars. That’s 10 minutes per car, 7x as many as if we build them sequentially.

Code Pipelines

The same thing holds when processing data in software. If you get 10 items of data per second, but can only process one per second, you need 10 things processing data to keep up. By having separate components to perform different tasks, you can parallelise and scale them independently.

This is a unix system I know this Let’s start by going back to basics, let’s think about the command line

cat attendees.txt

First, you have an input, in this case, it’s a list of attendees here at PHPNW. We just output that data to the screen using cat.

cat attendees.txt | grep Michael

We pipe the output into a utility called grep, which is used for string matching. Here, we’re looking for people called Michael

cat attendees.txt | grep Michael > michaels.txt

And finally, we store the output in a text file called michaels.txt. This is a pipeline. We have input, do something and we have output

cat attendees.txt | grep Michael | sort > michaels.txt

We can easily add another component to our pipeline, for example we want to sort all the michaels by name. Imagine how difficult this would be to do in a traditional application when the business requirements change.

Write programs that do one thing and do it well. Write programs to work together. Write programs to handle text streams, because that is a universal interface.

This is the core of the unix philosophy, which is a great guideline for anyone writing code.

I’ve been thinking, when else do we use pipelines in development? When do we use them without really thinking about it?

read(file.txt). map(u => return u.name). filter(n => return n == “michael”) If you do any functional development, you’re working with a pipeline

read(file.txt). map(u => return u.name). filter(n => return n == “michael”) You read your input

read(file.txt). map(u => return u.name). filter(n => return n == “michael”) Do some transforms, and have some output

producer.go channel render()

You can use it in a web context

$ php foo.php ‘bar’ Domain echo cli_output($output);

You can use it in a command line context. Proper decoupling of your inputs, domain and outputs is a hugely important part of application development.

zmqpp zmqpp

What if we took this even more literally? What if all your application did was processing? What if input and output were handled for you? Let’s get right back down to the unix philosophy, where everything speaks plain text. We could have an application that reads from zeromq and writes to stdout, your application reads that then writes to stdout which in turn sends it to zmqpp

zmqpp redis-pipe

You want to write to redis instead? No problem!

nsq nsq

Oh, you need an nsq pipeline? We can do that too. Your application knows nothing about the transport. (I don’t recommend this, as there’s a whole host of questions around multiple inputs/outputs, but it’s an interesting idea)

Human Pipelines

Then of course, there are human pipelines.

Conways law - Organizations which design systems are constrained to produce designs which are copies of the communication structures of these organizations.

Turns out this works well for lots of teams that have to collaborate too. You agree to contracts and develop against them. Your development process becomes a pipeline and you can scale components (teams) up and down as required

+

=

Sometimes pipelines look like this. In fact, all of my favourite pipelines look like this (with the magically colour changing cup and everything)

And sometimes, pipelines can look something like this. This is the architecture diagram of DataSift, a realtime data processing company. It also happens to be where I work

Data ??? Profit! The general gist of how it works, is that data comes in, we do stuff to it, then it leaves the platform to our customers

Data

Profit! That stuff in the middle might be filtering

Data

Profit! We might also do links resolution on the data

Data

Profit! We might attach metadata about the user. We might just do the filtering, or just the links augmentation, or just add user metadata. We might do two of the three, or all three. Our platform doesn’t care.

Data JSON JSON JSON JSON

Profit! Your application can be built as a pipeline, so long as your internal APIs have solid boundaries. In this case, the input and output data is always JSON and always has the correct format

Data JSON JSON JSON JSON

Profit! And because they all take the same input and output, we can chain them together in any order. If we need to add user information first, that’s easy. Just change the order of the pipeline

Data JSON JSON

?

? JSON JSON

Profit! What happens if there are no links to resolve? The component reads data in, checks for what it needs and if it finds it, it performs it’s job. Otherwise it re-serializes it and passes it on to the next component in the pipeline.

Data PHP PHP C++

Profit! If you have a serialisation format that all languages can talk, you can use any programming language you like

Data PHP Golang C++

Profit! In fact, our links resolution service used to be PHP, now it’s written in Go

Data PHP Golang Michael

Profit! If we really wanted to, we could replace our filtering engine with a human. It might not be as fast, or as accurate, but the point is that we could do it. (and this has actually happened in debugging situations)

Building Pipelines

So, you know what pipelines are, you know why they’re good. How do I go about building one? There’s actually two different kinds of pipeline

Data-Driven

The most common ones are data driven. These are also the easiest to work with. Something upstream of your component says “hey, I have a job for you to do! Here’s all the information that you need”

Demand-Driven

Sometimes though, we need to flip that on it’s head and go to a demand driven model where things ask for jobs to complete whenever they’re free. Both models have pros and cons, so let’s start with data driven pipelines.

Data-Driven

Data driven pipelines usually follow one of two models

push/pull

The first is push/pull. Imagine you have lots of work to do, and a set of five people to help you. You give each person a task and wait for them to finish it before giving them another. You can also only remember one task, so if you have a task but no-one to do it, no-one can give you another task. This is known as a blocking model - the process blocks until someone is available to process the request, and is primarily used for data pipelines where you need all of the data to be received

Back Pressure

Now, this applies back pressure. If no-one can give you work, what should they do with it? In real life, we use buffers to cope with this. Emails, ticketing systems. The same thing is true in programming too - we need a buffer. This might be in memory, but generally you’ll want something that can persist to disk. We use something called Ogre which reads and writes to ZMQ but sadly it’s not open source yet (soon!). For my personal projects, I tend to use something like Kafka, from LinkedIn.

pub/sub

The other option is pub/sub, or publish subscribe. Imagine that each time some work comes in, you just tell the room and everyone runs off to do it. The next time we get some work in, you tell the room and forget about it. If there’s people to hear it and perform a task, great. But otherwise, no-one hears it and it gets lost. This is generally used for signal messaging, where multiple people need to be informed of something.

Sequence Numbers

As there is a potential for messages to get lost, each message needs a number. If you get a message that is more than one message above the last one you got, you can request all messages since your last number. Sequence numbers are a good verification technique for pub/sub models. However, if you don’t get another message you don’t know that you’re missing a message. You need to poll for state periodically.

Demand-Driven

The other model is demand driven. In this model, there’s only one paradigm to follow - when it’s ready it asks for data. There’s no need for you to check who’s busy and who can take on the work. They just come and take it when they’re ready, complete it then ask for more.

Unique Jobs

If you’re working with this model, a useful technique is to keep the set of work to be performed unique. If you have something that triggers every 10 seconds but only gets processed every 30, you don’t want it to add 3 jobs to the queue for every one it completes. Instead, use a set and keep the jobs unique

Transport

One of the biggest decisions when working with a pipeline, is deciding on your transport - the way that things communicate with each other. Depending on your domain, you might need total reliability, and you’d choose something like rabbit MQ. You might be better suited to using HTTP if it’s mainly web requests, or you might opt for ZeroMQ for speed. It doesn’t matter which you choose. So long as you have a way to read from the transport and write to the transport, your application shouldn’t care. Once it’s received, your application doesn’t mind where it comes from, it could be a hard coded input in your app. Your application just cares that the data is there

Serialization

Once you have a transport chosen, you need to find a way to represent your data. You might struggle putting a ruby object directly onto a ZeroMQ socket. You might choose JSON, you might choose XML, you might choose nano message. Again, it doesn’t matter so long as all your components talk the same language. This seems easy right? Make sure everything talks the same language and it will “just work”

Error Handling

But what happens when it doesn’t? What happens when there’s an error? Pipelines tend to have a one way data flow. As you can’t go back to the user and say something went wrong, what do we do? Generally, one of two things. Either the data is placed back in the queue so we try and process it again, or we store it somewhere else along with a log message of what went wrong so that someone an inspect it by hand later. Personally, I’m a fan of doing a bit of both. Try processing it a few times, keep track of a unique ID and if it fails a certain number of times, put it in an error store.

Testing Pipelines

Pipelines are actually ridiculously easy to test. One of the core concepts of building a pipeline is that they adhere to a contract, data in, data out. This means that when testing pipeline components we can pretend to be data sources and consumers and just test the output. We have no interest in how it does what it does, just that it does it correctly. BDD tools are excellent for this

Pros/Cons

There are tradeoffs to make when you write things as a pipeline. There are some awesome pros:

* Well defined responsibilities, * Easy to scale individual components, * Generally leads to a micro services architecture

But also some fairly substantial cons: * Added operational complexity, * Latency (a lot of this is serialisation), * Hitting the network harder

It’s down to you to make the decision. We merged two go projects to get enough throughput (100k/sec, 10-20kb)

Examples

* Invoice generation: Fetch data, Generate PDF, Email user

* Report generation: Fetch data every hour into store (buffer), Once per day, process it

* Image processing: Resizing, Cropping, Make it greyscale, Rotation

If you have any examples of an application that you think could benefit from being a pipeline but you’re not too sure where to start, come and find me for a chat. I’d love to hear about it.

https://www.flickr.com/photos/neotsn/2246759100/ https://www.flickr.com/photos/theenmoy/15255488518 https://www.flickr.com/photos/stuckincustoms/5109788502 https://www.flickr.com/photos/danielfoster/4725849931 https://www.flickr.com/photos/mzmo/8666989650 https://www.flickr.com/photos/usacehq/5529362681 https://www.flickr.com/photos/alpha_auer/5068482201/ https://www.flickr.com/photos/109144586@N08/14863759614/ https://www.flickr.com/photos/94418464@N08/9007117038 https://www.flickr.com/photos/swamibu/2868288357/ https://www.flickr.com/photos/sepehrehsani/5766453552/ https://www.flickr.com/photos/scubasteveo/296747958/ https://www.flickr.com/photos/56218409@N03/15371262455/ https://www.flickr.com/photos/jvk/4417106 https://www.flickr.com/photos/13523064@N03/14448815581 https://www.flickr.com/photos/jamesshade/473206000 https://www.flickr.com/photos/albertogp123/5843577306/ https://www.flickr.com/photos/dpstyles/4835354126/

I just want to say a big thanks to all of these people on Flickr for putting their photos up there as creative commons.

Questions?

@mheap

joind.in/15428

So, this is PHPNW, I’m @mheap, and any feedback on joind.in would be hugely appreciated. Any questions?

http://joind.in/15428