Scalable Drupal infrastructure - Four Kitchens [PDF]

0 downloads 179 Views 3MB Size Report
May 30, 2009 - 10%. 40%. 50%. 30%. 100%. 70%. 20%. Analyzing hit distribution. Anonymous ... Redundancy. ‣ When one server fails, the website should.
Designing, Scoping, and Configuring

Scalable Drupal Infrastructure

Presented 2009-05-30 by David Strauss

Understanding Load Distribution

Predicting peak traffic Traffic over the day can be highly irregular. To plan for peak loads, design as if all traffic were as heavy as the peak hour of load in a typical month -- and then plan for some growth.

Analyzing hit distribution 40% 30%

n a m Hu

t n e

3%

50%

cC

i

t ta

t n o

eb W

S

s ou

r le

ym

t

a e r al T

aw Cr

100%

nt e m

No

eci p S

g Pa ic am

An

n Dy

on

10% “P ay

es

W al l”

By

pa

ss

70% Auth

entic

ated

20%

7%

Throughput vs. Delivery Methods Green (Static) Content Delivery Network Reverse Proxy Cache

Yellow (Dynamic, Cacheable)

Red (Dynamic)

2



●●●●●●●●●●

●●●●●●●



●●●●●●●



●●●



●●



1000 req/s

Drupal + Page Cache + memcached

1

●●● 1

Drupal + Page Cache

●●● 1

Drupal

●●●



● 10 req/s

More dots = More throughput

1

Delivered by Apache without Drupal

2

Some actually can do this.

Objective Deliver hits using the fastest, most scalable method available.

Layering: Less Traffic at Each Step Your Datacenter

Traffic

Load Balancer

Reverse Proxy Cache

Application Server

DNS Round Robin

CDN

Database

Offload from the master database Search

Application Server

Memory Cache

Your master database is the single greatest limitation on scalability.

Slave Database

Master Database

Tools to use ‣

Apache Solr for search. (Acquia offers hosting of this now.)



Squid or Varnish for reverse proxy caching.



Any third-party service for CDN.

Do the math ‣

All non-CDN traffic travels through your load balancers and reverse proxy caches. Even traffic passed through to application servers must run through the initial layers.

Traffic

Load Balancer

Reverse Proxy Cache

What hit rate is each layer geing? How many servers share the load?

Application Server

Get a management/monitoring box Load Balancer

Database

Management

Reverse Proxy Cache

(maybe two or three and have them specialized or redundant)

Application Server

Planning + Scoping

Infrastructure goals ‣

Redundancy



Scalability



Performance



Manageability

Redundancy ‣

When one server fails, the website should be able to recover without taking too long.



This requires N+1, putting a floor on system requirements.



How long can your site be down? ‣

Automatic versus manual failover

Performance ‣

Find the “sweet spot” for hardware. This is the best price/performance point.



Avoid overspending on any type of component



Yet, avoid creating bottlenecks



Swapping memory to disk is very dangerous

Relative importance Processors/Cores Reverse Proxy Cache Web Server

Database Server

Monitoring

Memory

Disk Speed



●●●

●●

●●●●●

●●



●●

●●●●

●●●●







Reverse proxy caches ‣

Squid makes poor use of multiple cores. Focus on getting the highest per-core performance. The best per-core performance is often on dual-core processors with high clock rates and lots of cache.



Varnish is much more multithreaded.



4-8 GB memory, total



Expect 1000 requests per second, per Squid



64-bit operating system if more than 2 GB RAM

Web servers ‣

Apache 2.2 + mod_php + memcached



Many processors + many cores is best



25 Apache threads per core



50 MB memory per thread, system-wide



1 GB memory for system



1 GB memory for memcached



Configure MaxClients in Apache to maximum system-wide thread count



Expect 1 request per thread, per second

Database servers ‣

MySQL 5.0 cannot use more than eight cores effectively but gets good gains from at least quadcore processors.



Depend on each Apache thread needing one connection, and add another 50.



Each MySQL connection needs around 6 MB.



MySQL with InnoDB needs a buffer pool large enough to cache all indexes. Start by giving the pool most remaining database server memory and working from there.



64-bit operating system if more than 2 GB RAM

Monitoring server ‣

Very low hardware requirements



Choose hardware that is inexpensive but essentially similar to the rest of the cluster to reduce management overhead



Reliability and fast failover are typically low priorities for monitoring services

Assembling the numbers ‣

Start with an architecture providing redundancy. ‣

Two servers, each running the whole stack



Increase the number of proxy caches based on anonymous and search engine traffic.



Increase the number of web servers based on authenticated traffic.



Databases are harder to predict, but large sites should run them on at least two separate boxes with replication.

Pressflow Make Drupal sites scale by upgrading core with a compatible, powerful replacement.

Common large-site issues ‣

Drupal core requires patching to effectively support the advanced scalability techniques discussed here.



Patches often conflict and have to be reapplied with each Drupal upgrade.



The original patches are often unmaintained.



Sites stagnate, running old, insecure versions of Drupal core because updating is too difficult.

What is Pressflow? ‣

Pressflow is a derivative of Drupal core that integrates the most popular performance and scalability enhancements.



Pressflow is completely compatible with existing Drupal 5 and 6 modules, both standard and custom.



Pressflow installs as a drop-in replacement for standard Drupal.



Pressflow is free as long as the matching version of Drupal is also supported by the community.

What are the enhancements? ‣

Reverse proxy support



Database replication support



Lower database and session management load



More efficient queries



Testing and optimization by Four Kitchens with standard high-performance software and hardware configuration



Industry-leading scalability support by Four Kitchens and Tag1 Consulting

Four Kitchens + Tag1 ‣

Provide the development, support, scalability, and performance services behind Pressflow



Comprise most members of the Drupal.org infrastructure team



Have the most experience scaling Drupal sites of all sizes and all types

Ready to scale? ‣



Learn more about Pressflow: ‣

Pick up pamphlets in the lobby



Request Pressflow releases at fourkitchens.com

Get the help you need to make it happen: ‣

Talk to me (David) or Todd here at DrupalCamp



Email [email protected]

Managing the Cluster

The problem Soware and Configuration

Application Server

Application Server

Application Server

Application Server

Application Server

Objectives: Fast, atomic deployment and rollback Minimize single points of failure and contention Restart services Integrate with version control systems

Manual updates and deployment Human

Human

Human

Human

Human

Application Server

Application Server

Application Server

Application Server

Application Server

Why not: slow deployment, non-atomic/difficult rollbacks

Shared storage Application Server

Application Server

Application Server

Application Server

Application Server

NFS

Why not: single point of contention and failure

rsync Synchronized with rsync

Application Server

Application Server

Application Server

Application Server

Application Server

Why not: non-atomic, does not manage services

Capistrano Deployed with Capistrano

Application Server

Application Server

Application Server

Application Server

Application Server

Capistrano provides near-atomic deployment, service restarts, automated rollback, test automation, and version control integration (tagged releases).

Multistage deployment Deployed with Capistrano

Deployments can be staged.

Deployed with Capistrano

cap staging deploy cap production deploy Development Integration

Application Server

Staging

Deployed with Capistrano

Application Server

Application Server

Application Server

Application Server

But your application isn’t the only thing to manage.

Beneath the application Reverse Proxy Cache

Application Server

Cluster-level configuration

Application Server

Application Server

Database

Application Server

Application Server

Cluster management applies to package management, updates, and soware configuration. cfengine and bcfg2 are popular cluster-level system configuration tools.

System configuration management ‣

Deploys and updates packages, cluster-wide or selectively.



Manages arbitrary text configuration files



Analyzes inconsistent configurations (and converges them)



Manages device classes (app. servers, database servers, etc.)



Allows confident configuration testing on a staging server.

All on the management box

Management

{

Development Integration

Staging

Deployment Tools

Monitoring

Monitoring

Types of monitoring Failure

Capacity/Load

Analyzing Downtime

Analyzing Trends

Viewing Failover

Predicting Load

Troubleshooting

Checking Results of Configuration and Soware Changes

Notification

Everyone needs both.

What to use Failure/Uptime

Capacity/Load

Nagios

Cacti

Hyperic

Munin

Nagios ‣

Highly recommended.



Used by Four Kitchens and Tag1 Consulting for client work, Drupal.org, Wikipedia, etc.



Easy to install on CentOS 5 using EPEL packages.



Easy to install nrpe agents to monitor diverse services.



Can notify administrators on failure.



We use this on Drupal.org

Hyperic ‣

I haven’t used this much, but it’s fairly popular.



More difficult to set up than Nagios.

Cacti ‣

Highly annoying to set up.



One instance generally collects all statistics. (No “agents” on the systems being monitored.)



Provides flexible graphs that can be customized on demand.



Optimized database for perpetual statistics collection.



We use this on Drupal.org and for client sites.

Munin ‣

Fairly easy to set up.



One instance generally collects all statistics. (No “agents” on the systems being monitored.)



Provides static graphs that cannot be customized.

Cluster Problems

Cache/session coherency ‣

Systems that run properly on single boxes may lose coherency when run on a networked cluster.



Some caches, like APC’s object cache, have no ability to handle network-level coherency. (APC’s opcode cache is safe to use on clusters.)



memcached, if misconfigured, can hash values inconsistently across the cluster, resulting in different servers using different memcached instances for the same keys.



Session coherency can be helped with load balancer affinity.

Cache regeneration races ‣

Downside to network cache coherency: synched expiration



Hard to solve All servers regenerating the item.

{ Old Cached Item

Expiration

Time

New Cached Item

Broken replication ‣

MySQL slave servers get out of synch, fall further behind



No means of automated recovery



Only solvable with good monitoring and recovery procedures



Can automate removal from use, but requires cluster management tools

Server failure ‣

Load balancers can remove broken or overloaded application reverse proxy caches.



Reverse proxy caches like Varnish can automatically use only functional application servers.



Cluster management tools like heartbeat2 can manage service IPs on MySQL servers to automate failover.



Conclusion: Each layer intelligently monitors and uses the servers beneath it.

All content in this presentation, except where noted otherwise, is Creative Commons AttributionShareAlike 3.0 licensed and copyright 2009 Four Kitchen Studios, LLC.