Apr 24, 2012 - Creator of innodbstatus.com. OXEVA provides fully managed hosting services since 2005. Mostly pure player
Finding an unusual cause of max_user_connections Olivier Doucet OXEVA co-founder and CEO www.oxeva.fr www.olivierdoucet.info/blog/
[email protected]
1
• Find this talk online http://bit.ly/VdP0PF
• Contacts
[email protected] @ezameku
2
What I’m going to talk about
Despite the session title, the problem was server-side, and not a suboptimized query.
3
Who am I ? OXEVA CEO Contributor to the PHP QA project Involved everyday in MySQL real world customer cases Creator of innodbstatus.com OXEVA provides fully managed hosting services since 2005. Mostly pure players (online press, browser based games)
Biggest website hosted: 12M visitors / day 4
What is my job (and what I’m going to talk about)
Remember me ?
5
« Production is heavily impacted » • Need to restore production quickly! With a paper clip if necessary. – You don’t have time to look at thousands of variables, graphs – You have to find it on your own (well, in our case, that’s our job)
6
« Production is heavily impacted » • You cannot make the situation worse – Restart MySQL ? – Changing random variables to see what happens
7
Great ! This is what this talk is about … • What you should focus on • You are a doctor and you have a sick patient. You need to diagnose iteratively. This is all based on experience …
8
First step : recognize you have a problem • What to monitor 24/7 ? – Hundreds of probes • Hardware : Power usage ; SMART … • OS-level : CPU usage, memory usage … • MySQL level : grab all data from ‘SHOW STATUS’ and store counters (in RRD or equivalent). « The more, the merrier »
9
MySQL SHOW STATUS • How to get data from dozens of server ? – pmysql (from Domas Mituzas) « it can run SQL query on 1000 servers in under 0.1s and scales well with large sets of servers » ./pmysql --servers-file=serverList.txt \ -–query=‘SHOW FULL STATUS’
• Focus on some specific values : – – – – –
aborted_connects Aborted_clients connections Queries / Questions threads_running 10
Environment (1/3) • Hardware – Dell Blade – 4x Intel L7555 (that’s 64 cores with HT on) – 128GB of memory – XFS over NFSv3 – Network : 10G (ixgbe) – No local hard drive
11
Environment (2/3) • Server usage – Main database of a massively-multiplayer online game – Developed in PHP (5.2)
• Software – Linux with Kernel 3.2.0 – Percona Server 5.5.17 (x86_64 of course) – Data fits entirely into buffer pool with ~ 20% free space (see ‘innodb_buffer_pool_pages_free’) 12
Environment (3/3) – A few config highlights : • • • • •
innodb_file_per_table = 1 max_user_connections = 250 max_connections = 510 wait_timeout = 600 Query_cache_size = 0
13
What we see (1/3) • On the PHP side : « mysql_connect() returns a max_user_connection error » • When we show processlist : process list is empty
14
What we see (2/3) - System CPU usage (from /proc/stat)
- There is a CPU spike - but it is far from 100% usage
15
What we see (3/3) - Software Queries / second + threads_running
aborted_connects
Huge spike at 28M
No abort at start
16
First Analysis • This is not some false positive – We do have aborted connections (aborted_connects spike) – CPU usage also shows a spike – There is a gap in queries / second – And problem is happening every 1 or 2 hours (not at specific interval)
=> Problem confirmed, but we need to gather more data to find the cause. Where to look first ? 17
Checking concurrent connections • When queries / sec is dropping, we can see a spike in threads_running Let’s check if any sleeping connections took all available slots
while sleep 0.1 ; do mysqladmin pr | fgrep Sleep | wc -l ; done 15 21 23 20 Nothing weird … 30 and no sleeping connections 35 with Time > 2sec 20 [...]
18
Gathering data (1/3) • Our current metrics have a one minute precision, but we need very precise data, only when the problem is happening => « observation modifies the reality observed ». Use pt-summary and pt-stalk from Percona Toolkit
19
Gathering data (2/3) • pt-summary – Prints an overview of the server • Hardware • Software • Mounted FS • Connections from remote IP •… Useful if you do not remember exactly the environment
20
Gathering data (3/3) • Pt-stalk – Gathers tons of data when a specific event or threshold is triggered. Many parameters, check documentation ! http://www.percona.com/doc/percona-toolkit/2.1/pt-stalk.html wget percona.com/get/pt-stalk chmod u+x pt-stalk ./pt-stalk --dest=/tmp/pt
21
Reproduce error • Try to reproduce the customer code behaviour (with same user/password) : while sleep 0.5 ; do date; mysql -u myuser --password=areumad -e "SELECT 1 ;" > /dev/null ; done
[...] Tue Apr 24 19:58:42 CEST 2012 ERROR 1226 (42000): User 'myuser' has exceeded the 'max_user_connections' resource (current value: 250) Tue Apr 24 19:58:43 CEST 2012 ERROR 1226 (42000): User 'myuser' has exceeded the 'max_user_connections' resource (current value: 250) Tue Apr 24 19:58:43 CEST 2012 [...]
22
Have a break and think
• Raising max_user_connections never solved a stall problem • Error is max_user_connections but processlist is always empty !
23
Time to restore production • No definitive solution ? At least try a temporary fix.
24
Time to restore production • Follow « OHERIC » method – Observation – Hypothesis – Experiment – Results – Interpretation – Conclusion
25
Time to restore production • Follow « OHERIC » method – Observation : max_user_connections limit reached – Hypothesis : Maybe we connect too much / too quickly to the server ? – Experiment : Raising max_user_connections very high – Results : (next slide)
26
Time to restore production • Raise max_user_connections to 600. Leave max_connections to 510 [...] Tue Apr 24 20:04:12 CEST 2012 ERROR 1226 (42000): User 'myuser' has exceeded the 'max_user_connections' resource (current value: 600) Tue Apr 24 20:04:13 CEST 2012 ERROR 1226 (42000): User 'myuser' has exceeded the 'max_user_connections' resource (current value: 600) Tue Apr 24 20:04:13 CEST 2012 [...]
27
Time to restore production • Results : • Monitor aborted_connects live with pt-mext ./pt-mext -r -- "mysqladmin ext -c 8 -i 1" |fgrep Aborted Aborted_clients 214 0 0 0 0 0 Aborted_connects 3571112 0 0 0 0 0
0
0
=> This is working ! We have a temporary fix ! 28
Time to restore production • Interpretation – We connect too much to the server at specific spikes. May be a cache invalidation or something similar.
• Conclusion – We need to monitor network trafic
29
Monitoring network trafic with MySQL • Quite an issue : – A lot of data gathered ! Could impact production – Are you really able to fully read / understand a tcpdump ?! Better read the Matrix – If you use MYSQL_COMPRESS flag, Wireshark will not be able to completely understand MySQL trafic
30
Monitoring network trafic with MySQL • Capture tcpdump –s 0 –c 1000 -n -i eth0 –w file.pcap port 3306
• Analyze it with Wireshark. => See the exact error code (1203)
31
Error codes • 1226 ER_USER_LIMIT_REACHED – User raised when you reach MAX_QUERIES_PER_HOUR
User '%s' has exceeded the '%s' resource (current value: %ld)
• 1203 ER_TOO_MANY_USER_CONNECTIONS User %s has already more than 'max_user_connections' active connections 32
Monitoring network trafic with MySQL • Follow « OHERIC » method • […] => This is definitely NOT a network isssue
33
Finding a bug report • If the problem is not on our side, could it be on MySQL side ? • Check MySQL bugs interface ’max connections’ returned 221 active bugs…
• Check Percona Server bugs interface « max_user_connections issue » bugs.launchpad.net/percona-server/+bug/893348
34
Bug report LP 893348 • Exactly our problem • WITH a working patch \o/ • Report this bug back in MySQL bugs.mysql.com/bug.php?id=65104 • And wait 4 months for Oracle to commit a fix
35
Everything OK ? • No more max_user_connections ! • … But now, we experienced severe stalls from 5 to 20 seconds
36
Investigating kernel issue • On a previous slide, we saw an increase in « system CPU usage », i.e. kernel • How to debug a kernel on a production server? => Use oprofile
37
Debugging a kernel issue (1/5) • Oprofile – A kernel driver – A daemon (on user side) to collect sample data + post profiling tools
• Pros – Sampling (compared to tools like strace) – Can sample kernel too (compared to … strace) – gather data in one place, and allow you to zip samples to perform the analysis you want on an other server. 38
Debugging a kernel issue (2/5) opcontrol --setup --vmlinux=/root/vmlinux -event=CPU_CLK_UNHALTED:4000000 That’s the sampling you want
Need two samples : • One when everything is fine • One when problem happens opcontrol sleep 5 opcontrol opcontrol opcontrol
--start --vmlinux=/root/vmlinux --stop --dump --save=$(date "+%H%M%S") 39
Debugging a kernel issue (3/5) samples % 462229 99.0696 559 0.1198 185 0.0397 139 0.0298 114 0.0244
symbol name poll_idle clear_page_c find_busiest_group clflush_cache_range mutex_spin_on_owner
samples % 1000515 78.9433 217103 17.1300 6822 0.5383 5304 0.4185 4378 0.3454
symbol name poll_idle compaction_alloc compact_zone spin_lock flush_tlb_others_ipi
We found our guy …
40
Debugging a kernel issue (4/5) • Look for function compaction_alloc in Kernel source files => mm/compaction.c line 410 : 406 407 408 409 410 411 412
/* * This is a migrate-callback that "allocates" freepages by taking pages * from the isolated freelists in the block we are migrating to. */ static struct page *compaction_alloc(struct page *migratepage, unsigned long data, int **result)
41
Debugging a kernel issue (5/5) • Check Google / The guy next to you => This is related to ‘huge pages’ functionality • Check if they are enabled $ sysctl vm.nr_hugepages vm.nr_hugepages = 0
• Check Transparent huge pages (THP) cat /sys/kernel/mm/transparent_hugepage/enabled [always] madvise never • Disable it : echo "never" > /sys/kernel/mm/transparent_hugepage/enabled
42
Explanation • Server has 128 GB of memory • Memory can get very fragmented, and once in a while, Linux kernel tried to migrate THP to defrag memory. • But this process can be too agressive … See lkml.org/lkml/2011/7/26/103
43
Thoughts … (1/2) • Gather more data than just queries / second and threads • Keep available some useful tools / command lines (Wiki, …) • 99% of the times, you (or the developer) did something wrong. But sometimes this is just a MySQL bug. It just ‘happens’.
44
Thoughts … (2/2) • Open source software does not mean « bug free » • IF your project is critical, do not spare money on highly qualified people. They will save you in these very cases, where upgrading hardware could not solve anything – Use fully managed services – Or contract MySQL support 24/7 45
• But I’m sure you have questions 46
Tools used in this presentation • Pmysql bazaar.launchpad.net/~mysqlatfacebook/mysqlatfac ebook/tools/files/18/pmysql • Percona Toolkit percona.com/software/percona-toolkit – Especially pt-mext ; pt-summary ; pt-stalk
• Oprofile oprofile.sourceforge.net
47