Shortly before the end of August one of our cloud providers told us they needed us to move from one set of servers to another. Apart from the hassle of moving "servers" I didn't think this was going to be a big deal. I scheduled some time, grumped at the work, and got going.
On the junction between 8/30 and 8/31 I threw the switch. This is what happened:
As should be apparent, things got slower. A lot slower. That graph helpfully shows that anything hitting the database went from about 80'ish milliseconds, to 300'ish milliseconds. Which had a disastrous effect on end user performance.
I began chatting with our vendor. They thought that maybe the environment they moved us to was somewhat oversubscribed, but didn't think they could do much to improve it. The best they could do was offer us a chance to move yet again. They gave us a some scratch machines in the new environment and I started testing.
As this smelled like a disk performance issue, I started with iozone. Its throughput test was a pretty clear smoking gun -- several days and six test runs, produced this graph:
Notice how much slower the read speeds are. That graph shows the average of six results. What it does not show is that the performance varied wildly throughout the day. In the intermediate environment, we saw read speeds ranging from 10MB/s to as low as 1MB/s. And 1MB/s results in some really awful database performance.
So we decided to move yet again. Much work was done, and early on the morning on 9/29 I changed dns. I was met with this result:
Database times were back down. And the world is happy again.
(The observant might notice that the overall times shown by graph3 look a lot better than graph1. This is because we did a bunch of software optimization as a workaround for the performance issues.)