managing millions of data services at heroku – live blogging at qcon

Managing Millions of Data Services @ Heroku
Speaker: Gabe Enslein

See the list of all blog posts from the conference

AWS S3 failure

  • February 28, 2017 – AWS S3 outage – pager duty failed to give message
  • Down for about 6 hours
  • Heroku recovered before everyone went to bed (10pm Eastern)
  • Avoid failure by having failover strategies
  • Would have taken 35 years to recover if had to do all tasks manually
  • No Heroku customers lost any data

Concepts

  • Layers of abstraction simplify evelopment
  • Everything rus on hardware at some level down
  • Abstractions can hide real problem
  • Can be harder to reproduce problems
  • Can model many tasks as state machines – both deterministic and non-deterministic moels

“just” implies it is easy. Be skeptical. How easy to repeat? How often is “just”

Automate yourself out of a job – recurring and one off work

If haven’t gotten a heartbeat in a while, don’t know health.

States
Not all states used by all systems

  • installing
  • available
  • uncertain
  • unavailable
  • retiring
  • retired
  • archived
  • terminated
  • restart
  • upgrading

Check on

  • Backups
  • Replication
  • Security
  • Performance

Manual fixes can cause more problems than started with. Immutuable infrastrucure enforces the “just”. Script the exceptions; don’t manually tinker. “Break Glass” in case of emergency procedures still help. Modeling emergency remedies help so computer can fix when detects instead of waking someone up.

Infrastructure is code, not a second class citizen. Test it for functionality, performance and regression.

Then March 15, 2017, there was a Linux denial of service and admin escalation vulnerability. Needed to see none of the images were affected. Can fix image so customers get when start up.

Key Takeaways

  • Automate yourself out of regular operations
  • Have emergency automation in place – scripts, jobs, etc
  • Make routine failover strategies
  • Treat infrastructure as full units
  • Abstractions have their limits

 

Rod Johnson in the cloud – the server side java symposium

Rod Johnson led off with a Dilbert cartoon about the cloud/platform – the pointy haired boss says “it’s like you are a technologist and philosopher all in one” in response to “blah blah blah cloud”.

Benefits of cloud

  • No reason to operate own data center because does not scale. Also noted the environmental impacts due to overprovisioning on small/medium data centers. Large companies will adopt onsite/internal clouds.
  • Compared computing to a utility and noted factories used to operate their own electricity. Really good analogy!

Perceived weakness for Java in the cloud

  • Perceived to be slow and startups look to Ruby and the like. This is a problem because these are the companies that are tomorrow’s leaders
  • Universal hosting not available like PHP

Strengths of Java in the cloud

  • Write once run anywhere
  • Java already in the enterprise
  • In actuality scales better than Ruby

Other Points

  • We think if the internet as free when we use it. However, there is a cost and the energy use is very high.
  • Rod Johnson considers flexibility to be a more important driver than cost. That way pay when need and not guessing up front if a startup will succeed. Compared to the Zipcar model. It costs more per mile but cheaper if don’t drive a lot. Another great analogy.
  • Simplicity, because large organizations have a lot of bureaucracy and provides a way to subvert that bureaucracy [some of that bureaucracy has a reason, accounting and security teams are not going to be happy with subverting things. However, an internal cloud provides benefits without subverting.]
  • Need architectural trade offs for economies of scale
  • If cloud is about business agility, need to be able apps fast which is why Rails shines. Notes the importance of being perceived as being able to start fast. [sounds like this is more valuable than maintenance]
  • It is more important to have one good option than a lot of medium options. Thinks Java community needs a cultural change to pick a stack. Showed how having two IDES of eclipse and intelij means good products and not a lot of poor choices. Interesting NetBeans was left off the list. rails helps because gives you less choices and saves time by encouraging consistency
  • The line between development and operations is blurring. No common tool chain organizational chaos. More self service in cloud because developers deploy

Spring’s thoughts

  • IaaS (infrastructure as a service) – vmware vcloud
  • PaaS (software as a service) – must abstract away OS, provide self service and be a productive programming model
  • Frameworks are key to portability – gave example that Spring isolates you from the app server [in other words if you deploy your container in your application, you don’t have to worry about the external container as much]. I liked the Ruby example better because Rails is. Framework that doesn’t replace the container.
  • Challenges: a lot more and complex data, adapt to RDBMS + NOSQL. Recognize what datastores need to be transactional. plugged Spring data project. Similarly for asynchronous data patterns
  • Visual builders are usually horrible because build uneditable code. Wavemaker is different because Spring owns it. I think there as another reason he tried to state but i didn’t catch it. Maybe it builds on roo?

Java’s phases

  • -2005 : make java work
  • 2005-2010: RIA, REST, higher productivity
  • 2010-?: beyond RDBMS, mobile/tablet, cloud, even higher productivity required

the end was all blatant advertising of spring’s tools.

Live from TSSJS – Testing in the Cloud with Andrew

This afternoon I’m live blogging from TheServerSide Symposium, attending “Breaking all the Rules:  The Myth of Testing and Deployment in the Cloud” presented by fellow CodeRanch Andrew Monkhouse.

1.  CodeRanch History
Andrew opens his talk with a discussion of CodeRanch performance, traffic, and memory usage.  He mentions there have been serious issues in the past that haven’t temporarily brought the performance of the website down to a crawl.  CodeRanch uses a fork of JForum (2008) and staffs around 20 developers among 150 moderators, all volunteer participants, to build a custom community-based forum.

CodeRanch developers use a number of test strategies including smoke tests, load tests, and stress test.

2. Redeploying in the Cloud
The focus of Andrew’s presentation will be to demonstrate how CodeRanch would function in the cloud.  He has allocated a “small EC2 instance” (1.7GB memory, 1 compute unit, moderate I/O performance) within Amazon’s cloud that is useful for quick set up, but extremely small for practical use and reliable testing.

3. Benefits of the Cloud
The cloud provides a relatively clean environment in which it is ‘cheap’ to add extra servers, at least compared to running a local test server which may be allocated for additional processes.  Also, tests can run by anyone, anywhere, which is great for a distributed developer team, since Amazon allows you to assign rights to additional users.

You can easily create, set up, and desotry Amazon Machine Images (AMI), so you do not need to get it right the first time.  Lots of backup storage options, depending on your needs.

4.  Dangers of the Cloud
Testing in the Cloud can be difficult.  First off, it is hard to get absolute numbers because of the nature of the underlying change in hardware happening within the cloud.  Second, calibrating results is difficult.  Also, developers have to remember to shut down servers after they are down with them else other developers will run out of resources.  Finally, network traffic does cost money so testing can be expensive, although hardware is cheaper in cloud.

5.  Test Setup
The performance tests were set up into 3 servers, in part because it turned out to be cost effective to set it up this way based on the individual server options.

  • Application Server running CodeRanch
  • Database Server
  • JMeter Test Server

JMeter was used at the testing suite software given its large variety of features, options, and plugins.

6.  Test Analysis
Andrew discussed some items to keep in mind when testing the server and understanding the results:

  • It is quite possible for the Test Server to have greater load than the Application server, so be sure to monitor all your servers!
  • Try different graphs, they may tell you different things
  • JMeter plugins are your friend, they will help you identify problems with your application server and/or test set up.
  • Don’t forget about ramp up times

7.  Where do we go from here?
Automated weekly performance tests using the latest code base.  Add more stress tests.  Andrew comments there are no plans to move the CodeRanch to the cloud at this time given that we require full-time administrators since we don’t have a dedicated, paid team of administrators.

Conclusion
Andrew presents a compelling argument why you should be testing your applications within the cloud.  It provides a sanitary environment in which to perform a large variety of tests across multiuple servers.  I’ve heard arguments of moving application servers to the cloud, but this is the first time I’ve heard it suggested to migrate all automated testing to the cloud.