Making machine metadata visible
R.I. Pienaar | July 29, 2010 | 9:04 am | DevOps, Uncategorized, mcollective, puppet | Comments closed

I’m quite the fan of data, metadata and querying these to interact with my infrastructure rather than interacting by hostnames and wanted to show how far I am down this route.

This is more an iterative ongoing process than a fully baked idea at this point since the concept of hostnames is so heavily embedded in our Sysadmin culture. Today I can’t yet fully break away from it due to tools like nagios etc still relying heavily on the hostname as the index but these are things that will improve in time.

The background is that in the old days we attempted to capture a lot of metadata in hostnames, domain names and so forth. This was kind of OK since we had static networks with relatively small amounts of hosts. Today we do ever more complex work on our servers and we have more and more servers. The advent of cloud computing has also brought with it a whole new pain of unpredictable hostnames, rapidly changing infrastructures a much bigger emphasis on role based computing.

My metadata about my machines comes from 3 main sources:

  • My Puppet manifests – classes and modules that gets put on a machine
  • Facter facts with the ability to add many per machine easily
  • MCollective stores the meta data in a MongoDB and let me query the network in real time

Puppet manifests based on query

When setting up machines I keep some data like database master hostnames in extlookup but in many cases I am now moving to a search based approach to finding resources. Here’s a sample manifest that will find the master database for a customers development machines:

$masterdb = search_nodes("{'facts.customer': '${customer}', 'facts.environment':${environment}, classes: 'mysql::master'}")

This is MongoDB query against my infrastructure database, it will find for a given node the name of a node that has the class mysql::master on it, by convention there should be only one per customer in my case. When using it in a template I can get back full objects with all the meta data for a node. Hopefully with Puppet 2.6 I can get full hashes into puppet too!

Making Metadata Visible

With machines doing a lot of work, filling a lot of roles etc and with more and more machines you need to be able to tell immediately what machine you are on.

I do this in several places, first my MOTD can look something like this:

   Welcome to Synchronize Your Dogmas 
            hosted at Hetzner, Germany
 
        Puppet Modules:
                - apache
                - iptables
                - mcollective member
                - xen dom0 skeleton
                - mw1.xxx.net virtual machine

I build this up using snippet from my concat module, each important module like apache can just put something like this in:

motd::register{"Apache Web Server": }

Being managed by my snippet library, if you just remove the include line from the manifests the MOTD will automatically update.

With a big block of welcome done, I now need to also be able to show in my prompts what a machine does, who its for a importantly what environment it is in.

Above a shot of 2 prompts in different environments, you see customer name, environment and major modules. Like with the motd I have a prompt::register define that module use to register into the prompt.

SSH Based on Metadata

With all this meta data in place, mcollective rolled out and everything integrated it’s very easy to now find and access machines based on this.

MCollective does real time resource discovery, so keeping with the mysql example above from puppet:

$ mc-ssh -W "environment=development customer=acme mysql::master"
Running: ssh db1.acme.net
Last login: Thu Jul 29 00:22:58 2010 from xxxx

$

Here i am ssh’ing to a server based on a query, if it found more than one machine matching the query a menu would be presented offering me a choice.

Monitoring Based on Metatdata

Finally setting up monitoring and keeping it in sync with reality can be a big challenge especially in dynamic cloud based environments, again I deal with this through discovery based on meta data:

$ check-mc-nrpe -W "environment=development customer=acme mysql::master"  check_load
check_load: OK: 1 WARNING: 0 CRITICAL: 0 UNKNOWN: 0|total=1 ok=1 warn=0 crit=0 unknown=0 checktime=0.612054

Summary

This is really the tip of the ice berg, there is a lot more that I already do – like scheduling puppet runs on groups of machines based on metadata – but also a lot more to do this really is early days down this route. I am very keen to get views from others who is struggling with shortcomings in hostname based approaches and how they deal with it.

MCollective Components, Terminology and Flow
R.I. Pienaar | July 28, 2010 | 8:16 am | Uncategorized, mcollective | Comments closed

I often see some confusion about terminology in use in MCollective, what the major components are, where software needs to be installed etc.

I attempted to address this in a presentation and screen cast covering:

  • What middleware is and how we use it.
  • The major components and correct terminology.
  • Anatomy of a request life cycle.
  • And an actual look inside the messages we sent and receive.

You can grab the presentation from Slideshare or view a video of it on blip.tv. Below find an embedded version of the slideshare deck including audio. I suggest you view it full screen as there’s some code in it.

Ten of the best HTML5 Spotify mashups
Kenneth | July 25, 2010 | 5:47 pm | Projects | Comments closed

Number 1:  moretrackstrackslikethis.com

A useful mashup of lastfm’s “similar tracks” service and Spotify gives a great way to find new artists and music matching your tastes.  What’s particularly good is that you can match single tracks, so even if you only like one track by a certain artist, you can find good matches.

Written in 100% Pure CSS3, HTML5 and JavaScript.

Number 2: admission that this is just shameless link-bait

There’s some other great mashups out there. Check Spotify’s list.

Monitoring ActiveMQ
R.I. Pienaar | July 25, 2010 | 4:13 pm | Code, DevOps, activemq, cacti, monitoring, nagios, ruby | Comments closed

I have a number of ActiveMQ servers, 7 in total, 3 in a network of brokers the rest standalone. For MCollective I use topics extensively so don’t really need to monitoring them much other than for availability. I also though do a lot of Queued work where lots of machines put data in a queue and others process the data.

In the Queue scenario you absolutely need to monitor queue sizes, memory usage and such. You also need to graph things like rates of messages, consumer counts and memory use. I am busy writing a number of Nagios and Cacti plugins to help with this, you can find them on Github.

To use these you need to have the ActiveMQ Statistics Plugin enabled.

First we need to monitor queue sizes:

$ check_activemq_queue.rb --host localhost --user nagios --password passw0rd --queue exim.stats --queue-warn 1000 --queue-crit 2000
OK: ActiveMQ exim.stats has 1 messages

This will connect to localhost monitoring a queue exim.stats warning you when it’s got 1000 messages and critical at 2000.

I need to add to this the ability to monitor memory usage, this will come over the next few days.

I also have a plugin for Cacti it can output stats for the broker as a whole and also for a specific queue. First the whole broker:

$ activemq-cacti-plugin.rb --host localhost --user nagios --password passw0rd --report broker
stomp+ssl:stomp+ssl storePercentUsage:81 size:5597 ssl:ssl vm:vm://web3 dataDirectory:/var/log/activemq/activemq-data dispatchCount:169533 brokerName:web3 openwire:tcp://web3:6166 storeUsage:869933776 memoryUsage:1564 tempUsage:0 averageEnqueueTime:1623.90502285799 enqueueCount:174080 minEnqueueTime:0.0 producerCount:0 memoryPercentUsage:0 tempLimit:104857600 messagesCached:0 consumerCount:2 memoryLimit:20971520 storeLimit:1073741824 inflightCount:9 dequeueCount:169525 brokerId:ID:web3-44651-1280002111036-0:0 tempPercentUsage:0 stomp:stomp://web3:6163 maxEnqueueTime:328585.0 expiredCount:0

Now a specific queue:

$ activemq-cacti-plugin.rb --host localhost --user nagios --password passw0rd --report exim.stats
size:0 dispatchCount:168951 memoryUsage:0 averageEnqueueTime:1629.42897052992 enqueueCount:168951 minEnqueueTime:0.0 consumerCount:1 producerCount:0 memoryPercentUsage:0 destinationName:queue://exim.stats messagesCached:0 memoryLimit:20971520 inflightCount:0 dequeueCount:168951 expiredCount:0 maxEnqueueTime:328585.0

Grab the code on GitHub and follow there, I expect a few updates in the next few weeks.

DevOps talk at London Java Community
Paul's Wibblings | July 17, 2010 | 6:08 pm | Uncategorized | Comments closed
Rumors of my disappearance into a void are somewhat overrated.

I was invited to talk at the London Java Community on DevOps, I decided for that community it'd be good to have some examples with some code in. Fortunately my previous employers have open-sourced some of the frameworks used including their version of feature switches and a handy log configuration servlet.

It was good to catch up with old colleagues from ThoughtWorks, the London DevOps community and also meet new people.

The slides and talk are my own work, and do not represent the opinions of my employer...


rpms built on EL6beta2 might have an issue with CentOS older than 6
Karanbir Singh | July 16, 2010 | 4:09 pm | Linux | Comments closed

I had to build a fresh set of rpms for the area_cli tools, and decided it was also a good time to rebase some of my local personal build tools to rhel6beta2 since its there and I need to do some tests with it anyway.

Firstly, there are a couple of cool things with rhel6 beta2 - rpmdevtools is included by default, which helps gets things started and helps a bit in managing spec files etc.

What isnt so cool is that the newer rpm in el6beta creates packages which then cause issues with some of the older CentOS releases. eg. this areca_cli package built on el6beta2/x86_64 for target i686 caused this to happen with yum:

Downloading Packages:
areca_cli-1.83_091103-1.e 100% |=========================| 493 kB    00:00     
Running rpm_check_debug
ERROR with rpm_check_debug vs depsolve:
rpmlib(FileDigests) is needed by areca_cli
rpmlib(PayloadIsXz) is needed by areca_cli
Complete!
(1, [u'Please report this error in https://bugzilla.redhat.com/enter_bug.cgi?product=Red%20Hat%20Enterprise%20Linux%205&component=yum'])

Which causes the package to not install. This is on a CentOS-5.2 machine. And I need to maintain that at CentOS-5.2 due to various issues ( no, machine is not available on the internet, and only hosts a single app in production ). Trying the install on a CentOS-5.3 machine I get the exact same issue. Upgrading rpm to 4.4.2.3-18.el5 ( the version in 5.4 and 5.5 ) makes no difference.

On the other hand, builds run on CentOS-5 have no problems installing on EL6Beta. So for the time being it looks like all buildhosts and all build stuff will need to stay on EL5. Specially since that allows you to target CentOS-3 and CentOS-4 as well.

- KB

Note: yeah, I see that url pointing at bugzilla isnt idea. I'll look into plumbing in a bugs.centos.org reference instead.

DevOps talk at London QCon 2010
R.I. Pienaar | July 16, 2010 | 4:08 pm | DevOps, Front Page | Comments closed

I was invited to London QCon this year to give a talk, I chose to talk about how I’ve helped to build a startup heavily favoring the scenario where developers do support, rollouts and maintenance of their code directly in production.

My talk go into the approaches I took while thinking about networks, boxes, operating systems, team structure, monitoring and so forth to attain these goals in a way that does not compromise the traditional goals that sysadmins have as a team and profession.

You can watch the talk – 50 minutes roughly – at the InfoQ site.

I should add I was feeling a bit rough on the day and coming down with a cold, but mostly I think I remained more or less conscious during the talk :)

Bootstrapping Puppet on EC2 with MCollective
R.I. Pienaar | July 13, 2010 | 11:10 pm | Code, DevOps, mcollective, puppet | Comments closed

The problem of getting EC2 images to do what you want is quite significant, mostly I find the whole thing a bit flakey and with too many moving parts.

  • When and what AMI to start
  • Once started how to do you configure it from base to functional. Especially in a way that doesn’t become a vendor lock.
  • How do you manage the massive sprawl of instances, inventory them and track your assets
  • Monitoring and general life cycle management
  • When and how do you shut them, and what cleanup is needed. Being billed by the hour means this has to be a consideration

These are significant problems and just a tip of the ice berg. All of the traditional aspects of infrastructure management – like Asset Management, Monitoring, Procurement – are totally useless in the face of the cloud.

A lot of work is being done in this space by tools like Pool Party, Fog, Opscode and many other players like the countless companies launching control panels, clouds overlaying other clouds and so forth. As a keen believer in Open Source many of these options are not appealing.

I want to focus on the 2nd step above here today and show how I pulled together a number of my Open Source projects to automate that. I built a generic provisioner that hopefully is expandable and usable in your own environments. The provisioner deals with all the interactions between Puppet on nodes, the Puppet Master, the Puppet CA and the administrators.

<rant> Sadly the activity in the Puppet space is a bit lacking in the area of making it really easy to get going on a cloud. There are suggestions on the level of monitoring syslog files from a cronjob and signing certificates based on that. Really. It’s a pretty sad state of affairs when that’s the state of the art.

Compare the ease of using Chef’s Knife with a lot of the suggestions currently out there for using Puppet in EC2 like these: 1, 2, 3 and 4.

Not trying to have a general Puppet Bashing session here but I think it’s quite defining of the 2 user bases that Cloud readiness is such an after thought so far in Puppet and its community. </rant>

My basic needs are that instances all start in the same state, I just want 1 base AMI that I massage into the desired final state. Most of this work has to be done by Puppet so it’s repeatable. Driving this process will be done by MCollective.

I bootstrap the EC2 instances using my EC2 Bootstrap Helper and I use that to install MCollective with just a provision agent. It configures it and hook it into my collective.

From there I have the following steps that need to be done:

  • Pick a nearby Puppet Master, perhaps using EC2 Region or country as guides
  • Set up the host – perhaps using /etc/hosts – to talk to the right master
  • Revoke and clean any old certs for this hostname on all masters
  • Instruct the node to create a new CSR and send it to its master
  • Sign the certificate
  • Run my initial bootstrap Puppet environment, this sets up some hard to do things like facts my full build needs
  • Run the final Puppet run in my normal production environment.
  • Notify me using XMPP, Twitter, Google Calendar, Email, Boxcar and whatever else I want of the new node

This is a lot of work to be done on every node. And more importantly it’s a task that involves many other nodes like puppet masters, notifiers and so forth. It has to adapt dynamically to your environment and not need reconfiguring when you get new Puppet Masters. It has to deal with new data centers, regions and countries without needing any configuration or even a restart. It has to happen automatically without any user interaction so that your auto scaling infrastructure can take care of booting new instances even while you sleep.

The provisioning system I wrote does just this. It follows the above logic for any new node and is configurable for which facts to use to pick a master and how to notify you of new systems. It adapts automatically to your ever changing environments thanks to discovery of resources. The actions to perform on the node are easily pluggable by just creating an agent that complies to the published DDL like the sample agent.

You can see it in action in the video below. I am using Amazon’s console to start the instance, you’d absolutely want to automate that for your needs. You can also see it direct on blip.tv here. For best effect – and to be able to read the text – please fullscreen.

In case the text is unreadable in the video a log file similar to the one in the video can be seen here and an example config here

Past this point my Puppet runs are managed by my MCollective Puppet Scheduler.

While this is all done using EC2 nothing prevents you from applying these same techniques to your own data center or non cloud environment.

Hopefully this shows that you can wrap all the logic needed to do very complex interactions with systems that are perhaps not known for their good reusable API’s in simple to understand wrappers with MCollective, exposing those systems to the network at large with APIs that can be used to reach your goals.

The various bits of open source I used here are:

EC2 Bootstrap Helper
R.I. Pienaar | July 12, 2010 | 3:08 pm | Code, Uncategorized, amazon, cloud, ruby | Comments closed

I’ve been working a bit on streamlining the builds I do on EC2 and wanted a better way to provision my machines. I use CentOS and things are pretty rough to non existent for nicely built EC2 images. I’ve used the Rightscale ones till now and while they’re nice they are also full of lots of code copyrighted by Rightscale.

What I really wanted was something as full featured as Ubuntu’s CloudInit but also didn’t feel much like touching any Python. I hacked up something that more or less do what I need. You can get it on GitHub. It’s written and tested on CentOS 5.5.

The idea is that you’ll have a single multi purpose AMI that you can easily bootstrap onto your puppet/mcollective infrastructure using this system. Below for some details.

I prepare my base CentOS AMI with the following mods:

  • Install Facter and Puppet – but not enabled
  • Install the EC2 utilities
  • Setup the usual getsshkeys script
  • Install the ec2-boot-init RPM
  • Add a custom fact that reads /etc/facts.txt – see later why. Get one here.

With this in place you need to create some ruby scripts that you will use to bootstrap your machines. Examples of this would be to install mcollective, configure it to find your current activemq. Or to set up puppet and do your initial run etc.

We host these scripts on any webserver – ideally S3 – so that when a machine boots it can grab the logic you want to execute on it. This way you can bug fix your bootstrapping without having to make new AMIs as well as add new bootstrap methods in future to existing AMIs.

Here’s a simple example that just runs a shell command:

newaction("shell") do |cmd, ud, md, config|
    if cmd.include?(:command)
        system(cmd[:command])
    end
end

You want to host this on any webserver in a file called shell.rb. Now create a file list.txt in the same location that just have this:

shell.rb

You can list as many scripts as you want. Now when you boot your instance pass it data like this:

--- 
:facts: 
  role: webserver
:actions: 
- :url: http://your.net/path/to/actions/list.txt
  :type: :getactions
- :type: :shell
  :command: date > /tmp/test

The above will fetch the list of actions – our shell.rb – from http://your.net/path/to/actions/list.txt and then run using the shell action the command date > /tmp/test. The actions are run in order so you probably always want getactions to happen first.

Other actions that this script will take:

  • Cache all the user and meta data in /var/spool/ec2boot
  • Create /etc/facts.txt with all your facts that you passed in as well as a flat version of the entire instance meta data.
  • Create a MOTD that shows some key data like AMI ID, Zone, Public and Private hostnames

The boot library provides a few helpers that help you write scripts for this environment specifically around fetching files and logging:

    ["rubygems-1.3.1-1.el5.noarch.rpm",
     "rubygem-stomp-1.1.6-1.el5.noarch.rpm",
     "mcollective-common-#{version}.el5.noarch.rpm",
     "mcollective-#{version}.el5.noarch.rpm",
     "server.cfg.templ"].each do |pkg|
        EC2Boot::Util.log("Fetching pkg #{pkg}")
        EC2Boot::Util.get_url("http://foo.s3.amazonaws.com/#{pkg}", "/mnt/#{pkg}")
     end

This code fetches a bunch of files from a S3 bucket and save them into /mnt. Each one gets logged to console and syslog. Using this GET helper has the advantage that it has sane retrying etc built in for you already.

It’s fairly early days for this code but it works and I am using it, I’ll probably be adding a few more features soon, let me know in comments if you need anything specific or even if you find it useful.

Git Pre Recieve Hook For Integrity
Gareth Rushgrove | July 10, 2010 | 11:00 pm | Uncategorized | Comments closed

I’m getting married rather soon so time has been somewhat short (in a good way) for just hacking on stuff, but I’ve finally found a little bit of time to play with something I’ve been mulling over for a while. Namely a continuous deployment workflow using the integrity continous integration server.

I’m hoping to have an incredibly simple but fully operation example available at some point – mainly to act as a good discussion point. For now here’s my current pre-receive hook.