Surviving a Ceph cluster outage: the hard way

Giant Suspended Octopus Monster Sculpture (Huang Yong Ping)

Intro

GRNET has a wide service portfolio, offered to the academic and research community in Greece and Europe. A prominent place in this portfolio is taken by Okeanos, the GRNET cloud service. The service provides IaaS to the community for the last 4+ years and uses Ceph, among other solutions, for storing files and VM disks.

Ceph was selected for this task on the basis of deploying an open-source, highly redundant, feature-full Software Defined Storage solution, which we could scale at will. However, nice things prove themselves when things go south, and things went south for our Ceph cluster on a Friday evening some weeks ago. Here comes the story of our first multi-day Ceph outage (and hopefully the last one).

The incident

Everyhing started at Friday 08.57PM, when the Ceph cluster object storage daemons (OSDs) started flapping and marking one another down. The problem was most prominent on the OSDs of a specific node (6 out of 162) and our investigation pinpointed that the OSDs of this node were causing all the mess, being reported by other OSDs as down but also reporting other innocent OSDs as down too.

The problem built up until peaking in the next morning when we also lost all the cluster monitors (MONs). The cluster services were down and we experienced a significant outage in VM services. We had to resort to special handling and tooling to bring the cluster in an operating state again but finally all client data were intact and in place. The following paragraphs provide a summarized history of what happened that night and the days that followed trying to resolve the issues.

Phase 1: Flapping OSDs

The Ceph cluster is built on the basis of distributing the load of the service in multiple nodes where the OSDs, the basic building block of any Ceph cluster, are housed. Each OSD is a system daemon, handling the task of storing objects, as requested by the Ceph cluster rules and directives. The OSD daemons are listening for information exchanged between the cluster nodes and OSDs and can proactively inform the MONs, the cluster control plane, about problems they detect. The cluster data is placed in OSDs according to a placement policy, which, in turn, is based on placement groups (PGs); depending on the replication setting set for the cluster, one PG may have two or more OSDs as members, meaning that when the cluster tries to place an object in one PG, the PG will eventually ensure that the object has been written to all member OSDs. In this way redundancy is assured, and moreover the OSDs participating in a PG can check each other, reporting faults to the MONs. In a standard scenario, if an OSD goes down, other OSDs that share membership with the faulty OSD in PGs will report the OSD down and the cluster MONs will update the cluster distribution directives (CRUSHMAP) in order to recover the cluster and the redundancy.

So, theory is nice but does not make your hands dirty. The fact is that late one Friday evening, 6 out of 162 OSDs, all hosted on the same node, started flapping causing substantial recovery IO in the cluster and wrongly marking down other OSDs. Flapping is not a rare problem when you have to manage a large Ceph cluster and most of the times you can trace it back to underlying storage or network connectivity problems. It will either stop when the cause of the problem is eliminated, or when the problematic OSDs will be marked down, leading to a proper recovery of the cluster.

But in this case, things did not proceed as planned. Until now, we haven’t yet found what was the exact, real cause of the problem. We searched for the usual culprits, such as some indication of increased disk or network latency, or an observed overflow or bottleneck in our monitored metrics. Our vendor tools never reported anything, neither in the systems layer nor in the network gear layer. In past events, when we had some similar problems either with faulty hardware or with some network glitch, the Ceph cluster managed to cope with them in a graceful way, but this time it did not.

Was that a ghost? Cosmic radiation? Whatever the case, the Ceph OSDs kept marking one another down and, as the marked down OSDs were not really down, they were requesting the MONs to be reinserted in the cluster. It seems as if the OSDs of the cluster were trying to mark the 6 OSDs of the “problematic” node as down, while at the same time these 6 OSDs were trying to mark the majority of the other OSDs down. Because of this hub and spoke pattern of the problem, we suspect a network-related reason and our current efforts lean towards improving our network-related monitoring metrics and their reporting granularity. There is also a possibility that we have hit a bug on Ceph code due to overload on one or more subsystems, but this has not been proven so far.

Phase 2: IO freeze

Approximately 5 hours after continuous flapping, the 6 guilty OSDs started crashing. There were two kinds of asserts that the OSDs crashed on: suiciding time-out asserts and LevelDB corruption asserts. The former are not unknown when operations hung in the cluster for a long time, but the latter was completely new to us. During the next 50 minutes after the crash, we lost all client IO. A look at cluster logs would show minimal client IO accompanied by appr. 100 PGs peering+down, which in simple words means that, after an OSD failure, the remaining OSDs in the PG were waiting for the failed OSD member to provide the latest and relevant version of the data.

This may come as a surprise to some, as Ceph is supposed to be redundant. It is important to take a break here and explain the specifics of Ceph redundancy. Ceph is known for its “no single point of failure” mantra, but this is a “feature” configured by the administrator at many levels. Basically it comes down to replication size and min_size plus CRUSHMAP configuration and failure domains. We operate our cluster under default CRUSHMAP with the host as a failure domain and with size=2/min_size=1. With replication 2, you usually have two problems:

if you find an inconsistent object you have some hard time to tell which copy is the correct one
if you have flapping OSDs, i.e. osd.0 goes down, osd.1 is up and acting and accept writes. At the next moment, osd.1 goes down and osd.0 comes up. osd.0 becomes primary, but the PG is ‘down’ because osd.1 had the last data. In that case, you need osd.1 to come back so that the PG will work again.

And the latter was exactly the situation we got into. Ceph is pedantic and will try to prevent split-brains in the PG level by halting IO. As an administrator you have two options here: either mark the data on the down OSD as lost and let IO continue or try to bring up the crashed OSDs.

Phase 3: Monitors malfunctioning

The following Saturday morning and after ~6 hours since the 6 OSDs crash, we lost communication with the MONs. We found that the processes were running but they didn’t listen to their ports. It took us some minutes to understand that the monitor processes’ file descriptors were exhausted. What happened probably was that during and after Phase 1 and 2, Monitors FDs started increasing until they were exhuasted. Was that a FD leak? Not sure. We immediately increased the Monitors’ soft FD limits and brought up the monitors quorum in order to start working towards recovering the crashed OSDs. Ceph monitor logs also witnessed the problem as found later:

2016-09-10 07:39:46.425986 7f0b894fe700 0 accepter.accepter no incoming connection? sd = -1 errno 24 (24) Too many open files

You should keep in mind that Ceph monitors log excessively even in minimal debug mode and you should have some log monitoring and alerting mechanism for catching such things on time.

Phase 4: Recovery

Having already realized that marking the crashed OSDs was the wrong choice for our clients’ data, we immediately opted for rescueing them and started working towards bringing the down PGs back to life. What made our job even more complicated was the fact that upon attempts to restart the crashed OSDs, they immediately crashed with absolutely different asserts than the initial ones. For the record, the asserts on which our OSDs crashed for the first time were of two types:

Assert Type 1: Unexpected errors

2016-09-10 02:40:47.153302 7f699671a700 0 filestore(/rados/rd0-19-01) error (1) Operation not permitted not handled on operation 0x23a7e900 (1731767078.0.0, or op 0, counting from 0)
2016-09-10 02:40:47.153338 7f699671a700 0 filestore(/rados/rd0-19-01) unexpected error code
2016-09-10 02:40:47.153339 7f699671a700 0 filestore(/rados/rd0-19-01) transaction dump:
2016-09-10 02:40:47.155778 7f699671a700 -1 os/FileStore.cc: In function ‘unsigned int FileStore::_do_transaction(ObjectStore::Transaction&, uint64_t, int, ThreadPool::TPHandle*)’ thread 7f699671a700 time 2016-09-10 02:40:47.153544 os/FileStore.cc: 2761: FAILED assert(0 == “unexpected error”)

accompanied by OSD omap logs indicating LevelDB corruption (under /rados/rd0-19-00/current/omap/LOG) like the following:

2016/09/12-13:10:10.192841 7ff412ca8700 Compaction error: Corruption: corrupted compressed block contents

Assert Type 2: Suicide timeouts

2016-09-10 02:04:19.221253 7ff02e97f700 1 heartbeat_map is_healthy ‘OSD::osd_op_tp thread 0x7feffba94700′ had timed out after 15
2016-09-10 02:04:19.221267 7ff02e97f700 1 heartbeat_map is_healthy ‘OSD::osd_op_tp thread 0x7feffba94700′ had suicide timed out after 150
2016-09-10 02:04:19.224881 7ff02e97f700 -1 common/HeartbeatMap.cc: In function ‘bool ceph::HeartbeatMap::_check(ceph::heartbeat_handle_d*, const char*, time_t)’ thread 7ff02e97f700 time 2016-09-10 02:04:19.221277 common/HeartbeatMap.cc: 79: FAILED assert(0 == “hit suicide timeout”)

On consequent restarts, we saw asserts of the following types:

Assert Type 3

2016-09-10 21:18:40.787404 7f56671e4880 0 osd.115 281185 load_pgs
2016-09-10 21:18:42.544318 7f56671e4880 -1 osd/PGLog.cc: In function ‘static void PGLog::read_log(ObjectStore*, coll_t, coll_t, ghobject_t, const pg_info_t&, std::map<eversion_t, hobject_t>&, PGLog::IndexedLog&, pg_missing_t&, std::ostringstream&, std::set<std::basic_string >*)’ thread 7f56671e4880 time 2016-09-10 21:18:42.542169 osd/PGLog.cc: 908: FAILED assert(last_e.version.version < e.version.version)

Assert Type 4

2016-09-10 22:00:23.082908 7fd27fd4f880 0 osd.118 284869 load_pgs
2016-09-10 22:00:32.251331 7fd27fd4f880 -1 osd/PG.cc: In function ‘static int PG::peek_map_epoch(ObjectStore*, spg_t, epoch_t*, ceph::bufferlist*)’ thread 7fd27fd4f880 time 2016-09-10 22:00:32.246505 osd/PG.cc: 2889: FAILED assert(values.size() == 2)

At this point, we understood that the problem was a really complex one and that there was no way to find a cure-all, fix-all solution. While we had not been able to properly identify the source of the problem yet, we were quite sure that the multitude of errors that we were getting from the failed OSDs were due to different ways of how each one of them handled the crash that they originally had.

So we decided that we would split our efforts on two tasks: a) we would try to get as many of the missing data out of the failed OSDs via Ceph tools and try to inject them to the cluster so that their PG pairs could start working again and b) try to circumvent the errors of the OSDs and get them running.

All in all, our recovery mission managed the following:

4.1 Injecting PGs with unfound objects

We exported unfound objects living inside some crashed OSDs and reimported them to a temp OSD. We first created a temp OSD as usual by reweighting it to zero and then we made extensive use of ceph-objectstore-tool. First, we found the OSD with the most fresh PG data for every unfound object and then we exported and reimported:

:~# ceph-objectstore-tool –op export –pgid 6.467 –data-path /rados/rd0-19-03 –journal-path /dev/sde –file /rados/31423export/6.467.export
:~# ceph-objectstore-tool –op import –data-path /rados/162_temp_fs –journal-path /dev/sdi1 –file /rados/31423export/6.467.export

When finished, we let the temp OSD start and the cluster discovered the unfound objects the next seconds. Our goal was to use the same method for every down+peering PG too, but ceph-objectstore-tool had no luck on OSDs with corrupted LevelDBs. At the end of this procedure, we managed to recover approximately 20% of the missing PGs.

4.2 GDB to the rescue

In order to find why OSDs were not booting, we used GDB to step through the crashed OSD boot process. We installed the dbg Ceph packages, downloaded the ceph git repo, checked out the relevant tag, cd-ed into the ceph/src directory and started one of the crashed OSDs via:

gdb –args /usr/bin/ceph-osd -d –cluster ceph –id 118

The process did hang as expected, but we managed to find out that the crash was happening as the OSD was trying to enumerate the PGs in the OSD filesystem. Our understanding of the internals of the OSD services is not that deep, but, as OSDs use filesystem mounts to store their data, it is apparent that they store the objects of each PG in a subfolder of the following structure:

/rados/rd0-00-00/current # ls -l|head
total 8544
drwxr-xr-x 2 root root 4096 Jul 5 10:52 2.1ac_head
drwxr-xr-x 2 root root 4096 Jul 5 10:52 2.1b0_head
drwxr-xr-x 2 root root 4096 Jul 5 10:52 2.242_head
drwxr-xr-x 2 root root 4096 Jul 5 10:52 2.24a_head
drwxr-xr-x 2 root root 4096 Jul 5 10:52 2.27b_head
drwxr-xr-x 2 root root 4096 Jul 5 10:52 2.2db_head
drwxr-xr-x 2 root root 4096 Sep 27 13:06 2.2de_head
drwxr-xr-x 2 root root 4096 Sep 27 13:06 2.2de_TEMP
drwxr-xr-x 2 root root 4096 Sep 27 13:06 2.2fe_head

Each one of these subfolders holds the objects for the relevant PG (2.1ac, 2.1b0, etc). We are running v0.94.7 of Ceph, and the OSD was crashing at osd/PG.cc:2888, where we were falling on an assert, as the daemon tried to read the OMAP values from the PG. In our case, the values.size() was not 2, which means that we were not getting any data or metadata from the PG:

…
map<string,bufferlist> values;
int r = store->omap_get_values(coll, pgmeta_oid, keys, &values);
if (r == 0) {
assert(values.size() == 2);
…

As we watched the program values, we were able to get the PG IDs where the assert was crashing. We looked into the relevant folders and we saw that they were empty. It seemed as if the folders were created at some time during the flaps (the timestamps matched) but did not receive any data. For some reason, which at that time we could not explain, the empty PG folders led to the OSD crashing again and again.

Given the fact that the data for these crash-causing PGs, were safely stored in other places — these PGs were not part of the problematic group — we took the decision to delete these empty folders. All the OSDs, apart from one, managed to start properly after this change, and we were left with only two PGs down, being hosted on the only non-running OSD.

Things started to look a lot brighter at this point.

4.3 LevelDB corruption, journals and calculated risks

We were left with one OSD down, which did not start, even after removing the empty PG folders. The OSD had lost data for two of our PGs (0.26a and 0.49c). Using GDB we figured that we still had problems with the assert. There were empty PG subfolders in the OSD filesystem. But we had clearly deleted all of them in the previous step! What was happening?

Well, the problem was that the OSD journal of this specific OSD had a very large backlog and we suspected that in this backlog there were many PG related actions (assignment of OSDs in PGs). Every time we were trying to delete the empty folders and start the OSD, the journal would replay and recreate the PG folder, as all this late night flapping had created a massive backlog of PG actions for this specific OSD. With no tool to report on the contents of the journal and no way to flush the journal without the OSD running we were stuck.

At this point we had to figure out why we were experiencing this specific failure. We knew that we had empty PG folders in other, running OSDs, and these OSDs did not have any problem starting. We assumed that as the OSDs in this node crashed, the LevelDB that stored the OSD metadata crashed too and that the OSD was trying to recover from that crash. This recovery was hindered by the empty PG folders.

In order to test our assumption, we tried to test the health of the OSD LevelDB, under the omap directory. It was crashed, exactly as stated by the log lines that we had collected. We found out that there is a tool to repair crashed LevelDB, using python and we run this. We had no success as the LevelDB was not actually repaired by this procedure; it was zeroed out. So when we started the OSD again, the journal kicked in, recreated the empty folders, the OSD repair function tried to get the metadata to repair the LevelDB and it crashed.

We had two options at that point.

either mark the objects of the two PGs as lost and inform the affected users or
take a risk and remove the “journal issue” from the equation.

The second option would mean that we would create a new, empty journal and start the OSD with this new empty journal, losing forever whatever transactions we had in the original journal. We started checking timestamps to find out whether the OSD had really any data written in it during the flaps, beyond the PG assignment actions. After a lot of searching, we were pretty sure that no data was written on it.

So we moved with the plan, fingers crossed and after deleting all the empty directories, repairing the LevelDB, and setting up the OSD with the new journal, the OSD started and the two remaining PGs were recovered.

Epilogue

Full cluster recovery was completed on Tuesday 11:46AM and client IO came back to its normal levels immediately after all PGs were brought back to life. It took us some more hour to clean-up and cross check with our developers that Archipelago, our custom storage layer backed by RADOS, was not reporting any more problems. During the outage we also observed:

thread leaks on remaining OSDs and relevant crashes
multiple disk failures

At the time of the outage, we were on Ceph Hammer 0.94.7. No corruption was found or reported by our clients.