Disk to Disk Backup Architectue at eBay

Blog
The best minds from Teradata, our partners, and customers blog about whatever takes their fancy.
Enthusiast

We get a lot of questions about BAR. Yes, we here at eBay went our own way a number of years ago, and it was one of the more successful things we’ve done – I have not even thought about them in quite some time. I’ll explore a bit about what we’ve done then wade through some of the architectural definitions and issues.

eBay implemented a GRID system of deep local storage platforms, and directly connected them to our Teradata nodes with GigE – we refer to these as “NearLine” nodes. We also wrote two substantial pieces of software, one to mange the metadata about what to backup when, and second to coordinate multiple ARC jobs.

Tape was never an option. As a company, the culture was “always online”, so the solution to database backups – was to use replication. There simply was not a culture which could support table, as we found out when we discovered that no one had added backups for the additional clusters after not one – but two platform upgrades. We essentially went without backups for six months.

The systems geometry we use is two Teradata nodes for each NearLine node. Each NearLine node has nearly 30TB of usable storage, connected by a single 1GigE link. In our current implementation, we can sustain data transfer rates of 100MB/sec per link, so If we need to go 2x faster, we can add an additional link per node and get to 200MB/sec. Essentially this makes for peak transfer rates in our system of ~7.8 GB/sec, or about 25TB an hour.

Setting up the hardware environment is actually the easy part. The hard part is managing what needs to be backed up on what schedule, allocating the work units to parallel and balanced work streams, managing and documenting what the state of that infrastructure is at any given time.

Our Backup system starts with metadata that tracks every table on the database, with knowledge about it’s synchronization state on both of our primary systems, how to select incremental rows on each specific table, the last time the backup was made of that table, and where the table is to be found in the file system image. This has some broad implications about what information is tracked, which is out of scope for this discussion.

These tables are used to generate a “backup instance”, which is comprised of the set of work to be done for that backup. The software we wrote takes into account the size of the tables, the number of nodes on the system, the number of concurrent ARC jobs which can be executed in parallel, and then generates a balanced set of ARC Scripts to be executed in phases on all the nodes.

The system finds the optimal point to break the work into small tables and large tables, the latter to be taken with Cluster Archives, the former set going through a knapsack algorithm based on tablesize to evenly distribute work of the small tables across all nodes.

It then compiles the worksets into Arcmain scripts, creates a new environment across all the nodes for the sets to execute, distributes the compiled scripts and executes the body of scripting in parallel. If there is an error, each instance is designed to be restartable from the last known state - very friendly to our operations group.

We originally intended to use the ability of our Dual Active sync infrastructure to generate incremental data sets, but it turns out it is usually cheaper and faster to simply dump the entire contents of a large table…

That much more robust Dual Active sync infrastructure was the starting point for the concepts of isolating work groups into “instances”, since several sets of work may overlap concurrently. It also contained the environment setup, compiling infrastructure, restartability, compression, validation, catalog metadata and data/system state – all supersets of our BAR requirements.

The benefits to eBay to build this infrastructure were substantial. Our operations groups no longer had to fix tape problems off of the 8/5/5 shift schedule. Our backup success rate went from 50% per backup, to 100% per quarter. Our restore failure rate went from 20% failures to 0% because we were no longer using Tape.

But most importantly, our backup schedule went from taking all day on Saturday without users to several hours weekday evenings – with our users being blissfully unaware we’re even doing backups. We even dispensed with our offsite backup contracts because our sync and backup systems were integrated – and those images are 1000 miles apart.

2 Comments
akd
N/A
Thanks ! Great article. This looks great for ongoing/incremental backups. How about baseline backups ? Is there ever a need for them ? Can you please write in more details about metadata tracking methods ?
Enthusiast
This system can do incrementals, but it was not intended to do them - the system takes full snapshots of all tables currently. If you can support the throughput, it makes things much simpler. So to answer your question - all the backups are baselines...

The synchronization software we built for Dual Active fully supports finding Delta Row Sets. This was made possible by a convention in which row accounting columns keep the insert and last update timestamps of records. That process breaks moving data between systems into Subset, Extract, Movement, Load, Validation, and Apply phases. Each phase has a compiler which generates scrips for execution from the metadata.

The Subset phase generates SQL which will find the delta rows for every table, and when run moves them to a database for extract. The data is extracted with ARC and moved to the target system with the SocPARC parallel movement utilities, loaded to the database on the fly to a temporary database. The data is validated with checksums, then run through an apply phase to insert the data into the matching target tables, and then the data is validated again (if we can gen SQL for Delta extraction, you can do so for validation). The system does NOT support deletes, which is not an issue for us since deletes are not allowed in our DW.

The metadata system is a basic version of a data dictionary which tracks tables, columns, stats, indexes, movement history, and state, with columns appropriate to what needs to be tracked.

Teradata is building it's own version of this movement infrastructure, have your team ask Teradata about the "Data Mover".