200,000,000 Keys in Redis 2.0.0-rc3

I’ve been testing Redis 2.0.0-rc3 in the hopes of upgrading our clusters very soon. I really want to take advantage of hashes and various tweaks and enhancements that are in the 2.0 tree. I was also curious about the per-key memory overhead and wanted to get a sense of how many keys we’d be able to store in our ten machine cluster. I assumed (well, hoped) that we’d be able to handle 1 billion keys, so I decided to put it to the test.

I installed redis-2.0.0-rc3 (reported as the 1.3.16 development version) on two hosts: host1 (master) and host2 (slave).

Then I ran two instances of a simple Perl script on host1:

#!/usr/bin/perl -w
$|++;

use strict;
use Redis;

my $r = Redis->new(server => 'localhost:63790') or die "$!";

for my $key (1..100_000_000) {
	my $val = int(rand($key));
	$r->set("$$:$key", $val) or die "$!";
}

exit;

__END__

Basically that creates 100,000,000 keys with randomly chosen integer values. They keys are “$pid:$num” where $pid is the process id (so I could run multiple copies). In Perl the variable $$ is the process id. Before running the script, I created a “foo” key with the value “bar” to check that replication was working. Once everything looked good, I fired up two copies of the script and watched.

I didn’t time the execution, but I’m pretty sure I took a bit longer than 1 hour–definitely less than 2 hours. The final memory usage on both hosts was right about 24GB.

Here’s the output of INFO from both:

Master:

redis_version:1.3.16
redis_git_sha1:00000000
redis_git_dirty:0
arch_bits:64
multiplexing_api:epoll
process_id:10164
uptime_in_seconds:10701
uptime_in_days:0
connected_clients:1
connected_slaves:1
blocked_clients:0
used_memory:26063394000
used_memory_human:24.27G
changes_since_last_save:79080423
bgsave_in_progress:0
last_save_time:1279930909
bgrewriteaof_in_progress:0
total_connections_received:19
total_commands_processed:216343823
expired_keys:0
hash_max_zipmap_entries:64
hash_max_zipmap_value:512
pubsub_channels:0
pubsub_patterns:0
vm_enabled:0
role:master
db0:keys=200000001,expires=0

Slave:

redis_version:1.3.16
redis_git_sha1:00000000
redis_git_dirty:0
arch_bits:64
multiplexing_api:epoll
process_id:5983
uptime_in_seconds:7928
uptime_in_days:0
connected_clients:2
connected_slaves:0
blocked_clients:0
used_memory:26063393872
used_memory_human:24.27G
changes_since_last_save:78688774
bgsave_in_progress:0
last_save_time:1279930921
bgrewriteaof_in_progress:0
total_connections_received:11
total_commands_processed:214343823
expired_keys:0
hash_max_zipmap_entries:64
hash_max_zipmap_value:512
pubsub_channels:0
pubsub_patterns:0
vm_enabled:0
role:slave
master_host:host1
master_port:63790
master_link_status:up
master_last_io_seconds_ago:512
db0:keys=200000001,expires=0

This tells me that on a 32GB box, it’s not unreasonable to host 200,000,000 keys (if their values are sufficiently small). Since I was hoping for 100,000,000 with likely lager values, I think this looks very promising. With a 10 machine cluster, that easily gives us 1,000,000,000 keys.

In case you’re wondering, the redis.conf on both machines looked like this.

daemonize yes
pidfile /var/run/redis-0.pid
port 63790
timeout 300
save 900 10000
save 300 1000
dbfilename dump-0.rdb
dir /u/redis/data/
loglevel notice
logfile /u/redis/log/redis-0.log
databases 64
glueoutputbuf yes

The resulting dump file (dump-0.rdb) was 1.8GB in size.
I’m looking forward to the official 2.0.0 release.

About Jeremy Zawodny

I'm a software engineer and pilot. I work at craigslist by day, hacking on various bits of back-end software and data systems. As a pilot, I fly Glastar N97BM, Just AirCraft SuperSTOL N119AM, Bonanza N200TE, and high performance gliders in the northern California and Nevada area. I'm also the original author of "High Performance MySQL" published by O'Reilly Media. I still speak at conferences and user groups on occasion.

View all posts by Jeremy Zawodny →

This entry was posted in programming, tech. Bookmark the permalink.

Coding Outside My Comfort Zone: Front-End Hacking with jQuery and flot

To folks who’ve read my tech ramblings over the years, it’s probably no surprise that I generally avoid doing front-end development (HTML, CSS, JavaScript) like the plague. In fact, that’s probably one of the reasons I finally migrated my blog from a self-hacked and highly-tweaked MovableType install to WordPress. I spend the majority of my time dealing with back-end stuff: MySQL, Sphinx, Redis, and the occasional custom data store (for a feature we’re launching soon). I try to build and maintain fast, stable, and reliable services upon which people who actually have front-end talent can build usable and useful stuff.

But every now and then I have an itch to scratch.

For the last year and a half, I’ve wanted to “fix” a piece of our internal monitoring system at craigslist. We have a home-grown system for gathering metrics across all our systems every minute as well as storing, alerting, and reporting on that data. One piece of that is a plotting tool that has a web interface which lets you choose a metric (like CpuUser or LoadAverage), time frame, and hosts. When you click the magic button, it sends those selections to a server that pulls the data, feeds it to gnuplot, and then you get to see the chart. It’s basic but useful.

However, I wanted a tool that gave me more control, took advantage of the fact that I have a lot of CPU power and RAM right here on my computer, and make prettier charts. I wanted easier selection of hosts and metrics (with auto-complete as you type instead of really big drop-down lists), plotting of multiple metrics per chart, and a bunch of other stuff. So I went back to a few bookmarks I’d collected over the last year or two and set about building it.

I ended up using JSON::XS and building a mod_perl handler to serve as a JSON endpoint that could serve up lists of hosts and metrics (for the auto-completion) from MySQL as well as the time series data for the plots. That was the easy part. For the font-end I used jQuery (the wildly popular JavaScript toolkit) and flot (a simple but flexible and powerful charting library based on jquery). It took a lot of prototyping and messing around to get the JavaScript bits right. That is due largely to my lack of knowledge and experience. It’s frustrating to me when nothing happens and I wonder what good debugging tools might be. But instead of actually bothering to install and learn something like Firebug, I just charge ahead and try to reason out WTF is going on. Eventually I get somewhere.

As with most things, the initial learning curve was steep. But I eventually started to feel a little comfortable and productive. I had a few patterns for how to get things done and, most importantly, I understood how they worked. So I was able to piece together a first version with all the minimal functionality I thought would be good to have. Yesterday I made that first version available internally. There’s already a wishlist for future features, but I’m happy with what I have as a starting point.

It was fun to step out of the normal stuff I do. Getting out of my comfort zone to work on front-end stuff gave me a renewed appreciation for all this browser technology (it works on Firefox, Chrome, and Safari), the libraries I built upon, and all the collective work that must have produced them. Aside from scratching my own itch, I now feel a little less resistant to hack on JavaScript. So that means I’m just a bit more likely to dive into something similar in the future.

It feels like I stretched parts of my brain that don’t normally get much of a workout. I like that.

About Jeremy Zawodny

View all posts by Jeremy Zawodny →

11 Responses to 200,000,000 Keys in Redis 2.0.0-rc3

Sean Porter says:

July 24, 2010 at 3:10 pm

Thanks!

Great post, wish you had recorded the run time

Have you looked into using virtual memory?

I’m curious to see the performance impact when values are pulled from disk, back into memory. Having rarely used key values stored on disk seems more cost effective, specially reaching the ONE BILLION DO… keys (with more realistic/larger values).

Would you be up for another go?

Thanks again!

Jeremy Zawodny says:

July 24, 2010 at 3:23 pm

Sean,

VM is really only appropriate for situations where you have fairly large values behind your keys. I suspect that I could make it “work” but I’d get maybe 500,000,000 on a single node at best (since the keys still need to be in RAM).

I suppose it’d be fun to try on one of our Fusion-io equipped machines!

Sandeep says:

July 24, 2010 at 9:27 pm

How large a value are you realistically looking to store ?
It will be very interesting to see how it works when you store, say 10K of XML per key.

You see because what you are doing is fitting the value in one “machine word-size” – Only in larger text will, realistic performance be apparent.

Pingback: 1,250,000,000 Key/Value Pairs in Redis 2.0.0-rc3 on a 32GB Machine « Jeremy Zawodny's blog
Pingback: Top Posts — WordPress.com
Pedro Lopes says:

July 26, 2010 at 2:02 am

A little off topic – what is ‘$|++’ (2nd line of your script) ? Hard to find an explanation on Google.

- jenya says:
  
  July 30, 2010 at 5:54 pm
  
  if you set $| to true ($| = 1 or simply $|++) it makes filehandle output unbuffered (flushes buffer immediately)
  
Ryan Detzel says:

July 26, 2010 at 6:05 am

Hey, can I ask what you’re using this for? We’re researching doing something similar and we probably need to store 300-500M keys so we’re trying to decide if we should just use one nice machines or a few decent machines in a cluster setup using some type of internal sharding to know where to fetch the data from.

Salman says:

July 26, 2010 at 8:03 am

What about read/write performance? Was it linear when you have that many keys in redis?

Pingback: 用Redis存储大量数据 : NoSQLfan
Nick says:

August 9, 2011 at 6:59 am

You can use sharding with keys (php example) –
$servers = array(“192.168.0.1”, “192.168.0.2”);
$host = crc32(md5($key)) % count($servers);
then connect and store / read the key from this host.
however in this set-up, you will not be able to do set operations between the servers
(e.g. intersect / union sets that are located in different servers)

i tried client intersects – 1M+ sets on 2 diff servers and wait time is huge – 3-4 sec for long sets,
since all data needs to be send to the client.