I’ve recently had an odd request. Given an Apache log, can I write a tool to query a database, and find what country a given IP is in.
The first rule is that this client did not wish to use the GeoIP database provided by MaxMind, despite the fact that their API already has this functionality… for free.
The second was, it had to generate two files; one with the IPs for the country we’re looking for, the second, the unwanted IPs.
The third, it had to be fairly fast, and written in PHP.. and it had to have both a Web interface, and a pretty command line. (Note the “spinnies” around the percentage completed.)
Of course.
Enter Web hosting.Info and their IP-To-Country database. Ok, fine. It’s not MaxMind’s software, but it’s CSV. No problem.
Looking at the file, we have 5 fields:
Starting IP, Ending IP, Country Code (2 Characters), Country Code (3 Characters, and the Country Name.
Let’s import this into an SQL database:
mysql>create database 'ip2country'; mysql>use 'ip2country';
Now, let’s create our table.
mysql>create table ip2c (`ip_start` int(4) unsigned \ NOT NULL default '0', `ip_end` int(4) unsigned \ NOT NULL default '0', `country_code2` char(2) \ NOT NULL default '', `country_code3` char(3) \ NOT NULL default '', `country_name` varchar(50) \ NOT NULL default '', PRIMARY KEYS (`ip_start,ip_end`));
Now that we have our table setup, let’s cheat and use the local file support inherit within MySQL (unless disabled):
mysql>LOAD DATA LOCAL INFILE 'ip-to-country.csv' \ INTO TABLE `ip2c` FIELDS TERMINATED BY ',' \ ENCLOSED BY '\"' LINES TERMINATED BY '\r\n';
Still with me? Let’s test it.
mysql> SELECT `country_code2`, `country_name` \ FROM `ip2c` WHERE (`country_code2` != "" AND \ `country_name` != "") GROUP BY `country_code2` \ ORDER BY `country_name` ASC LIMIT 5; +---------------+----------------+ | country_code2 | country_name | +---------------+----------------+ | AF | AFGHANISTAN | | AL | ALBANIA | | DZ | ALGERIA | | AS | AMERICAN SAMOA | | AD | ANDORRA | +---------------+----------------+ 5 rows in set (0.12 sec) mysql>
Good. Everything seems to be in there.
Now, then. Here’s the tricky part. This database stores its ip addresses in long numerical format.
That’s not too bad. So, we need to test our IPs in the long format. You can either do it in PHP, or, you can do as I did, and have MySQL do the grunt work with INET_ATON().
The only ‘gotcha’ approach with this is you have a range of IPs between the beginning, and the end. So, you need to test if your IP is between, or equal to them. Here’s sample MySQL code:
mysql>SELECT `country_name` FROM `ip2c` WHERE \ `ip_from` <= inet_aton('216.239.51.104') AND `ip_to` \ >= inet_aton('216.239.51.104'); ... UNITED STATES
The whole reasoning for this is that IP-To-Country attempts to discern when IPs are tunneled or sent to other countries. It’s a bit of a nuisance.. it’s essentially asking “Ok, is this phone number between 555-0000 and 555-9999?”
It’s also a bit obnoxious that every IP address causes a MySQL call. I’ve thought about loading the array for our search into RAM, but herein lies the problem:
mysql> select NULL from ip2c WHERE \ `country_code3` = "USA"; ... 14557 rows in set (0.05 sec)
Yep. That’s right. There’s nearly fifteen THOUSAND sets for the US. This would mean that even in RAM, you’d have to load your full list, and test your numeric representation between two sets of numbers for each list. This is still fairly easy, but can get messy in PHP.
The problem with this approach is that in this case, you end up with a O(n*14557) for a host that doesn’t match any test case in this scenario. You can always break out of the loop, of course, if you find a match.. but at what cost?
The SQL version is always O(n) for the loop, at the cost of an SQL query for each IP address.
You’d say, “Well, gee, RAM is faster, right?” Wrong.
Upon delivering the product, I was asked if there was a way to reduce the load (I admit, it sucks to hammer an SQL server, even one that’s dedicated for this purpose). I stated that I felt that due to the constraints of what I had to work with, it would be unlikely to be faster (I suggested moving the project to C, or even a more numerically-friendly scripting language, but was denied).
So, noting once more that in the case this client would wish to test for the United States, that alone would give nearly 15,000 queries to test for every IP lookup. I stated that if one was found, it would be faster, but would never reach the speed of an SQL server.
Still, they persisted.. so, I wrote a companion that loaded the whole table they wanted to search for into an array (from MySQL.. but only ONE DB call), creating a multilevel array in RAM for each beginning and end ip, all numerically indexed (does this sound familiar?).
Then, much to my chagrin, I had it loop through this dynamic array for each IP, testing every case, and if it was true, set the Boolean to be true, and break; else, complete the test and the Boolean returns false, after all, there were thousands of sets that it might be, and I had to explicity test each range.
This version only used the SQL server for it’s single query: “Gimme the sets of IPs I need to search for.” This made the client quite happy.
Then, I started both versions of the product in “Debug” mode.
They watched as the MySQL version happily chugged away with each IP, taking mere milliseconds per IP call. However, the RAM version had to loop through and try to fit the number between the two in it’s discovered IP set. At best, it took roughly 1 full second for every IP.. still not too bad, but I gave it a somewhat likely test case scenario of nearly 250,000 IPs.
The MySQL version took 9 seconds. We all opted to give up on the in-RAM processing version after two and a half hours. It was nearly at 60,000, that’s roughly 25%. Given that the data was the same for the remainder of this test, it would have taken ten hours.