Cloudinary Blog

Identifying Countries by IP Address in Columnar Databases Through SQL

Yehonatan Benyamin
By Yehonatan Benyamin
Identifying Countries by IP Address in Columnar Databases Through SQL

Cloudinary reaps a myriad of open web traffic, from ad networks to e-commerce sites. Our Data Science team is dedicated to analyzing the data for use internally and externally.

A glance at any General Data Protection Regulation (GDPR) article would reveal that—unlike Android device IDs (AID), through which users can reset their web address—keeping user identifiers, such as Internal Protocol (IP) and Media Access Control (MAC) addresses, as well as International Mobile Equipment Identity (IMEI), violates privacy. As a solution, you can discard all privacy identifications or make them visible to users for reset.

Even though IP addresses are less user friendly, the data we can extract from them, e.g., popular locations (country, state) or the number of distinct sessions (without the help of cookies), is crucial to our business success. Until recently, to calculate such dimensions, a geo provider or content delivery network (CDN) required the full IP address for processing. For tuning reasons, geo providers offer to have companies cache the key-value results internally or grant companies download privilege of the main logic-dataset.

Recently, while examining Cloudinary access logs for data analysis, we discovered that we received approximately 12.5 million different IP addresses worldwide from a sample of over 118 million requests. Over 99% of those addresses are IPv4s because IPv6 is not exactly ubiquitous yet. This post describes how to pinpoint key details, such as the country of origin, from a columnar database through standard SQL.

The Mapping Database

Most providers offer the following logical dataset to map IP to country with three columns: IPNumFrom, IPNumTo, and CountryCode. About 200K rows would take up 2.5 MB of disk space. IPNum is arrived at through a simple calculation done from the IP. For example, 1.2.3.4 equals 16,909,060 = 1*256*256*256+2*256*256+3*256+4, hence 16,909,060 modulus 256 = 4.

Study 1: If you perform the following query, which converts through JOIN ON BETWEEN on less than 1-5K records in a regular Relational database management system (RDBMS), it might work just fine:

Copy to clipboard
SELECT ip2country.country_code,COUNT(1) requests
FROM (SELECT CAST(SPLIT_PART(ip,'.',1) AS INT)*256*256*256
+CAST(SPLIT_PART(ip,'.',2) AS INT)*256*256
+CAST(SPLIT_PART(ip,'.',3) AS INT)*256 AS ipnum
FROM log
LIMIT 100) log
JOIN ip2country ON log.ipnum BETWEEN ip2country.ipnum_from AND ip2country.ipnum_to
GROUP BY 1 ORDER BY 2 DESC LIMIT 100

Study 2: If you perform the following query, which converts through the SELECT SUBSELECT BETWEEN level, it might work more slowly on a few records but faster on more of them, i.e., up to a large volume of approximately 10-25K rows:

Copy to clipboard
SELECT (SELECT ip2country.country_code
FROM ip2country
WHERE log.ipnum BETWEEN ip2country.ipnum_from AND ip2country.ipnum_to) AS country_code
,COUNT(1) requests
FROM (SELECT CAST(SPLIT_PART(ip,'.',1 AS INT)*256*256*256
+CAST(SPLIT_PART(ip,'.',2) AS INT)*256*256
+CAST(SPLIT_PART(ip,'.',3) AS INT)*256 AS ipnum
FROM log
LIMIT 10000) log
GROUP BY 1 ORDER BY 2 DESC LIMIT 100
  • The 1-5K and 10-25K numbers are rough estimates only. Depending on the database performance, the sizes might vary.
  • In both cases, the real problem occurs when you increase the number of records to millions, after which the query performance will dramatically decline according to how well the database optimizer sees the JOIN or the BETWEEN clause as CARTESIAN JOIN to FULL SCAN on each record’s dataset.
  • SQL databases usually perform JOIN tasks more effectively with an exact EQUAL (=) rather than a BETWEEN condition. At times, solutions like spatial indexing can better imitate the BETWEEN logic.
  • Particularly in columnar databases, those types of solutions might fail altogether since we usually process over 100 million records with more than a million distinct values.

Transformation of Range-Value Datasets to Key-Value Datasets

So, here’s the ultimate question: How do you transform a range-value dataset to a key-value dataset? One option is to take the data and cache the results. For the above example, you would need—in theory—a 12.5 million key-value cache versus a 200K-range cache, which might require a load of bandwidth to join as the dataset dynamically grows.

Instead, we did the following:

  1. Reverse-engineer the range-value dataset into a key-value dataset.
  2. Keep the key-value dataset to the lowest octets⁽¹⁾ as possible, up to a maximum of three octets. ⁽¹⁾Octet is the four IP components that run between 0-255.

In other words:

  • In case 100.0.0.0-100.255.255.255 matches the country U.S.A., specify only 100.0.0.0-US in the first octet dataset.
  • Since you don’t want to receive four-octet IPs any more, discard the four-level octet altogether.
  • If feasible, keep the three-octet datasets apart for higher performance.
  • Update the datasets monthly or quarterly.

Conversion Results

Before doing the conversion, let’s have a look at the results.

In converting the 200K-range dataset to three octet datasets, one for each octet, you can see that, since each octet presents a different size of data (50, 50K, and 2.5MM rows), keeping them apart might be a smart move. Moreover, if a preceding dataset has already given you the result, you can skip the step of joining datasets.

Study 3: As expected, the following query performs much better on a columnar database:

Copy to clipboard
SELECT COALESCE(ipnum12country.country_code, ipnum22country.country_code, ipnum32country.country_code) AS country_code
,COUNT(1) requests
FROM (SELECT CAST(SPLIT_PART(ip,'.',1) AS INT) *256*256*256 AS ipnum1
,CAST(SPLIT_PART(ip,'.',2) AS INT) *256*256 AS ipnum2
,CAST(SPLIT_PART(ip,'.',3) AS INT) *256 AS ipnum3
FROM log
LIMIT 1000000) log
LEFT JOIN ipnum12country ON ipnum12country.ipnum1 = log.ipnum1
LEFT JOIN ipnum22country ON ipnum12country.country_code IS NULL AND ipnum22country.ipnum2 = log.ipnum2
LEFT JOIN ipnum32country ON ipnum12country.country_code IS NULL AND ipnum22country.country_code IS NULL AND ipnum32country.ipnum3 = log.ipnum3
GROUP BY 1 ORDER BY 2 DESC LIMIT 100

Before, the JOIN task took up over 80% of the total execution time. Now it takes less than 5%, with the remaining tasks being the normal ones: the actual data scan, sort, and aggregation.

You might find the data below of interest.

Note
The data and results cited in this article were collected from Cloudinary’s open web traffic and the IP histogram only.

Size and Usage by Octet

Octet Total Size of the Dataset Total No. of Keys in the Dataset No. of Used Keys Against the Dataset No. of Distinctly Matching IPs No. of Processed Requests
1 1 KB 52 8 11,505 44,111
2 180 KB 42,801 25,941 11,231,825 64,582,450
3 10 MB 2,411,552 282,110 1,528,017 54,303,033

Total Percentage of Usage

Octet % of the Distinct IPs % of Requests
1 0.09% 0.04%
2 87.94% 54.30%
3 11.96% 45.66%

Estimates of Key Results

Octet No. of Requests Per Dataset Record No. of Distinct IPs Per Record
1 5,513.88 1,438.13
2 2,489.59 432.98
3 192.49 5.42

Here are the conclusions:

  • The first two octets uncover 88% of the countries by IP address. Using only a three-octet dataset would increase the size from 10 MB to at least 60 MB—a much less efficient approach.
  • The last three-octet dataset represents only 12% of the IP addresses, where half (45%) of the traffic are derived from there. That means that about three octet keys are repeated more often, hence generating more traffic.
  • A split of the dataset results shows that 55% of the requests processed did not have to explore the major 10-MB, three-octet dataset.

In response to a request from our business partners, we did a similar study on the conversion of the U.S.A. states:

Octet Total Size of the Dataset No. of Used Keys Against the Dataset No. of Used Keys Against the Dataset No. of Distinctly Matching IPs No. of Processed Requests
1 1 KB 9 3 285 1,056
2 50 KB 10,910 2,404 523,086 3,582,907
3 10 MB 2,729,087 661,224 3,926,536 56,661,971

Note

  • Unlike before, from a performance perspective, the conversion for over 95% of the traffic came from the three-octet dataset. That's a less efficient process compared to the country conversion.
  • We realize that the states in the U.S.A cannot be properly converted with only the first three octets of the IP.
  • The range-value dataset for the U.S.A. is usually derived from the IP2City datasets.

Conversion Steps

To reverse-engineer the range-value logic to key-value logic, follow these steps:

  1. Convert the IPNum range to separate ranges per octet and extract the exact octet:

    Copy to clipboard
    From IPNumFrom, IPNumTo to Octet1From, Octet2From, Octet3From, Octet4From, and Octet1To, Octet2To, Octet3To, Octet4To.
  2. Create Octet12Country: 2.1. UNNEST the values between Octet1From and Octet1To. 2.2. GROUP BY Octet1 HAVING COUNT(DISTINCT country_code) = 1

  3. Create Octet22Country: 3.1. UNNEST the values between Octet1From and Octet1To. 3.2. IGNORE the values in Octet12Country. 3.3. Override the Octet2 ranges in the gaps: 3.3.1. If Octet1 is greater than Octet1From, set Octet2From to 0. 3.3.2. If Octet1 is smaller than Octet1To, set Octet2To to 255. 3.4. UNNEST the values between Octet2From and Octet2To into Octet2. 3.5. GROUP BY Octet1,Octet2 HAVING COUNT(DISTINCT country_code) = 1

  4. Create Octet32Country: 4.1. UNNEST the values between Octet1From and Octet1To. 4.2. IGNORE the values in Octet12Country. 4.3. Override the Octet2 ranges in the gaps: 4.3.1. If Octet1 is greater than Octet1From, set Octet2From to 0. 4.3.2. If Octet1 is smaller thanOctet1To, set Octet2To to 255. 4.4. UNNEST the values between Octet2From and Octet2To into Octet2. 4.5. IGNORE the values in Octet22Country. 4.6. Override the Octet3 ranges in the gaps: 4.6.1. If Octet2 is bigger than Octet2From, set Octet3From to 0. 4.6.2. If Octet2 is smaller than Octet2To, set Octet3To to 255. 4.7. UNNEST the values between Octet3From and Octet3To into Octet3. 4.8. GROUP BY Octet1,Octet2,Octet3 HAVING COUNT(DISTINCT country_code) = 1

  5. Convert Octet12Country, Octet22Country, and Octet32Country into IPNum12Country, IPNum22Country, and IPNum32Country.

  6. Update the results monthly or quarterly.

Note
This logic can be performed in multiple ways. The original dataset for the approach described above weighs only 20 MB, which makes it easier to debug floating octet values.

More Thoughts

We conducted this research to find out how well we could enrich data on the fly. It might often seem easier to adopt out-of-the-box solutions, but every company handles its data differently, from IP to user agent and URL. Even though it takes time and effort to cleanse, format, or tune the usage algorithm, each and every attribute promises to offer metadata-enrichment opportunities that are just waiting out there. Have fun!

Recent Blog Posts

Get Your Media Moving Faster with Cloudinary’s Media Optimizer

So, your boss comes to you in a panic: he's just heard about Google's Core Web Vitals initiative and needs you to optimize the company website right now! "No problem," you say, hiding your fear that it's not something that can be done overnight. Just taking the first metric, Largest Contentful Paint (LCP), how can you possibly identify all the large elements - most likely images or video posters - of the many hundreds of pages that make up your site? There are already thousands of high-resolution (read massive) media files stored away, which marketing could use any time. How are you going to make sure they're all compressed to a size small enough to be delivered within the threshold? Not to mention all the new images and videos that will be created over time...

Read more
How to Tap Into the Value of User-Generated Content (UGC)

User-generated content (UGC) took off with, first of all, the advent of the internet and, subsequently, social networks. Everyday consumers were given keys to the kingdom, so to speak, so that they, too, could compose and post content, simultaneously engaging with others online. Twitter, Facebook, Instagram, Snapchat, TikTok—the networks through which we can create and publish content have grown exponentially, and brands are becoming aware of the benefits of tapping into the gold mines offered by those networks.

Read more
Digital-First Asset Management Explained

As the world changes, so does technology. I don’t need to name more than a handful of antiquated technologies before you nod in agreement: floppy disks, Walkmans, phone booths, VHS tapes, each of which have been phased out or rendered useless by new solutions that meet the same need but much more effectively.

Read more
How to Build Workflows With Cloudinary’s MediaFlows

Many of you who work with the Cloudinary platform have a media-associated workflow for moderation of images, dispatch of notifications with certain data or headers, implementation of activities through add-ons, etc. For most of those cases, Cloudinary would suggest that you take advantage of our webhook notifications and build the workflow with an infrastructure like AWS Lambda. This post describes how to do that with Cloudinary’s MediaFlows a beta product that helps tackle management and operational tasks related to visual media.

Read more
Cloudinary’s Apps for DAM Soon Available for Customizing DAM

Organizations often go to great lengths to identify and select software solutions that deliver out-of-the-box features for most use cases. However, every company has unique needs, typically requiring customization to maximize the ROI of the software purchases. Customization, though, is a word avoided in polite circles because, in response, most people would lament that it “adds complexity, hence not worth the effort” and that it “requires ongoing support.”

Read more