Auditing Your Data Migration To ClickHouse Using ClickHouse Local

I’ve been developing a quick and dirty data migration routine to get terabytes of data stored in AWS S3 as parquet files into our ClickHouse Cloud cluster. I’m really happy that I took some time to read up on the clickhouse local command, which is included in any installation of ClickHouse.

Not only was this tool instrumental in getting the data migrated, but it also allowed me to very easily craft a way to quickly compare numbers between the source and destination to make sure everything I expected to be migrated was actually migrated.

The Mess I Made

It’s important to know that engineers with 25+ years of experience do dumb things sometimes. It’s also important to know that having decades of experience does not make you immune to fatigue, and really nobody should be working from 7AM until midnight. It’s not heroic. It’s unhealthy and problematic on a bunch of different levels. With that in mind, here’s the dopy stuff I did after working too late after too many hours:

Initially, after I migrated a subset of the data, I did a more manual, hacky check for consistency just using command line tools, manually querying each source in different terminal windows & eyeballing the output. I wanted a nicer way to view that data, so I created a one-liner using awk, paste, and column commands. That looked like this:

paste count_clicks.txt count_clicks_parquet.txt | column -t | awk '
NR==1{
  printf(
    "%12s %18s %18s %12s %12s \n", 
    "date", 
    "clickhouse_count", 
    "warehouse_count", 
    "diff", 
    "pct-diff"
  )
}
{
  printf(
    "%12s %18d %18d %12d %12.4f%%\n", 
    $1, 
    $2, 
    $4, 
    $4-$2, 
    (100-($2/$4)*100)
  )
}'

A quick overview of what’s happening there:

NR==1 means the first record (NR==2 would mean the second record, etc). If the record number is one, awk will output what’s in the first set of curly braces. The printf function takes a format spec in the first argument. My format spec lays out 5 columns of either 12 or 18 characters. All of those columns will hold strings, hence the ‘s’ in %12s. Then I have a bunch of hard-coded strings, which become the column headers. If you forget to put NR==1 in there, they’ll print on every row. Ask me how I know!
The second set of brackets specs out what will be printed in the rest of the rows. In this case, I have column widths that match up with those of the column headers, and then I have the columns in the output of the earlier parts of the command pipeline:
- $1 is the date column
- $2 is the count from the S3 data source
- $4 is the count from the ClickHouse data source
- $4-$2 shows the difference between the two data sources, and
- The last column shows the percent difference between the two sources

The output looks something like this:

        date   clickhouse_count    warehouse_count         diff     pct-diff
  2023-12-01               1471               2445           74        0.017%
  2023-12-02               1665               1700           35        0.038%
  2023-12-03               4496               4537           41        0.045%
  2023-12-04               1650               1705           55        0.047%
  2023-12-05               1154               1237           83        0.069%
  2023-12-06               2777               2865           88        0.074%
  2023-12-07               9244               9293           49        0.041%

The data here is made up to give an idea of what the output looks like.

So, I had this issue where my data audit showed a mismatch. I did a little work, very late at night, and went to bed thinking I had straightened it all out. When I woke up the next day, my well-rested brain and eyes caught a problem: I copied output from my queries of the two sources into two separate files, and mis-labeled the data, and then compared data in a completely different window with that, and…. well, it was a mess, and I didn’t fix anything.

I was up too late working for sure, but I also had a messy process. I should’ve and could’ve done better.

Fresh Eyes, Fresh (and better) Ideas

Revisiting my work from the night before was painful. As soon as I looked at my process at a high level (by scrolling through my terminal window history) I almost immediately said “this is insane. ClickHouse Local should be able to query both sources. I shouldn’t need to copy/paste and introduce levels of indirection that leave room for errors like this.”

I was right. Using ClickHouse Local, you can query a ClickHouse Cloud instance using the remoteSecure function, and query the S3 data using the s3 function, which also lets me pass in the file format as an argument. So there’s support for my data sources and formats.

On top of that, ClickHouse supports Common Table Expressions (CTEs), so I can craft a query where I name the output from two separate sub select statements (one to each data source), and then write a third SELECT that references the two named result sets as if they were tables.

Below, s3_count and ch_count are named result sets. The last SELECT queries those two named result sets.

clickhouse local --query "
WITH s3_count AS (
  SELECT 
    toDate(time) AS day, 
    count() AS num_events 
  FROM s3('https://s3-endpoint/2023/12/**/*.parquet', 'Parquet')  
  GROUP BY day 
  ORDER BY day
), 
ch_count AS (
  SELECT 
    toDate(time) AS day, 
    count() AS num_events 
  FROM remoteSecure('clickhouse-instance-hostname:9440', 'db.tablename', 'clickhouse-user', 'clickhouse-password') 
  WHERE toYYYYMM(timestamp) = '202312' 
  GROUP BY day ORDER BY day
) 
SELECT 
  s3.day AS date, 
  s3.num_events AS s3_count,  
  ch.num_events AS ch_count, 
  ch_count - s3_count AS diff  
FROM s3_count AS s3 
LEFT JOIN ch_count AS ch 
ON s3.day = ch.day;"

The output has a couple of quirks: for some reason, I guess possibly related to querying multiple sources, or using CTEs maybe, the output columns are not ordered according to my query, and there is no header line to tell me which column is which. It looks like this:

2023-02-01	40	2778	2738
2023-02-02	43	4413	4370
2023-02-03	26	7024	6998
2023-02-04	54	3079	3025

Conclusion

So, this is not something I’d paste as-is into a slide deck and present to an executive team. However, clickhouse local in this case did give me a (relatively) quick way to verify the consistency (and quantify the inconsistency) between my data sources. Once the data was moved, I wasted probably an hour with awk and friends, but was able to recover the next morning and throw together the clickhouse local solution in maybe just another hour of reading docs, debugging the query, and running test queries. Hope this helps.

The Mess I Made

Fresh Eyes, Fresh (and better) Ideas

Conclusion

Share this: