I’ve been developing a quick and dirty data migration routine to get terabytes of data stored in AWS S3 as parquet files into our ClickHouse Cloud cluster. I’m really happy that I took some time to read up on the clickhouse local
command, which is included in any installation of ClickHouse.
Not only was this tool instrumental in getting the data migrated, but it also allowed me to very easily craft a way to quickly compare numbers between the source and destination to make sure everything I expected to be migrated was actually migrated.
The Mess I Made
It’s important to know that engineers with 25+ years of experience do dumb things sometimes. It’s also important to know that having decades of experience does not make you immune to fatigue, and really nobody should be working from 7AM until midnight. It’s not heroic. It’s unhealthy and problematic on a bunch of different levels. With that in mind, here’s the dopy stuff I did after working too late after too many hours:
Initially, after I migrated a subset of the data, I did a more manual, hacky check for consistency just using command line tools, manually querying each source in different terminal windows & eyeballing the output. I wanted a nicer way to view that data, so I created a one-liner using awk
, paste
, and column
commands. That looked like this:
paste count_clicks.txt count_clicks_parquet.txt | column -t | awk '
NR==1{
printf(
"%12s %18s %18s %12s %12s \n",
"date",
"clickhouse_count",
"warehouse_count",
"diff",
"pct-diff"
)
}
{
printf(
"%12s %18d %18d %12d %12.4f%%\n",
$1,
$2,
$4,
$4-$2,
(100-($2/$4)*100)
)
}'
A quick overview of what’s happening there:
- NR==1 means the first record (NR==2 would mean the second record, etc). If the record number is one, awk will output what’s in the first set of curly braces. The
printf
function takes a format spec in the first argument. My format spec lays out 5 columns of either 12 or 18 characters. All of those columns will hold strings, hence the ‘s’ in%12s
. Then I have a bunch of hard-coded strings, which become the column headers. If you forget to putNR==1
in there, they’ll print on every row. Ask me how I know! - The second set of brackets specs out what will be printed in the rest of the rows. In this case, I have column widths that match up with those of the column headers, and then I have the columns in the output of the earlier parts of the command pipeline:
- $1 is the date column
- $2 is the count from the S3 data source
- $4 is the count from the ClickHouse data source
- $4-$2 shows the difference between the two data sources, and
- The last column shows the percent difference between the two sources
The output looks something like this:
date clickhouse_count warehouse_count diff pct-diff
2023-12-01 1471 2445 74 0.017%
2023-12-02 1665 1700 35 0.038%
2023-12-03 4496 4537 41 0.045%
2023-12-04 1650 1705 55 0.047%
2023-12-05 1154 1237 83 0.069%
2023-12-06 2777 2865 88 0.074%
2023-12-07 9244 9293 49 0.041%
The data here is made up to give an idea of what the output looks like.
So, I had this issue where my data audit showed a mismatch. I did a little work, very late at night, and went to bed thinking I had straightened it all out. When I woke up the next day, my well-rested brain and eyes caught a problem: I copied output from my queries of the two sources into two separate files, and mis-labeled the data, and then compared data in a completely different window with that, and…. well, it was a mess, and I didn’t fix anything.
I was up too late working for sure, but I also had a messy process. I should’ve and could’ve done better.
Fresh Eyes, Fresh (and better) Ideas
Revisiting my work from the night before was painful. As soon as I looked at my process at a high level (by scrolling through my terminal window history) I almost immediately said “this is insane. ClickHouse Local should be able to query both sources. I shouldn’t need to copy/paste and introduce levels of indirection that leave room for errors like this.”
I was right. Using ClickHouse Local, you can query a ClickHouse Cloud instance using the remoteSecure
function, and query the S3 data using the s3
function, which also lets me pass in the file format as an argument. So there’s support for my data sources and formats.
On top of that, ClickHouse supports Common Table Expressions (CTEs), so I can craft a query where I name the output from two separate sub select statements (one to each data source), and then write a third SELECT
that references the two named result sets as if they were tables.
Below, s3_count
and ch_count
are named result sets. The last SELECT
queries those two named result sets.
clickhouse local --query "
WITH s3_count AS (
SELECT
toDate(time) AS day,
count() AS num_events
FROM s3('https://s3-endpoint/2023/12/**/*.parquet', 'Parquet')
GROUP BY day
ORDER BY day
),
ch_count AS (
SELECT
toDate(time) AS day,
count() AS num_events
FROM remoteSecure('clickhouse-instance-hostname:9440', 'db.tablename', 'clickhouse-user', 'clickhouse-password')
WHERE toYYYYMM(timestamp) = '202312'
GROUP BY day ORDER BY day
)
SELECT
s3.day AS date,
s3.num_events AS s3_count,
ch.num_events AS ch_count,
ch_count - s3_count AS diff
FROM s3_count AS s3
LEFT JOIN ch_count AS ch
ON s3.day = ch.day;"
The output has a couple of quirks: for some reason, I guess possibly related to querying multiple sources, or using CTEs maybe, the output columns are not ordered according to my query, and there is no header line to tell me which column is which. It looks like this:
2023-02-01 40 2778 2738
2023-02-02 43 4413 4370
2023-02-03 26 7024 6998
2023-02-04 54 3079 3025
Conclusion
So, this is not something I’d paste as-is into a slide deck and present to an executive team. However, clickhouse local
in this case did give me a (relatively) quick way to verify the consistency (and quantify the inconsistency) between my data sources. Once the data was moved, I wasted probably an hour with awk
and friends, but was able to recover the next morning and throw together the clickhouse local
solution in maybe just another hour of reading docs, debugging the query, and running test queries. Hope this helps.