Skip to content

Musings of an Anonymous Geek

Made with only the finest 1's and 0's

Menu
  • About
  • Search Results
Menu

Auditing Your Data Migration To ClickHouse Using ClickHouse Local

Posted on January 21, 2024January 21, 2024 by jonesy

I’ve been developing a quick and dirty data migration routine to get terabytes of data stored in AWS S3 as parquet files into our ClickHouse Cloud cluster. I’m really happy that I took some time to read up on the clickhouse local command, which is included in any installation of ClickHouse.

Not only was this tool instrumental in getting the data migrated, but it also allowed me to very easily craft a way to quickly compare numbers between the source and destination to make sure everything I expected to be migrated was actually migrated.

The Mess I Made

It’s important to know that engineers with 25+ years of experience do dumb things sometimes. It’s also important to know that having decades of experience does not make you immune to fatigue, and really nobody should be working from 7AM until midnight. It’s not heroic. It’s unhealthy and problematic on a bunch of different levels. With that in mind, here’s the dopy stuff I did after working too late after too many hours:

Initially, after I migrated a subset of the data, I did a more manual, hacky check for consistency just using command line tools, manually querying each source in different terminal windows & eyeballing the output. I wanted a nicer way to view that data, so I created a one-liner using awk, paste, and column commands. That looked like this:

paste count_clicks.txt count_clicks_parquet.txt | column -t | awk '
NR==1{
  printf(
    "%12s %18s %18s %12s %12s \n", 
    "date", 
    "clickhouse_count", 
    "warehouse_count", 
    "diff", 
    "pct-diff"
  )
}
{
  printf(
    "%12s %18d %18d %12d %12.4f%%\n", 
    $1, 
    $2, 
    $4, 
    $4-$2, 
    (100-($2/$4)*100)
  )
}'

A quick overview of what’s happening there:

  • NR==1 means the first record (NR==2 would mean the second record, etc). If the record number is one, awk will output what’s in the first set of curly braces. The printf function takes a format spec in the first argument. My format spec lays out 5 columns of either 12 or 18 characters. All of those columns will hold strings, hence the ‘s’ in %12s. Then I have a bunch of hard-coded strings, which become the column headers. If you forget to put NR==1 in there, they’ll print on every row. Ask me how I know!
  • The second set of brackets specs out what will be printed in the rest of the rows. In this case, I have column widths that match up with those of the column headers, and then I have the columns in the output of the earlier parts of the command pipeline:
    • $1 is the date column
    • $2 is the count from the S3 data source
    • $4 is the count from the ClickHouse data source
    • $4-$2 shows the difference between the two data sources, and
    • The last column shows the percent difference between the two sources

The output looks something like this:

        date   clickhouse_count    warehouse_count         diff     pct-diff
  2023-12-01               1471               2445           74        0.017%
  2023-12-02               1665               1700           35        0.038%
  2023-12-03               4496               4537           41        0.045%
  2023-12-04               1650               1705           55        0.047%
  2023-12-05               1154               1237           83        0.069%
  2023-12-06               2777               2865           88        0.074%
  2023-12-07               9244               9293           49        0.041%

The data here is made up to give an idea of what the output looks like.

So, I had this issue where my data audit showed a mismatch. I did a little work, very late at night, and went to bed thinking I had straightened it all out. When I woke up the next day, my well-rested brain and eyes caught a problem: I copied output from my queries of the two sources into two separate files, and mis-labeled the data, and then compared data in a completely different window with that, and…. well, it was a mess, and I didn’t fix anything.

I was up too late working for sure, but I also had a messy process. I should’ve and could’ve done better.

Fresh Eyes, Fresh (and better) Ideas

Revisiting my work from the night before was painful. As soon as I looked at my process at a high level (by scrolling through my terminal window history) I almost immediately said “this is insane. ClickHouse Local should be able to query both sources. I shouldn’t need to copy/paste and introduce levels of indirection that leave room for errors like this.”

I was right. Using ClickHouse Local, you can query a ClickHouse Cloud instance using the remoteSecure function, and query the S3 data using the s3 function, which also lets me pass in the file format as an argument. So there’s support for my data sources and formats.

On top of that, ClickHouse supports Common Table Expressions (CTEs), so I can craft a query where I name the output from two separate sub select statements (one to each data source), and then write a third SELECT that references the two named result sets as if they were tables.

Below, s3_count and ch_count are named result sets. The last SELECT queries those two named result sets.

clickhouse local --query "
WITH s3_count AS (
  SELECT 
    toDate(time) AS day, 
    count() AS num_events 
  FROM s3('https://s3-endpoint/2023/12/**/*.parquet', 'Parquet')  
  GROUP BY day 
  ORDER BY day
), 
ch_count AS (
  SELECT 
    toDate(time) AS day, 
    count() AS num_events 
  FROM remoteSecure('clickhouse-instance-hostname:9440', 'db.tablename', 'clickhouse-user', 'clickhouse-password') 
  WHERE toYYYYMM(timestamp) = '202312' 
  GROUP BY day ORDER BY day
) 
SELECT 
  s3.day AS date, 
  s3.num_events AS s3_count,  
  ch.num_events AS ch_count, 
  ch_count - s3_count AS diff  
FROM s3_count AS s3 
LEFT JOIN ch_count AS ch 
ON s3.day = ch.day;"

The output has a couple of quirks: for some reason, I guess possibly related to querying multiple sources, or using CTEs maybe, the output columns are not ordered according to my query, and there is no header line to tell me which column is which. It looks like this:

2023-02-01	40	2778	2738
2023-02-02	43	4413	4370
2023-02-03	26	7024	6998
2023-02-04	54	3079	3025

Conclusion

So, this is not something I’d paste as-is into a slide deck and present to an executive team. However, clickhouse local in this case did give me a (relatively) quick way to verify the consistency (and quantify the inconsistency) between my data sources. Once the data was moved, I wasted probably an hour with awk and friends, but was able to recover the next morning and throw together the clickhouse local solution in maybe just another hour of reading docs, debugging the query, and running test queries. Hope this helps.

Share this:

  • Click to share on X (Opens in new window) X
  • Click to share on Reddit (Opens in new window) Reddit
  • Click to share on Tumblr (Opens in new window) Tumblr
  • Click to share on Facebook (Opens in new window) Facebook

Recent Posts

  • Auditing Your Data Migration To ClickHouse Using ClickHouse Local
  • ClickHouse Cheat Sheet 2024
  • User Activation With Django and Djoser
  • Python Selenium Webdriver Notes
  • On Keeping A Journal and Journaling
  • What Geeks Could Learn From Working In Restaurants
  • What I’ve Been Up To
  • PyCon Talk Proposals: All You Need to Know And More
  • Sending Alerts With Graphite Graphs From Nagios
  • The Python User Group in Princeton (PUG-IP): 6 months in

Categories

  • Apple
  • Big Ideas
  • Books
  • CodeKata
  • Database
  • Django
  • Freelancing
  • Hacks
  • journaling
  • Leadership
  • Linux
  • LinuxLaboratory
  • Loghetti
  • Me stuff
  • Other Cool Blogs
  • PHP
  • Productivity
  • Python
  • PyTPMOTW
  • Ruby
  • Scripting
  • Sysadmin
  • Technology
  • Testing
  • Uncategorized
  • Web Services
  • Woodworking

Archives

  • January 2024
  • May 2021
  • December 2020
  • January 2014
  • September 2012
  • August 2012
  • February 2012
  • November 2011
  • October 2011
  • June 2011
  • April 2011
  • February 2011
  • January 2011
  • December 2010
  • November 2010
  • September 2010
  • July 2010
  • June 2010
  • May 2010
  • April 2010
  • March 2010
  • February 2010
  • January 2010
  • December 2009
  • November 2009
  • October 2009
  • September 2009
  • August 2009
  • July 2009
  • June 2009
  • May 2009
  • April 2009
  • March 2009
  • February 2009
  • January 2009
  • December 2008
  • November 2008
  • October 2008
  • September 2008
  • August 2008
  • July 2008
  • June 2008
  • May 2008
  • April 2008
  • March 2008
  • February 2008
  • January 2008
  • December 2007
  • November 2007
  • October 2007
  • September 2007
  • August 2007
  • July 2007
  • June 2007
  • May 2007
  • April 2007
  • March 2007
  • February 2007
  • January 2007
  • December 2006
  • November 2006
  • September 2006
  • August 2006
  • July 2006
  • June 2006
  • April 2006
  • March 2006
  • February 2006
  • January 2006
  • December 2005
  • November 2005
  • October 2005
  • September 2005
  • August 2005
  • July 2005
  • June 2005
  • May 2005
  • April 2005
  • March 2005
  • February 2005
  • January 2005
  • December 2004
  • November 2004
  • October 2004
  • September 2004
  • August 2004
© 2025 Musings of an Anonymous Geek | Powered by Minimalist Blog WordPress Theme