I’ve been backing on a side project lately to try and make open sources some of the bones of a FinOps visibility tool. You can find the FinOpsPod episode I recorded on the topic recently here. Well now that that’s out, I’ve been properly motivated to ship and while AWS is done enough for now, I have been wrangling the Azure side of things over the weekend. This is what I learned in the last 72 hours.

Azure Blob Storage download progress with TQDM

I searched the internet high and low for how to handle this. AWS makes it fairly easy with the Callback argument you can pass when downloading an object from S3. I guess Azure’s version is more recent and it goes like this:

def download_object(self, blob_name, destination_path):
    blob_client = self.container_client.get_blob_client(blob_name)
    filesize = blob_client.get_blob_properties().size
    # just in case the destination path doesn't exist
    os.makedirs(os.path.dirname(destination_path), exist_ok=True)
    with open(destination_path, "wb") as file, tqdm(
        total=filesize,
        unit="B",
        unit_scale=True,
        desc='whatever helpful file description you wish',
        colour="green",
    ) as t:
        bytes_read = 0
        # the progress_hook calls with 2 args - the bytes to far and the total
        # bytes.  t.update wants the bytes read in that iteration, so we have to
        # make that happen by keeping track of what the total was as of the previous
        # iteration.
        def update_progress(bytes_amount, *args):
            nonlocal bytes_read
            t.update(bytes_amount - bytes_read)
            bytes_read = bytes_amount
 
        blob_data = blob_client.download_blob(progress_hook=update_progress)
        file.write(blob_data.readall())
        t.close()

In the API docs for the download_blob function you can find the progress_hook kwarg. It isn’t called as often as its AWS counterpart so the progress bar isn’t nearly as fine-grained, but it’s better than nothing in my opinion. The whole thing in genearl requires more wrangling than the AWS version, but I learned quite a lot in the process.

DuckDB, the ultimate dataeng swiss army knife?

One helpful thing that AWS does in their billing data export is to include a metadata file with each night’s export that tells us facts about the export in general. Things like

  • the timestamp that the export was generated
  • where you can find the associated billing data files
  • a unique ID on that particular version of the export and most helpfully
  • a list of columns and their datatypes in that export.

For this side project I’m using Clickhouse as the backend warehouse. It’s really fun to run huge queries on a huge dataset and have them come back in what feels like 100ms, so I’m a rather big fan of Clickhouse at this point though I’m only just getting to know it. There are fussy things, too. Things like its CSV importing, which is … not super friendly. Here’s an example:

Azure’s billing exports generate with a Date field that tells you the date of the charge/line item. For some reason, even though my employer is a French company and our bill is in euro, all of the date fields in this bill come across with the US date format - MM/DD/YYYY. After exhaustive searching, I did find a clue in the Clickhouse documentation that it could parse US style dateTime strings, but I cannot find that piece of documentation again AND it was only available after you’d gotten the data into the warehouse (presumably as a String). I want the thing stored as a date to begin with so I started to wonder if I could grab DuckDB and use it to parse this stupid Date column for me correctly.

The answer is yes. DuckDB is also a pretty cool piece of gear and so I’m playing with both of these open-source columnar things at the moment. One thing that Duck has gone out of their way with is making it super easy to ingest data, and to specify all the little weird things that can go wrong in their extremely generous default CSV importer, things like “hey, the dateformat should look like this - {strptime string}”. Super cool and works like a charm, so now I have this CSV in memory as a DuckDB table. What else can I do with it?

Well, why spit it back out as CSV, how about spit it back out as Parquet? Clickhouse will have a much easier time reading a Parquet file as it comes along with all the column names and datatypes, so that’s what I’m doing. So, I have this function that downloads all the data_files for a given billing export and for the sake of brevity I’ll put it here in its current, non-optimized form:

def download_datafiles(self, data_files: List[str]):
    local_files = []
    # this downloads each of the CSV files and puts them in a local
    # tmp directory
    for data_file in data_files:
        destination_path = f"tmp/{data_file}"
        print(f"Downloading to {destination_path}")
        self.storage_client.download_object(data_file, destination_path)
        local_files.append(destination_path)
    dirname = os.path.dirname(local_files[0])
    con = duckdb.connect()
    # Here we convert the csv to Parquet, because DuckDB is excellent with
    # parsing CSV and Clickhouse is a bit fussy in this regard.  The Azure
    # files come over with dates in the format MM/DD/YYYY, which DuckDB
    # can be made to deal with, but Clickhouse cannot.
 
    # moreover, Duck can grab any number of CSV files in the target directory and
    # merge them all together for me.  This allows me to generate a single Parquet
    # file from all the CSV files in the directory.  Given that Azure doesn't even
    # gzip these files, this turns 500MB of CSV into 20MB of Parquet.  Not bad.
    con.sql(
        f"""CREATE TABLE azure_tmp AS SELECT * FROM read_csv('{dirname}/*.csv', 
            header = true, 
            dateformat = '%m/%d/%Y'
        )"""
    )
    # We use this DuckDB table to generate an accurate list of columns and types
    # from the CSV.  Thanks DuckDB!
    columns = [
        {"name": result[0], "type": result[1]}
        for result in con.sql("DESCRIBE azure_tmp").fetchall()
    ]
    con.sql(
        f"COPY (SELECT * FROM azure_tmp) TO 'tmp/azure-tmp.parquet' (FORMAT 'parquet')"
    )
    con.close()
    # yes, we do two things in the function right now, it's ok.  We'll refactor
    # and use the "parse the columns and datatypes out of this parquet table" probably
    # all over the place.
    return [f"tmp/azure-tmp.parquet"], columns