Ignored By Dinosaurs 🦕

data

I've been backing on a side project lately to try and make open sources some of the bones of a FinOps visibility tool. You can find the FinOpsPod episode I recorded on the topic recently here. Well now that that's out, I've been properly motivated to ship and while AWS is done enough for now, I have been wrangling the Azure side of things over the weekend. This is what I learned in the last 72 hours.

Azure Blob Storage download progress with TQDM

I searched the internet high and low for how to handle this. AWS makes it fairly easy with the Callback argument you can pass when downloading an object from S3. I guess Azure's version is more recent and it goes like this:

def download_object(self, blob_name, destination_path):
    blob_client = self.container_client.get_blob_client(blob_name)
    filesize = blob_client.get_blob_properties().size
    # just in case the destination path doesn't exist
    os.makedirs(os.path.dirname(destination_path), exist_ok=True)
    with open(destination_path, "wb") as file, tqdm(
        total=filesize,
        unit="B",
        unit_scale=True,
        desc='whatever helpful file description you wish',
        colour="green",
    ) as t:
        bytes_read = 0
        # the progress_hook calls with 2 args - the bytes downloaded so far and the total
        # bytes of the object.  t.update wants the bytes read in that iteration, so we have to
        # make that happen by keeping track of what the total was as of the previous
        # iteration.
        def update_progress(bytes_amount, *args):
            nonlocal bytes_read
            t.update(bytes_amount - bytes_read)
            bytes_read = bytes_amount

        blob_data = blob_client.download_blob(progress_hook=update_progress)
        file.write(blob_data.readall())
        t.close()

In the API docs for the download_blob function you can find the progress_hook kwarg. It isn't called as often as its AWS counterpart so the progress bar isn't nearly as fine-grained, but it's better than nothing in my opinion. The whole thing in general requires more wrangling than the AWS version, but I learned quite a lot in the process.

DuckDB, the ultimate dataeng swiss army knife?

One helpful thing that AWS does in their billing data export is to include a metadata file with each night's export that tells us facts about the export in general. Things like

  • the timestamp that the export was generated
  • where you can find the associated billing data files
  • a unique ID on that particular version of the export and most helpfully
  • a list of columns and their datatypes in that export.

For this side project I'm using Clickhouse as the backend warehouse. It's really fun to run huge queries on a huge dataset and have them come back in what feels like 100ms, so I'm a rather big fan of Clickhouse at this point though I'm only just getting to know it. There are fussy things, too. Things like its CSV importing, which is ... not super friendly. Here's an example:

Azure's billing exports generate with a Date field that tells you the date of the charge/line item. For some reason, even though my employer is a French company and our bill is in euro, all of the date fields in this bill come across with the US date format – MM/DD/YYYY. After exhaustive searching, I did find a clue in the Clickhouse documentation that it could parse US style dateTime strings, but I cannot find that piece of documentation again AND it was only available after you'd gotten the data into the warehouse (presumably as a String). I want the thing stored as a date to begin with so I started to wonder if I could grab DuckDB and use it to parse this stupid Date column for me correctly.

The answer is yes. DuckDB is also a pretty cool piece of gear and so I'm playing with both of these open-source columnar things at the moment. One thing that Duck has gone out of their way with is making it super easy to ingest data, and to specify all the little weird things that can go wrong in their extremely generous default CSV importer, things like “hey, the dateformat should look like this – {strptime string}”. Super cool and works like a charm, so now I have this CSV in memory as a DuckDB table. What else can I do with it?

Well, why spit it back out as CSV, how about spit it back out as Parquet? Clickhouse will have a much easier time reading a Parquet file as it comes along with all the column names and datatypes, so that's what I'm doing. So, I have this function that downloads all the data_files for a given billing export and for the sake of brevity I'll put it here in its current, non-optimized form:

def download_datafiles(self, data_files: List[str]):
    local_files = []
    # this downloads each of the CSV files and puts them in a local
    # tmp directory
    for data_file in data_files:
        destination_path = f"tmp/{data_file}"
        print(f"Downloading to {destination_path}")
        self.storage_client.download_object(data_file, destination_path)
        local_files.append(destination_path)
    dirname = os.path.dirname(local_files[0])
    con = duckdb.connect()
    # Here we convert the csv to Parquet, because DuckDB is excellent with
    # parsing CSV and Clickhouse is a bit fussy in this regard.  The Azure
    # files come over with dates in the format MM/DD/YYYY, which DuckDB
    # can be made to deal with, but Clickhouse cannot.

    # moreover, Duck can grab any number of CSV files in the target directory and
    # merge them all together for me.  This allows me to generate a single Parquet
    # file from all the CSV files in the directory.  Given that Azure doesn't even
    # gzip these files, this turns 500MB of CSV into 20MB of Parquet.  Not bad.
    con.sql(
        f"""CREATE TABLE azure_tmp AS SELECT * FROM read_csv('{dirname}/*.csv', 
            header = true, 
            dateformat = '%m/%d/%Y'
        )"""
    )
    con.sql(
        f"COPY (SELECT * FROM azure_tmp) TO 'tmp/azure-tmp.parquet' (FORMAT 'parquet')"
    )
    con.close()
    # yes, we do two things in the function right now, it's ok.  We'll refactor
    # and use the "parse the columns and datatypes out of this parquet table" probably
    # all over the place.
    return "tmp/azure-tmp.parquet"

#data #data-engineering #azure #finops

“Run your data team like it's a product team” was a common refrain at the DBT conference for the last two years. What does that mean? I am still figuring that out, but I have had an aha in the last few weeks about what a “data product” is, exactly.

Your standard software development process requires two main components to deliver the Things – a software developer and her wits. She takes those wits and expresses them into a text editor, and thereby makes something from truly nothing.

Data products differ in 1 key way – they require raw materials in the form of data. The process of building a data product therefore requires at least 1 additional step that standard software product development does not – refining that data into something consumable by the system that is delivering the product.

There can potentially be an additional step even before this one, which is to get the data in the first place. My current employer built an Observability suite and stack to be able to deliver metrics to our customers about their projects that they run/host here. This process took multiple quarters because the entire metrics creation and delivery pipeline had to be built from scratch. Once the data existed, it was then a process of refining the materials and building the product.

The good news is that many data products can be consumed in a standard way through some kind of BI or reporting or data visualization tool, we use Metabase. It has taken me a while to understand that the method of delivery of the products is the more standardized part, whereas the gathering and refinement of the raw materials/data is where the action is.

#analytics #business #data

Not sure if I mentioned this in the last post, but what I'm doing right now is essentially building a data warehouse for the company. It's from scratch, there was nothing here beforehand, so I get get to/have had to chose everything from the tech stack to the processes to my favorite part of late: naming things.

The name you give to a piece of software, or a command line flag, or a column in a database is an act of asynchronous communication with another human. You are asking them to care about the thing you've built. If they chose to work with your tool, you are asking them to understand the choices that you made in its design.

The most selfless thing you can do is think about how much effort it takes for the user who did not design the thing to understand how to use the thing. Name things according to what they are and what they do, make it intuitive. This is what design is. It's not the parts that most people will never see, it's the parts that most people will only see.

#generaldevelopment #data

My brother in law is a recruiter. He historically recruits salesfolks for companies “in the BI space”. I tried to help him out many years ago when I was still on the road playing music, but had absolutely no background to do anything other than plow through the spreadsheet of contacts that he had and try and get a response. Like most witless recruiters. I had no idea what BI was.

Years later, after starting a new job at ABM I still had no idea what BI was. It was something we needed, or something we did, I wasn't really sure. We had a big old database, some stuff was in there, reports came out. Somebody read them. No idea. Didn't seem very intelligent, but apparently it helped with our business, yet most of the time people seemed pissed off at it and the person who ran it.

So here's a quicky definition of “Business Intelligence” for me, 5 years ago.


Companies take in a lot of data. Data can be anything. It can be logs from your webserver. It can be daily dumps from Google Analytics about the traffic on your site. It can be daily dumps from Exact Target or Mailchimp about what emails went out yesterday, on what lists, and which ones were opened, which ones bounced. What videos were played on the sites yesterday? On what kind of browser? Basically it's anything and everything that a business can get their hands on.

Ok, you've got your hands on it, now what? Let's figure out how to figure out what is going on with our business on a macro scale so that the C suite can make decisions and we can all keep our jobs.

This is basically what BI is. Take in data. Munge data. Get answers out of data so that you can run your business.

Obviously today (2016), this is big business. “Big data”, you've heard of that? Very closely related to BI, since the amount of data that we are able to take in these days is so vast that there's no way we could get meaningful answers out of all of it using technology from even just 10 years ago.

Wrangling all this data is a wide open field, and that's where I want to be right now.

#business #data