analytics

DBT – incremental strategies for late arriving data

July 10, 2023

Hey there, just wanted to leave a signpost for you. My usecases lately have been something like -

Computing a BIG table, with a lot of math in it, over a LOT of rows of data, and then joining in other data to enrich the primary set. Specifically, this is container usage data, which I'm attempting to blend with our AWS bill to arrive at something like “cost per container” per time period.

I don't want to have to rebuild this table every day because most of the data is static once it shows up in the warehouse. An incremental strategy would be perfect BUT, some of this data arrives late, which means that if I do the standard DBT routine of

WHERE timestamp > (SELECT MAX(timestamp) from {{this}})

then I will have gaps. Indeed, I have gaps. I haven't rolled out any reporting on this table, or made any announcements because I felt a disturbance in the force, confirmed by some light analysis this morning.

I've recently discovered a new DBT hammer in the incremental_strategy parameter for incrementally built tables, and specifically the insert_overwrite option. From the DBT docs:

The insert_overwrite strategy generates a merge statement that replaces entire partitions in the destination table.

In short I can just always recompute yesterday and today, or the last 7 days, or whatever full partitions-worth of data I want. Yes, I'm recomputing more than I strictly need to, but it assures me that there will be no gaps in the results.

This operation seems pretty foolproof so far, check it out.

#analytics #DBT

Two different flavors of monthly metrics

April 26, 2023

Supposing a hypothetical organization that sold a product whose feature set and COGS closely followed a typical CSP like Amazon Web Services. That organization allows its customers to change products at will but must manually invoice a significant percentage of those sellables, therefore it needs a robust system to track changes to those sellables and ensure that they are properly charged at each turn of the billing cycle.

I'm picturing reporting format that reports on 2 different types of metrics -

Accruing (non-temporal)
Static (temporal)

Accruing metrics are easy, they're things like outgoing bandwidth. These are capped monthly and overages should be trued up, therefore 2 measurements could be helpful on these -

Month to date sum (this will end up being your billable, since the bounds of the billing cycle are likely set at the calendar month)
Rolling 30 day average (typical month's usage, helps you notice a customer who is tracking above what you sold them)

Static metrics require a bit more understanding. These are things like CPU cores in a given VM that you're selling. The tin says “8 CPUs” and gives you a monthly rate for those 8 CPUs but you're allowed to upsize that 8 core machine any time you want. Those 8 cores might become 16 for a week, then back to 8. That means you're charging for neither the 8 core machine nor the 16 core machine, but a blend of both.

This is what I mean by “temporal”, you have to generate a time component, divide your 8 or 16 cores into that time component, prorate the usage by that time component, and ultimately arrive at a piece of usage that accrues just like the other.

Given the example of 8 cores to 16 cores and using a 30 day month (720 hours) we get something like this:

You're actually charging for CPU/Hours, firstly. If an 8 core machine is $720/month ($1/hour) and a 16 core machine is $1440 ($2/hour) then your hourly CPU rate is $.0125/hour. This makes it very simple to track (and bill!!) the changes to your sellables that your customers are using.

The metrics you might want to watch on these types of sellables/COGS are almost the opposite of the accruing type:

Month to date average
Rolling 30 days average

The monthly vs. the 30 day average would tell you if they are tracking above or below recent historical averages. It would be trivial to compare the two and throw an alert if the month to date or shorter term rolling average is trending significant above the 30 day average.

Note: I'm on vacation and just want to remember some stuff for when I get back so don't dock me on this, I'm just spitballing

#business #analytics

SQL is easy, Data is hard

February 18, 2023

I've been kicking around this thought for a year or so now – to the outsider a career in data looks like a technical path. The data practitioner learns SQL, how to query data stored in a database somewhere using SQL, and if you know enough SQL you can answer any question whose artifacts are stored in that database somewhere.

The reality is that SQL is the very last mile. SQL is code, and so it looks to the non-practitioner like the act of creation, like code written in any imperative language creates motion and process and a webapp or piece of automation that didn't exist before. SQL does not create. SQL encapsulates that which already exists as a business process.

SQL is a contract. SQL puts business conditions and processes into code. If the business processes are ill-defined, then the SQL that has to be written to handle all the various cases will sprawl. (Most business processes are ill-defined as it turns out, made up in a time of need by a human, and probably one who doesn't spend their day thinking about data modeling.) If the business process is well-defined, but the SQL author's understanding of it is wrong or incomplete, then you'll end up with a poorly written contract that spits out wrong or incomplete answers.

That's what makes Data the hard part, because to write that contract down always requires the author to have spent time reverse-engineering the business process. I view this as an inherent good for the business as a whole – it forces the business to reckon with itself and to better define how it operates. The road to get there is tough though and in my experience it's often the data analyst who is actually pulling the cart.

#analytics #business #databases

What is a “data product”?

September 21, 2022

“Run your data team like it's a product team” was a common refrain at the DBT conference for the last two years. What does that mean? I am still figuring that out, but I have had an aha in the last few weeks about what a “data product” is, exactly.

Your standard software development process requires two main components to deliver the Things – a software developer and her wits. She takes those wits and expresses them into a text editor, and thereby makes something from truly nothing.

Data products differ in 1 key way – they require raw materials in the form of data. The process of building a data product therefore requires at least 1 additional step that standard software product development does not – refining that data into something consumable by the system that is delivering the product.

There can potentially be an additional step even before this one, which is to get the data in the first place. My current employer built an Observability suite and stack to be able to deliver metrics to our customers about their projects that they run/host here. This process took multiple quarters because the entire metrics creation and delivery pipeline had to be built from scratch. Once the data existed, it was then a process of refining the materials and building the product.

The good news is that many data products can be consumed in a standard way through some kind of BI or reporting or data visualization tool, we use Metabase. It has taken me a while to understand that the method of delivery of the products is the more standardized part, whereas the gathering and refinement of the raw materials/data is where the action is.

#analytics #business #data

Tracking Pixels 101

October 4, 2014

So tracking pixels. They sound awful. They sort of are, but we all use them. One just fired off on you a minute ago when you loaded this page. That's how Google Analytics works its magic. But how do they work? The GA tracking code is Javascript and doesn't say anything about an image pixel.

Step inside...

Dat JS

So that javascript does a few things, primarily it creates another javascript tag that pulls down the real “payload”, which is a hackerish term for “a bigger ball of code”. I haven't analyzed that code yet, but one of the things that it does is build a profile of your browser that you're on and the page that you're looking at. Once it does that it pings GA's tracking servers with that profile which counts as a “pageview”. That's the ga('send', 'pageview') bit. But how does that work?

A tracking pixel!

Placement in the DOM, you need not...

So a pretty interesting thing about tracking pixels, and anything in the browser really is that it doesn't actually need to be put on the page to exist in memory somewhere. In fact, even if that pixel is only 1x1 in size, it could bump something out of the way enough to trigger a repaint of the webpage, which might alert you to that pixel's existence, which is something that advertisers and their ilk stringently avoid.

So basically, that ga('send', 'pageview') ends up generating a request to a server somewhere. That request looks like this

https://www.google-analytics.com/collect?v=1&_v=j29&a=806595983&t=pageview&_s=1&dl=http%3A%2F%2Fwww.ignoredbydinosaurs.com%2F2014%2F09%2Fdeconstructing-the-google-analytics-tag&ul=en-us&de=UTF-8&dt=Deconstructing%20the%20Google%20Analytics%20tag%20%7C%20Ignored%20by%20Dinosaurs&sd=24-bit&sr=1440x900&vp=1334x479&je=1&fl=15.0%20r0&_u=MACAAAQAI~&jid=&cid=1626931523.1412365384&tid=UA-8646459-1&z=962163205

In the network tab of your devTools in your favorite browser you can break down all those query string params into something a little more interesting.

v:1
_v:j29
a:806595983
t:pageview
_s:1
dl:http://www.ignoredbydinosaurs.com/2014/09/deconstructing-the-google-analytics-tag
ul:en-us
de:UTF-8
dt:Deconstructing the Google Analytics tag | Ignored by Dinosaurs
sd:24-bit
sr:1440x900
vp:1334x479
je:1
fl:15.0 r0
_u:MACAAAQAI~
jid:
cid:1626931523.1412365384
tid:UA-8646459-1
z:962163205

Some of that stuff is understandable, some of it is not. But the point is that that request actually trigger a response of a 1x1 pixel.

# Response headers

access-control-allow-origin:\*
age:66666
alternate-protocol:443:quic,p=0.01
cache-control:private, no-cache, no-cache=Set-Cookie, proxy-revalidate
content-length:35
content-type:image/gif
date:Fri, 03 Oct 2014 18:23:52 GMT
expires:Mon, 07 Aug 1995 23:30:00 GMT
last-modified:Sun, 17 May 1998 03:00:00 GMT
pragma:no-cache
server:Golfe2
status:200 OK
version:HTTP/1.1
x-content-type-options:nosniff

If this were the first time I'd visited the internet, there would almost certainly be a set-cookie header in there as well, but since they set that cookie on me a LONG time ago, it doesn't get sent.

The kinda creepy thing is that since Google Analytics is on a large number of sites, and their origin servers are on the same domain (cookies), they can follow you around the internet from site to site to site in a way that nobody else can (save perhaps the other giant analytics providers, which probably have nowhere near the reach, unless you count Facebook).

Wow, cool. So what?

So that image pixel is not really the point. It gets returned in that response, but doesn't get put on the page. It plants a cookie on you, big deal.

But what happens at Google is that the request that was made in the first place gets logged. It gets broken down by its query string params, and that's how they build the tool. That's how you know what size browser people are on, what part of the world they're from, what they looked at, and what they clicked on (if you're tracking events).

The really interesting part to me, and the part I haven't figured out yet, is how they store ALL that data on the backend to build the reports out of. Think about it — they're basically logging every request made to every website that is running their code. That's a really big number, even for Google. And then they're able to pull your report suite out of all that data, and sort it out by whatever you wanna know. Seems pretty cool, and also well beyond the capability of normal relational DBs.

A post in the future, I imagine....

#javascript #analytics

Deconstructing the Google Analytics tag

September 3, 2014

If you're a web developer, I'm sure you've placed this snippet of code more into more than a few projects.

(function(i,s,o,g,r,a,m){i['GoogleAnalyticsObject']=r;i[r]=i[r]||function(){
(i[r].q=i[r].q||[]).push(arguments)},i[r].l=1\*new Date();a=s.createElement(o),
m=s.getElementsByTagName(o)[0];a.async=1;a.src=g;m.parentNode.insertBefore(a,m)
})(window,document,'script','//www.google-analytics.com/analytics.js','ga');

ga('create', 'UA-XXXX-Y', 'auto');
ga('send', 'pageview');

Let's unpack it a little bit -

(function(i, s, o, g, r, a, m) {
 i['GoogleAnalyticsObject'] = r;
 i[r] = i[r] || function() {
 (i[r].q = i[r].q || []).push(arguments)
 }, i[r].l = 1 \* new Date();
 a = s.createElement(o),
 m = s.getElementsByTagName(o)[0];
 a.async = 1;
 a.src = g;
 m.parentNode.insertBefore(a, m)
})(window, document, 'script', '//www.google-analytics.com/analytics.js', 'ga');

Now let's make those local variables a little more clear -

(function() {
	var a, m;
	window['GoogleAnalyticsObject'] = 'ga';
	window['ga'] = window['ga'] || function() {
	(window['ga'].q = window['ga'].q || []).push(arguments)
	}, window['ga'].l = 1 \* new Date();
	a = document.createElement('script'),
	m = document.getElementsByTagName('script')[0];
	a.async = 1;
	a.src = '//www.google-analytics.com/analytics.js';
	m.parentNode.insertBefore(a, m)
})()

So if you go to your javascript console and type “GoogleAnalyticsObject”, you'll get back the string “ga”. window.ga is a function, but since functions in javascript are also objects, it has a property called q, which is just an array. This is reminiscent of the old ga.js syntax which went something like this —

var _gaq = _gaq || [];
 _gaq.push(['_setAccount', 'UA-XXXX-Y']);
 _gaq.push(['_trackPageview']);

 (function() {
 var ga = document.createElement('script'); ga.type = 'text/javascript'; ga.async = true;
 ga.src = ('https:' == document.location.protocol ? 'https://ssl' : 'http://www') + '.google-analytics.com/ga.js';
 var s = document.getElementsByTagName('script')[0]; s.parentNode.insertBefore(ga, s);
 })();

_gaq is/was just a plain old Javascript array, which gives it the push() method for free. This new ga.q property serves the exact same purpose, an array to push things into and wait for something to come along later and pop them off. That something that comes along later is whatever is contained in that async script that this snippet also builds.

This is super clever because it doesn't have to wait for anything, it can go ahead and do all its business the instant the page loads and even if the main tracking script doesn't come down for some reason, nothing breaks.

Back to analytics.js...

Whatever you hand as arguments to ga() gets fed into ga.q right here —

window['ga'] = window['ga'] || function() {
	(window['ga'].q = window['ga'].q || []).push(arguments)
}

If you pop open the console on the front page of this blog, and type in ga.q, you'll get this —

> ga.q
[
Arguments[3]
0: "create"
1: "UA-8646459-1"
2: "ignoredbydinosaurs.com"
callee: function (){
length: 3
__proto__: Object
, 
Arguments[2]
0: "send"
1: "pageview"
callee: function (){
length: 2
__proto__: Object
]

Those are stashed in the queue because as soon as that first bit of code is parsed out, there are two quick calls to ga(), and that's exactly what they have as their arguments. It's so simple, it's almost stupid to explain, but the script is so heavily optimized it's not at all obvious on first glance what's going on here.

Moving on, there's another property of the ga function/object – ga.l. ga.l gets initialized to a javascript timestamp (in milliseconds). new Date() returns a javscript Date object, but multiplying it by the integer 1 casts it into a number, which automatically converts it into the number of milliseconds since the epoch. Another way of writing this would be +new Date() – another, albeit less clear way, of performing the same casting to a timestamp. ga.l's purpose is to provide a time for the initial “create” and “pageview” calls to ga().

Lastly, an asynchronous javascript tag is written to make the call to fetch the ga.js script from Google's servers so the real magic can start.

Another interesting bit is that the a and m parameters are not assigned anything at the IIFE call at the end. This leaves them as undefined in the script until they are assigned the script tags toward the end of this snippet. Another way of writing the exact same thing would be to only have (i,s,o,g,r) as parameters to the function, and then declaring var a, m; somewhere in this snippet. I'm not sure off hand if this is a memory or performance optimization or if it's just a handy way to save a couple bytes over the network, but someday I'll figure it out.

Thanks for sticking with me – this is one of the most common little snippets that I've probably placed in my web development career, and I'd never totally dug in to understand what exactly it does beyond writing an async script tag. The pattern of declaring a “plain old javascript array” and then pushing your “stuff” into it as a queue for working with later is an extremely common pattern in 3rd party javascript, since you want everything to be as performant as possible, and you want to make sure you don't break anything if for some reason the rest of your payload script doesn't actually load.

#javascript #analytics