databases

SQL is easy, Data is hard

February 18, 2023

I've been kicking around this thought for a year or so now – to the outsider a career in data looks like a technical path. The data practitioner learns SQL, how to query data stored in a database somewhere using SQL, and if you know enough SQL you can answer any question whose artifacts are stored in that database somewhere.

The reality is that SQL is the very last mile. SQL is code, and so it looks to the non-practitioner like the act of creation, like code written in any imperative language creates motion and process and a webapp or piece of automation that didn't exist before. SQL does not create. SQL encapsulates that which already exists as a business process.

SQL is a contract. SQL puts business conditions and processes into code. If the business processes are ill-defined, then the SQL that has to be written to handle all the various cases will sprawl. (Most business processes are ill-defined as it turns out, made up in a time of need by a human, and probably one who doesn't spend their day thinking about data modeling.) If the business process is well-defined, but the SQL author's understanding of it is wrong or incomplete, then you'll end up with a poorly written contract that spits out wrong or incomplete answers.

That's what makes Data the hard part, because to write that contract down always requires the author to have spent time reverse-engineering the business process. I view this as an inherent good for the business as a whole – it forces the business to reckon with itself and to better define how it operates. The road to get there is tough though and in my experience it's often the data analyst who is actually pulling the cart.

#analytics #business #databases

Prefect looping – .map() it

November 22, 2020

So I'm back in the writing code game after 4 years off. I started a new position about 2 months ago as “director of data engineering” at Platform.sh. It only took ten years longer than originally planned but I have finally unlocked the “some kind of engineering mgmt at an actual tech company” badge.

I'll get into what I'm doing maybe later, but right now I want to leave a breadcrumb for my past self regarding looping over tasks in Prefect.

I'm just kind of tinkering with pulling info from APIs at this point, so the workflow looks basically like this for a couple different vendors:

Pull a list of accounts
For each of those accounts, pull a list of services
For each of those services, pull service metrics for a range of days

This ends up fanning out to potentially thousands of APIs calls ultimately, so I'm doing a lot of for ... in: looping on Python. Given that I'm using Prefect though, I'm bumbling up the ladder of how to do these things in a more idiomatic way. Prefect has a native concept of mapping that reduces the amount of work you need to do in each of these steps as well as makes each iteration of each of these steps more atomic.

By example – my first, naive approach which was something like this:

Pull a list of accounts() returns –> list of accounts, then
Pull a list of services for each of those accounts(list of accounts) loops over and then returns –> list of services
Pull a list of metrics(list of services) loops over enormous list of services and then returns –> ginormous list of metrics that is the thing we're actually interested in.

Obviously a failure in any one of the loops in the process means the whole thing derails, so I was doing a lot of work trying to handle that. Prefect has a smarter way in their native mapping API. Basically, instead of making the downstream tasks loop over a list of things, you write the downstream task assuming that it will handle 1 of the things in the list. Then instead of calling do_this_over_and_over_in_a_loop(list) you call something like do_one_iteration.map(list). Prefect then handles spawning a “child task” for each of the items in the list. This makes is possible to run the loops in parallel, and even to create entire parallel pipelines out of the original “pull the list of accounts” job. It also handles failures in any of the iterations atomically, and doesn't kill the entire loop.

The whole flow then looks like:

Pull a list of accounts() returns –> list of accounts
Pull a list of services for 1 account.map(list of accounts) returns –> list of services for 1 account, then you call map again on the task downstream of that
Pull a list of metrics for 1 service.map(list of services) returns –> the stuff you're really interested in.

the .map() bit unrolls each of the lists and creates a child task for each element of the list, then calls the task on that 1 element.

I'm struggling right now with getting this all running on Platform.sh with Dask, but I'm pretty excited with the potential given my relative rustiness and that I'm working basically alone so far.

More later, <3 love you!!

#business #databases

Resetting postgres sequences

January 27, 2014

I recently rebuilt this site (again, I think this is v9 now) into Rails and Postgres. I was moving out of Drupal, so there was a significant downsizing of the database, and some pretty painless migration of the data. Once again, I used the Sequel gem to write the migration script to pull out the posts and related data that gets scattered hither and yon in Drupal's database. That was the easy part.

I did all of this over a period of about two weeks and it was joyous, painless, and fun. I used the Textacular gem to power that full text search up there (if you're on a #{$medium-and-up} device) and generally got to know a teeny little bit more about Postgres – the grown up's open source database. This whole thing is just a step on the path of the continous quest toward “the idea”, and some of Postgres' features that let you denormalize some easy data will definitely come in handy when I understand the fundamentals of the engine a little better. I just met one of them now.

Postgres sequences

When you want an auto-incrementing id as a primary key on your database table in MySQL, it's very straightforward. It's essentially a property of the column itself, ie

+-------+---------+------+-----+---------+----------------+
| Field | Type | Null | Key | Default | Extra |
+-------+---------+------+-----+---------+----------------+
| id | int(11) | NO | PRI | NULL | auto_increment |
+-------+---------+------+-----+---------+----------------+

In postgres, which is a much more robust database, you have an entire “sequence” object available in the database. You can program these sequences as you would an object, so if for some reason you wanted to skip even numbers, you can do that. The first use case for this that comes to mind is sharding your dataset when scaling out, but it'll likely be never when I find that out in practice.

Anyway, when I migrated the blog data into the system it brought the numeric primary ids with it, but just now when I went to author my first blog post on the new system it bombed out when I went to save.

ERROR -- : PG::UniqueViolation: ERROR: duplicate key value violates unique constraint "posts_pkey"
DETAIL: Key (id)=(1) already exists.

If I hadn't read about sequences already, this would've confused the shit out of me. So the fix is to holler at your sequence object to start incrementing where you need it to, rather than 1, which is where it's currently set.

ibd_db=# select id from posts order by id desc limit 5;
  id
  -----
  251
  250
  244
  243
  239
(5 rows)

ibd_db=# alter sequence posts_id_seq restart with 252;

// Save post, no problems

ibd_db=# select id from posts order by id desc limit 5;
  id
  -----
  252
  251
  250
  244
  243
(5 rows)

So ALTER SEQUENCE sequence_name RESTART WITH next_id_in_line;.

#databases #postgres

Explaining Non-Relational Databases To My Mom

May 17, 2013

I was on the phone with Mom yesterday, and we got to talking about technology – a thing that actually happens fairly frequently. Being an only kid, she's genuinely interested in everything that I do and it's been helpful to have someone who's mostly non-technical to bounce explanations off of when I'm getting my head around a new piece of gear.

The piece of gear that I was explaining the other day was something called Mongo DB. Mongo's parent company is called 10gen, and they landed on the startup scene about 5 years ago or so with their flagship product, Mongo DB. Mongo is currently the pre-eminent player in the “NoSQL” database market. The NoSQL flavor of databases has come en vogue in the last few years in certain technology sectors, primarily ones that are evolving so quickly that having to slow down to put forethought into your data store and how it's going to be structured might literally be the difference between your whole company suceeding or not. That whole sentence will make more sense in a minute, but first it'd be helpful to understand how a traditional, “relational” database works.

The Relational model

The relational model of storing data has been around for more than 40 years. Wikipedia, of course, has a great article – http://en.wikipedia.org/wiki/Relational_database – giving a technical overview, and I wrote my own article 4 years ago while trying to get my own head around the concept – http://www.ignoredbydinosaurs.com/2009/04/chapter2-databases. Wow, that's hilarious reading that article actually, since my “Railroad Earth setlist database” example is well underway – http://www.hobocompanion.org/. So yeah, read that second one for the idiot's guide.

The classic example I gave to my mom was that of a common blog. You have a table for all your posts. Each post in that table has a row, and each row has something like a numeric ID, and the text of that post. When you want to read a blog post, the URL in that address bar of your says something to the effect of “give me post #1”. Whatever blog hosting software your using then turns around to the database behind it and says “select everything from the posts table where the ID of the post is 1”. The database hands back the post, the software renders some HTML out of it and sends it back to your browser. This is the simplest example of your interacting with a database.

The relational model typically comes into play when you visit a blog that has comments. Now you don't just have a Posts table, but you also have a Comments table. That Comments table will most likely have the same ID column, and a column called “body” or something like that for storing the text of the comment. However, this table will also have a column called something like “postid”, and what gets stuffed in that column is the (you guessed it) ID of the blog post that this comment “relates” to. So now when your reader comes by, the blog software turns around to the database and asks for two things this time – “select everything from the posts table where the ID of the post is 1”, and then “select everything from the comments table where the postid is 1”. This second “query”, if you're lucky enough to write something that people respond to, will return a list of comments, an array if you will, that your blog software will then convert to HTML and append to your blog post in the form of the comments section. Pretty simple, or is it?

Issues with the relational model

For the purposes of this simplistic example, this hopefully isn't that hard to get your head around. “But”, you might be wondering, “does that really make sense to store blog posts over here and comments over there? They're clearly related. When are you ever going to be examining a post's comments without wanting that post also?”

Very good. And this is a prime example of when a “non-relational” or “NoSQL” database might make a lot of sense.

The non-relational model

Let's stick with the same example, the blog post and comments, but let's think about how to “model” this in a non-relational way. By the way, if you're still with me, you have a deeper technical understanding than 99% of everybody around you.

The non-relational data model would look more like a sheet of paper. In fact, the concept of one entity and all the data that pertains to that one entity is known in Mongo as a “document”, so truly this is a decent way to think about it.

The way that your blogging software interacts with this blog post is theoretically a lot simpler in a non-relational database. Request comes in, blogging software turns around to Mongo and says “Please give me back this specific post and everything related to it”, in this case a listing of comments. BUT, that's not all. Since we're not forced to be too uptight about having to define how the data is structured beforehand, what if we want to tag that post with an arbitrary number of categories? No problem, stick them on the same document and when blog software says “gimme everything on Post #1”, the tags, the comments, and any other randomly associated data come back with it. Your software doesn't need to know ahead of time, you don't need to know ahead of time. It all just kind of works, provided you know what you're doing on the software level.

This amount of flexibility is what makes Mongo a very popular “data store” with quickly moving techonology companies that might be chasing an idea that changes from week to week or day to day. You don't have to rearchitect your entire system when you discover this particular use case down the road where you need to stick some “extra data” on this particular “thing”.

Issues with the non-relational model

It seems obvious that for this example a non-relational database makes a lot of sense, but like any useful tool, it has it's limits.

For this example, let's use a grocery list. You can model the grocery list in a lot the same way – a piece of paper, you write the items you want on there. But let's say, for the purposes of this example, you figure out after a couple of months that you want to keep track of how many loaves of bread you actually bought last year. Well, with the non-relational model you might literally have to go through every list and count each loaf individually (*simplistic example alert*), whereas had you modeled this in a relational way, you could get that count back almost instantly. Yes, it'd be a bit more work on the front end – essentially you'd have a Lists table and an Items table. But, modelled correctly, you'd also have a table in the middle called a “join table” that allows you to associate any number of items with any number of lists. After a while, were you truly the hacker, you could probably write some code that'd predict what you need to put on that list without even having to put it on there yourself, based strictly off of patterns that are easily discernable in a relational database.

That's why the Hobo Companion has a relational database behind it. It's super easy to count how many and which shows each user was at (2000+ shows tagged between about 30 users so far). It's super easy to count the number of times they've played Mighty River. It's super easy to count the number of times that I played Mighty River (somewhere around 260 times). It's super easy to figure out where in the setlist they typically put Dandelion Wine (Either first, or at the end of the first set). These calculations are far from impossible in a non-relational database, but they are also pretty far from trivial.

In situations where you absolutely want to be really uptight about your data and how it's structured – think bank accounts as the classic example – a relational database is absolutely required, not just because it's theoretically better suited for the job, but it's also a “mature technology” that's older than I am.

The wonderful point here is that we actually have decent options depending on what it is we are actually trying to do. Startup hackers love Mongo precisely because it lets you move fast. You can rock out an application in a weekend without having to spend most of one day setting up the equipment in the studio first.

Thanks for sticking with me.

#databases #theory

Using the MySQL command line with Media Temple's grid server

January 18, 2013

I've been a Media Temple customer for a long while now, almost 4 years. The file system on my grid server is a timeline of my learning programming and systems administration. I mostly work on big-boy systems over the last 2 years or so, so my comfort level with working through SSH is pretty high now (much higher than working through GUI tools that want to hold your hand, actually). That's one thing that I always loved about MT – the fact that they give you a pretty decent command line experience for administering your server.

One thing that's kind of always gotten on my nerves, however, is the DB admin experience. Lately since it seems like their PHPMyAdmin installation has completely gone to shit. Seriously, it's unusable. I couldn't even restore/import a < 1 MB brand-new-site database the other day because it kept timing out. Of course, this would be a 3 second maneuver on the command line mysql -u user -p pass target_database < dump.sql, but they have the command line locked down on the grid server.

I've recently moved a rather involved client project back onto their production server which is hosted on said grid server. I'm still banging on the thing, so I'd like to delete some rows straight out of the DB without having to spend 30 minutes trying to get PHPMyAdmin to do what I need it to. So the other day I went poking around the hosting admin panel.

I'd seen this years ago, and fortunately it's still there, probably just for this use case. In the “global settings” under the “databases” section of your admin panel at the very bottom of the screen is an option to allow external IP addresses access to the DB server. It even has a handy option to “use current IP” to prefill the field for you. Do this.

Next, invoke mysql locally, but pass it the --host option, like this —

mysql -u user -p pass --host=external-db.s123456.gridserver.com.

You're welcome.

PS – I tried importing the aforementioned <1 MB db in this method to no avail. Don't know if their db servers were on the blink at the time, but it definitely wasn't bandwidth.

PPS – it has occurred to me since this morning that mysql doesn't run on localhost, that's why you can't login as you normally would. Thus, if you were logged into your webserver, you could pass the same --host argument and specify the internal-db host to carry on about your business.

Never said I was that good at this stuff.

#databases #devops #mysql

Databases cont. – the rookie web developer series

April 18, 2012

This is part of an IRC transcript between a buddy and I, wherein I try to explain a little bit about how the internet works and why knowing at least a little bit of database theory will go a long way in demystifying learning how to build stuff. (It's a technical term.) He's trying to learn a bit about Drupal and about how you build sites with it, so I'm going to tag him in on a project for a friend of mine who runs a yoga studio. Hers was the first site I ever built, in Wordpress. I've been thinking about porting it to Drupal and having some more fun with it, as she'd kinda like a database of her registered students among a few other niceties that would be fairly easy to pull off in Drupal.

grubb: And it basically works fine, but I wanted to have a bit more fun with it and integrate the calendar into the site

[4:21pm] grubb: but it's pretty much a prefect easy Drupal site.

[4:21pm] grubb: When I say “there'd be a couple of different content types

[4:21pm] grubb: “

[4:21pm] Keith__: ok

[4:21pm] grubb: does that mean anything to you?

[4:24pm] grubb: So I'll just go ahead and explain it

[4:24pm] Keith__: ok, sorry multitasking

[4:24pm] grubb: Pretty much everything you interact with on the internet is an interface to a database somewhere.

[4:25pm] grubb: no sweat.

[4:25pm] grubb: I assume you're familiar with super-basic database theory

[4:26pm] grubb: basically a database is a bunch of data, and that data is ordered in a very structured way so that it's easy to tell a computer how to go get the specific data you're looking for

[4:26pm] grubb: Example – http://ignoredbydinosaurs.com/

[4:27pm] grubb: if you scroll to the bottom, the blog – that's one content type on my site, which means more or less that each one of those posts is located in one row of the “blog” table in the database.

[4:27pm] grubb: I'm the only user on that site, but if there were more we'd each occupy one row in the “user” table in the database.

[4:28pm] grubb: My portfolio is another content type – each portfolio entry is one row in the “portfolio” table in the database.

[4:28pm] grubb: (This is a simplified explanation, but basically accurate)

[4:28pm] grubb: If you go to Facebook, each user is a row in the “user” table in the FB database.

[4:29pm] grubb: Each post on everybody's wall is a row in the “wall_posts” table in their database.

[4:29pm] grubb: The “wall_posts” table in the database would have a column for the “post” and for the “user” who posted it.

[4:30pm] Keith__: ok

[4:30pm] grubb: So if I post something on my Facebook wall, that particular row would have “blah, blah blah” (whatever the post is) and my user_id

[4:31pm] grubb: So by linking that table with the post on it with the table that contains my profile info (including my user_id), you can put all of that info onto my post on someone else's wall.

[4:31pm] Keith__: ok

[4:31pm] Keith__: based on user id

[4:31pm] grubb: and that post would have all the meaningful shit that you want to see on FB

[4:31pm] grubb: right

[4:31pm] Keith__: ok

[4:31pm] grubb: so the user_id is important

[4:32pm] Keith__: what browser do you work in?

[4:32pm] grubb: it appears not just in the user table, but also in the posts table so you can link the two together

[4:32pm] grubb: I use Chrome

[4:32pm] grubb: is this making sense?

[4:32pm] Keith__: somewhat

[4:32pm] grubb: the point is obviously that whenever you go to a web page, you're looking at HTML

[4:32pm] grubb: (scuse me, it's not that obvious)

[4:33pm] Keith__: ok

[4:33pm] grubb: But there's not a mountain of programmers writing each page of HTML that anyone could possibly look at

[4:33pm] grubb: on the entire internet

[4:33pm] Keith__: right

[4:34pm] grubb: When you go to your Facebook homepage, a computer on the other end of that is putting that particular page of HTML together “dynamically”

[4:34pm] grubb: it says “okay, database, give me the 50 most recent posts from all of Keith Hick's friends. put the most recent one at the top”

[4:35pm] grubb: and that is fundamentally what a database does

[4:35pm] Keith__: ok

[4:35pm] grubb: and so it hands that information back to the computer which then puts that specific page together for you

[4:35pm] Keith__: right

[4:35pm] grubb: If you go to the front page of my website it does the same thing

[4:36pm] Keith__: complex filing system deciphered?

[4:36pm] grubb: you mean this/path/or/wahtever?

[4:36pm] grubb: to get that particular page?

[4:36pm] Keith__: huh

[4:36pm] Keith__: yeah

[4:36pm] grubb: what do you mean?

[4:36pm] Keith__: never mind

[4:37pm] grubb: OK

[4:37pm] Keith__: let's plan on rebuilding this site

[4:37pm] grubb: OK

grubb: so basically, start looking at most webpages as a thing that has been lovingly crafted for you by a computer at that specific point in time, because that's what a lot of them are

#databases