Like many beginning Scala programmers, I was exposed to the Cake Pattern early on and told that this is how you do dependency injection in Scala. Coming from the Ruby world I thought it looked like an awfully heavy-weight method, but of course I didn’t know any other way yet. Right away I was placed on a project in which the Cake pattern was apparently very much in use, a CMS built on Play.

I was tasked with adding a sitemap feature, such that when the path /sitemap.xml was requested, a sitemap of the site would be rendered. This seemed straightforward enough. I would just need to pull some data about the site’s currently published pages from the database and massage it into some pretty straightforward XML. This being Play, I started with a controller, and right away knew I’d need to pull in whatever code pulls pages from the database, which was pretty easy to find. I soon found I would also want to pull in a trait for looking at the contents of the HTTP request. Again, no big deal.

trait SitemapController extends Controller
    with SiteRequestExtractorComponent
    with PageRepositoryComponent {

  def sitemap = {
    // the magic happens...
    Ok(sitemapXML)
  }
}

Simple enough, until I tried to compile:

[error] /Users/chuckhoffman/dev/cms/app/controllers/SitemapController.scala:48: illegal inheritance;
[error]  self-type controllers.SitemapController.type does not conform to models.page.CmsPageModule's selftype models.page.CmsPageModule with models.auth.UserRepositoryComponent with models.auth.GroupRepositoryComponent with models.approval.VersionApprovalComponent with models.email.EmailServiceComponent
[error]   with CmsPageModule

Hm. Looks like somebody used that Cake pattern thingy to inject dependencies into CmsPageModule having to do with users, user “groups,” and approval of new content. That probably has to do with who can do what kind of updating of pages, so even though that isn’t relevant to what I’m after since I only want to read page data, not update it, it still seems reasonable. I’ll just find the right traits that satisfy those three things – even though I’m not really using them here – and add withs for them and all should be good.

One little snag, I guess… it turns out that those traits were “abstract”, which meant greping through the code to find the correct “implementations,” which turned out to be UserRepositoryComponentPostgres, GroupRepositoryComponentPostgres, and MongoVersionApprovalComponent. (This is a common sort of thing to do, since one often wants to mock out the database for tests.) Took a while to track them down, but eventually I did. So surely I should be able to just add those three withs to the SitemapController, add the imports of them to the top of the file, and now we’re off and running, yeah?

[error] /Users/chuckhoffman/dev/cms/app/controllers/SitemapController.scala:48: illegal inheritance;
[error]  self-type controllers.SitemapController.type does not conform to models.page.CmsPageModule's selftype models.page.CmsPageModule with models.auth.UserRepositoryComponent with models.auth.GroupRepositoryComponent with models.approval.VersionApprovalComponent with models.email.EmailServiceComponent
[error]   with CmsPageModule
[error]        ^
[error] /Users/chuckhoffman/dev/cms/app/controllers/SitemapController.scala:51: illegal inheritance;
[error]  self-type controllers.SitemapController.type does not conform to models.approval.MongoVersionApprovalComponent's selftype models.approval.MongoVersionApprovalComponent with models.page.PageModule with models.treasury.TreasuryModule with models.auth.UserRepositoryComponent with models.auth.GroupRepositoryComponent with models.email.EmailServiceComponent with com.banno.utils.TimeProviderComponent
[error]   with MongoVersionApprovalComponent
[error]        ^

Oh. Looks like there’s now some kind of dependency here being enforced between pages and something having to do with email; also, versions, in addition to depending on pages, users, groups, and that same email thing again, also depend on… treasuries? Huh?

Plainly there’s a design problem here because I’m now being forced to mixin traits having to do with treasuries (these are bank websites) into a controller that makes a sitemap. At this point, however, I don’t know Scala well enough to pull off the refactoring this needs with all these self-types in the way. So off I go to find more traits to mixin to satisfy those self-types. Then those traits turn out to have self-types forcing mixin of even more traits, and so on.

After a day and a half of work, I finally had a working SitemapController.scala file containing about ten lines of actual “pulling web pages data from the database” and “building some XML,” and a couple dozen lines of mostly irrelevant withs and imports just so the bastard would compile.

It’s Time We Had A Talk About What A “Dependency” is

Consider this: given two modules (in the general sense of “bunch of code that travels together”, so Scala traits and objects, class instances, Ruby modules, and so forth, all apply) A and B, having, let’s say, a dozen functions each, if one of the functions in A calls one of the functions in B, does that make B a dependency of A?

I’ll save you the suspense. No, it does not. Or at least, not that fact alone. In fact, laying aside the concern that a dozen functions might be too many for one module anyway, it’s clear that the dependency is between those two functions, not the whole modules they happen to be in. Which suggests that that one function in that module is responsible for some functionality that may not be all that relevant to what the other eleven are for. In other words, you have a case of poor cohesion.

To the extent that we promote the Cake pattern to new Scala programmers before they have a handle on what good design in Scala looks like, I believe we’re putting the cart before the horse. The cake pattern, or more generally, cake pattern-inspired self-typing, takes your bad design and enlists the compiler to help cram it down others’ throats. Couple this with the fact that a lot of new Scala programmers think that: (1) because I’m writing Scala, I’m doing functional programming; (2) functional programming is the wave of the future and OO is on its way out, therefore (3) The past couple decades of software design thinking, coming as it does from the old OO world, has no relevance to me; and we get situations like my humble little sitemap feature.

Cake-patterned code, especially badly cake-patterned code (which has been the majority of cake-patterned code I’ve seen, which isn’t surprising given the pattern’s complexity – literally nobody I’ve talked to seems to quite completely “get” it, myself included), is needlessly difficult to refactor, not just because of the high number of different modules and “components” involved and/or because you have to very carefully pick apart all the self-types (especially when those have even more withs in them), but also because you frequently find yourself wanting to move some function A, but need to make sure it can still call some function B, but B turns out to be very difficult to find, let alone move – it might be in some trait extended by the module A is in, or it might be in some trait extended by one of those, or some trait extended by one of those, and so on, to the point where B could literally be almost anywhere in your project or any library it uses, and likewise anywhere in there could easily be completely different functions with the same name. All this just so that you can get the compiler to beat the next developer that has to maintain this code over the head with errors if he doesn’t extend certain traits in certain places, despite the fact that the compiler is already perfectly good at knowing if you’re trying to call a function that isn’t in scope.

To make matters worse, most folks’ introduction to functional programming these days still consists of pretty basic Lisp or Haskell use throwing all your program’s functions in one global namespace with no real modularization. It’s no surprise then if they see either the cake pattern or trait inheritance in general as simply a way of cramming more stuff into one namespace. Old Rails hands will hear echoes of concerns or more generally, the Ruby antipattern of “refactoring” by shoving a bunch of seemingly-sorta-related stuff out of the way into a module (it makes your files shorter on average, but doesn’t necessarily improve your design any).

Cohesion and coupling, separation of concerns, connascence, even things like DCI, these things still matter in Scala and in any of today’s rising functional or mostly-functional programming languages – or for that matter, any programming language that gives you the ability to stick related things together, which is pretty much all the useful ones. (I posit that DCI may be especially relevant to the Scala world as it seems like it would play nicely with anemic models based on case classes.)

I hate to keep harping on my Ruby past, but I heartily recommend Sandi Metz’s book Practical Object-Oriented Design in Ruby. Scala is really just kind of like a verbose, statically-typed Ruby plus pattern matching, when you think about it. Both combine OO and functional concepts, both have “single-inheritance plus mixins” multiple-inheritance; heck, even implicit conversions are just a way better way of doing what refinements are trying to do.

Ultimately though, the cake pattern has the same problem as used to be pointed to about those other “patterns” when they were all the rage: people learned the patterns early on, and started using them everywhere because they thought that was how you’re supposed to program now. They ended up with overly convoluted designs because they were wedging patterns in where they weren’t necessary or didn’t make sense, rather than first understanding the reasons the patterns existed, reaching for the patterns only when they found themselves facing the design puzzles the patterns are intended for.

"Soul patriot and truth lover"

Wow, I have really been neglecting the blogging. It’s the same old deal, the more you have going on that might be good to blog about, the harder it is to make the time to actually do that writing. Then you feel bad because you’ve fallen so far behind. So you declare blog bankruptcy and just make a quick summary update, like this one. After all, outside of my work hours at the office, I am lucky these days if I can string together two minutes worth of coherently related thoughts. I’m no Steve Yegge, obviously.

Back in February I changed jobs. I had some history with this company, but in its earlier life when it was doing very different things. Since then I had kept up on their doings in the media and while they seemed to be doing really well and would be a great place to work again and much more stable than when they had to cut me loose, I was convinced I no longer had qualifications for the technologies they were working with now. But as 2013 wound down I was beginning to work on a lot of positive changes in my life, and eventually I was encouraged to apply. Any outsider who knew the history involved (or noticed that my LinkedIn profile sports an old recommendation from the company’s CEO) would probably be a lot less surprised that I landed the gig than I was.

Probably one of the first notable things to happen there was that barely two weeks after I started, we were acquired by a legit big corporation. The results have been markedly different from the usual acquisition horror-stories you hear about. The CEO and CTO both stuck around to head up the new division; very few others quit afterwards; the parent company worked hard to integrate us without making us assimilate too hard or too fast, and really acted as if they understood that the way we work and the kind of people we attract was one of the most valuable assets they were getting on the deal. We get to keep innovating just as we were, but with the cred and resources that come along with their name. On our end of the bargain, we have to figure out how to make our products and platform scale from supporting thousands of users to potentially millions. We also are expected to try to scale up the company culture we came in with; I see management’s project at this point as nothing smaller than figuring out how to scale up agility, something many have said can’t be done. It’s all tremendously exciting.

As for how this has affected my then-nascent tech blog, well obviously you can see that it has meant several months without a post. I’ve had to change my personal life and routine a lot, and working blogging back into it is something I’ve reached only now. On the other hand, I have been learning tons, and I have even made honest-to-goodness open source contributions in the context of my job to meet needs we have. Along with that, new personal developments have occupied my time and energy, including my second child, born this June. So I’m learning to scale myself too.

For the future of this page? Assuming I can get back into the swing of things, you can probably expect a lower proportion of Ruby content and a whole lot more Scala. Though truth be told, I’ve still ended up doing a fair bit of Ruby too. We have a very polyglot mentality; I’ve recently discovered a potential good excuse to code some Clojure. Also I’ll probably end up writing a bit about Docker and Mesos and Kafka and a whole lot of other new hotness like that.

Funny anecdote related to Kafka: My co-workers seem to be evenly split over whether it’s pronounced with a short or long a sound in the first syllable. Of course, anyone familiar with the author for which it is named (one of my favorites) knows which pronunciation is correct. Then Samza started to find its way into our toolchain as well, which is of course named for the main character in one of Franz Kafka’s most well-known works. Now, even more recently, related to Kafka, we’ve started working with Camus, obviously another author name from the same era, and everyone at work keeps pronouncing it “kam-uss”. I’m a bit older than all of them, and one day I had to ask, don’t you guys actually know who Kafka and Camus are? You didn’t read The Metamorphosis and The Stranger in high school like I did? No, they said, they didn’t. In mock exasperation, I replied, “What are they teaching kids now? See, this is what’s wrong with America these days: we’ve taken existentialism out of our schools.”

For all the talk on the interbutts about TDD and related topics, it sure seems like as a working programmer I run into a startling amount of projects – a great majority, really – that either have no tests, or have old useless tests that were abandoned long ago; and a startling number of developers who still don’t write any tests at all, let alone practice a TDD style of work. It’s as if as an industry we’re all putting up a big front about how important testing and TDD is to us but then when the fingers hit the keyboard, it’s all lies. That’s probably not really the case, but rather that test-infected developers are a small but vocal minority – that developers that test tend to also be the kinds of developers that blog, make podcasts, present at conferences, write books, and so on, but these happen to only be a sadly small percentage of all the developers out there cranking out code. But this minority has been talking about testing for what, a decade now at least? So why hasn’t the portion of developers seriously testing grown faster?

Once you’ve got going with TDD or even just a little automated testing, and have come to rely on it, one of the most frustrating things is to find yourself having to collaborate with others who have not, and have no interest in it. You really don’t want to leave an only partly-tested system, meanwhile these other developers on your project will make changes that break your tests with impunity. The path of least resistance is to fall back in line with the rest of your team and go back to what one of my professors back at UNI referred to as the “compile-crap” cycle – a loop of add or change some code, try to compile it, say “crap” when fails, repeat – except for interpreted languages, substitute in place of the compile step, running the application and trying to “use” it, so maybe call it the “run-crap” cycle. This friction may well be one of the biggest factors slowing the adoption of TDD; but the less developers are testing, the more it will happen, so it’s also an effect. It’s a vicious feedback loop.

Then there’s maintenance, and/or working with “legacy” code, without tests, or with bad tests. Many a project is written with no tests ever – just banging out code in a run-crap loop.

Others start out with tests, but somewhere during the development process something changes and the team reverts back to run-crap. Why do they do this? It may be that members of the development team have been swapped out for some, shall we say, ahem, “cheaper” ones; this might happen when the product is launched and comes to be seen as in “maintenance” phase, but it also happens earlier on. Or it may be that the developers reverted to comfortable old habits in the face of schedule pressure from management – after all TDD can be slower in the short-term, especially when you’re new at it, and it’s easy to lose focus on careful discipline in favor of short-term speed (or at least the appearance thereof) when the management is breathing down your neck or freaking out at you.

In any case, the eventual result is either no tests, or tests that are no help because most of them are failing because they express requirements that have since changed – which might be even worse than no tests at all; it can look like the best way to deal with it is to just nuke the whole suite.

But then what? Touching on how TDD informs design, it’s well established that code written without TDD is likely to contain design that is much harder to write tests for, with lots more coupling and dependency snarls. As requests for bug fixes and new features come in for such a system, how do you work on it in a test-driven manner? Stopping the world long enough to retrofit a complete suite of difficult-to-write tests isn’t feasible and chances are there’s no documentation you can consult when you hit all those ambiguities in what some code should be doing, so you’re likely not to even know what exactly to test for – the definition of “legacy code” as being that for which requirements have been lost. Practicing TDD on greenfield projects is relatively obvious; but the vast majority of development time is spent in maintenance, and legacy/maintenance is “advanced” TDD. I’m probably not telling you anything you don’t already know. Michael Feathers’ book Working Effectively With Legacy Code is the authoritative source on the subject, but if it’s not feasible to halt work long enough to Test All The Things, then is it feasible to halt work long enough to read a book, especially if you’re a painfully slow ADHD-stricken reader like myself? Yet again, it’s much easier to go back to the good old irritating-but-familiar run-crap loop.

It’s clear that as an industry we only stand to benefit by spreading the good word of TDD far and wide. The more it’s being done, the better. But the factors I’ve just outlined present very real obstacles to its adoption. It’s a long-term project of raising awareness and educating the developer public. Meanwhile, what can you as an individual developer do? For starters, if you really want to do TDD but are stuck in a job where everyone’s oblivious to the concept, it’s probably not worth your time trying to force that kind of sea change on your own. You’re swimming against a torrent. My advice? Find a company that’s as serious about it as you are, and go work there instead.

I myself don’t even consider my work to be test-driven. I’m a believer in TDD, and I make the best, sincerest attempts at it I can relative to the time and energy constraints within which I am working. I certainly don’t consider myself an enlightened TDD guru. I even come out and say just that right in the introduction to my résumé. What’s that, you’re supposed to talk yourself up in a résumé and make yourself sound like the answer to all a company’s prayers so that you get the job? I don’t believe in that. I’m hoping to score a gig working with test-driven developers but I don’t want to be expected to be perfect at it from day one if such a company hires me; I want such a job because I know I have a lot to learn and am looking for advantageous situations in which to learn. It pains me that such honesty should seem radical, but in my experience, the pains that come from getting oneself into the wrong situations are worse.

Developers can also tend to be a prickly lot with a healthy distrust of dogma. And sometimes the practices of what I might call “strong” or “pure” TDD can feel like a dogma, especially when delivered in a kind of hellfire-and-brimstone way a la your average Bob Martin conference talk. I don’t care for the idea that you cannot be considered a professional developer if you don’t practice TDD (and by whose standard/definition of TDD anyway?).

As I have begin to view it, TDD isn’t something you just start doing and are able to do all of it flawlessly from the get-go. Among the many concepts and tools you’ll need in order to be able to completely test-drive all parts of a system, there’s things like the delicate art of mocking, how to fake a logged-in user, how to make a unit test truly isolated, how to mock a collaborator without making the test useless, what different kinds of tests there are, and a lot of subjective experience-based intuition about what tools and techniques are best suited for what kinds of tests and situations. It can all feel really daunting.

Especially in the context of a web applications, and then especially when you’re working with a framework such as Rails, there’s a big learning curve, one that I think would be better viewed as a long process of continual improvement. There will be difficulties along the way, but in the meantime you still have to get work done and people are still paying you. To say you can’t call yourself a professional until you’ve already mastered every aspect of TDD feels, frankly, insultingly elitist. You have to crawl before you can walk before you can run before you can fly. Doing some testing still beats the pants off not doing any. I don’t think agile development processes was ever meant to be dogmatic. The processes should be flexible, adaptable, pragmatic – just like the code you hope to write when you use TDD to guide the design.

The problem so far is that too seldom is TDD presented in this way. Instead it’s usually framed as, you’re either TDD or you’re not. (And by the way what constitutes TDD is a constantly moving target.) That way of looking at TDD isn’t going to help you or anybody else adopt it. All it does is feed into your impostor syndrome.

I think it’s worth reminding oneself that guys like Corey Haines took years to get that good at a totally test-driven style. I mean just watch that video. He’s test-driving every little piece of a Rails application totally outside-in, that is, starting with the “outermost” layers, what the user sees, the GUI, the views, and working inward towards the hard candy database center. There are so many points where he shows techniques for isolating the piece he’s working on, hacks to circumvent the coupling inherent in Rails’s architecture in order to get Rails to let him keep working at an upper level of the application instead of bombing out with an error about some lower-level piece not existing yet. Techniques that I just don’t think I would be able to absorb by rote, that he seems to have arrived at on his own through leaps of intuition and experience that I don’t see myself being able to duplicate. It’s quite beautiful but even though I know he wants to sell these videos, I concluded that this wasn’t going to work for me. We all gotta find our own way, I guess.

That kind of outside-in TDD approach is very much in-vogue right now, though. And another thing that’s very in-vogue at the moment, and a very useful guiding concept, is the Rails Testing Pyramid. The tl;dr of it is that your unit tests are the most important, and should be the type of test you have the most of; and as you look up the Rails stack each kind of test is slower and more integrated and rests on the foundation of those below it.

The mosaic of types of tests you might use in a Rails application is larger than they present in that article, and I think several of them can be grouped together in the “service tests” category, but you can see approximately where they would live in the pyramid relative to each other – in order starting from the bottom: unit tests, model tests, controller tests, request tests, helper tests, view tests, client-side/javascript tests (which might be a whole other pyramid actually), and finally acceptance tests/features. As you go up the pyramid in this way too, you find that the tools and techniques become more advanced in skill, or at least are usually assumed to be and presented as such: testing literature usually begins with unit tests, and Rails-oriented testing literature usually begins with what the Rails community have traditionally called “unit tests,” which are tests at the model layer, which might be integrated with related models and might be tied to the database, or might be totally isolated from both, depending on how well you’ve gotten the hang of the higher-level skills of mocking and isolating from the database.

But here’s what I realized a while ago: when you put the outside-in approach together with the Rails Testing Pyramid, the implication is that you are building a pyramid top-first.

Does that even make sense? I mean, I realize we’re talking about software here, not big blocks of stone. It’s a metaphor, but I think there’s useful insight to be gotten from metaphors. The Agile and XP literature says so too.

You’ve got a pyramid of your own to build: your repertoire of testing skills. And if building a pyramid it top-first seems counterintuitive, building all of it at once certainly should.

All your favorite TDD gurus had to have started somewhere – probably with a few simple unit or model tests just like most of us probably did. If you get too attached to an ideal of TDD enlightenment, it can be discouraging. Better to keep TDD in mind as a guiding principle, an ideal, then just start testing. As you progress, keep a sharp eye on ways to get more test-driven – places where more testing, new kinds of tests, new techniques and tools, can help you be more confident in your code with more ease. Tackle learning those as you feel yourself become ready for them.

I recently had this idea for a presentation that would bring together concepts from Testivus with a sprinkling of Buddhist philosophy. The saying “if you meet the Buddha on the road, kill him” seemed prescient, but I wouldn’t want to be misinterpreted as advocating anyone’s murder.

I think it can be pretty easy to sell developers on some kind of automated testing. There’s a big win right away in that you can spend more time writing useful code versus less time filling out the same web form over and over like a trained monkey. That’s already going to make you more productive and your day more enjoyable. Traditionally the introduction to testing has been at the unit test level, but I almost wonder whether it would be better, now that there are good tools for it, to start from full-stack acceptance tests right away and go as far with that as you can. You may end up with slow, very coarse-grained tests this way (and it’s for this reason that so many testing advocates will tell you it’s wrong), but at least they will exercise most of the system and you will catch defects and regressions you were likely to miss otherwise. Of course any developer/team working in this way will end up experiencing some pain when the test uncovers a bug but can’t pinpoint where in the system it is originating; but that’s a good pain point to have if it can be turned into a motivation to dig into those deeper levels of testing.

Convincing developers to test shouldn’t be as hard as it looks like it’s been made. It’s time to simplify the pitch: Testing is a path to reduce suffering. You will be learning it forever.

Occasionally I surprise myself and end up feeling a desire to write about it and toot my own horn a little bit. What better place to do that than on a professional blog at least part of the purpose of which is to show prospective employers or clients that I’m good at stuff?

I’m pretty good, I guess

note: personal background jabber, skip this section at will

I’m largely self-taught in the area of databases and SQL. The only course I ever took on the subject was a quarter-length database class, circa 1999, at Hamilton College (since bought up by Kaplan, I think) as part of their two-year IT degree program. It used Microsoft Access and was very beginner-level and I think I might have been out sick on joins day. Later when pursuing my Computer Science degree I avoided the databases course out of dislike for the professor who taught it; the alternative course to meet the same requirement had more to do with text indexing, information theory – search-engine kind of stuff – and oddly enough, the course taught and used an open-source multi-dimensional hierarchical database and MUMPS compiler developed by the course’s professor (multi-dimensional databases are quite good at storing and comparing things like, vectors of the occurrences of hundreds of different words in a bunch of textual articles). So, yes, I learned MUMPS in college instead of SQL. Actually, you can download and make-install the C++ code for the MUMPS compiler we used yourself, which compiles MUMPS into C++, if you ever get a wild urge to do such a thing. In fact, I’d recommend it to my fellow programming language nerds, especially those interested in old, obscure, or just plain weird languages. At the very least you’ll have a little fun with it; and I believe MUMPS is even still in use in some corners of the health care industry, so you’d be picking up a skill that’s in some demand yet increasingly difficult to hire for. (While you’re at it, check out Dr. O’Kane’s MUMPS book and his rollicking, action-packed novel.)

At my first real programming job, I started out coding in Actionscript 2.0 but when a particular developer left the company, someone was needed to take over server-side development in PHP, so I took it upon myself to learn PHP, and, as it turned out, also ended up needing to learn SQL and relational databases. I read a PHP book or two and a whole lot of blogs, but mostly just dove right in to the existing code and gradually made sense out of it. Eventually I was working back and forth between Actionscript and PHP pretty regularly. That kind of pick-it-up-as-needed approach is pretty much how I roll, though it’s hard to explain this kind of adaptability to recruiters who are looking to basically keyword-match your experience against a job description, which can be a real drag if you’re the type of person who craves new experiences. When at UNI I had been the kind of student that made a point of taking the more theoretical computer-sciencey courses, on the rationale that things like programming languages are certain to change in the future, but they will most likely continue to build on the same underlying theory dating at last as far back as good ol’ Alan Turing. I would say that approach has paid off well for me in the years since. My first boss described me in a LinkedIn endorsement as being capable of working in multiple programming languages simultaneously, “something which drives most of us insane.”

But I digress (often). Like I said starting out this post, sometimes I still surprise myself. When I pull off something new or just more complex than I’m used to, it feels good, and I like to share it, not just to strut about, but also because I am sure others are out there trying to solve similar problems, and also to give credit to others whose work I drew on to arrive at my solution. And like I said, my SQL skills are largely the product of a few old blog posts and experience so I was pretty stoked at what I pulled off this week.

The assignment

I was given the task of populating a “related articles” part of a page on a news website. Naturally the first thing I thought we needed to hash out was how the system should conclude that two articles are related. After some discussion we arrived at this idea: we would score two articles’ relatedness based on:

  • The number of keyword tags they have in common (this was the same site using acts_as_taggable_on from which I drew this recent post)
  • The number of retailers they have in common (Article HABTM Retailer)
  • How close or far apart their published_at timestamps are (in months)

How this turns out to be slightly difficult

This sounds perfectly reasonable, even like it would be pretty easy to express in an OO/procedural kind of way in Ruby or any other mainstream programming language. But once this site gets a long history of articles, it’s likely that looping or #maping through all of them to work this out is going to get way too time and memory intensive to keep the site running responsively.

Another alternative is to store relatedness scores in a database table and update them only when they need to change; we could hook in to Rails’s lifecycle callbacks like after_save so that when an article is created or saved, we insert or update a record for its relatedness to every other article. That still sounds intensive but we could at least kick off a background worker to handle it. However, I got the feeling that there was potential for errors caused by overlooking some event that would warrant recalculating this table, or missing some pairs.

And there was still another wrinkle to work out: the relatedness scores pertain to pairs of articles, and those pairs should be considered un-ordered: the concept of article A’s relatedness to article B is identical to B’s relatedness to A. I don’t know if any databases have an unordered tuple data type and even if they did whether ActiveRecord would know how to use it. It seems wasteful and error-prone to maintain redundant records so as to have the pairings both ways around. Googling about for good ways to represent a symmetrical matrix in a SQL database didn’t bear much fruit. So it would probably be best to enforce an ordering (“always put the article with the lower ID first” seems reasonable). But then this means to look up related articles, we need to find the current article’s ID in one of two association columns, rather than just one, and then use the other column to find the related article. I’m pretty sure ActiveRecord doesn’t have a way to express this kind of thing as an association. Which is too bad, because ideally, if possible, we’d like to get the relatedness scores and related articles in the form of a Relation so that we can chain other operations like #limit or #order onto it. (Possibly we could write it as a scope with a lambda and give the model a method that passes self.id to that, but I’m still not sure we would get a Relation rather than an Array. The point at which ActiveRecord’s magic decides to convert from one to the other is something I find myself constantly guessing on, guessing wrong, and getting confused and annoyed trying to come up with a workaround.) But so it goes.

Any way we look at this, it looks like we’re going to be stuck writing some pretty serious SQL “by hand”.

I’m not going to show my whole solution here, but you probably don’t need all of it anyway. I think the most useful bit of it to share is the shared-tags calculation.

Counting shared tags in SQL

acts_as_taggable_on has some methods for matching any (or all) of the tags on a list, and versions of this that are aware of tag contexts (the gem supports giving things different kinds/contexts of tags, which I’m not going into here but it’s a cool feature). So obviously you can call #tagged_with using an Article’s tag list to get Articles that share tags with it, but the documentation doesn’t mention anything about ordering the results according to how many tags are matched, or even finding out that number. Well, here’s the SQL query I arrived at that uses acts_as_taggable_on’s taggings table to build a list of article pairs and counts of their shared tags. One nifty thing about it is that it involves joining a table to itself. To do this, you have to alias the tables so that you can specify which side of the join you mean when specifying columns, otherwise you’ll either get an ambiguous column name error or you’ll just get confused. You’ll see I’ve also added a condition in the join that the “first” id be lower than the “second,” forcing an ordering to the ID pairs so as to eliminate duplicate/reversed-order rows and also eliminate comparing any article with itself, since we don’t care to consider an article related to itself. (Also, the way this is written Article pairings with no shared tags won’t be returned at all. Maybe try a left join if you want that.)

select
  first.taggable_id as first_article_id,
  second.taggable_id as second_article_id,
  count(first.tag_id) as shared_tags
from taggings as first
join taggings as second
on
  first.tag_id = second.tag_id and
  first.taggable_type = second.taggable_type and
  first.taggable_id < second.taggable_id
where first.taggable_type = 'Article'
group by first_article_id, second_article_id

Add a and first_article_id = 23 or second_article_id = 23 to the where clause here and you’ll get just the rows pertaining to article 23. Add an order by shared_tags desc and the rows will come back with the highest shared-tag-counts, the “most related,” at the top. If you’re looking to know the number of shared acts_as_taggable_on tags among your articles or whatever other model you have, here you are.

Building a leaning tower of SQL

So, for the other two relatedness factors, I did a similar query to this against the articles_retailers table to count shared retailers, and another on articles to compute the number of months apart that pairs of articles were published to the site. Each query used the same “first id less than second id” constraint. Then I pulled the three queries together as subqueries of one larger query, joining them by first_article_id and second_article_id, and added a calculated column whose value was the shared tags count plus the shared retailers count minus the months-apart count and call this their score – a heuristic, arbitrary measure of “how related” each pairing of articles is. (The coalesce function came in mighty handy here. Despite its esoteric-sounding name, all it does is exchange a null value for something else you specify, like you might do with || in Ruby – so coalesce(shared_tags, 0) returns 0 if shared_tags is null, or otherwise returns whatever shared_tags is, for example.)

As you are probably picturing in your head, the resulting master relatedness-score query is huge. It took me a good couple hours at a MySQL command-line prompt composing the subqueries and overall query a little bit at a time. It felt awesome. But still: the result was one seriously big glob of SQL. (Incidentally iTerm2 acted up in a really weird way when I tried pasting these large blocks of code into it, but not when I was SSHed into a remote server; if this rings a bell to you, drop me a line.) I’m going to spare you the eye-bleeding caused by seeing the whole thing. You’re going to drop that big nasty thing in the middle of some ActiveRecord model? Yikes!

Views to the rescue

In a forum thread where I was looking for help on the implementation of all this, Frank Rietta suggested I consider using a database view. To be perfectly honest, I hadn’t used a view in years, if ever. I didn’t even think MySQL had them (yes, I’m using MySQL, don’t judge) – maybe some older version I used in the past didn’t and they’ve been added since? At first I wasn’t sure how this could help me, but then Frank wrote this excellent blog post on the subject. I read it, and the more I thought about it, the better the idea sounded.

Basically, a view acts like a regular database table, at least when it comes to querying it with a select. But underneath it’s based on some query you come up with of other tables and views. You can’t write to it, but it provides you with a different “view” of your data by what I would describe as “abstracting a query.” And because the view can be read from like any other table, it can also act as the table behind an ActiveRecord model (at least, until you try to #save to it). Go read Frank’s post so I don’t have to recap it here. You’ll be glad you did.

The great advantage of using a view to hold the relatedness scoring is that I don’t have to think about writing Ruby code to maintain the table of relatedness scores, I don’t have to think about background jobs or hooking into ActiveRecord lifecycle callbacks to maintain the data or any of that – the database itself keeps this “table” updated. Any time the tables it depends on change, it changes right along with them automatically. Plus it gets the big hairy SQL query out of my Ruby code where it won’t distract or confuse anyone; and it handles the issue of making sure first_article_id is always lower than second_article_id because that’s expressed right in the query it’s based on.

So that settles it, I create a view out of my big relatedness-scoring query and an ActiveRecord model over top of it! Only one problem, and it turned out to be pretty minor, but as I mentioned, my big relatedness query involved a join over three subqueries. Turns out that in MySQL, views can’t have subqueries. Perhaps they can in other database engines, I would not be surprised, but not in MySQL. The workaround for this is to create views for the subqueries and query those views. Honestly that probably makes the SQL read more easily anyway. On the other hand, I ended up creating four views. That was definitely the longest Rails migration I have ever written, by far.

The models and other miscellaneous thoughts

So, now I have a table called article_relations that contains pairs of Article id’s and their relatedness scores, I can give it a model like this:

class ArticleRelation < ActiveRecord::Base
  belongs_to :first_article,  class_name: 'Article'
  belongs_to :second_article, class_name: 'Article'

  def other_article(source)
    [first_article, second_article].find{|a| a != source}
  end

  def readonly?
    true
  end
end

And give the Article model a couple methods like this:

  def article_relations
    ArticleRelation.where(
      'first_article_id = ? or second_article_id = ?', id, id).order('score desc')
  end

  def related_articles
    article_relations.map{|r| r.other_article(self)}
  end

Or something to this effect. You’ll likely want to have your view only contain records where the score is above 0, for instance, or give the above methods an optional parameter to use in a limit so you can limit the number of related articles you show.

Which reminds me, speaking of #limit… as I alluded to before, it would be great if I could do things like @article.related_articles.limit(10) here but I can’t. This bugs me a little bit, because it means that some of my queries to the Article class are going to call #limit and others will have to pass the limit as a parameter, or slice the array like [0..9] or something, so I have code where doing the “same” thing reads completely differently. (I am also unfortunate enough to still be working with Rails 2 regularly, where limit goes in an options hash. It appears if you try that syntax in Rails 3, it just ignores it.) There are other gems like punching_bag where this itches at me a little as well (not to mention, I’d like to be able to give my model a method or scope with a name more appropriate to my domain such as popular or hot and have that delegate to most_hit). I think this might just be a product of the usual leakiness of ORM abstractions and I’ll just have to get over it.

One caveat that should be pointed out is that Rails’s generating of schema.rb doesn’t handle views “properly” and probably can’t be made to when you think about it or depending on what you think the proper thing for it to do would be. Rails will dump the structure of your views out as regular tables, so if you use rake db:schema:load you’ll get tables rather than views with all their cool magic. At this point it’s probably a good idea to uncomment that config.active_record.schema_format = :sql line in your application.rb configuration file, which will make rake db:migrate spit out a structure.sql file instead of schema.rb, and get rid of schema.rb altogether.

Another thing worth considering, depending on the complexity of your view(s), is whether to make them materialized views. This is a view that’s backed by a physical table that gets updated as needed. It’s more efficient to query but a little slower to update so the effects of a change to one of the tables it depends on might not be reflected right away, but this may be a worthwhile trade-off to make.

Join me next time when I talk about technical debt or something like that.

I wanted a really fly keyword tagging input in my app that let me do what I’m already pretty used to doing with Wordpress’s tagging: auto-complete existing tags to help me maintain consistency, but also let me make up new tags on the spot.

Select2 is nice as heck, and has a tagging functionality that does just what I’m looking for and is even prettier than what Wordpress has. The section on “Tagging Support” on the website looked like pretty much exactly what I wanted, but there were a few things to iron out: Firstly, I didn’t want to have to stick all the existing tags in the javascript or in the view. Yeah it’s cool that the asset pipeline lets us do .js.erb but it just feels wrong; and that list of all the existing tags could get pretty big, so jamming it all into an HTML attribute feels even more wrong. What I wanted was that AJAXy searching autocomplete where you start typing and it fetches a list from the server and that list narrows down as you type more letters. And on top of it all, I was doing this in Active Admin in a Rails 3 app.

"Select2 docs screenshot

select2-rails takes care of a good bit of that last bit, though it doesn’t do too much more than package it up for the asset pipeline. I had to wrestle with it quite a bit more, hacking bits and pieces together from different documentation, blogs, and StackOverflow threads, before everything would behave like I wanted, even though I didn’t think what I was after was particularly exotic. So naturally the right thing do to once I got it all working was to write it up here. I think even if you’re not using Active Admin, a lot of this will still help without too much adjustment, especially if you’re using Formtastic.

First off you’ll want gem 'select2-rails' and gem 'acts-as-taggable-on' in your Gemfile and bundle install’d. Then pull the select2 javascript into your app by putting //= require select2 in your active_admin.js – or application.js if you want to also have it available in non-admin parts of your app – and that same line in active_admin.css.scss. If some stuff still looks visually out of whack later on, try adding this at the end of active_admin.css.scss:

body.active_admin {
  @import 'select2'
}

So now we get into how to put this in your Active Admin form. We’ll make it an input for acts_as_taggable_on’s tag_list accessor because it does such a nice job of Doing What You Mean with very little fuss. Here’s a somewhat redacted excerpt from my app/admin/articles.rb:

form do |f|
  f.inputs do
    f.input :title
    f.input :content, as: :rich
    f.input :tag_list,
      label: "Tags",
      input_html: {
        data: {
          placeholder: "Enter tags",
          saved: f.object.tags.map{|t| {id: t.name, name: t.name}}.to_json,
          url: autocomplete_tags_path },
        class: 'tagselect'
      }
  end
  f.buttons
end

As you can see, there’s quite a few attributes being given to the input’s HTML element, which Select2 will then hide and manipulate behind the scenes while presenting us the very cool tagging widget we love. The class could be whatever we want, but it’s what we’ll be using to find this element in the javascript we’ll get to momentarily.

The data hash gets placed on the input as data attributes. This is data we want to make available to said javascript. saved is for the article’s current tags, so that the widget can render those right away. Select2 expects to work with a JSON array of objects, but you’re probably wondering why I’m passing both an id and a name but setting both values to the tag’s name.

The thing is, since we’re using the tag_list accessor, we don’t really care about the tags’ IDs. I think that’s fine, after all, conceptually, a tag’s name is it’s identifying attribute. It would be a perfectly reasonable design for the tags database table to not have an id column at all and have name be the primary key – that would match our mental model of tags – but this is Rails where everything has to have an id. More to the point, Select2 won’t render the tags right, or at all, if they don’t have an id attribute with something in it. But when I used the tags’ actual ID there, the IDs were ending up among the array of tag names in the params coming in to the Rails app causing me to end up with extraneous tags getting created whose names were those IDs, and that was awful. There might be other ways around this.

The url data attribute is there to tell Select2 where to find the remote service to look up tags in for the auto-complete. It’s up to you whether you want to set this up in another controller, what you want to name it, and so on. In my case, just keeping it simple, I added it to Active Admin’s controller for my app/admin.articles.rb, like so:

controller do
  def autocomplete_tags
    @tags = ActsAsTaggableOn::Tag.
      where("name LIKE ?", "#{params[:q]}%").
      order(:name)
    respond_to do |format|
      format.json { render json: @tags , :only => [:id, :name] }
    end
  end
end

and correspondingly, in config/routes.rb:

get '/admin/autocomplete_tags',
  to: 'admin/articles#autocomplete_tags',
  as: 'autocomplete_tags'

Fairly straightforward what’s going on here, we’ll be having Select2 pass in what we’ve typed so far in the q param and using a SQL LIKE query to give back tags to offer in the little auto-complete list.

And now, the javascript to fire up Select2’s tag input magic. Right now I just have this tacked on the end of active_admin.js but it’s a significant enough piece of code that I’d feel justified putting it in a separate file and //= require-ing it.

$(document).ready(function() {
    $('.tagselect').each(function() {
        var placeholder = $(this).data('placeholder');
        var url = $(this).data('url');
        var saved = $(this).data('saved');
        $(this).select2({
            tags: true,
            placeholder: placeholder,
            minimumInputLength: 1,
            initSelection : function(element, callback){
                saved && callback(saved);
            },
            ajax: {
                url: url,
                dataType: 'json',
                data:    function(term) { return { q: term }; },
                results: function(data) { return { results: data }; }
            },
            createSearchChoice: function(term, data) {
                if ($(data).filter(function() {
                    return this.name.localeCompare(term)===0;
                }).length===0) {
                    return { id: term, name: term };
                }
            },
            formatResult:    function(item, page){ return item.name; },
            formatSelection: function(item, page){ return item.name; }
        });
    });
});

So at the top you can see I start with a jQuery selector of that “tagselect” class I put on in the input_html option, then grab the values off those data attributes, then call select2 on the element with a whole mess of the options it accepts. The most interesting bits:

  • tags: true is the simplest way to tell Select2 this is a tagging input without having to tell is what tags to autocomplete for up front.
  • minimumInputLength is how many letters we want the user to type before we start trying to suggest completions.
  • initSelection is used to set up the tagging input at the start, to get it to display what we brought in the saved data attribute.
  • ajax sets up the call to our autocomplete_tags action described before.
  • createSearchChoice is where we tell Select2 how to put the results of that call in the autocomplete list. The snarly-looking conditional here is just to filter out duplicates of tags we’ve already got picked out. As long as it’s not a duplicate, we whip up another id/name object just like we did when we set up the saved data attribute.
  • formatResult and formatSelection look for a text attribute if you don’t tell them otherwise so I’m telling them to use name.

And that’s pretty much all it takes. I had to complicate it up pretty heavily in order to see how to get it this simple, now you don’t have to. Have fun!

update 6 September 2014: Samo Zeleznik writes in:

When I create a new post with a tag that is the same as a tag that was already created prior to that it does not save it by it’s name, but by it’s id. So it creates a new entry in the tags table that has a unique id, but the name of that tag is the id of the real tag.

What I just wrote is probably a little bit confusing so let me explain it with an example: I have a post tagged with “math” and this tag has an ID of 5. Now I create a new post and I tag also tag it with “math”. Now when I save this post it will bi tagged with 5. So It creates a new tag with a unique id (6 for example) and names it 5 (id of math). Do you have any idea what could be causing this issue?

Around the same time, David Sigley tweets me with what appears to be the same issue.

As it’s been quite a while, all I could offer was that I sorta remembered having trouble with tags getting named their IDs instead of their names before and there was some hack I had to do, and I may not have done enough to point it out and explain it above. Later Samo sent me this StackOverflow question where he got it worked out, and the solution comports with the ruby code above that looks like this: f.object.tags.map{|t| {id: t.name, name: t.name}}.to_json. Note how the hash/JSON has an id key and a name key, but the value at both is the tag’s name. Later the Javascript does something siliar: return { id: term, name: term }; Then David figured it out too. I don’t have a really clear idea of why it has to be this way, it’s a hack, but there you have it.