Mixing Code Styles with org-babel

I've really started making good use of org-babel recently as part of my job, and org-babel-tangle has been particularly invaluable. I never honestly expected to get any use out of org-babel outside of taking lecture notes or working through textbooks. Don't get me wrong, it's a very useful tool - and I want to explain how I'm using it now that I've found a really solid use-case for it.

If you're unfamiliar, org-babel is a literate programming extension for org-mode. On a surface level, it works similar to a Python notebook. It allows you to specify what major mode to treat a code block as and execute code directly from within your org-mode file. It also allows you to rip code out of code-blocks and place it into another file. So, for example, I can have two code-blocks at different points in my org file, both with a :tangle file.cpp header, and have them both be inserted (in order) into file.cpp. This feature is often used for literate configs within Emacs - where one org file contains all the configuration code, with suitable annotations.

I don't use a literate config (currently). Actually, when it comes to Lisp I'm generally against being too literate in your code, either in code comments or literal literate programming. There's a quote out of Riastradh's Lisp Style Rules that I generally adhere to:

Write comments only where the code is incapable of explaining itself. Prefer self-explanatory code over explanatory comments. Avoid `literate programming' like the plague.

Rationale: If the code is often incapable of explaining itself, then perhaps it should be written in a more expressive language. This may mean using a different programming language altogether, or, since we are talking about Lisp, it may mean simply building a combinator language or a macro language for the purpose. `Literate programming' is the logical conclusion of languages incapable of explaining themselves; it is a direct concession of the inexpressiveness of the computer language implementing the program, to the extent that the only way a human can understand the program is by having it rewritten in a human language.

In Lisp, via macros, we usually have the luxury of completely changing the "language" (not literally) of the code we write, which means that code can, in my opinion, actually be self-documenting. Whenever I feel the impulse to annotate something with a comment, I try to take time to reflect on whether that line could instead be written in a better way that requires less external explanation. There are exceptions, like marking where a section of some logic begins and ends, but that's my general philosophy. It's also why I like Racket so much - being able to switch as required between effectively any syntax you want for a particular problem is very useful for maintaining readability in my opinion. Or just because I want to write something using APL syntax.

However, at work, I don't have the luxury of writing in Lisp. Currently I'm doing data engineering with dbt, and I'm mostly writing in SQL (with Jinja macros sprinkled on top). There's also been a lot of hubbub around code style recently, for a number of reasons I won't get into, and now we are recommended to use sqlfluff to fix up formatting on SQL files. So, my two issues here are:

  1. SQL can be very opaque without sufficient commentary (something that is very true for a lot of older dbt models in this codebase).
  2. I do not like the standard sqlfluff formatting rules. This is really just a personal grip, but having functions and keywords be CAPITALISED has always seemed a bit pointless to me, as though it's a relic from a time before syntax highlighting.
  3. Bonus: dbt tests are often undocumented and (this is a problem with a lot of test suites, not just dbt) are abstracted away from the context of the code they're testing.

org-babel helps me fix all of these issues in one fell swoop. Better yet, it lets me do so without ever bothering other people working in the repository with org-mode or Emacs.

Let's tackle the first issue: opaqueness and literate programming. There is more to this than just explaining that "Oh, line X does Y and Z". The business context of a particular decision can also be very helpful to keep track of, and so is the story or task in Jira that a model or code block is a part of. While the former of these two could be done with just code comments, the latter benefits greatly from org's structural editing. For example, if a bunch of user stories are related and should be (per the system design) in one folder in the tree, we can structure the file like so (DE- is the user story prefix in Jira):

* DE-1: Building some pipeline
** Model 1
** Model 2

Even better, some models are made up of a massive chain of SQL CTEs. We can add those to our structure too:

* DE-1: Building some pipeline
** Model 1
*** CTE 1
*** CTE 2

For each model, we can specify the file we want the model to untangle to via header properties for each code-block.

* DE-1: Building some pipeline
** Model 1
:PROPERTIES:
:header-args:sql: :tangle model_1.sql :comments no :padline no
:header-args:yaml: :tangle _model_1.yml :comments no :padline no
:END:

*** CTE 1
*** CTE 2

This way, every SQL code-block within Model 1 automatically gets tangled into the right file, and the YAML (which is used for test specification in dbt) goes to its own file. I can write as much as I want around a code block, and it will also be ignored during the tangle. But, if I want a specific piece of text to be inserted as a comment in the output file (which I might want to do, since the output is going into version control and no one but me will ever see the 'master file'), I can do that by adding :comments org to the header of the code-block.

OK, so literate programming is solved. Let's look at the bonus issue next since that's still more in the realm of literate programming as a whole. One benefit of this setup is that I can write my YAML tests directly next to the actual SQL source code. For example:

#+begin_src sql
  select something from somewhere
#+end_src

:testing:
This goes to the yaml.
#+begin_src yaml
  - name: some_column
    data_Type: INTEGER
    tests:
      - not_null
#+end_src
:end:

This of course requires setting up the boilerplate for the YAML file, which I do in a separate code-block before the first CTE section. I also use this preface to specify the dbt materialization and table alias, as well as add a general comment as to the purpose of the model.

If you're familiar with YAML, you will know that indentation is used to determine scope, and tangling org files tends to mess up the whitespace rules. The solution here is to set the org-src-preserve-indentation variable to t - but we want this to be on a per-file basis. And we have some other Elisp that we want to run, so that we can fix the second issue of code style. I have an Elisp code block that looks like this at the start of the org file:

#+name: startup
#+begin_src elisp :results silent
  (setq org-src-preserve-identation t)

  (defun sql-format ()
    "Run sqlfluff on the current file."
    (call-process-shell-command (concat "sqlfluff fix --dialect snowflake" buffer-file-name)
                            nil
                            nil))

  (add-hook 'org-babel-post-tangle-hook (lambda () (when (string-match-p ".sql\\" (buffer-file-name))
                                                   (beginning-of-buffer)
                                                   (flush-lines "-- :end:")
                                           (flush-lines "^$")
                                           (save-buffer)
                                           (sql-format))))
#+end_src

The sql-format function just runs sqlfluff over the file of the current buffer, our team is using Snowflake so that's the dialect I've selected. I want a hook to run this function on the output whenever I tangle an SQL file, so that's what the add-hook block is doing. The reason for the two flush-lines calls is to get rid of any :end: comments (which come up when I want a comment after a :testing: section) and to remove any empty lines (since sqlfluff will handle padding on it's own). This means that I can write the SQL however I like, and then not care about formatting the output. In fact, until I send the PR for review, I never even have to read the actual output of the master file. I can just work in it the whole time.

We can run this code-block from within org-mode with C-c C-c, but it's a bit of a chore to do this each time I enter the file. There is a better way: with local variables I can execute this code-block when I enter the file. This is also the reason for #+name: startup at the top:

# Local Variables:
# org-confirm-babel-evaluate: nil
# eval: (progn (org-babel-goto-named-src-block "startup") (org-babel-execute-src-block))
# End:

Chuck this at the end of the org file, and now every time you open it that startup block will run.

That's basically it. There's a few caveats: obviously, once this gets merged into the version control then any changes won't synchronise with the org file. This could be fixed with link comments and org-babel-detangle, but that goes against the making sure no one else in the team ever needs to interact with org. This workflow is supposed to be for myself only, and not interrupt anyone else in their reading or writing of the code. This does mean that once it starts going through review, and especially once any code is actually merged, the master file is essentially useless as it won't stay up to date. But that's OK, because usually any changes that are made after merge are minor, and I can still keep the org file updated with any information that shouldn't be a file comment. Or, I can export the whole thing as a word document to describe the implementation in plain English to the reviewer. Not sure if that will ever come up, but it's nice to have the option.