I hate the experience of a new code base when everything is so alien that at fir...

capableweb · on Jan 27, 2023

> I hate the experience of a new code base when everything is so alien

Personally I love that experience, of a fresh new codebase with some patterns and other things I haven't seen before, seeing what works well and what doesn't, what stinks and what shines.

Once the exploration have been done, and things start to be obvious on how to achieve, I lose a bit of motivation as I can solve the problems in my head and now the boring part of actually typing it out begins.

xupybd · on Jan 27, 2023

Ha, I wish there was a way we could split the labor on that. You do the learning, somehow transfer that to me and I do the bug fixing. I love fixing up a legacy system but don't enjoy the first few weeks of feeling unproductive. That has more to do with the pressure to get things done than the actual exploration.

Jtsummers · on Jan 27, 2023

I do that with my teams. I have a technique where I turn existing code bases into literate programs. Largest I ever did this on was around 250k SLOC, but that didn't take me too long (you get fast with it if you practice). The result was a set of documents that I turned into presentations and diagrams (better than autogenerated ones) on the system to share with the team.

I'll be doing that with one part of our system starting tomorrow, actually, because no one understands it and everyone is afraid of it (there is one automated test which is a massive end-to-end test that exercises, theoretically, every capability and takes 2 hours or so to run).

The underlying code and repo are unaltered, this is a collection of org files that sit in tandem. I run org-tangle to verify that I haven't actually altered the code in my various adjustments to the literate program version. So people still see their regular cpp or whatever files and don't have to learn the tools I use (though I share if they ask, no one has ever asked). But they get the benefit of understanding the programs better.

EDIT: By "didn't take me too long", it still took a few weeks. But it was a new-to-us system that the original contractor provided sans tests (there were directories and references to numerous automated tests that we weren't given and they wouldn't provide) and with useless auto-generated UML diagrams as "documentation". But a few focused weeks was probably a lot better than a few years of confusion and frustration.

tejohnso · on Jan 27, 2023

Ok I'll ask, what tools do you use?

Is your technique described somewhere?

Jtsummers · on Jan 27, 2023

Tools used: emacs, org mode, org babel, and git. `git grep` or an IDE (these days mostly an IDE because it works well) to do code searches to find references to functions/classes/structs/whatevers.

Org mode lets you create code blocks like:

  #+BEGIN_SRC cpp :noweb yes :tangle yes
    // place all code here
  #+END_SRC

By default org-tangle tangles a file foo.org into foo.cpp or whatever appropriate extension, for each code type you have in a block set to tangle. You can be more explicit with:

  #+BEGIN_SRC cpp :noweb yes :tangle foobar.cpp

Useful for explicitness or if the name is different than the org file (for my purposes, I try to keep it one-to-one).

For every source file I generate a .org file that contains a single code block which will start as the contents of the original corresponding file. I may not always do this automatically, sometimes I do it manually (it takes a few seconds) as I step through if it's a more focused effort (vice trying to understand the entire program).

After that I select various points of interest. I find `main` or other entry points, if there's a known issue I may dive in there to start with but eventually it gets back to some equivalent of `main`. I generate a todo list which is just a list of all the files, it will be expanded over time. In org mode you can link a file with:

  [[file:path/to/foo.org][foo.org]]

So the todo list actually becomes an index to various points in the program. I can add text if appropriate, a lot of files are named well enough though so that's not always necessary. Sometimes I delete thing that aren't really that important but are linked elsewhere. I may create a table of contents that's more focused than the raw index if I want to preserve the raw index (it is convenient).

Diving into a particular source file I start extracting portions out. Org supports noweb syntax for references. Naming a block I can reference it in it's original location by surrounding the name with `<<` and `>>`:

  #+NAME: main
  #+BEGIN_SRC cpp
    // copy of main()
  #+END_SRC

  The rest of the source:

  #+BEGIN_SRC cpp :noweb yes :tangle footer.cpp
    // bunch of code

    <<main>>

    // remaining source
  #+END_SRC

Periodically run org-tangle and git diff. If the code is changed, more than whitespace (sometimes I lose blank lines, that doesn't change the meaning of programs in any language I use), then I botched the extraction. Go back and fix it. You can present a file path, not just the name, in the filename part of `:tangle` so you can do this in a parallel git repo so the work is under version control but not impacting the real project repo.

Repeat this, shifting file contents around to draw attention to interesting, important, or complex bits. Uninteresting and boilerplate stuff gets shoved to the bottom in "appendices", as I usually name them,:

  * Appendix I: All the includes, nothing interesting

  #+BEGIN_SRC cpp
    // `,` needed before # within org-babel blocks, but doesn't show up in generated file
    ,#include <iostream>
  #+END_SRC

I use links and references to cross reference most of it, but probably not all since, as with most things, there is a point of diminishing returns. Org can generate HTML and other document formats, so I take advantage of that to produce something shareable. I add documentation that covers critical things, especially non-obvious or complex ones. Got a complex set of equations? I write them out in TeX notation so it's clearer than the raw code, explaining the variables, or adding a reference document.

The todo list gets expanded with subtasks (to the containing file) as I see the contents of files. These might be class names, function names, or a name capturing some trait or purpose of a collection of functions. Not every function is worth documenting, many are obvious. But any longer or more complex ones will usually get an entry and a link and be extracted to their own source block. Their contents may be further extracted since LP permits it so I can draw attention to the things that I think are most important.

----------

Primary deficiency of this method: I'm the only one who does it, it's a separate repo, and if I'm not a primary contributor to the actual project this will not be maintained. If the system is somewhat stable that's not a big deal. But it will become out of date and scrapped eventually. It's good for kickstarting a project though because you can either guess at what you're doing, or try to understand the system and be deliberate about your changes, extensions, etc. I prefer to be deliberate. The guessing approach never worked out well for me.

----------

This approach also lets me identify certain critical points in the system, like "seams" in Michael Feathers terms. This is helpful for getting to the actual work (writing or changing code), which will usually require introducing tests that don't exist or were removed by some asshole contractors. I'll document these things, since I'm not usually actively changing the code yet these are notes. I'll also draw attention to potential insecure code or questionable code.

----------

This isn't the only thing I do (or try to do). I've described it in the past as dissecting and vivisecting. The above is dissection. The code isn't "live". The whole process can, and should, be paired with various tracing and debugging tools to actually exercise the system and get the real control flow. Especially if how a function/method is triggered isn't obvious by looking at the static system. Which series of actions on the real system will bring us to this point? Ok I can make a note of it, and maybe it's actually statically traceable I just missed something but the trace or stepping through with the debugger gives me the details I missed.

----------

For smaller programs (I deal with systems of systems these days, so individual programs may actually be pretty small even if the whole system is still "large" by some definition) I may put this into a single org file for all the source files. The explicit filename parameter to `:tangle` is helpful here, but it's also a lot easier to link everything together.

I'll also use a single org file to create a focused document presenting some critical thing. How do we get from main to X? Here's the path, eliminating everything else. This version may not tangle into a compilable solution because of what I leave out, but it's a good presentation format.

----------

Text-based graphical presentation tools like graphviz and plantuml play well with this method. I can embed graphs and diagrams as texts in the same org file which will be rendered when exported to HTML or another format.

xupybd · on Jan 27, 2023

You could totally get a great article out of that.

Jtsummers · on Jan 27, 2023

Thanks, maybe one day. I've started pushing for "lunch & learn" events at work again so I may dig out a smaller program to demo the process on and a medium sized one to show a more substantial result. If I use non-proprietary source code for it then I could throw it on my github to share.

xupybd · on Jan 27, 2023

I'd say that is one way to have a 10x impact.

somethoughts · on Jan 27, 2023

I like the section on Document and verify.

Particularly in startup situations without processes (i.e. formal product requirements documentation, SW architecture descriptions), the codebase can be largely undocumented.

As a new member of the team, it can be beneficial to start generating this documentation to:

- buy more time to review the code as having some sort of work product output assures management that you are indeed still coming up to speed on the code versus playing video games

- is actual beneficial to the team especially as it grows as likely new additions to the team will need onboarding documentation