Programmer Musings: June 2015 Archives

This site will look much better in a browser that supports web standards, but is accessible to any browser or Internet device.

June 24, 2015

LCDC: Library Code

In LCDC: Fundamental Knowledge, I explained how hard it is to specify a minimum level of knowledge or experience for all programmers. This minimum level would be needed to determine what is allowable for Lowest Common Denominator Code (LCDC). Anyone who has been programming for any time is probably shouting at the screen, calling me an idiot, because programmers don't really need to know the internals of some of this stuff. We can rely on well-written libraries to handle the hard parts.

I'm going to look at this from two different directions.

Libraries Without Understanding

The problem with assuming that a library hides all of the hard bits is that no library is a perfect abstraction. In some cases, you can ignore the internals. In others, the fundamental properties of the library are more evident.

Misuse of Hashes

In recent years, I've been doing quite a bit of Perl programming. In Perl, as in most of the dynamic languages, one of the fundamental data types is a hash, which is implemented as a hash table. To make sure we are on the same page (because I can't know your background), the following is a list of the important characteristics of a Perl hash.

Consists of strings as keys with and associated scalar which is the value
Given a string, access to the associated value does not depend on the size of the hash (constant time access)
Checking for the existence of a key in the hash is also a constant time operation.

I have regularly seen a pattern in code where a programmer wants to see if a string exists in a large array of strings. So, they use the following approach:

build a hash from the array of strings
check for existence of the string in the hash
discard the hash

They reason that looking up a string in a hash is fast, so this is a good idea. Unfortunately, this is actually slower than doing a straight-forward linear search of the array. If the programmer understood the way hashes worked (and a little bit about algorithmic complexity), they never would have made this mistake.

Random Sorting

In multiple languages, I've seen people use the standard library's sort function by calling the standard rand function for the sorting function to try to randomize an array. Without knowing how sort works under the hood, you may not realize that this can result in anything from a mostly unsorted array to a run that doesn't terminate. (In really unusual cases, it could result in modifying memory outside the array.

C String Functions

A large number of security holes have been caused by misuse of the C standard library functions strcat and strcpy. Some people blame the language for not being robust. Another way to look at it is that people are using the library without understanding how it works.

Terminated C Strings

One last example dates from early in my programming career. I found the following line in a C program.


     str[strlen(str)] = '\0';

In fact, this same idiom was repeated in many places in the code. It turns out that the programmer had come to C from another language. When learning C, he had read that every C string must be terminated with a nul character. He intended this to set the character after the end of the string to nul. Unfortunately, he didn't realize that strlen works by looking for the nul. This makes the line an expensive no-op.

The more complex the library, the more likely that some programmer will not understand it. This means that hiding complicated code by putting it in libraries may not solve your problem.

Project Libraries

Let's say that somehow we could argue that the library solution would actually make complicated algorithms and data structures usable for everyone. Shouldn't that same argument apply to your project's code? Shouldn't your programmers be able to write a set of code to wrap up complicated logic and make it usable to the entry level people?

If the library is well designed, with good abstractions, and documented very well, they can definitely abstract away some of the complex problems in the code. This approach makes the code easier to understand and maintain for junior programmers.

The problem, of course, is that you can't use a library to encapsulate knowledge and still write the internals of the library without the need for that knowledge. In general, the critical functionality of the code is usually entrusted to the more senior people. They must understand the internals, in order to write the library code. So, at a minimum, the library itself cannot be LCDC.

Summary

Libraries are not a panacea for the LCDC problem. Programmers can find ways to misuse libraries if they don't understand the algorithms and assumptions used by the library. Moreover, if libraries could solve the problem, then your project should be able to use the same approach by hiding knowledge in libraries. But, that violates the LCDC assumption because the library cannot be written without that knowledge.

In the next post, we'll start looking at a way to get rid of the LCDC assumption.

For the rest of the posts in this series, check out The Myth of Code Anyone Can Read.

Posted by GWade at 08:25 AM. Email comments

June 19, 2015

LCDC: Fundamental Knowledge

In The Myth of Code Anyone Can Read, I introduced the idea that least common denominator code (LCDC) is not a good approach to writing software. One reason for this problem is caused by the knowledge base of your average programmer.

Different Programmers Have Different Backgrounds

Programming is still a relatively new field. It's also a pretty broad field. A person claiming to be a programmer or software engineer could have learned their craft in any of several ways:

Self-taught: on-line tutorials, books,ongoing self study
Computer science degree
Management Information System degree
Programming course in a different degree program
Programming boot camp
Internship at a programming shop

Each of these can result in either really good or not-so-good programming skills. In addition, the terms programming and software development can also be applied in very different areas.

Embedded systems
Hardware driver development
Scientific software
Website development
SCADA software
Financial software
Game development
Graphics programming
Smart phone app development
Automotive software
High availability software
... and many more

Each of these different areas have very different ideas of what knowledge and skills are fundamental. You can't necessarily take a website developer and have them be productive on an embedded systems project. You might not want a game developer working on software for pacemakers.

Given different backgrounds, specifying a minimum level of knowledge becomes much harder.

Data Structures

Let's start simple. If we want to write LCDC, we can't use any data structures that aren't understood by everyone. So, we can probably guess that most people would understand arrays. That is pretty fundamental. What about others[1]:

Linked lists
Binary trees: basic binary, AVL, or red-black trees
Generalized trees: tries, suffix trees, octrees, B-trees, R-trees
Graphs: DAGs, spanning trees
Stacks
Queues: FIFO, dequeues, priority queues
Hash tables, associative arrays, dictionaries
Heaps

Most programmers of my experience are not familiar with many of the data structures above, much less all of them. Some of these data structures underlie programming tools we use every day. Others are more specialized. Some are extremely well-known in one industry or company and virtually unknown in others.

If we really want LCDC, these data structures and the advantages they give would be unavailable. After all, most programmers don't know how a red-black tree or hash table work, so how can we write code that uses them?

Fundamental Algorithms

Data structures aren't the only fundamentals that we can't rely on everyone understanding. Many of the algorithms that we depend on are opaque to the average developer.[2]

Sorting: quicksort, insertion sort, heap sort, merge sort
Security: SHA-256, AES, Diffie-Hellman key exchange, cypher-block chaining, HMAC
Graphics: JPEG compression, ray tracing, bezier curves
Databases: SQL, document databases, object databases, hierarchical databases
Randomness: Fisher-Yates shuffle, Mersenne Twister, entropy pools
String manipulation: regular expressions, longest common sub-sequence, hamming distance, Levenshtein distance, KMP algorithm
Graphs: Dijkstra's algorithm, alpha-beta pruning, topological sort

In some fields, each of these algorithms are commonly used. In others, each is completely unknown. Even in the fields that a particular algorithm is used, most developers probably don't understand all of algorithms used in that field. According the LCDC premise, we can not use any algorithms that everyone can't understand.

Summary

Because of the breadth of the programming field and the many different ways that individuals came to work in the field, it is very hard to describe a subset of knowledge that we can claim is known by everyone.

Not all of these apply to every business, but most programs end up touching one or more of these areas somewhere. Our code would be slower, less correct, and harder to maintain without being able to take advantage of well-known and well-tested algorithms, even if they are beyond the grasp of your most junior people.

In the next post, I'll explore libraries to solve this problem. We'll also see how they would be impacted by the LCDC idea.

Notes

Apologies if I've left out your favorite data structure. I just wanted a list big enough to get the point across.
Since there are even more algorithms than data structures. This is an even more incomplete list. On the other hand, I suspect that more of these will be unknown to more programmers.

Posted by GWade at 08:25 AM. Email comments

June 18, 2015

The Myth of Code Anyone Can Read

I got into a conversation recently coming out of the Houston.pm user group meeting. As usual, we wandered over numerous technical topics, but one stuck out in my mind: whether or not to use more advanced or more complicated language idioms.

I've written about programming idioms and advanced code many times in the past (see below). Part of the reason for revisiting this topic repeatedly is a mindset that I have seen throughout my career. The idea is to write the code so that anyone can read it. Although this sounds reasonable at first, lowest common denominator code (LCDC) almost always results in a hard-to-maintain code base.

The Failure of LCDC

There are a number of reasons for this simple idea to fall apart. The most obvious comes from your experience of reading text in a human language. If we wanted to keep the text at a level that anyone could read, everything would need to be written at a first grade level. That's really the lowest level that you can claim that someone can read.

In human languages, text is written at different levels depending on the context and expected audience. Why would you expect programming to be different? Over the next few entries, I plan to cover some of the context that would change the way code should be written.

Over the course of the next few entries, I plan to show different places where writing lowest common denominator code (LCDC) would harm the project and, possibly, your business.

References

Posted by GWade at 08:40 AM. Email comments