This site will look much better in a browser that supports web standards, but is accessible to any browser or Internet device.
Back in February, I ran across the post robkinyon on Perl: How useful is a debugger, really? and was again reminded of the great debugging debate. I thought about commenting on it at the time, but I got distracted.
Well, Matt Trout reminded me of it again today with his post Simple Debugging. So, I decided it's time to throw my thoughts into the wind.
When I was a very new programmer (a long, long time ago) I only used print statements to debug. This was partially due to a lack of tools and partially due to a lack of experience. I was introduced to a couple of debuggers, but was not really convinced.
Then I was introduced to a really good interactive debugger. It completely changed the way I worked, I could inspect data structures interactively, chase down long chains of links, and basically do the same sort of stuff I did with print, just faster. I could also set a breakpoint near where I thought things would happen and kind of explore the area to see what was wrong.
It didn't take me long to decide that this was definitely the way to work. Unfortunately, I couldn't rally use this wonderful tool at work where I was working on a background program that could not comfortably be placed under the debugger's control. (This was an old TSR program running under MSDOS.)
I moved on to another job and was working in an environment without an interactive debugger. More importantly, it was the server-side of a client server system. Suddenly, the concept of pausing the whole server to track down some bugs did not seem like such a great idea. Our client-side programmers often used an interactive debugger for their work and I could easily compare their experiences with mine.
I started noticing something interesting. A few people were really amazing with the debugger. They could zero in on a problem with a small umber of breakpoints and a little judicious inspection. Others spent enormous amounts of time single-stepping through the code, eventually finding a problem.
Over of the host side, I noticed something equivalent. There were a few people that wielded their print statements like a scalpel. They would examine code, use a few well-placed prints to uncover surprising behavior. Iterate a few times, and solve the problem. Others would scatter-shot print statements all over the code, printing out reams of data in the hopes of seeing something of interest.
With either tool, the key insight appeared to be the method chosen for placing either a print statement or a breakpoint. The effective programmers strove to understand the code and form a hypothesis of what problem might cause the current symptom. That person would use this hypothesis to determine where a print statement or breakpoint could be used to prove or disprove the hypothesis.
Those people would usually make effective progress toward finding and squashing the bug. The ones that also had the most experience or understanding of the code were particularly effective in this technique. Their initial guesses tended to be much closer to the target.
The other group used either print statements or single-stepping to explore the code. They seemed to almost be on a voyage of discovery. This is admirable, and can be an effective way of learning the code. But, this approach does not lend itself to rapid bug fixing.
In watching several people over a period of years, I've noticed that people using print statements often discover this approach on their own. This is caused by the simple fact that adding and removing print statements is time-consuming and boring. Good programmers soon learn to reduce this cycle with better analysis.
On the other hand, I've seen many developers using the interactive debugger single-step technique continue because it feels like they are making progress. They don't notice the boring stuff as much, because they appear to be moving. It often takes input from a mentor to get them to understand that a better breakpoint choice would make all the difference.
I tend to prefer the print statement approach. This has fit well with the kinds of programming I've done during my career. However, I'm not a fanatic about it. A good interactive debugger can be a joy to use on a code-base that I'm really familiar with. But, as with most of the debates in programming, the most important and powerful debugging tool in your arsenal is the one between your ears.
There has been a lot of talk lately about projects moving to git. So far I the write-ups from people converting to git have all been glowing endorsements of the new one, true way. There's almost a religious fervor related to the subject.
Since my experience has not been quite that good, I thought it was worth documenting what I have seen. Based on some of the responses I've seen to any negative comments I expect to be blasted if anyone actually reads this. But, if anyone else runs into these kinds of problems, you'll at least know one person has seen the same.
I have fairly simple needs from a version control system based on the last couple of decades of software development.
I don't have a particular agenda or approach I care about. I just want to be able to work with my code and have the VCS help me. At present, I don't have a major need for distributed version control, but it might be nice. For me, a VCS is a tool, not a religion.
As time has gone on, I have used several VCS or SCM systems. These include: RCS, CVS, Subversion, and ClearCase. (I also had to help support some people using SourceSafe long ago.) I've done branching and merging in all of those (except RCS), so I'm fairly conversant with the general issues.
A couple of years ago, I tried to use git and was badly frustrated, I could not get data committed. I wasn't able to follow my normal workflow. The tool forced me to completely change the way I was working. I immediately dropped it. I had a similar experience with another distributed VCS tool (I don't remember which one), and so I discounted the whole mess as a bad idea.
Later, a fellow Perl Monger started talking a lot at the local meetings about how git was working well for him, and even gave a talk on the subject. It seemed like git might be worth trying again. With the information from that talk and better online resources that had become available in the intervening time, I was able to use git for a few minor projects I was working on.
It turns out that my original problem had to do with the index feature. This had apparently been a problem for many people and the newer tutorials made a point of explaining this feature better. I eventually got used to the extra step of re-adding files I had changed (or using the -a switch).
I was becoming somewhat comfortable with the tool.
Because I was comfortable with my Subversion repositories and I had been told about the git svn tool, I was using Subversion as my remote repository. This allowed me to have it backed up with all of my other repositories and fit my comfort zone better than having the whole repository in the local directory. (I know a lot of people swear by that feature. But I remember disasters in the old RCS days when the repository was also stored with your sandbox. It was too easy to lose both your current work and all of the history with one mistake.)
Things seemed to be going along okay until one day when I decided to push some changes from my laptop to the Subversion repository and pull them to my working directory on my desktop.
At the time, I think that my desktop was up to date with the master branch. I had been doing some history rewriting clean up a few commits on a (local) branch. I merged a branch on my laptop to master and pushed the changes to the Subversion repository. A day or two later I pulled from the Subversion repository to my desktop machine. (The details are a little hazy since I expect the version control to keep up with what I've saved and when.) When I did, there were merge conflicts like crazy. Almost every file was conflicted somewhere. I tried to resolve the conflicts by hand and could not get everything back into a stable state.
This was quite surprising, because I've merged multi-month long branches in CVS (with much pain and suffering) as well as resolved merges in Subversion without much problem. Given the hype about how easy merges are with git, I was not expecting this.
Eventually, I came to the conclusion that the best thing to do was to blow away my working directly and start over with git svn with a new working directory. This was not a good feeling. Although I'm pretty sure there was nothing in my desktop working directory that I lost, this was not the kind of behavior I expected from a VCS.
With some research, I eventually convinced myself that I must have messed up somewhere in the history rewriting and that was the cause of my mistake. Maybe history rewriting and Subversion weren't compatible or something.
A few months later, I was working on another project. I was still using Subversion as the remote repository for working with git. Honestly, despite all of the assurances that everything that goes into git comes back out again, I was still more comfortable with Subversion for safety.
Once again I was working on the code from two different machines. I had just finished some relatively hairy work and pushed to the remote (Subversion) repository. A day or two later, I pulled on the other machine and BAM, I'm in conflict hell again. I tried to resolve the issues without a whole lot of success. The conflicts did not seem to match up with what I could see on either machine. This time I'm sure I hadn't done any history rewriting. (I wasn't using that feature after the previous disaster.)
After fighting with the mess (and pulling two or three more times), I eventually gave up and blew away my working directory and rebuilt it.
I checked for similar stories on-line and talked with my local expert to no avail.
The biggest problem I had with it was that the actions that blew up that day were identical to things I had been doing all along. I couldn't track down exactly what I'm doing wrong or even localize it to a sequence of steps that caused the problem. As a developer myself, I know that reporting a bug that randomly blows up after doing something that worked the last dozen times in a row was not going to be taken too seriously.
At this point, it's worth reminding you that I've been using version control tools for almost 20 years. I've recovered from disasters in almost every one that I've used. I have never been left in this situation before.
Despite a few bad experiences so far, most of my usage of git has done what I needed. I was still not completely comfortable with the new workflow. But the ability to add aliases and script new commands is quite addictive. I knew that my experiences had to be odd, otherwise people would be reporting them and dropping git like a hot rock.
I had a new project that I wanted to work on, so I decided to do things a little differently. This time, I worked entirely in git, no Subversion repository. I made a bare repository along side my Subversion (and CVS) repositories to give me a single spot to back up and began working on the new project.
Things went along fine for a month or so, until I needed to get ready for a conference. I had been working mostly on my desktop, but would need my laptop updated before I could go to the conference. I merged a couple of feature branches back to master and made certain everything was working fine. I pushed the master branch to the remote (git) repository. I went immediately to the laptop and did a pull. BAM, my working directory was suddenly a smoking crater with conflict shrapnel everywhere.
Unfortunately, this was a Catalyst project and I ran into a new kind of conflict hell. The Catalyst system uses a Perl ORM that creates class descriptions for data stored in a database. The main description of the classes are protected by an MD5 sum to show they haven't changed. Merging files with this sum in it are guaranteed to have conflicts. Unfortunately, I couldn't get the code to match up with either MD5. By this point, I had learned about the git reset command. But I still wasn't able to completely recover.
In any normal circumstance, I would probably have throw away any tool that causes me this much grief. I have not yet had the religious conversion that many seem to have where git is concerned.
The only saving grace is that despite the disasters, I've always been able to recover my code (if not my working directory). I also haven't been able to nail down the problem. My latest attempt to stop the problems is to stop using the git that comes with Ubuntu and update to the latest.
I'm not convinced that git is as wonderful as everyone says, but it does have features I like.
Unlike most of the people writing about git, I'm not a true believer. It's got some advantages over the systems I've used before. But, despite their flaws, I've never had either Subversion or CVS to leave me with a smoldering crater where my working directory was.
I'm sure that someone (if anyone actually reads this) will tell me that I'm doing something horribly wrong, or that I just don't understand the beautiful elegance of Linus's vision. Frankly, I don't care. Elegance of design doesn't matter if the implementation blows up. As Richard Feynman once said,
It doesn't matter how beautiful your theory is, it doesn't matter how smart you are. If it doesn't agree with experiment, it's wrong.
I use version control to support my code. If it doesn't, I will switch to a new system. I haven't given up on git yet (it hasn't yet lost any code I've committed), but one more explosion may the last.
The Productive Programmer
Neal Ford
O'Reilly, 2008
When I found out about this book, I knew I had to read it. I have been a programmer for a long time and I'm always on the lookout for tips about how to become more productive. I expected a list of tricks that would make individual development tasks easier. This book does have some of those, although not as many as I expected.
The book is divided into two major parts. The first section, Mechanics, contains a number of useful tricks and tips organized in four major categories: Acceleration, Focus, Automation, and Canonicality. This organization is the first useful feature of the book.
Instead of just listing random tips, Ford spends a bit of time on what makes us productive and groups his tips accordingly. This approach not only gives specific tips, but also gives you a framework into which you can put your own tips. By reviewing the tips in a particular section and thinking about the tasks you do every day, you open up the possibility of spotting ways to make yourself more productive.
Although the book has a strong developer focus, many of the tips would be useful for any computer user that wants to be more productive. Much of Ford's advice flies in the face of the GUI/user friendly path taken by modern operating systems and programs. He makes a strong argument that a command line is more effective in the hands of someone who has learned to make use of it.
The second section, Practice, looks at higher-level practices that can make you more productive. In this section, Ford touches on Agile practices (TDD and YAGNI), good design, philosophy, meta-programming, and tools, among other things. In each case, he focuses on each practice in terms of productivity rather than methodology or dogma. This section will not teach you how to develop in an Agile fashion, but it does show how some practices make you more productive.
This section suggests some answers to the high-level questions we all struggle with:
* How do you make certain the code you write is the best it can be and solves the problem you need to solve?
* How do you avoid writing code that does not need to be written?
* How do you get the most out of your tools.
Many programmers will be familiar with many of the tips or, at least, the ideas behind them. Unless you have decades of experience on multiple kinds of systems, you will probably still find a few new items. At the very least, it may remind you of tricks you had forgotten.
I would definitely recommend this book to any programmer or computer power user. If you've only worked on Windows or always develop with a powerful IDE, prepare to have your beliefs challenged. If you have experience with the command line, this book still provides a lot of benefits in tools to improve the more GUI-based OSes, if you need to work there.
Last week, I wrote about a technique for debugging without using a debugger in Debugging Without a Debugger. I talked a bit about the advantages of instrumenting code and how it can be used to supplement the use of a debugger.
Those of you who have always used a debugger might consider this vaguely interesting, but not particularly critical. I first learned this technique in an environment where we could not use a debugger even if we had wanted to. Several times in my career I have have been in situations where a debugger was not available and this technique was the only way to debug code.
The first debugger-intolerant environment I developed in was writing a TSR program under DOS that communicated with a foreground program written by another company. The equivalent type of program would be daemon processes under various flavors of Unix or Windows services.
At the time, debuggers did not support the ability to attach to a running process. Since we couldn't start the process under the debugger, there was no way to run the program inside the debugger. Additionally, the program we were working on was communicating with a device that was not under our control. Stopping the program in the debugger would not have been an option even if we could have run under one. Which leads us to the next class of programs that do not work well with debuggers.
If the program has real-time requirements, running under a debugger may not be an option. Real-time does not always mean that it has to be lightning fast, but almost all real-time systems have requirements on the amount of time they are allowed to let certain operations wait. Stepping through code in a debugger or stopping on a breakpoint will almost certainly violate these requirements.
If the only real-time requirement is user responsiveness, this is not a real problem. But, if the program is communicating with another program or external device, pausing in the debugger might cause the whole application to fail or behave mysteriously. The TSR I talked about above communicated with another computer over a special interface card. If the card wasn't serviced in a timely fashion messages would be lost and the application would fail.
If the program has really tight, hard real-time requirements, even printing to a log might be too disruptive. In those cases, I've seen systems that log to a buffer in memory that is written to disk when the system has time.
Although there are a few debuggers on the market that deal with multi-threaded code, this kind of system plays havoc with debuggers. First of all, there is the question of what happens when a breakpoint is hit. Do we stop all threads or just the one? If we allow the other threads to continue, what happens if a second thread hits a breakpoint while we are looking at the breakpoint on the first thread?
If that isn't confusing enough, think about the changes to the timing of the interactions between thread. Race conditions may appear and disappear at random because of the interactions we are having with one or more threads. How does access to a shared object work when a thread changes the object we are inspecting in the debugger?
The next kind of system that I worked on without debugger help was a server in an on-line system. Like many on-line systems, this one had multiple threads to deal with incoming requests. There was a real-time component in the time required to service the request and respond to the client. If that weren't enough, the servers needed to stay running pretty close to 24/7. We could rotate servers into and out of service to load new code and fix problems, but they tended to run for hours, days, or weeks at a time.
A debugger is practically useless in this scenario. How do you watch a breakpoint that is only hit once every few hours? How do you catch problems that only occur on certain kinds of requests when you aren't sure which request triggers the problem?
In each of these scenarios, we found that by carefully instrumenting the code we were able to troubleshoot and solve problems despite the lack of a visual debugging environment. In some cases, we logged lots of information in the hopes of spotting the problem in the reams of collected data. In other cases, we put very specific instrumentation in place to catch the rare times when the problem occurred.
One benefit of this approach is the ability to bring the full power of your programming language to bear on recognizing a problem and logging the appropriate information. If we knew that the problem was related to a certain area of memory becoming corrupted, it was possible to make a function that tested that area of memory. Now, we can call the test at various points in the code and log when the error was detected. Most debuggers today support some form of conditional breakpoint. In most cases, though they support only counting or simple conditional expressions. If your debugger supports a condition based on a function call, the debugger can match this feature.
You can also easily write a test that saves earlier state of the program to compare with the current state to see when things change. For example, you might only want to log in the destructor of the ABC object that was created by function abc(), not the other hundred or so ABC objects in the system. If this information is not already in a variable, most debuggers could not track this change. Most debuggers support some method of testing if a small number of simple variables change.
With the ability to write arbitrarily complex tests and the ability to log anything that you can access from the code. Instrumenting the code is a very powerful technique.
Between this article and the last, I hope I've given you some reasons to consider troubleshooting without a debugger. Maybe the next time you find yourself bouncing on the step or next command in your debugger, you might consider a more automated way to troubleshoot.
I've noticed something about the programmers I have dealt with in the last few years. Many of them seem to equate debugging skill with ability to use a debugger. In fact, in some instances, the concept of being able to troubleshoot a problem outside a debugger is so foreign it would never occur to them.
A debugger is very helpful for many troubleshooting tasks. If there is a logical error in a localized area of code, a debugger can help you quickly explore the logic and find the problem. This sharp focus is the most important feature of the debugger. However, if you don't have any idea where the problem is located, this narrow focus is more of a hindrance than a help.
Most people who only have debugger-based troubleshooting experience end up scattering breakpoints throughout the code hoping that one of the breakpoints will get them close. Unfortunately, if the problem is based on the relationships between different portions of the code, the narrow focus may hide the actual problem. Much of the problem is that some defects are related to the sequence in which different pieces of code or called or relationships between multiple pieces of code.
In this case, the debugger only gives part of the story. You need to track these relationships separate from the debugger session. The debugger's narrow focus and the need to track relationships and sequencing separately makes this kind of troubleshooting difficult. This is not a problem with debuggers, it is the result of using the wrong tool for the job.
A completely different approach is to instrument the code with logging statements. Because of the nature of logging, the sequencing information is explicitly tracked in the log. Proper choice of information to write to the log can help tracking the relationships as well. Instrumenting the code does not provide as easy a method of focusing in on more localized problems, but it is much better at troubleshooting non-localized problems.
Either technique can be used on many problems. Some problems are easer to solve with one technique or the other. A few problems are easiest to solve by combining the techniques. You can use instrumentation to find the general shape of the problem. Instrumentation may help to discover which methods are being called in which order or which methods are called more often than others. This approach is really helpful in trying to find when code is not called.
Since the kinds of problems that work best with instrumenting are problems with relationships between calls, just looking at the output is not always enough to find the problems. Sometimes you need to summarize the data in some way. Maybe you need to count calls to particular routines, or verify that every call to method A has a corresponding call to method B. These kinds of relationships are often easier to see after the data has been re-ordered in some way.
You can use tools like sort and uniq to do simple reorganization of the data to look for patterns. Sometimes you will need more powerful tools like AWK or Perl to extract relationships from the code. If you format the output of your logging statements appropriately, you can even use a spreadsheet program like Excel to re-organize the output to provide better understanding.
Instrumenting code provides a different kind of information than you normally get from a debugger. This technique is very useful for dealing with problems that require seeing relationships between multiple different portions of the code. Another place where instrumenting can be more useful than using a debugger is when investigating long loops. If a loop runs a dozen times, setting a breakpoint and inspecting the code on each pass can be useful. If the loop runs half a million times, the breakpoint is basically useless.
The main use for instrumenting code is looking for getting an overview of the code. Using a debugger gives a highly focused way to inspect the code. However, instrumenting is a better tool for getting a broader view of the code. In some cases, once you have digested this broader view, you may find a particular piece of code that needs more focused attention. Switching back to the debugger can be very effective at this point.
Once you are comfortable with both techniques, you will often find yourself switching back and forth between the two techniques. You might use some instrumenting to test an idea of why a problem is occurring and then switch to the debugger to look more carefully at a method that has attracted your attention. After doing some focused examination with the debugger, you might decide that another area might be more fruitful. Then, you instrument a different piece of the code to explore another idea. In some situations, the two techniques complement each other.
One of the main problems with instrumenting code is the need to change the code to add the statements needed to log information. This requires some recompilation and is not as quick as adding and removing breakpoints. Because of the recompilation cost, some people ignore this technique.
Obviously, avoiding a useful technique because it has a cost is not reasonable. Many of the decisions we make in software development are about trade-offs. You should be able to evaluate your debugging tools in terms of costs and benefits, as well. Obviously, you wouldn't use an expensive technique for a trivial problem. But, if the problem is complicated enough, the cost is less than the benefits.
Another way to reduce the cost of recompiling is to try to add as much logging as possible to avoid recompiling again. Printing out too much information with each logging statement or adding lots of instrumentation just in case may reduce the amount of time spent recompiling, but it increases the amount of time you spend summarizing and analyzing the output. It's always important to remember that the output from the instrumentation and the summarizing code are not the goal, they are just tools used to find an actual problem.
Instrumenting the code in a way that tests one or a couple of ideas is much better than generating so much output that you will spend a day wading through the logs looking for something important. Eventually, you reach the point where you are spending more time looking at the logs, than you would have running another compile.
One final downside of instrumenting code is the risk of accidentally leaving the logging code in place after you have found the problem. Your version control system is your friend at this point. Always check the changes you are going to commit to verify that you are only adding actual fixes and not debugging code.
Software Configuration Management Patterns
Stephen P. Berczuk and Brad Appleton
Addison-Wesley, 2003
The first three chapters define the problem space. We get a solid description of Software Configuration Management (SCM) and an introduction to patterns and pattern languages. This section of the book sets up the context that you will need to understand the rest.
Much like the GOF book, this book gives names to different practices that you may now be using. It explains the each of these practices as patterns. More importantly, this book relates these patterns to one another as a pattern language that gives more of a big picture understanding of SCM. In other words, the book not only presents patterns such as Mainline, Integration Build, and Release Line; it also explains how these and other patterns relate to each other to make a strong SCM policy.
I have been using various version control systems for nearly two decades. In that time, I have stumbled my way toward understanding many of these patterns. If you have worked in software for a long time, you might feel that you already know everything you need. One of the things I found most useful in this book, (besides the standardized naming) was justification for some of the practices I had come to accept. The way the book related different practices to make the combination stronger was also quite revealing.
If you have not been doing SCM for long or have just begun using some version control tool, this book can give you insight into what you should be doing. Unfortunately, I suspect that some experience is needed to properly appreciate the patterns in the book. If you already know everything you need to about SCM, the book still provides standardized names and relationships that can help when explaining your practices to others.
Overall, I recommend this book for anyone working in software development. While it is not the most exciting topic to read, it is practical and useful to working developers and their support teams.
I don't often write about something quite this geeky, but it is just too cool not to share.
Several months ago, I read about an interesting module for Linux called SSHFS. This module allows you to mount a file system over SSH. This gives a way to access files on another server that you don't have any form of shared access such as NFS or Windows shares. All you need is SSH access to the machine that you want to access. It sounded interesting, but I didn't really have a need for it.
A few months ago, I had been doing a lot of work that involved comparing two source code trees. Depending on various factors, I would move some changes from one of the trees to another. I spend a lot of my time in the gvim editor and I've found the gvimdiff mode to be quite helpful for this sort of editing.
Unfortunately, gvimdiff only compares two files, not two directory trees. I had wrapped some scripting around my gvimdiff call to provide some support for the directory merge, but it was awkward and not very pleasant. While looking for something else, I stumbled across the DirDiff vim script. This did exactly what I wanted. I mapped a couple of function keys to diffget and diffput to simplify my work. Life was good.
Then one day, I needed to work on some files on another server where I had SSH access. I used SSHFS to mount the other system and found working that way to be quite comfortable. I was looking at files on the two systems and without thinking executed DirDiff to compare the differences. After working my way through a few files, it hit me what I was doing. I had files from one machine open in the left window and files from another open in the right window and I was copying bits of code back and forth between the two machines without any real effort.
In some circumstances, (like the one I was in) this is incredibly useful. Much more useful than copying files back and forth or making little changes and uploading files. I wasn't completely replicating the files on both servers, so an rsync approach would not work.
I know this won't be useful (or interesting) to everyone, but it really made my life easier.
Practical Subversion, second edition
Daniel Berlin and Garrett Rooney
Apress, 2006
Two years ago, I reviewed the first edition of Practical Subversion. The second edition has substantially updated the reference information about the Subversion commands. The previous edition had been based on a pre-1.0 version of the program and had somewhat incomplete coverage of the program options. So the update was sorely needed. The production quality of the new edition is also much better. Better font choice and improvements in editing make this edition seem much more professional than the previous edition.
Unfortunately, this edition keeps the structure of the first edition, which I found awkward. The book tries to be both a definitive reference and a practical guide to the usage of Subversion. These very different goals would have been better served by providing the reference information in one section and the guide in another. However, the authors chose to intermingle the two streams of material. Like the first edition, this tends to break the flow of the text. If you are just interested in the practical advice, you are suddenly interrupted with a reference section detailing all of the options for a given subcommand. If you are looking for the options for a single command, you may need to skip over some advice sections to reach the detail that you need.
The bad news is that this structure prevents the book from being either a great reference or a great practical guide. With Version Control with Subversion freely available, there is not as much need for a definitive reference.
I would almost like to see all of the reference material removed or put in an appendix. If you want that information, the definitive on-line reference would be better anyway. More material exploring best practices and giving example usage of some of the commands would have been much more useful to me.
On the other hand, the book still provides a substantial amount of benefit. The sections on the programming API and using the SVN libraries from other languages are once again appreciated. The appendix comparing Subversion to other version control software is also very handy.
All in all, I think I would still recommend this book. The practical advice is effective if not quite as extensive as I would like. The Best Practices chapter is still one of the highlights of the book.
In Conversion to Subversion: Tags, I explained how I cleaned up the tags portion of the CVS dump file in order to generate a subversion repository in the format I wanted. Lars Mentrup emailed to say that I had not been clear enough on a few points. After re-reading what I wrote, I have to agree.
In the original article, I referred to working on the initial copy from trunk. This was obviously missing a subtle point. All of my changes started with the dumpfile that had been filtered to contain the project and tag entries. Given this filtered file, I was talking about the initial copy of the tags. This is the equivalent of
cp 'trunk' 'tags/FIRST_RELEASE'
Now, I really did not want the entire tree copied to the tags directory, and I wanted the tags directory structured differently. So, I changed the equivalent of the above command to the equivalent of
cp 'project1/trunk' 'project1/tags/FIRST_RELEASE'
Now the dumpfile has a large number of extraneous delete commands that are the equivalent of
rm tags/FIRST_RELEASE/project2 rm tags/FIRST_RELEASE/project3 rm tags/FIRST_RELEASE/project4 ...
Since I changed the initial copy, these are no longer needed. So, I just delete them from the dumpfile.
The result of all of these changes is a single dumpfile with the 'project1' project information and the 'FIRST_RELEASE' tag information. I can use svnadmin load to load the project and its tags into the repository.
Hopefully, this will clarify what I meant. Thanks again to Lars for taking the time to let me know where I was unclear.
In the first article of this series, Conversion to Subversion, Part I, I described the problem I found in trying to convert a project from my CVS repository to Subversion. In my last article, Conversion to Subversion: The Project's Trunk, I described the solution that I used to convert a basic project with no tags or branches. This time I'll discuss the converting the tags on a project from CVS to Subversion.
As stated before, the tag directory structure generated by cvs2svn was not what I wanted. Assuming a CVS module of project1 and a tag of FIRST_RELEASE, the dump file would have a directory structure of tags/FIRST_RELEASE/project1. I wanted project1/tags/FIRST_RELEASE.
At first, I thought I could use the same approach that I had used for the main project. I would
tags/FIRST_RELEASE/project1 to project1/tags/FIRST_RELEASE.Unfortunately, searching through the dump file did not turn up a tags/FIRST_RELEASE/project1. So I began looking at the dump file a little harder. The result was a little confusing. Apparently, cvs2svn treated each tag as if the entire repository had been tagged (everything in trunk was copied to tags/FIRST_RELEASE). Then, everything except the project1 directory was deleted. This generated a large number of extraneous revisions that do not accurately reflect what happened in the repository. The end result would have been correct in the repository with the old directory structure; but it wouldn't work with the new structure.
I modified the initial copy from trunk to copy from project1/trunk to project1/tags/FIRST_RELEASE in the dump file. Then, I deleted all of the extraneous delete directory commands in the dump file.
The new modified dump file would build the project with the tags I required. Just as importantly, the extraneous manipulation used to clean up the initial strange tagging request have been removed. This also solves the problem that would have been caused by attempting to change directories that had been filtered out of the dump.
I incorporated this change into my script that fixes up the dump file before I do the load. It seems to be working quite well. The new projects I've added with these changes appear to be intact with the appropriate tags in place. If I had any branches I wanted to keep, I could apply an equivalent approach to fix up the branches before loading.
This entry has been updated a bit in the entry Conversion to Subversion: Tags Revisited to answer questions I've received by email.
In Conversion to Subversion, Part I, I described the problems I found when I began converting my CVS repository to Subversion. In this article, I describe the work and surprises that came from the first project migration.
My first idea was to build the repository from the dump file and then fix the result using moves inside the repository. Unfortunately, that would have left the previous history in the wrong places in the hierarchy. Although this would not have prevented me from doing further development, looking at previous versions would be messier than I'd like.
So, obviously, I needed a was to make the repository right when projects were added for the first time. The section in Practical Subversion on importing from other systems suggested that the format was relatively easy to modify. Reviewing the relevant sections of Version Control with Subversion confirmed this information. If I could make the required changes to the dump file, then I could create a repository laid out the way I wanted it.
The first step in this process was to dump a particular project with its related tags and branches. Examining the projects I wanted to move gave me one that had neither tags nor branches. This would be about as simple a case as I could start with. To hide irrelevant details, let's call this project smallproject.
As I said in the previous article, the path for this project would have the form: trunk/Repository/smallproject. To extract this project from the main dump file named cvs2svn-dump, I used the following command:
svndumpfilter include trunk/Repository/smallproject \
--drop-empty-revs --renumber-revs \
< cvs2svn-dump > smallproject.dump
The --drop-empty-revs option removed revisions that did not have any relation to the project I want. The --renumber-revs option cleans up the numbering in the file. I found it more convenient to have contiguous revision numbers when examining the file.
Since I needed to do a relatively simple fixup to the new dump to change the path, I used a Perl one-liner to make the change:
perl -pe's!trunk/Repository/smallproject!smallproject/trunk!g;' \
smallproject.dump > smallproject2.dump
This just uses Perl's substitute operator (with '!' as a delimiter) to change the old path into the new path everywhere in the file. I put the output in a different file so I could compare them and make certain that there were no unexpected differences. After I verified that the paths in the file looked correct, I was ready to go.
One of the reasons I had picked this project was that I had decided that I wanted it in a different repository than the source code I had from my earlier experimentation. So I created a new repository using the command:
svnadmin create /home/svn/newrepos
where newrepos was actually the real name of this repository. But, we'll stick with this pseudonym for now. Then, I loaded the project with the following command:
svnadmin load /home/svn/newrepos < smallproject2.dump
This promptly failed with a message that smallproject/trunk was not found. Of course it wasn't found, I'm trying to create it.
After a bit more experimentation, I realized that the load was failing because the path /smallproject did not exist in the repository yet, so load could not create a subdirectory. So I recreated the repository and prepared to begin again.
With a clean repository, I created the beginning of the project with the following command:
svn add file:///home/svn/newrepos/smallproject \
file:///home/svn/newrepos/smallproject/tags \
file:///home/svn/newrepos/smallproject/branches \
-m "Migrate smallproject project."
I have left off the creation of the trunk subdirectory, otherwise the load would still fail when it attempted to create that directory. Then, I reran the load successfully. I used the svn tools to check out this project in the new repository and verify that everything appears to be as I expected.
The first actual migration worked. To simplify my work for later steps, I converted several of the command lines listed above into shell scripts to make running them a little less error prone. One other piece of insurance I started was to do a dump of any repository right before adding a new project to it. This gave me an easy way to recreate the previous state if/when something went wrong.
Next time, I'll explain how I dealt with a project with tags.
Thanks to Lars Mentrup for catching my cvsadmin/svnadmin goof. The text has been corrected.
For about a year now, I've been playing with Subversion on small projects. In order to protect my main repository in CVS from my experiments, I just created new projects under Subversion and worked with them there. All of my real projects continued under CVS control. This way if my experiments with Subversion were a disaster, I would only lose revisions from the new work.
Now, I've finally reached the point where I want to move some of my old projects over to Subversion. I could just add all of the projects in their current state, but I do not want to lose the history. Since this turned out not to be quite as easy as I expected, I figured it might be useful to document the process I am going through in case anyone wants to learn from my mistakes.<grin/>
To understand the examples, you will need a little background on the CVS repository that I am working from. This repository holds about thirty projects that I have worked on over the last few years. Some of the projects are big, some are small. Some are currently undergoing work, some are effectively dead. Some of these projects date back over ten years, some are relatively new.
The repository lives on a Linux box in the directory /home/cvs. The directory where the actual repository is stored is called Repository. I started keeping my repository under /home when I started keeping my /home on a separate filesystem. This makes backups and upgrades easier. Moreover, some of the items in the repository could be considered private, so putting the repository with the home directories reminds me to treat it with the same care as I treat my home directory.
My goal is to move my current projects to Subversion repositories. The move must also meet the following additional goals:
Although, I consider tags to be important, I have no work currently going on in any branches and all code from any branches has been merged into the trunk. I would prefer not to lose those branches, but it's not a requirement like the others. Additionally, I am experimenting with multiple Subversion repositories. So I may want to separate some projects into different repositories.
My first idea was to just use the cvs2svn script that comes with Subversion to convert directly. While examining the program, I found that it has an option to just make a dump file without changing the Subversion repository. This would allow me to do some poking around before actually moving the data to the new repository.
From reading Practical Subversion recently, I was aware that the installation should include a program called svndumpfilter that allows extracting parts of a dump file. This could allow me to move individual projects instead of moving everything at once.
I needed to look at the dump file to determine the paths needed for svndumpfilter to extract my projects. This was when I found my first surprise. The structure of the revision tree in the dump file did not match the structure of repository I wanted to create. As an example, assume that I have a module in the CVS repository named project1. That project has a tag named RELEASE1. Finally, the project has a branch named major_rewrite. The directory structure from the dump file for this configuration would be:
/trunk/Repository/project1 /tags/RELEASE1/Repository/project1 /branches/major_rewrite/Repository/project1
Unfortunately, this does not match the recommendations from any of the articles or books I have read on Subversion. Based on those recommendations, the structure of the Subversion repository should be more like:
/project1
/trunk
/tags/RELEASE1
/branches/major_rewrite
with the history stored in the /project1/trunk directory. In the time I've been working with Subversion, I have become accustomed to this structure and wanted to continue to use it.
The second surprise came when I examined the tags and branches. Both branches and tags are made in strange way in the dump file. The entire repository is copied for each tag (or branch), then any modules that are not supposed be part of that tag (or branch) are deleted separately. This means that there will be a series of revisions in the repository with tags/branches applied to projects that were never part of those tags/branches. None of this is visible in the final version of the repository, but it seems a bit inelegant.
In summary, this approach would result in all of the history from the CVS repository being copied to a new Subversion repository, but there are a few problems.
None of these is a killer problem. I would just like to set up the new repositories in a cleaner way. Come back next time to see how I fix it.
If you haven't tried Subversion yet, you really owe it to yourself to give it a try. I've used CVS for over a decade now and I've been trying Subversion for a little less than a year. I haven't yet moved most of my home projects to Subversion, but it's looking more probable every day.
The ability to rename and reorganize your files and directories without losing history is wonderful. The separation of status from update is great. I'm slowly coming to appreciate the properties system. It's really great to have a mime-type associated with each file and all the potential that goes along with that.
If you want to get started with Subversion, you can download a version at the URL above. You'll also want to read Version Control with Subversion, which is available on-line or in hard copy.
Compiler Design in C
Allen I. Holub
Prentice Hall, 1990
I decided to take a break from the relatively new books I've been reviewing and hit a real classic.
Over a decade ago, I saw Compiler Design in C when I was interested in little languages. A quick look through the book convinced me that it might be worth the price. I am glad I took the chance. This book describes the whole process of compiling from a programmer's point of view. It is light on theory and heavy on demonstration. The book gave an address where you could order the source code. (This was pre-Web.) All of the source was in the book and could be typed in if you had more time than money.
Holub does a wonderful job of explaining and demonstrating how a compiler works. He also implements alternate versions of the classic tools lex and yacc with different tradeoffs and characteristics. This contrast allows you to really begin to understand how these tools work and how much help they supply.
The coolest part for me was the Visible Parser mode. Compilers built with this mode displayed a multi-pane user interface that allowed you to watch a parse as it happened. This mode serves as an interactive debugger for understanding what your parser is doing. This quickly made me move from vaguely knowing how a parser works to really understanding the process.
Many years later, I took a basic compilers course in computer science and the theory connected quite well with what I learned from this book. Although the Dragon Book covers the theory quite well, I wouldn't consider it as fun to read. More importantly, nothing in the class I took was nearly as effective as the Visible Parser in helping me to understand the rules and conflicts that could arise.
Although this book is quite old, I would recommend it very highly for anyone who wants to understand how parsers work, in general. Even if you've read the Dragon Book cover to cover and can build FAs in your sleep, this book will probably still surprise you with some fundamentally useful information.
The book appears to be out of print, but there are still copies lurking around. If you stumble across one, grab it.
I was doing a little research on the Java JUnit test framework and ran across the article The Third State of your Binary JUnit Tests.
The author points out that in many test sets there are ignored tests as well as the passing and failing tests. As the author says, you may want to ignore tests that show bugs that you can't fix at this time. He makes a pretty good case for this concept.
The Perl Test::More framework takes a more flexible approach. In this framework you can also have skipped tests and todo tests in addition to tests that actually need to pass. These two different types of tests have very different meanings.
Skipped tests are tests that should not be run for some reason. Many times tests will be skipped that don't apply to a particular platform, or rely on an optional module for functionality. This allows the tests to be run if the conditions are right, but skipped if they would just generate spurious test failures.
Todo tests have a very different meaning. These tests describe the way functionaly should work, even if it doesn't at this time. The test is still executed. But, if the test fails, it is not treated as a failure. More interestingly, if a todo test passes, it is reported as a failure because the test was not expected to pass. This allows bugs and unfinished features to be tracked in the test suite with a reminder to update the tests when they are completed.
Unlike the idea in the referenced article, these two separate mechanisms don't ignore tests that cannot or should not pass. Instead, we can document two different types of non-passing tests and still monitor them for changes.