Primitive Lists in Perl

Oct. 16, 2002

When most programmers think about a programming language, they think about its syntax and maybe its libraries. However, a language is more than just the syntax you need to make the compiler happy. Each programming language develops a set of idioms, as well as a culture. This essay explores the idiom of primitive lists. When used correctly, this idiom can improve the readability, maintainability, and sometimes even the efficiency of your code.

What Is a List?

Most languages have some form of array or list data type. Usually languages treat these lists as second-class citizens, at best. Perl lists actually combine the best of arrays and lists into one type. People coming to Perl with a background in languages like C, C++, or Java start off by seeing Perl lists as the same sort of arrays they are familiar with. But Perl lists are actually much more.

According to Programming Perl, Third Edition, a list is "an ordered set of scalars." Notice that this definition says nothing about the types of the scalars. In a language like C, C++, or Java, an array must contain elements of the same type. You can have an array of ints, but your array could not contain multiple types.1 A single Perl list, on the other hand, can contain numbers, strings, and references all at the same time. Perl lists are also not static in size. You can add and remove items from the list at will.

So, Perl lists can be homogeneous -- all elements are the same type -- or heterogeneous -- elements are different types. Lists can be almost any size, from empty to completely filling memory. Lists can remain static or can change in size. In a way, we can say that traditional arrays are a subset of the behavior of lists. After a little exposure to Perl lists, most programmers learn to use some of the enhanced functionality.

1Of course, in an object-oriented language, you can have an array of references or pointers to a base class and store references or pointers to different derived classes. Even in this case, the array still only contains one type.

Lists Are a Primitive Data Type

A Perl list is actually a primitive data type just like a number or string. In order for a data type to be primitive, it must have some special support from the language. An integer is a good example of a primitive data type. Most languages support some form of integer. The special support provided by a language for a primitive data type includes

Variables for storage
Easy creation/initialization
Use as a literal
Special operators

Since almost every data type in a language has some form of variables for storage, that support is not unique to primitive data types. Since support for variables doesn't gain us much, we'll ignore that issue. However, the other support is usually only provided to primitives. So let's look at those features in Perl that support lists as a primitive data type.

Creation and Initialization

Probably the most basic language support for primitive data types is creation. Perl supports many ways of creating lists, both statically and dynamically.

In addition to special operators like split and regular expressions that create lists from strings, Perl supports push and unshift operators for adding items to either end of a list variable. Of course, assignment is useful for initializing a list variable as well. One very useful mechanism for creating lists is the quote words operator (qw//). This allows you to easily declare a list of single word items separated by whitespace. For example, the expression qw/Sun Mon Tue Wed Thu Fri Sat/ returns a list of abbreviations for the days of the week.

Literals

Another characteristic of primitive data types is the ability to create them on-the-fly, or as a literal. For example, arrays in C can only be created as an array variable. This variable may be initialized with a list in curly braces. Unfortunately, this is the only place in the language that you can use a literal list. On the other hand, in Perl we can define literal lists wherever we need them. To create a list of items, you can use the values in parentheses:

  @arr = ( 1, 2, 3, 10, 20 );

Or you can use square brackets to generate a reference to an anonymous list:

  $ref = [ 1, 2, 3, 10, 20 ];

However, in Perl these expressions are not just for initializing variables. You can use an anonymous list almost anywhere a list is allowed. For example,

  foreach my $prime ((2, 3, 5, 7, 11, 13, 17, 19))
   {
    # do something
   }

In this case, both sets of parentheses are required. The inner set creates the list. The outer set is part of the syntax of the foreach.

In fact, Perl even has the ability to assign to a list in a way that makes sense. Assigning to a list variable works as expected. But an anonymous list on the left side of the = is legal provided all of its elements can be assigned to. This allows us to generate interesting expressions like:

  ($x, $y) = ($y, $x);  # swap the values in these variables

and

  ($id, $pass, $uid, $gid) = split; # split line and set variables

Expressions such as these can often simplify a piece of code.

One piece of confusion about lists in Perl involves array variables. An array, or variable starting with @, may contain a list. And, when the whole array (or an array slice) is used it returns a list. But an array variable is not a list any more than a scalar variable is a number. A scalar variable may contain a number (or a string, or a reference) but that is slightly different than being a number. Early in your use of arrays and lists in Perl, this distinction will not make any difference. As you begin to use lists as primitives, this distinction begins to clarify what you can and can't do with lists.

Operators

In order to be treated as a primitive data type, a list must be easy to create and have a number of operators for working with that data type. The following is a list of Perl operators for lists:

( ), [ ], qw//, and .. - create lists
push, unshift - add to a list
pop, shift - extract from a list
splice - add, extract, and replace elements of a list at any position
map, grep, sort, reverse - mutate or transform the list elements
split, m// - convert strings into lists
join - stringize a list
foreach - list walking
print - output a list
= - assignment
[] - subscript operator
array slice

The subscript operator ([]) and the splice operator are usually used when set of data is to be treated as an array. An array slice is another useful operation where part of a list is treated as a separate list.

Kinds of Lists

As stated above, Perl lists can be homogeneous or heterogeneous. But, not all lists are created equal. Some of the operators listed above work more naturally with some kinds of lists than with others. For instance, it probably never makes sense to sort the list created from the fields in a web server log. Each field has a particular position and type. No one would care if the User Agent field sorts before or after the URL. Likewise, you would probably be hard pressed to find a use for map when processing a list consisting of the parts of a mailing address. Very few expressions could be usefully applied to a person's name, street address, and state of residence. However, both of these lists would often be accessed by subscript. The position of the item in the list is as important as its value.

Some list operators work fine with all lists, but some do not. For instance, the creation operators, join, split, and print tend to be useful anytime you want to create lists or convert them to or from string form. However, you may find that some lists are more suited to the use of push, pop, shift, and unshift than the subscript operator, [].

There is a subset of homogeneous lists in Perl that is particularly useful. I will call this type of list a primitive list. Some of Perl's list operators are very effective when used on primitive lists. These operators are map, grep, and sort.

Is It a Primitive List?

Depending on how it is used, a Perl list can either be an array of elements or a primitive list. The emphasis here is on how the list is used. If the elements are the important thing, it is not a primitive list. If the set of elements is more important than the individual elements, you probably have a primitive list. A list may be primitive if

The list elements are all the same type,
All items are treated in the same way,
The number of items is not significant, and
The order of the items is not significant.

Some examples of primitive lists are:

A list of names: ('Fred', 'Barney', 'Larry', 'Dilbert')
A list of filenames: ('index.html', 'perl.html', 'style.css')
A list of symbols: ('IBM', 'T', 'HPQ', 'C')
A list of tokens: ('$a', '=', 'time', '+', '4')
A range of values: (1 .. 100)

Some lists that seem to meet these criteria may not be primitive. If any of these criteria are violated the list is definitely not primitive. In other words, a list is not primitive if

The list elements are fundamentally different types,
Particular items in the list are treated differently than others, or
The number of items is significant.

Some examples of non-primitive lists are:

(1, "name", \&func )
fields from a web server log
corners of a rectangle
(x, y, z) coordinates

Why Do We Care?

Most Perl lists are useful. However, primitive lists can be even more useful. In fact, Perl provides some very powerful list operators that work particularly well with primitive lists. These operators are map, grep, and sort.

These operators are sometimes cumbersome and non-intuitive when used with non-primitive lists. However, with primitive lists, these operators really shine. By learning to recognize the difference between primitive lists and non-primitive lists, you will find it easier to decide when to use these operators effectively. Let's look at some examples.

A map Example

What happens when you need to perform the same operation on every item in a list? Most people coming to Perl from another language would tend to treat the list as an array when solving this kind of problem.

Assume we have performed an experiment and that some of the data we have collected is a list of elapsed times in seconds. If we wanted to convert these time values into minutes, we need to divide every element in the list by 60. If you are relatively new to Perl, you might code something like the following:

  my @minutes = ();

  for(my $i = 0;@times > $i;++$i)
   {
    $minutes[$i] = $times[$i] / 60;
   }

There's nothing particularly wrong with this approach, but there are a few nagging oddities:

Why do we need a new variable $i in order to modify the value of @times?
Why is more code time spent manipulating this extra variable than is spent on the problem we are trying to solve? (After stripping unnecessary whitespace, 2/3 of the characters in this solution are devoted to this manipulation.)
It is less obvious that we are transforming the entire array as opposed to just manipulating some elements. You must read the code to be sure. Granted, on a piece of code this small, that doesn't take long to read. But, in a large program, it is still time wasted.

Someone with a little more experience in Perl might code this as:

  my @minutes = ();
  
  foreach my $t (@times)
   {
    push @minutes, $t / 60;
   }

This is a little better. Although there is still a new temporary variable, much less code time is spent manipulating it. The syntax still focuses on the elements in order, even though this is not important in the current operation. It is still not obvious that the @minutes list is the result of this computation without looking carefully at the code. Once again, in a small piece of code, that is not too hard. In a larger program, inspection could be a significant problem.

On the other hand, we can use the map operator to do this in a more obvious way:

  @minutes = map { $_ / 60 } @times;

If you haven't used map very much, this may look a little foreign. This approach has several features to recommend it:

The only new variable is the default $_ variable. It's fairly obvious, at a glance, that the variable itself is not important.
It is obvious that we are manipulating the array as a whole.
The order of the elements is not emphasized in this approach.
There is minimal extraneous syntax.
It is obvious, at a glance, that @minutes contains the result of this operation.

As you can see, this approach can generate clearer code in some circumstances.

A grep Example

Another operation that you might want to perform on a list is to filter out data that does not meet some criteria. Once again, with only a little Perl experience, a programmer is likely to treat the list as an array.

Using the same @minutes list from the previous example, let's discard any data that is less than 1 minute in duration.

Someone with a strong background in another language might code something like this:

  @longMinutes = ();
  
  for(my $i = 0; @minutes > $i;++$i)
   {
    if($minutes[$i] >= 1)
     {
      push @longMinutes, $minutes[$i];
     }
   }

As with the map example, there's a lot of extra syntax and an extra variable that is needed just to hold temporary information. You might be tempted to try to walk the array and delete items from the middle as you find them. This is actually a lot harder than it appears and will result in more code.

Of course, we could also use the foreach my $t (@times) approach here, as well. Instead, let's look at the equivalent solution using grep:

  @longMinutes = grep { $_ >= 1 } @minutes;

Once again, the intent of the code is not obscured by a lot of extraneous syntax. At least, the intent is clear once you know how to read a grep expression. Basically, grep aliases each element in the list to $_ and performs the test in curly braces on it. Any element that passes the test is passed to the new list. Any that fail are not in the new list.

A sort Example

Another operation that you might want to perform on a list is sorting. Since sorting is something that is usually done very badly when done by hand, most people coming from another language expect to find some kind of function for sorting. Sure enough, Perl offers the sort operator.

Now let's assume that we want to change the @longMinutes list so that it is sorted numerically from lowest to highest.

The appropriate Perl code for this would be

  @sortedMinutes = sort { $a <=> $b } @longMinutes;

Pairs of elements are aliased to the variables $a and $b. The code in the braces is executed and the result is interpreted as follows:

result < 0 - $a is before $b
result == 0 - $a equal to $b
result > 0 - $a is after $b

Combined Example

Each of the examples above showed one part of a larger problem. Each of the individual pieces of code could be made simpler through the use of Perl's list operators. But most programs are bigger than these examples. What would happen in a larger example?

Let's say that you wanted to do everything from the last few examples. If you don't use the list operators, the code would look something like this:

  my @minutes = ();

  for(my $i = 0;@times > $i;++$i)
   {
    $minutes[$i] = $times[$i] / 60;
   }

  @longMinutes = ();
  
  for(my $i = 0; @minutes > $i;++$i)
   {
    if($minutes[$i] >= 1)
     {
      push @longMinutes, $minutes[$i];
     }
   }

  @sortedMinutes = sort { $a <=> $b } @longMinutes;

After combining the loops, removing unneeded temporaries, and a little cleanup, the code looks like this:

  @longMinutes = ();
  
  for(my $i = 0;@times > $i;++$i)
   {
    my $minute = $times[$i] / 60;
    if($minute >= 1)
     {
      push @longMinutes, $minute;
     }
   }

  @sortedMinutes = sort { $a <=> $b } @longMinutes;

If you prefer the foreach my $t (@times) version, it would look like this:

  @longMinutes = ();
  
  foreach my $t (@times)
   {
    my $minute = $t / 60;
    if($minute >= 1)
     {
      push @longMinutes, $minute;
     }
   }

  @sortedMinutes = sort { $a <=> $b } @longMinutes;

This code is not too bad. But the list operators seemed to clean the individual stages up quite a bit. So let's look at what they would look like for the whole example:

  @minutes = map { $_ / 60 } @times;
  
  @longMinutes = grep { $_ >= 1 } @minutes;
  
  @sortedMinutes = sort { $a <=> $b } @longMinutes;

In the larger solution, we were able to combine some operations and reduce repeated assignments. We can do something similar with this approach. However, the result can be a little confusing without a little more information. Each of these operators expects a list as its right-most argument. (It's not really a single argument, but this simplification helps you think about this part.) Each of the operators also returns a list.

Putting these two ideas together shows us that these operators can be chained right to left. You put the operation you want to do first on the right and work backwards. Since this is somewhat awkward for most people, you might want to stop with the previous code. But, for the sake of the example, we'll continue. The final result would be

  @sortedMinutes = sort { $a <=> $b } # ascending order
                   grep { $_ >= 1 }   # discard short times
	           map  { $_ / 60 }   # convert to minutes
	           @times;

By putting one operation on each line, we make the overall operation a little easier to follow. To read it, you start at the far right, or in this case bottom. Take the @times list, transform the list from seconds to minutes, filter out any that are less than one, and sort the remainder from lowest to highest.

Not Just for Numbers

These operators are also useful for lists that contain things other than numbers. For example, let's assume you want the months of the year in full length, short form, and short form all caps. Just apply two map expressions and you are done:

  @Months = qw(January February March April May June July
               August September October November December);

  @ShortMonths = map { substr( $_, 0, 3 ) } @Months;

  @MonthsUC    = map { uc $_ } @ShortMonths;

The first two lines are not a typo. I used the quote words operator to build a list from a space-separated list of words.

One of my favorite tricks involves using a map expression to help fill a hash. The map operator evaluates its expression in list context. Consequently, the result of the operator does not have to have a one-to-one correspondence with the input of the operator. The map expression can evaluate to an empty list, one item, or a list of items. This means that the output list can be made bigger or smaller than the input list. In this case, the size of the list is doubled to generate a key => value pair for each input item. So, if we wanted a translation hash between uppercased short month names and full month names, the following code would build the translation:

  %ShortToMonth = map { ( uc substr( $_, 0, 3 ) => $_ ) } 
                  qw(January February March April May June July August
                     September October November December);

The great thing is that you don't have to type anything twice; the map does the magic for you. Without this trick, you would probably build something like this:

  %ShortToMonth = 
     (
      'JAN' => 'January',
      'FEB' => 'February',
      'MAR' => 'March',
      'APR' => 'April',
      'MAY' => 'May',
      'JUN' => 'June',
      'JUL' => 'July',
      'AUG' => 'August',
      'SEP' => 'September',
      'OCT' => 'October',
      'NOV' => 'November',
      'DEC' => 'December'
     );

User-defined List Operations

Perl has two features that make building your own list operations very easy. First, Perl passes all subroutine arguments as a single list. Any list operations can be applied to this list (@_) or any subset of this list. Second, Perl supports returning a list from a subroutine, as well. Creating a user-defined list operator uses both of these features. User-defined list operators can also be quite powerful for use with primitive lists.

For example, what if you wanted a routine to convert a list into another list that contains items from the first list without duplicates. This routine would change (1,2,5,2,3,1,2,1,4,1,4) into (1,2,5,3,4). Note that the order of the items is not changed. A routine to accomplish this could be defined as

sub  unique
 {
  my %seen = ();
  
  grep { !$seen{$_}++ } @_;
 }

In order to decide which items to remove from the list, we need to keep up with items we have already seen. The %seen hash is used to store the items as we see them. The input list is provided in the special variable @_. Since the purpose of our function is to remove elements that meet a certain criteria, the grep operator seems a natural choice.

Although using one of the built-in list operators in our user defined function may seem a little like cheating, nothing could be further from the truth. Most of user-written functions with numeric arguments use built-in number operators. The same applies to list functions.

The grep expression here is a little more subtle than what we've worked with before. Working from the inside, for each element ($_) we check the %seen hash. If the current value of this hash element is zero, we pass the list element through. Finally, we increment the value of the element, so that it will be greater than 0 next time. The list returned by grep is then returned by the function.

You may wonder if this little function is worth the effort. One of the best uses of a function is to hide the actual implementation of an operation. This works especially well if the algorithm you are hiding is messy or complicated. So let's look at how it would clean up a small section of code. Assume we have an application that would have a list of host names and we want to generate a hash mapping these host names to IP addresses. Assume also that we have a function get_host_address() that takes a host name and does whatever is needed to generate an IP address. Since this function may result in DNS requests, we really don't want to call it more often than necessary.

In the first example, we create a set of unique host names from the function arguments inside this function. The code to generate the translation hash occurs in the last line of the function. Everything else is the code to remove duplications.

  sub  get_host_ip_translation
   {
    my @hosts = @_;

    my %seen = ();

    my @uniqHosts = grep { !$seen{$_}++ } @hosts;

    return map { $_ => get_host_address( $_ ) } @uniqHosts;
   }

By using the unique() function, we can simplify this subroutine dramatically. This actually reads very much like the built-in list operator code we have already discussed.

  sub  get_host_ip_translation
   {
    return map { $_ => get_host_address( $_ ) } unique( @_ );
   }

As you become more comfortable with primitive lists, you will find other situations where problems can be more elegantly solved using this kind of notation.

Conclusion

Like anything in Perl, there's always more than one way to do it. But treating some lists as primitive lists instead of arrays of scalars can sometimes result in much more elegant, maintainable, and readable code.

Acknowledgments

I'd like thank Debbie Campbell and Burke Gallagher for ripping the first few drafts to shreds.