Primitive Lists in Perl
When most programmers think about a programming language, they think about its syntax and maybe its libraries. However, a language is more than just the syntax you need to make the compiler happy. Each programming language develops a set of idioms, as well as a culture. This essay explores the idiom of primitive lists. When used correctly, this idiom can improve the readability, maintainability, and sometimes even the efficiency of your code.
What Is a List?
Most languages have some form of array or list data type. Usually languages treat these lists as second-class citizens, at best. Perl lists actually combine the best of arrays and lists into one type. People coming to Perl with a background in languages like C, C++, or Java start off by seeing Perl lists as the same sort of arrays they are familiar with. But Perl lists are actually much more.
According to Programming Perl, Third Edition, a list is "an
ordered set of scalars." Notice that this definition says nothing about the
types of the scalars. In a language like C, C++, or Java, an array must contain
elements of the same type. You can have an array of int
s, but your
array could not contain multiple types.1 A
single Perl list, on the other hand, can contain numbers, strings, and
references all at the same time. Perl lists are also not static in size. You
can add and remove items from the list at will.
So, Perl lists can be homogeneous -- all elements are the same type -- or heterogeneous -- elements are different types. Lists can be almost any size, from empty to completely filling memory. Lists can remain static or can change in size. In a way, we can say that traditional arrays are a subset of the behavior of lists. After a little exposure to Perl lists, most programmers learn to use some of the enhanced functionality.
Lists Are a Primitive Data Type
A Perl list is actually a primitive data type just like a number or string. In order for a data type to be primitive, it must have some special support from the language. An integer is a good example of a primitive data type. Most languages support some form of integer. The special support provided by a language for a primitive data type includes
- Variables for storage
- Easy creation/initialization
- Use as a literal
- Special operators
Since almost every data type in a language has some form of variables for storage, that support is not unique to primitive data types. Since support for variables doesn't gain us much, we'll ignore that issue. However, the other support is usually only provided to primitives. So let's look at those features in Perl that support lists as a primitive data type.
Creation and Initialization
Probably the most basic language support for primitive data types is creation. Perl supports many ways of creating lists, both statically and dynamically.
In addition to special operators like split
and regular
expressions that create lists from strings, Perl supports push
and
unshift
operators for adding items to either end of a list
variable. Of course, assignment is useful for initializing a list variable as
well. One very useful mechanism for creating lists is the quote words
operator (qw//
). This allows you to easily declare a list of
single word items separated by whitespace. For example, the expression
qw/Sun Mon Tue Wed Thu Fri Sat/
returns a list of abbreviations
for the days of the week.
Literals
Another characteristic of primitive data types is the ability to create them on-the-fly, or as a literal. For example, arrays in C can only be created as an array variable. This variable may be initialized with a list in curly braces. Unfortunately, this is the only place in the language that you can use a literal list. On the other hand, in Perl we can define literal lists wherever we need them. To create a list of items, you can use the values in parentheses:
@arr = ( 1, 2, 3, 10, 20 );
Or you can use square brackets to generate a reference to an anonymous list:
$ref = [ 1, 2, 3, 10, 20 ];
However, in Perl these expressions are not just for initializing variables. You can use an anonymous list almost anywhere a list is allowed. For example,
foreach my $prime ((2, 3, 5, 7, 11, 13, 17, 19)) { # do something }
In this case, both sets of parentheses are required. The inner set creates
the list. The outer set is part of the syntax of the foreach
.
In fact, Perl even has the ability to assign to a list in a way that makes
sense. Assigning to a list variable works as expected. But an anonymous list on
the left side of the =
is legal provided all of its elements can be
assigned to. This allows us to generate interesting expressions like:
($x, $y) = ($y, $x); # swap the values in these variables
and
($id, $pass, $uid, $gid) = split; # split line and set variables
Expressions such as these can often simplify a piece of code.
One piece of confusion about lists in Perl involves array variables. An
array, or variable starting with @
, may contain a list. And, when
the whole array (or an array slice) is used it returns a list. But an array
variable is not a list any more than a scalar variable is a number. A scalar
variable may contain a number (or a string, or a reference) but that
is slightly different than being a number. Early in your use of arrays and lists
in Perl, this distinction will not make any difference. As you begin to use
lists as primitives, this distinction begins to clarify what you can and can't
do with lists.
Operators
In order to be treated as a primitive data type, a list must be easy to create and have a number of operators for working with that data type. The following is a list of Perl operators for lists:
( )
,[ ]
,qw//
, and..
- create listspush
,unshift
- add to a listpop
,shift
- extract from a listsplice
- add, extract, and replace elements of a list at any positionmap
,grep
,sort
,reverse
- mutate or transform the list elementssplit
,m//
- convert strings into listsjoin
- stringize a listforeach
- list walkingprint
- output a list=
- assignment[]
- subscript operator- array slice
The subscript operator ([]
) and the splice
operator
are usually used when set of data is to be treated as an array. An array slice
is another useful operation where part of a list is treated as a separate list.
Kinds of Lists
As stated above, Perl lists can be homogeneous or heterogeneous. But, not all
lists are created equal. Some of the operators listed above work more naturally
with some kinds of lists than with others. For instance, it probably never
makes sense to sort the list created from the fields in a web server log. Each
field has a particular position and type. No one would care if the User Agent
field sorts before or after the URL. Likewise, you would probably be hard
pressed to find a use for map
when processing a list consisting of
the parts of a mailing address. Very few expressions could be usefully applied
to a person's name, street address, and state of residence. However, both of
these lists would often be accessed by subscript. The position of the item in
the list is as important as its value.
Some list operators work fine with all lists, but some do not. For instance,
the creation operators, join
, split
, and
print
tend to be useful anytime you want to create lists or convert
them to or from string form. However, you may find that some lists are more
suited to the use of push
, pop
, shift
,
and unshift
than the subscript operator, []
.
There is a subset of homogeneous lists in Perl that is particularly useful. I
will call this type of list a primitive list. Some of Perl's list
operators are very effective when used on primitive lists. These operators are
map
, grep
, and sort
.
Is It a Primitive List?
Depending on how it is used, a Perl list can either be an array of elements or a primitive list. The emphasis here is on how the list is used. If the elements are the important thing, it is not a primitive list. If the set of elements is more important than the individual elements, you probably have a primitive list. A list may be primitive if
- The list elements are all the same type,
- All items are treated in the same way,
- The number of items is not significant, and
- The order of the items is not significant.
Some examples of primitive lists are:
- A list of names: ('Fred', 'Barney', 'Larry', 'Dilbert')
- A list of filenames: ('index.html', 'perl.html', 'style.css')
- A list of symbols: ('IBM', 'T', 'HPQ', 'C')
- A list of tokens: ('$a', '=', 'time', '+', '4')
- A range of values: (1 .. 100)
Some lists that seem to meet these criteria may not be primitive. If any of these criteria are violated the list is definitely not primitive. In other words, a list is not primitive if
- The list elements are fundamentally different types,
- Particular items in the list are treated differently than others, or
- The number of items is significant.
Some examples of non-primitive lists are:
- (1, "name", \&func )
- fields from a web server log
- corners of a rectangle
- (x, y, z) coordinates
Why Do We Care?
Most Perl lists are useful. However, primitive lists can be even more useful.
In fact, Perl provides some very powerful list operators that work particularly
well with primitive lists. These operators are map
,
grep
, and sort
.
These operators are sometimes cumbersome and non-intuitive when used with non-primitive lists. However, with primitive lists, these operators really shine. By learning to recognize the difference between primitive lists and non-primitive lists, you will find it easier to decide when to use these operators effectively. Let's look at some examples.
A map Example
What happens when you need to perform the same operation on every item in a list? Most people coming to Perl from another language would tend to treat the list as an array when solving this kind of problem.
Assume we have performed an experiment and that some of the data we have collected is a list of elapsed times in seconds. If we wanted to convert these time values into minutes, we need to divide every element in the list by 60. If you are relatively new to Perl, you might code something like the following:
my @minutes = (); for(my $i = 0;@times > $i;++$i) { $minutes[$i] = $times[$i] / 60; }
There's nothing particularly wrong with this approach, but there are a few nagging oddities:
- Why do we need a new variable
$i
in order to modify the value of@times
? - Why is more code time spent manipulating this extra variable than is spent on the problem we are trying to solve? (After stripping unnecessary whitespace, 2/3 of the characters in this solution are devoted to this manipulation.)
- It is less obvious that we are transforming the entire array as opposed to just manipulating some elements. You must read the code to be sure. Granted, on a piece of code this small, that doesn't take long to read. But, in a large program, it is still time wasted.
Someone with a little more experience in Perl might code this as:
my @minutes = (); foreach my $t (@times) { push @minutes, $t / 60; }
This is a little better. Although there is still a new temporary variable,
much less code time is spent manipulating it. The syntax still focuses on
the elements in order, even though this is not important in the current
operation. It is still not obvious that the @minutes
list is the
result of this computation without looking carefully at the code. Once again,
in a small piece of code, that is not too hard. In a larger program, inspection
could be a significant problem.
On the other hand, we can use the map
operator to do this in a
more obvious way:
@minutes = map { $_ / 60 } @times;
If you haven't used map
very much, this may look a little
foreign. This approach has several features to recommend it:
- The only new variable is the default
$_
variable. It's fairly obvious, at a glance, that the variable itself is not important. - It is obvious that we are manipulating the array as a whole.
- The order of the elements is not emphasized in this approach.
- There is minimal extraneous syntax.
- It is obvious, at a glance, that
@minutes
contains the result of this operation.
As you can see, this approach can generate clearer code in some circumstances.
A grep Example
Another operation that you might want to perform on a list is to filter out data that does not meet some criteria. Once again, with only a little Perl experience, a programmer is likely to treat the list as an array.
Using the same @minutes
list from the previous example, let's
discard any data that is less than 1 minute in duration.
Someone with a strong background in another language might code something like this:
@longMinutes = (); for(my $i = 0; @minutes > $i;++$i) { if($minutes[$i] >= 1) { push @longMinutes, $minutes[$i]; } }
As with the map
example, there's a lot of extra syntax and an
extra variable that is needed just to hold temporary information. You might be
tempted to try to walk the array and delete items from the middle as you find
them. This is actually a lot harder than it appears and will result in more
code.
Of course, we could also use the foreach my $t (@times)
approach
here, as well. Instead, let's look at the equivalent solution using
grep
:
@longMinutes = grep { $_ >= 1 } @minutes;
Once again, the intent of the code is not obscured by a lot of extraneous
syntax. At least, the intent is clear once you know how to read a
grep
expression. Basically, grep
aliases each element
in the list to $_
and performs the test in curly braces on it.
Any element that passes the test is passed to the new list. Any that fail are
not in the new list.
A sort Example
Another operation that you might want to perform on a list is sorting. Since
sorting is something that is usually done very badly when done by hand, most
people coming from another language expect to find some kind of function for
sorting. Sure enough, Perl offers the sort
operator.
Now let's assume that we want to change the @longMinutes
list
so that it is sorted numerically from lowest to highest.
The appropriate Perl code for this would be
@sortedMinutes = sort { $a <=> $b } @longMinutes;
Pairs of elements are aliased to the variables $a
and
$b
. The code in the braces is executed and the result is
interpreted as follows:
- result < 0 - $a is before $b
- result == 0 - $a equal to $b
- result > 0 - $a is after $b
Combined Example
Each of the examples above showed one part of a larger problem. Each of the individual pieces of code could be made simpler through the use of Perl's list operators. But most programs are bigger than these examples. What would happen in a larger example?
Let's say that you wanted to do everything from the last few examples. If you don't use the list operators, the code would look something like this:
my @minutes = (); for(my $i = 0;@times > $i;++$i) { $minutes[$i] = $times[$i] / 60; } @longMinutes = (); for(my $i = 0; @minutes > $i;++$i) { if($minutes[$i] >= 1) { push @longMinutes, $minutes[$i]; } } @sortedMinutes = sort { $a <=> $b } @longMinutes;
After combining the loops, removing unneeded temporaries, and a little cleanup, the code looks like this:
@longMinutes = (); for(my $i = 0;@times > $i;++$i) { my $minute = $times[$i] / 60; if($minute >= 1) { push @longMinutes, $minute; } } @sortedMinutes = sort { $a <=> $b } @longMinutes;
If you prefer the foreach my $t (@times)
version, it would look
like this:
@longMinutes = (); foreach my $t (@times) { my $minute = $t / 60; if($minute >= 1) { push @longMinutes, $minute; } } @sortedMinutes = sort { $a <=> $b } @longMinutes;
This code is not too bad. But the list operators seemed to clean the individual stages up quite a bit. So let's look at what they would look like for the whole example:
@minutes = map { $_ / 60 } @times; @longMinutes = grep { $_ >= 1 } @minutes; @sortedMinutes = sort { $a <=> $b } @longMinutes;
In the larger solution, we were able to combine some operations and reduce repeated assignments. We can do something similar with this approach. However, the result can be a little confusing without a little more information. Each of these operators expects a list as its right-most argument. (It's not really a single argument, but this simplification helps you think about this part.) Each of the operators also returns a list.
Putting these two ideas together shows us that these operators can be chained right to left. You put the operation you want to do first on the right and work backwards. Since this is somewhat awkward for most people, you might want to stop with the previous code. But, for the sake of the example, we'll continue. The final result would be
@sortedMinutes = sort { $a <=> $b } # ascending order grep { $_ >= 1 } # discard short times map { $_ / 60 } # convert to minutes @times;
By putting one operation on each line, we make the overall operation a little
easier to follow. To read it, you start at the far right, or in this case
bottom. Take the @times
list, transform the list from seconds to
minutes, filter out any that are less than one, and sort the remainder from
lowest to highest.
Not Just for Numbers
These operators are also useful for lists that contain things other than
numbers. For example, let's assume you want the months of the year in full
length, short form, and short form all caps. Just apply two map
expressions and you are done:
@Months = qw(January February March April May June July August September October November December); @ShortMonths = map { substr( $_, 0, 3 ) } @Months; @MonthsUC = map { uc $_ } @ShortMonths;
The first two lines are not a typo. I used the quote words operator to build a list from a space-separated list of words.
One of my favorite tricks involves using a map
expression to
help fill a hash. The map
operator evaluates its expression in
list context. Consequently, the result of the operator does not have
to have a one-to-one correspondence with the input of the operator. The
map
expression can evaluate to an empty list, one item, or a list
of items. This means that the output list can be made bigger or smaller than
the input list. In this case, the size of the list is doubled to generate a
key => value pair for each input item. So, if we wanted a translation
hash between uppercased short month names and full month names, the following
code would build the translation:
%ShortToMonth = map { ( uc substr( $_, 0, 3 ) => $_ ) } qw(January February March April May June July August September October November December);
The great thing is that you don't have to type anything twice; the
map
does the magic for you. Without this trick, you would probably
build something like this:
%ShortToMonth = ( 'JAN' => 'January', 'FEB' => 'February', 'MAR' => 'March', 'APR' => 'April', 'MAY' => 'May', 'JUN' => 'June', 'JUL' => 'July', 'AUG' => 'August', 'SEP' => 'September', 'OCT' => 'October', 'NOV' => 'November', 'DEC' => 'December' );
User-defined List Operations
Perl has two features that make building your own list operations very easy.
First, Perl passes all subroutine arguments as a single list. Any list
operations can be applied to this list (@_
) or any subset of this
list. Second, Perl supports returning a list from a subroutine, as well.
Creating a user-defined list operator uses both of these features. User-defined
list operators can also be quite powerful for use with primitive lists.
For example, what if you wanted a routine to convert a list into another list that contains items from the first list without duplicates. This routine would change (1,2,5,2,3,1,2,1,4,1,4) into (1,2,5,3,4). Note that the order of the items is not changed. A routine to accomplish this could be defined as
sub unique { my %seen = (); grep { !$seen{$_}++ } @_; }
In order to decide which items to remove from the list, we need to keep up
with items we have already seen. The %seen
hash is used to store
the items as we see them. The input list is provided in the special variable
@_
. Since the purpose of our function is to remove elements that
meet a certain criteria, the grep
operator seems a natural
choice.
Although using one of the built-in list operators in our user defined function may seem a little like cheating, nothing could be further from the truth. Most of user-written functions with numeric arguments use built-in number operators. The same applies to list functions.
The grep
expression here is a little more subtle than what we've
worked with before. Working from the inside, for each element ($_
)
we check the %seen
hash. If the current value of this hash element
is zero, we pass the list element through. Finally, we increment the value of
the element, so that it will be greater than 0 next time. The list returned by
grep
is then returned by the function.
You may wonder if this little function is worth the effort. One of the best
uses of a function is to hide the actual implementation of an operation. This
works especially well if the algorithm you are hiding is messy or complicated.
So let's look at how it would clean up a small section of code. Assume we have
an application that would have a list of host names and we want to generate a
hash mapping these host names to IP addresses. Assume also that we have a
function get_host_address()
that takes a host name and does
whatever is needed to generate an IP address. Since this function may result
in DNS requests, we really don't want to call it more often than necessary.
In the first example, we create a set of unique host names from the function arguments inside this function. The code to generate the translation hash occurs in the last line of the function. Everything else is the code to remove duplications.
sub get_host_ip_translation { my @hosts = @_; my %seen = (); my @uniqHosts = grep { !$seen{$_}++ } @hosts; return map { $_ => get_host_address( $_ ) } @uniqHosts; }
By using the unique()
function, we can simplify this subroutine
dramatically. This actually reads very much like the built-in list operator code
we have already discussed.
sub get_host_ip_translation { return map { $_ => get_host_address( $_ ) } unique( @_ ); }
As you become more comfortable with primitive lists, you will find other situations where problems can be more elegantly solved using this kind of notation.
Conclusion
Like anything in Perl, there's always more than one way to do it. But treating some lists as primitive lists instead of arrays of scalars can sometimes result in much more elegant, maintainable, and readable code.
Acknowledgments
I'd like thank Debbie Campbell and Burke Gallagher for ripping the first few drafts to shreds.