- perl n:
- 1. A computer programming language borrowing the best features from some of the best languages around 2. practical extraction and report language 3. programmer's eclectic rubbish lister
An all-important feature of every language, the comment in perl is just like the comments you are probably familiar with in shell scripting. Any # occurring in a perl script that is not part of a constant (such as a string) begins a comment that extends to the end of the current line.
A popular comment to put at the top of a perl file is the "shebang" comment, which looks like this:
#!/path/to/perl
On a *nix machine, this comment combined with the fact that the file is executable will cause it to be automatically processed by the perl interpreter if it is invoked as a command.
Variables in perl are not like variables in most compiled (and many interpreted!) programming languages. Perl variables have no intrinsic type, do not have to be defined beforehand, and their memory does not have to be freed when you are done with them. Variables can have any name that is a valid perl identifier, which is to say that it must start with a letter or underscore and may contain any of A-Z, a-z, 0-9 and _. Variable names in perl are case sensititve. There are three basic types of variables in perl, as well as references. We will discuss the three basic variable types presently, but references are a little out of the scope of this paper and will have only a cursory examination.
A scalar is a variable that holds a single value. That value can be a number, a letter, a sentence, or binary data. The perl interpreter does not care in any way what the value of a scalar is, so you can store whatever you like. A scalar is denoted by the character $. Here are some example scalar variables and some associated values:
$x = 4;As you can see, perl follows the C convention of ending syntactic lines with a semicolon.
$pi = 3.1415927;
$quote = "There's more than one way to do it.";
$character = "g";
$xis syntactically identical to:
=
3
;
$x = 3;
This data type is interchangeably called an "array" or "list", and it behaves just like you might expect an array or list to behave. A nice thing about perl arrays is that you can treat them as an array on one line, indexing elements by number, and then treat them as a list on the next line by removing the first item as you would remove the head of a linked list. Perl hides all the array/list ugliness from you. An array can hold any number of scalar values, which are typically retrieved by referencing their numerical position in the array. An array is denoted by the character @. When dereferencing an array, however, you use the $ character the same as you would for a scalar. The best way to explain this is with an example, so here are some array examples:
@list = ("hello", "world");
@foo = (); # The empty list
$array[0] = "hello";
$array[1] = "world";
Perl hashes are one of its most powerful constructs. A perl hash is basically an array whose elements are referenced by an arbitrary key rather than a number. The name hash comes from the way they are stored internally by the perl interpreter. Hashes also have the desirable property that it takes a constant amount of time to retrieve any value within the structure. (If you know what I'm talking about, you can be suitably impressed at this point.) A hash is denoted by the % character. Hashes, like arrays, are dereferenced using the $ character. Unlike arrays, however, if a list is assigned to a hash it has some more structure than a simple list. Every other value is a key, and the remaining values are the value associated with their respective keys. This will make some more sense in the examples. In order to make large hash assignments more readable, perl has defined the => operator to be equivalent to ,. Some examples of hashes:
%empty = ();It is worth noting here that the dereferencing element used can be stored in a scalar. For instance:
%member = (name => "Ethan", office => "president");
$foo{bar} = "heh";
$foo{d} = "hamburger";
$x = "bar";This will assign heh into $foo{bar} the same as in the above example.
$foo{$x} = "heh";
Reference variables perform the duty of pointers in C, and are most closely akin to C++ or Java references. They allow you to access a variable with a name other than its own. We will not discuss references at this time, as they are somewhat more complex than the other variable types.
We will leave more complex variable topics such as anonymous variables for some other time. Suffice to say that we have barely scratched the surface of the power of the perl builtin data types.
Also worth mentioning on the topic of variables in perl are the special variables. There are a whole slew of these, and they have cryptic names such as $_, $., and @_. We will discuss several of these in more detail later.
Perl variables are, by default, global. This is probably not what you want if you have been programming in a language that uses tighter scopes. Variable scoping in perl is very flexible, however, so we can make it behave in a sane fashion with ease.
The kind of local variable you want 99% of the time is a my variable in perl. my variables behave just like variables in C; they are local to the function in which they are declared. To create a my variable, simply prepend its assignment with the keyword my. For example:
my $heh = "heh";
my @foo = ("bar", "baz", "d");
my $intro = "Four score and seven years ago";
The kind of local scoping you probably do not want is local variables. These variables are within scope not only in the current function, but in all functions called by this function. If you can't figure out why you would want this, you don't. There are, however, some very good reasons to use it; we won't talk about them here.
Quoting in perl is most similar to that of many command shells. Double quotes are typically used to denote string constants, and allow variable interpolation. Unlike C, single quotes do not denote a character constant; they also reference a string, but without variable interpolation. For instance:
$x = 5;
print "There are $x apples in the box.\n";
print 'There are $x apples in the box.\n';
This code will produce the following result:
There are 5 apples in the box.
There are $x apples in the box.\n
This is important to remember, and tremendously handy. Typically you will want variable interpolation, and will thus use double quotes. Single quotes will allow you to easily use more sinister characters such as $, however, without worrying about what might happen to your beautiful string when variable substitution comes around.
Variables within a double-quoted string can be escaped with the \ character if necessary. For instance, \$x would equal the expected $x rather than 5 in the above example.
The final form of quoting we will discuss is q// and qq// quoting. This syntax allows you to choose quoting characters that are convenient for you, not the perl interpreter. If you wish to embed both " and ' characters in a string, the q// operators can be a lifesaver. q// acts as a single-quoted string, and qq// acts as a double-quoted string. The / character can be replaced non-alphanumeric or whitespace symbol. If the chosen symbol is one of a pair (e.g. (), [], etc.) it can be closed with the matching character. Some examples:
$x = qq(He's fond of saying "heh."\n);
$foo = q/The literal $foo is within the scalar foo/;
$bar = qq:":;
print $x;
print $foo, "\n";
print "$bar\n";
Executing this code will produce the following result:
He's fond of saying "heh."
The literal $foo is within the scalar foo
"
The perl builtin functions act more or less the way you would expect builtin functions to work, and their syntax is generally very similar to C. A typical function call typically takes the form of function(arg1, arg2, arg3), and thus should look very familiar to programmers of C, C++, Java, and many other languages.
A notable distinction about many perl functions is that they can also take the form of commands, where they have syntax more similar to shell commands that you might type at the prompt. For instance:
print("x = ", $x, "\n");These two lines are equivalant, and both print the string "x = " followed by the value of x.
print "x = ", $x, "\n";
You can define your own functions in perl using the keyword sub. The function body is defined between curly brackets ({}) as it would be in any other C-like language. The function can then be dereferenced by either prepending its name with the & character, or appending () to it:
&function;For instance, suppose that we find ourselves printing the string "Ethan is a wonderful and talented programmer\n" a lot and want to save ourselves some keystrokes. We could define the function ethan_rules to do this for us:
function();
sub ethan_rules {We could later call this function by inserting ðan_rules anywhere in our code.
print "Ethan is a wonderful and talented programmer.\n";
}
Unlike many other interpreted languages, a perl function definition does not have to occur physically before its invocation in the source code. Perl precompiles its scripts before execution, so it is sufficient that the function occurs anywhere in the text file. I tend to put my function declarations at the end for clarity.
A function with that can take no arguments is somewhat less than useful. Fortunately, perl allows us to pass arguments into our function in a simple manner, although one that may seem somewhat bizarre to programmers of other languages.
For this, we will need to talk about the special variable @_ and the shift operator.
@_ is used by perl (among other things) to pass a function its arguments. You will never see a perl function that specifies its arguments in the manner that you would see for a C function; since perl does not have strongly typed (or even practically typed at all!) variables, there is no point. Instead, perl populates the @_ array with the arguments given when a function was called. It is up to the programmer to get them out.
We have several methods at our disposal to get at these arguments. One is a simple assignment such as these:
my($arg1, $arg2) = @_;Both of these assignments pull two arguments out of the @_ array and places them into $arg1 and $arg2. Note the use of my to avoid cluttering our global namespace.
my $arg1 = $_[0]; my $arg2 = $_[1];
The second, and sometimes preferred, way is to use the shift operator. shift pulls the first member out of the array specified and returns it. If no array is specified, it operates on @_. Two equivalent examples:
my ($arg1, $arg2) = (shift, shift);These two assignments are functionally identical to the above assignments that specifically reference the @_ array.
my $arg1 = shift; my $arg2 = shift;
Functions in perl return values in the same was as functions in C, with a few exeptions. The first exception is that in perl every function returns a value. If you have no return statement in a function, the value returned will be the value of returned by the last operation within the function.
When we said before that perl does not care about the value assigned to a scalar, that was a little bit misleading. There is one case where perl does care, and that is comparison. Consider this problem: Is 09 greater or smaller than 90? If we go by ASCII ordering, 09 is clearly larger. If we go by numerical ordering, 90 is larger. What we need here is a way to syntactically differentiate between string comparisons and numeric comparisons. Fortunately, perl has just such a distinction.
When comparing numerical values, we use the == operator, just as we would in C. For instance, $x == 4 would return 1 only if $x is truly equal to the integer 4, just as expected. The equivalent comparison operator for strings is eq, and you would use it as $x eq "foo", which would return true only if $x contained the literal string foo. The commonly used C-style comparison operators and their string counterparts are:
| Numeric | String | |
|---|---|---|
| Equality | == | eq |
| Inequality | != | ne |
| Less than | < | lt |
| Greater than | > | gt |
| 3-way Comparison | <=> | cmp |
There are other operator pairs (such as <= and le), but I think you get the point. When in doubt, consult the documentation.
Remembering what comparison operator to use where is a little hard at first, but using the right operator can be critical.
Perl control structures share many similarities with those of the C programming language, but they also have their share of quirks and features that C lacks. Here is a quick rundown of some of those structures.
The perl while loop is almost identical to the C while loop. It has the syntax of:
while(EXPRESSION) {This while loop will continue to execute the instructions in its body as long as the conditional expression remains true. It can also take the form of:
BODY
}
do {The difference here is that even if EXPRESSION is false, it will execute BODY at least once.
BODY
} while(EXPRESSION);
The for loop has two common forms in perl. The first of these forms behaves as similarly to the C for loop as the while loop to its C counterpart. For example:
for($i = 0; $i < 10; $i++) {
print "$i\n";
}
The second form of for loop is more perl-specific, and can be terribly useful. It is intended for the purpose of iterating over a list, and can save you a lot of time if that is what you need to do. It looks like this:
for $element (@array) {This snippet of code will print out each element in @array. The basic concept here is that the scalar specified will be filled with each element of the array between the parenthesis in turn, and can be referenced as such in the for loop body.
print "$element\n";
}
Perl also provides the foreach keyword, which is syntactically identical to the for keyword in the latter example, but can make your code easier to read.
The if statement, in its most basic form, is also identical to the C if statement. The interesting thing about the perl if statement is that it can also occur after its conditional in many cases. Yet again, this is best shown by example:
if($x > 0) {As you can see, the latter syntax is more concise in many instances, but can be less readable. As with many things in perl, you must weigh readability versus brevity when using this construct. Remember also that in some cases the shorter code might be more readable despite its apparent clutter.
print "heh\n";
}
print "heh\n" if $x > 0;
Files in perl are pretty much like files in every other programming language, except for the inherent coolness of the fact that they are perl files.
Files are opened with the open function and closed with the close function, just as veteran programmers would expect. The syntax seems a little odd at first, as there are no commas in places that you would expect them. You'll see that in just a bit.
Filehandles are a special kind of variable that has no delimiter (the $, %, and @s that we saw before). For this reason, they are typically written with all caps to set them apart. By inserting these filehandles in the appropriate places, you can make perl write to the file referenced.
Here is an example of opening a file, reading a line from it, and closing it again:
open(FILE, "data.txt") or die "Couldn't open data.txt";This example illustrates several new topics. The first is, of course, the open command itself, which should be fairly obvious. What follows it is a common perl trick that bears some explaining. Perl uses (as C does) "short-circuit" evaluation for boolean conditions, which means that as soon as the entire expression can be determined, perl moves on. In this case, open returns a true value if it successfully opens its file and a false value if it cannot; therefore if open succeeds in this case, the die command will never execute. die does exactly what you might think it would do from this example; it exits the perl interpreter after displaying the given error message.
$x = <FILE>;
print $x;
close(FILE);
The next interesting topic in this snippet is the <FILE> construct. This retrieves one line of text from FILE and returns it, in this case assigning it to $x. The terminating newline is returned, as well. The final close should require no explanation at this point. ;-)
We are now at an opportune point to talk about another one of those "special variables" I mentioned before, $_. $_ is what I like to think of as a "magic" value in perl; most operations return their value into $_, and most functions operate on $_ if you do not specify otherwise. For instance, the above snippet of code can be written entirely without $x:
open(FILE, "data.txt") or die "Couldn't open data.txt";
<FILE>;
print;
close(FILE);
Both the <> and print operators work their magic with $_ in such a way that this code is functionally identical to the previous example. Even though the variable $_ never actually appears in this code, it is being used behind the scenes to make our lives easier. Seldom will you have to actually use this variable in your code; for the most part it will be "understood" by the perl interpreter.
This can be used in many, many useful and interesting ways. As an example, this code will read every line of a file one at a time and print it to the screen:
while(<FILE>) {Also demonstrated here is a niftiness of the <> operator; when called repeatedly, it returns successive lines of the file until there are no more. When there are no more, it returns a boolean false.
print;
}
Files can also be opened for writing, as you might expect. The syntax to do so is a little different from what you are probably used to seeing, but not unintuitive. Instead of using:
open(FILEHANDLE, "filename");We would use:
open(FILEHANDLE, ">filename");In order to actually write to this file, we can use the same print statement that we have been using to write to the screen. For instance:
print FILEHANDLE "Yay for the camel\n";Notice that there is no comma between FILEHANDLE and the string to be printed; this is because barewords are technically legal string constants in perl, but two consecutive barewords are not. Therefore, if perl sees this construction it knows that FILEHANDLE should be written to if it is the name of a filehandle, as it is a syntax error otherwise.
Appending to a file is similar to opening for writing, except that >> is used instead of >.
For the interested, the default filehandle (equivalent to $_ and @_) is merely _.
Regular expressions are one of the most useful and powerful features of perl, but they can be somewhat difficult to understand. A full explanation of regular expressions is well beyond the scope of this document, but I will try to provide some simple explanations. The perlre manpage explains perl regular expressions in detail, and is certainly worth your while if you intend to do much perl programming.
If you have ever used filename globbing (e.g. * and ? at the prompt), you have used something similar to regular expressions. Regular expressions allow you, among other things, to define a complex set of strings with one expression. You will see just how useful this can be as we progress.
The most basic unit of a regular expression is the assertion. An assertion is merely a character, with the exception that the zero-width assertion is the absence of a character. For our purpose this is irrelevant, so we will consider the basic unit of a regular expression the character.
Regular expressions are commonly used for pattern matching, or determining if a particular constant string matches a defined pattern. For this reason, we will consider most of the elements of a regular expression in light of what characters they "match" in such a situation.
Most characters match themselves in a regular expression; certainly all alphanumeric characters do. For instance, the perfectly valid regular expression heh matches exactly one thing: The word "heh". As you can imagine, however, this is not particularly useful or interesting; a similar effect can be easily achieved in C using strcmp().
Where regular expressions start to get interesting is when you add assertions that can match different characters, or a multiple of characters, or even a multiple of different characters. A simple example of this is the special character .. That's right, the period has special meaning to a regular expression, in which we call it 'dot'. The dot matches any one character; For instance, the regular expression h.h matches the constant strings "heh", "hah", "huh", and even "hzh" or "h$h". A literal period may be matched by the assertion \..
Aside from this all-powerful universal match, perl defines many lesser categories to be matched called character classes. A character class could be whitespace, for instance, or any number. These predefined character classes typically take the form of \x, where x is one single alphanumeric character. Here is small subset of a few of the more useful predefined character classes:
| Symbol | Meaning | Matches |
|---|---|---|
| \s | Whitespace | Any whitespace character: Space, tab, newline, etc. |
| \w | Word character | Any of A-Z, a-z, and _ |
| \d | Digit | Any of 0-9 |
Most of these classes can be "reversed" by capitalizing the letter; for instance, any non-whitespace character is matched by \S. As with the comparison operators, there are enough of these that I suggest you consult the documentation if you think there should be a predefined character class for what you want to do. There probably is.
The most useful character class is one that you can define yourself. Perl provides a simple mechanism for you to do so, by way of the [] operator. In the most simple sense, any group of characters enclosed in []'s matches any one of those individual characters. Additionally, ranges of consecutive characters can be specified as m-n, where m is the first character you wish to match and n is the last character you wish to match. For instance, [a-z] matches any lowercase letter, and [1-5] matches any of the digits 1, 2, 3, 4, or 5. To match a literal -, make sure it is the last character within the square brackets, as in [a-zA-Z0-9-]. \x predefined classes may also be included within the [] operator.
The [] operator can also be reversed by including a circumflex (^) as the first character in the class. For instance, [^a-z] matches anything but the lowercase letters. To include a literal ^, ensure that it is not the first character in the set.
Symbols can be grouped in regular expressions by using the () operator. This works just how you would expect it to, with the addition that any matched group can later be referenced as $n, where n is the ordinal number of the opening parenthesis you wish to match. This is called a backreference, and bears an example:
([HB]([ea])h)!, I say\.Suppose that we apply this regular expression to the string "Bah!, I say." In this case, $1 (the outer parentheses) will be set to "Bah", and $2 (the inner parentheses) will be set to "a". For the string "Heh!, I say.", we would get $1 = "Heh" and $2 = "e".
Repetition is an important part of regular expressions, and therefore there are some powerful repetition operators. The simplest of these are *, +, and ?, which mean respectively to match zero or more times, one or more times, and zero or one times. Repetition operators modify the immediately preceding character or group only, not the whole regular expression. To clarify: Suppose that we have the regular expression he*h. This expression will match any of "hh", "heh", "heeh", and "heeeeeeeeeeeeeeh". he+h will match any of these but "hh", as it must find at least one e to return true. he?h will match "hh" and "heh" only, as it must find exactly zero or one e.
The complex repetition operator is {m,n}. This operator matches at least m repetitions, but no more than n repetitions. Each of the other repetition operators can be built with this operator, as {0,} (no specification for n equals infinity), {1,}, and {0,1}.
All of this "matching" we have been talking about is accomplished with the m// operator, which works like the q// family in that the / characters can be replaced by convenient characters. If you wish to use /, the m is understood and can be left out. To match a regular expression against a variable, the =~ operator is used. For instance:
$fullname =~ /(\w)+, (\w)+ (\w)\./;A regular expression like this might be used to extract the portions of a name from a form. Suppose that $fullname eq "Blanton, Ethan L.". This regular expression would return a true value, as "Blanton, Ethan L." is something it can match, and $1 would equal "Blanton", $2 would equal "Ethan", and $3 would equal "L". If a string that does not match were tried, such as "Ethan L. Blanton", the expression would return a false value and $1, $2, and $3 would be undefined. This true/false behavior can be exploited for situations such as this:
exit if $command =~ /quit/;I think you can see why this is useful.
While there are many other useful regular expression operators, the only other one we will discuss is s///. The s/// operator is used to replace the value of one regular expression with another. Whatever is matched by the first two /'s will be replaced with the evaluation of the regular expression between the second two. In the simplest example, you might do something like:
$sentence =~ s/Ethan/elb/g;The g switch tells s/// to do its thing as many times as possible; in our example, it would replace every occurrence of "Ethan" in $sentence with "elb".
More complex replacements can be made by utilizing the wildcard and backreference operators to do cool things:
$foo =~ s/(.*)/heh, $1\n/;Suppose that $foo eq "Perl is fun!" going into this expression. After evaluation, $foo will contain "heh, Perl is fun!\n". This is a useful script to bind your X-Chat "/say" command to, so you can avoid having to type 'heh' all the time.
If you have never used regular expressions before, I'm sure they still make no sense at this point. The best (and practically only!) way to learn to use regular expressions effectively is to write some and see how they work. Remember man perlre is your friend.