Perl and Multiple Line Ending Characters

Perl uses \n (the linefeed) as its default end of line character (record separator). You can change this with -0 option on the command line to be \r (carriage return), \r\n (carriage return linefeed pair) or something else. For example, this command sets the record separator to \r before replacing every occurence of the string foo with the string bar:

$ perl â€“pi -e -00d ‘s/foo/bar/g’ test.html

However my files are a weird mix of Unix, Mac, and Windows conventions. A few files may even use several line ending conventions in one file. Most modern text editors can autodetect and deal with this without any problem, as can XML parsers. However as near as I can figure, Perl cannot. It expects me to know in advance what kind of file I’m feeding it.

Is there any simple way around this? There’s more than one way to do it, but is there more than one $/?

This entry was posted on Saturday, January 6th, 2007 at 4:22 PM and is filed under Perl. You can follow any responses to this entry through the Atom feed. You can make a comment or trackback from your own site.

7 Responses to “Perl and Multiple Line Ending Characters”

John Cowan Says:
January 7th, 2007 at 10:36 PM
Things are worse than you think. There’s no way using -0 to express \r\n, though Windows Perl will translate \r\n to just \n on input (though not so Cygwin Perl). Similarly, Mac Classic Perl interprets \n as meaning CR rather than LF, unlike all other Perls (except maybe on EBCDIC mainframes, I don’t know about those), though Mac OS X Perl is Unix Perl and doesn’t do that.

Your best hope is to use one of the various text filters that adjusts line endings.
James Orenchak Says:
January 9th, 2007 at 3:00 PM
After reading this, I don’t understand why I’ve never experienced any problems with this! I’ve written Perl programs for configuring various systems, checking the configurations of systems and for slicing and dicing data without having to cope with Access and Excel. I’ve gone back and forth between Solaris, MS Windows and Linux systems.

Fred Mallot suggests at http://searchopensource.techtarget.com/tip/0,289483,sid39_gci881400,00.html to solve the problem you’re having using something like this, which converts “whatever” to single newlines:
********************************************
perl -i.bak -pe ‘s/[

]+$/
/;’ filenames-here

********************************************
Frederick Ducatella from the University of Edinburgh suggests at http://www.inf.ed.ac.uk/teaching/courses/dme/html/perllist.html something that I’ve often used without giving it too much thought:

When you read data in from a file, you will read line by line. So the variable $in will contain an end-of-line character. When you process input you will usually want to get rid of this character. You can do this with the following command:
chomp($in);

If the end-of-line character is always removed when data is read in, I don’t think you’ll have the problem you’ve described, but I’m not sure. Could that be why I never ran across this problem?
J Donald Says:
January 9th, 2007 at 3:45 PM
I cheat – I run the file through dos2unix and unix2dos when going back and forth between Mac/Unix and Windows. I don’t know whether this would be of any help, or whether you need to preserve the line endings as-is.
Elliotte Rusty Harold Says:
January 9th, 2007 at 3:46 PM
You probably never noticed the problem because you didn’t have Mac files in the mix. Windows files have carriage returns and line feeds. Thus the lines would still be split correctly, even if there was an extra \r at the end of each line. Mac text files have only carriage returns at the end of each line, so the entire file gets slurped up and treated as a single line.
Peter Says:
January 10th, 2007 at 10:40 AM
Looks like http://search.cpan.org/~audreyt/PerlIO-eol-0.14/eol.pm would sort this
out. Pity modules aren’t very one liner friendly.
Elliotte Rusty Harold Says:
January 11th, 2007 at 1:29 PM
That is indeed the answer, at least for non-one-liners. Thanks. For everyone else’s reference the trick is to add use PerlIO::eol; right after the shebang line:
```
#!/usr/bin/perl
use PerlIO::eol;
#...
```
Then, after you open the file, use the binmode command to convert all separators to linefeeds like so
```
open (FILE, $ARGV[0]) or die("Can't open $ARGV[0]\n");
binmode FILE, ":raw:eol(LF)";

Most people will probably need to install PerlIO::eol from CPAN first, since it does not seem to be included with most Perl distributions by default.
```
The Cafes » My Next Book Says:
February 9th, 2007 at 2:59 PM
[…] With a little luck, the book should be on store shelves sometime this summer. I’ve already posted a number of questions that arose while writing it. I’m going to be posting a lot more over the next couple of months. I also plan to post many small excerpts from the book for your perusal and comment. I hope you’ll help out by commenting on, caviling, and correcting the draft pieces I’ll be posting here. […]

Perl and Multiple Line Ending Characters

7 Responses to “Perl and Multiple Line Ending Characters”

Leave a Reply

Info

Archives

Categories

Feeds

Admin