Perl and Multiple Line Ending Characters
Perl uses \n (the linefeed) as its default end of line character (record separator). You can change this with -0
option on the command line to be \r (carriage return), \r\n (carriage return linefeed pair) or something else. For example, this command sets the record separator to \r before replacing every occurence of the string foo with the string bar:
$ perl –pi -e -00d ‘s/foo/bar/g’ test.html
However my files are a weird mix of Unix, Mac, and Windows conventions. A few files may even use several line ending conventions in one file. Most modern text editors can autodetect and deal with this without any problem, as can XML parsers. However as near as I can figure, Perl cannot. It expects me to know in advance what kind of file I’m feeding it.
Is there any simple way around this? There’s more than one way to do it, but is there more than one $/
?
January 7th, 2007 at 10:36 PM
Things are worse than you think. There’s no way using -0 to express \r\n, though Windows Perl will translate \r\n to just \n on input (though not so Cygwin Perl). Similarly, Mac Classic Perl interprets \n as meaning CR rather than LF, unlike all other Perls (except maybe on EBCDIC mainframes, I don’t know about those), though Mac OS X Perl is Unix Perl and doesn’t do that.
Your best hope is to use one of the various text filters that adjusts line endings.
January 9th, 2007 at 3:00 PM
After reading this, I don’t understand why I’ve never experienced any problems with this! I’ve written Perl programs for configuring various systems, checking the configurations of systems and for slicing and dicing data without having to cope with Access and Excel. I’ve gone back and forth between Solaris, MS Windows and Linux systems.
Fred Mallot suggests at http://searchopensource.techtarget.com/tip/0,289483,sid39_gci881400,00.html to solve the problem you’re having using something like this, which converts “whatever” to single newlines:
********************************************
perl -i.bak -pe ‘s/[
]+$/
/;’ filenames-here
********************************************
Frederick Ducatella from the University of Edinburgh suggests at http://www.inf.ed.ac.uk/teaching/courses/dme/html/perllist.html something that I’ve often used without giving it too much thought:
If the end-of-line character is always removed when data is read in, I don’t think you’ll have the problem you’ve described, but I’m not sure. Could that be why I never ran across this problem?
January 9th, 2007 at 3:45 PM
I cheat – I run the file through dos2unix and unix2dos when going back and forth between Mac/Unix and Windows. I don’t know whether this would be of any help, or whether you need to preserve the line endings as-is.
January 9th, 2007 at 3:46 PM
You probably never noticed the problem because you didn’t have Mac files in the mix. Windows files have carriage returns and line feeds. Thus the lines would still be split correctly, even if there was an extra \r at the end of each line. Mac text files have only carriage returns at the end of each line, so the entire file gets slurped up and treated as a single line.
January 10th, 2007 at 10:40 AM
Looks like http://search.cpan.org/~audreyt/PerlIO-eol-0.14/eol.pm would sort this
out. Pity modules aren’t very one liner friendly.
January 11th, 2007 at 1:29 PM
That is indeed the answer, at least for non-one-liners. Thanks. For everyone else’s reference the trick is to add
use PerlIO::eol;
right after the shebang line:Then, after you open the file, use the binmode command to convert all separators to linefeeds like so
February 9th, 2007 at 2:59 PM
[…] With a little luck, the book should be on store shelves sometime this summer. I’ve already posted a number of questions that arose while writing it. I’m going to be posting a lot more over the next couple of months. I also plan to post many small excerpts from the book for your perusal and comment. I hope you’ll help out by commenting on, caviling, and correcting the draft pieces I’ll be posting here. […]