Not Substring Regular Expressions
I’m trying to devise a regular expression that will find all or most img tags that don’t have alt attributes. <img[^>]*/>
will find all the img elements (or at least most of them). And I can easily find those that do contain an alt attribute. However, I’m stumped when it comes to finding those that do not contain the substring alt. Any ideas?
Note that this expression does not have to be perfect. I can live with some false positives and negatives. This is just meant to do a quick first pass over documents that will later be validated so any cases I miss will later be found, and nothing will be changed or replaced without human inspection first. This is a pure search, not a search and replace.
It feels like I need some sort of not operator in regular expressions. What am I missing?
February 19th, 2007 at 1:29 PM
This works pretty well for me (in BBEdit), ignoring case:
<img([^>a]|a[^l]|al[^t]|”.*”|’.*’)*?>
It allows
any character not a ‘>’ or ‘a’
any ‘a’ not followed by an ‘l’
any ‘al’ not followed by an ‘t’
anything in quotes
So it disallows ‘alt’ outside of quotes. Will give a false results for any other tags containing ‘alt’.
You might also want to search for: alt=””.
February 19th, 2007 at 1:46 PM
yeah that’s a good one. here’s what i came up with:
(?!]*\salt[^>]*>$)]*>
seems to work.
February 19th, 2007 at 1:48 PM
hey rusty – no ‘preview’ button? let’s try that again:
(?!<img[^>]*\salt[^=>]*=[^>]*>)<img[^>]*>
matches the thing and then makes sure there was no ‘ alt=’ in the match, or something to that effect
February 19th, 2007 at 7:21 PM
What am I missing? XPath?
Sorry, couldn’t help it.
February 20th, 2007 at 4:18 AM
Something like this should be ok
while( $html =~/
(
]* #0 or more non >
\s+alt\s*= #and an alt attribute
)
[^>]* #instead followed by 0 or more non >
> #and a tag close
)/xg)
{
print $1.”\n\n”;
}
February 20th, 2007 at 4:19 AM
Something like this should be ok
(lets see if a pre tag can help me)
while( $html =~/
(
]* #0 or more non >
\s+alt\s*= #and an alt attribute
)
[^>]* #instead followed by 0 or more non >
> #and a tag close
)/xg)
{
print $1.”\n\n”;
}
February 23rd, 2007 at 3:59 AM
I think Minute Bol’s second suggestion above uses Java RegEx’s negative look-behind assertion. My copy of “Java in a Nutshell (4th ed)” says that the pattern “must be a…fixed number of characters”, so I don’t think his suggestion will work exactly. There’s no such restriction on the negative look-ahead assertion, but I can’t think how to re-write the regex that way.
HTH. Hwyl,
Neil.