Show HN: I wrote a book on Python regular expressions

Maro · on Aug 8, 2019

You might like:

So you can write:

  username = rxe.one_or_more(rxe.set([rxe.alphanumeric(), '.', '%', '+', '-']))
  domain = rxe.one_or_more(rxe.set([rxe.alphanumeric(), '.', '-']))
  tld = rxe.at_least_at_most(2, 6, rxe.set([rxe.range('a', 'z'), rxe.range('A', 'Z')]))
  email = (rxe
    .exactly(username)
    .literal('@')
    .exactly(domain)
    .literal('.')
    .exactly(tld)
  )

sweeneyrod · on Aug 8, 2019

This seems cool, but I think that's mostly because the regex you think of as a point of comparison is something like

    r'^[\w.%+-]+@[\w.-]+\.[a-zA-Z]{2,6}$'

which does look awful. But there's no reason you can't do

    username = r'[\w.%+-]+'
    domain = r'[\w.-]+'
    tld = r'[a-zA-Z]{2,6}'
    email = username + r'@' + domain + r'\.' + tld

which is arguably easier to read than the rxe version for someone familiar with regex.

nerdponx · on Aug 9, 2019

And/or use verbose mode -- re.VERBOSE or (?x) --for comments and whitespace.

jazzyjackson · on Aug 8, 2019

wow I didn't know you could concatenate regexes, thanks

gnulinux · on Aug 8, 2019

Why not? They're just strings until they're compiled. "Code is data"

CaptainMarvel · on Aug 8, 2019

The thought of doing this literally didn’t occur to me... I think this is great!

jazzyjackson · on Aug 9, 2019

Well I suppose the r in front of them in python made them look just enough "not a normal string" for me to forget how the + operator might act.

My day-to-day experience is in nodejs, where adding one regex object to another coerces them to normal strings first

[edit: hey neat, the hackernews form strips emoji from comments, I wonder if that's just ranges of unicode or if there's some crazy regex going on :D]

baudehlo · on Aug 9, 2019

In Javascript you just need to wrap in `new RegExp(r1 + r2 + r3)`.

Maro · on Aug 8, 2019

Yes, an rxe is longer. But: I've been using regexps was 20 years and I can't remember them (both reading and writing). My brain swaps it out, partly bc I know it's one SO away. But that's bad for code readability.

asicsp · on Aug 9, 2019

I actually find these too verbose. May be because I didn't know about them when I started out, was using regex in cli tools and Perl scripts.

There's also a module [1] which already has collection of common regex to match dates, links, emails, etc.

[1] https://github.com/madisonmay/CommonRegex

just_myles · on Aug 8, 2019

I use the below to extract email addresses. Took from some website some time ago and made super light changes to it. Not sure if this is the best but it has served me fairly well.

email_reg = re.compile (r'''

([a-zA-Z0-9._%+-]+ #First name and last name

@ #@ sign

[a-zA-Z0-9._%+-]+ #domain name

\.[a-zA-Z]{2,10}) #.com

''',re.VERBOSE)

_eht · on Aug 8, 2019

You should allow “$” on the left; as per the RFC...

just_myles · on Aug 8, 2019

Why I hate regex :) . You think you know it but, you never really do.

nerdponx · on Aug 9, 2019

That's about the email RFC, not about regex syntax.

just_myles · on Aug 9, 2019

Ok cool.

wwwhizz · on Aug 8, 2019

I would not allow the % sign.

ape4 · on Aug 8, 2019

To me "exactly" means the same as "literal". (I understand here its a full pattern match)

foo_in_bar · on Aug 8, 2019

Never heard about the rxe but it looks like something I could use - thanks.

killaken2000 · on Aug 9, 2019

I'll have to look into this. In some limited cases verbosity that leads to a more obvious and understandable presentation is preferable, at least to me.

gnulinux · on Aug 8, 2019

Too verbose, I prefer good old regular expression.

felixr · on Aug 8, 2019

Completely agree. If you know good old regular expressions, then this rxe definition is actually harder to read. This is how APL programmers must feel when they look at code in popular programming languages.

PieUser · on Aug 8, 2019

This is excellent, I prefer this.

qwsxyh · on Aug 8, 2019

10 lines to replace maybe 20 characters. How horribly verbose.

tW4r · on Aug 8, 2019

But that is exactly what I like about it. I rarely have to use regex, but when I do have to write or change something I always have to spend 15 mins reading (remembering) most of the things about it again, this would help me understand my old code way better.

mikorym · on Aug 8, 2019

This is good point, and without trying to debate anything, regex is something of an art form for some people and I guess as many people are averse to it as there are high school people hating Shakespear.

But to give a bit of substance, you can often use a dictionary type approach in a situation where regex is needed. Example: replacing accented latin with normal (ascii) latin.

I do sometimes pride myself in necromancing skills of resurrection old Perl scripts from perlmonks.com but I suspect it is more of a hobby that out of absolute necessity. I find the memes about Perl/Python and Starwars to be pretty funny and much more entertaining than people actually debating programming languages. [1]

[1] https://www.python.org/doc/humor/#python-vs-perl-according-t...

ken · on Aug 8, 2019

That describes all of Python, to an APL programmer.

natpalmer1776 · on Aug 8, 2019

I personally learned RegEx with PowerShell for a work initiative that required heavy usage of it. Most of the documentation regarding RegEx was pretty language agnostic, so it's interesting when I run across guides that are specific to a particular language.

Is there anything about Regular Expressions in Python that creates a unique need for it's own domain specific guide?

asicsp · on Aug 8, 2019

There are enough differences in terms of syntax and features across languages that I felt separate books would help the user better than a common one. If you consider Python's built-in 're' module, it doesn't support possessive quantifiers, \K, variable-length lookbehind, set operations, control verbs like SKIP, \p{} unicode character sets and so on.

Another angle is with regards to usage of functions. When to use 're.findall' vs 're.finditer'. How does capture group affect 're.split' and 're.findall' and so on.

nurettin · on Aug 8, 2019

It is not really the regex syntax itself that changes (not too many python specific constructs), but rather, how you apply it. For example, you might use capture groups and re.findall, which will return a list of tuples, or re.search, which will return a match object, both of which have different ways of iterating.

just_myles · on Aug 8, 2019

None that I know of. My thought process has always been that regex was language agnostic. I think Perl has its own version that is widely used. But other than that, I do not know.

caymanjim · on Aug 8, 2019

Perl's is a superset ("PCRE" - Perl Compatible Regular Expression). It's now supported by a lot of other tools, not least of which is GNU grep.

Lordarminius · on Aug 8, 2019

> My thought process has always been that regex was language agnostic.

+1

I was surprised to find that regex in python was not much different from the language I use ruby. Would anyone with sufficient knowledge care to eli15 why this is and how it is implemented ?

cutler · on Aug 9, 2019

Ruby's regex engine is Onigmo which is very similar to Perl 5.10. Perl was one of Ruby's main influences, along with Smalltalk and Lisp, whereas Python's BDFL was never a fan of Perl. However, Perl's implementation of regular expressions (PCRE) has been widely adopted hence the similarity you refer to. If you compare with Javascript and PHP you'll find they're all similar.

redis_mlc · on Aug 9, 2019

Perl is a superior language. Besides being the fastest scripting language by far ...

Ruby was inspired by it.

Python stil can't do forward references, amongst many other puzzling limitations. I'm sure Python 3's split() improvements were copied from Perl.

danso · on Aug 8, 2019

Thanks for sharing! I only skimmed it but I liked how you include usage of the external regex module, which I hadn't realized allowed for the use of variable-length look-behinds.

ErikCorry · on Aug 8, 2019

The engine has support but the language doesn't expose it?

danso · on Aug 8, 2019

I don't think I understand your question (nor am I an expert on Python regex!)...but just to be clear, Python's regular expression standard library is named `re`. But there is an external lib – ostensibly a drop-in replacement – that goes by the name of `regex`. It is the `regex` library that supports variable lookbehind, not Python's standard library `re`

https://pypi.org/project/regex/

spazzy81 · on Aug 9, 2019

This is super helpful! I've always felt Python regex was one of the more complicated parts of the language.

nathanbarry · on Aug 8, 2019

Congrats on writing and publishing this book!

smitshah0014 · on Aug 8, 2019

Thank you very much! This is really helpful.

dhairya · on Aug 8, 2019

This is a great resource. Thank you for taking the time to write this and making it available to us!

AlexanderDhoore · on Aug 8, 2019

There two kinds of text search/parse problems:

- The really easy ones. A simple string search/split will do and a regex would be overkill. - The really hard ones. You'll need to fully parse this and using a regex will result in fragile/hard to understand code.

Please don't use regexes in production software. Learn how to write simple parsing code.

IceDane · on Aug 8, 2019

I couldn't agree more. I honestly find it rather baffling that someone would write a book on not just regular expressions in general, but one particular variant in one particular language.

I've been writing code for over 15 years and if I said I had reached for regular expressions maybe 10 times in that entire time, I'm pretty sure I wouldn't be off even by an order of magnitude.

There are very few cases where they're the appropriate tool, I think. They don't compose(unless you use a higher level library that lets you construct them declaratively maybe), they are hard to read and hard to debug and nearly impossible to extend.

In my opinion, parser combinators are state of the art for building parsers. They can read like prose, are very easy to extend and usually easy to debug as well since you can easily get detailed error messages.

quickthrower2 · on Aug 9, 2019

Regex is handy for client side JS validation of fields such as a field needs to be an integer etc. You could use parser combinator, but why combinators (pun!) when a simple regex will do?

gravitas · on Aug 8, 2019

I need to validate an input string from the user to ensure it's a 32 character hex (I'm validating it's a NAA format 6 WWN) and regex is just the easy, fast and efficient to do this. It's not a complicated regex and is quite readable to anyone with basic regex knowledge, why is this wrong?

asicsp · on Aug 9, 2019

I don't have experience with parsers, so I cannot comment how easy/difficult it is compared to regex.

Regex can indeed lead to issues [1] but so could any other piece of code. So, I disagree that regex shouldn't be used in production. This article [2] by Jeff Atwood gives a balanced view of when to use regex and some nice tips.

[1] https://new.blog.cloudflare.com/details-of-the-cloudflare-ou...

[2] https://blog.codinghorror.com/regular-expressions-now-you-ha...

madelyn · on Aug 8, 2019

Can you elaborate a bit on this please? I'd be interested in resources on writing better parsers!

Quickly looking at the python standard lib (urlparse, shlex, etc) and Python packages (NLTK Treebank tokenizer), a lot of packages related to slicing, dicing and parsing strings use a mashup of regex and rule based code.

cben · on Aug 14, 2019

No experience with them in python, but look also for PEG grammars. They are significantly simpler than the traditional tower of lexer + ambiguous limited lookahead grammar.

Plus the memoized Packrat algorithm allows throwing in functions with custom conditions. (Somewhat like parser combinators, they also support custom logic.)

IceDane · on Aug 8, 2019

Parser combinators are the way to go. I don't use python often enough to know if such a library exists for python, but I would assume so.

castis · on Aug 9, 2019

> regex will result in fragile/hard to understand code

Sure, if you dont know what you're doing.