Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Show HN: I wrote a book on Python regular expressions
193 points by asicsp on Aug 8, 2019 | hide | past | favorite | 50 comments
My book titled "Python re(gex)?" is free to download through this weekend [1][2]

The book covers both 're' and 'regex' modules, has plenty of examples and chapters also have cheatsheets and exercises.

Code snippets, exercises, sample chapters, etc are available on GitHub repo [3]

I used pandoc+xelatex [4] to generate the pdf.

[1] https://gumroad.com/l/py_regex

[2] https://leanpub.com/py_regex

[3] https://github.com/learnbyexample/py_regular_expressions

[4] https://learnbyexample.github.io/tutorial/ebook-generation/customizing-pandoc/



You might like:

https://github.com/mtrencseni/rxe

So you can write:

  username = rxe.one_or_more(rxe.set([rxe.alphanumeric(), '.', '%', '+', '-']))
  domain = rxe.one_or_more(rxe.set([rxe.alphanumeric(), '.', '-']))
  tld = rxe.at_least_at_most(2, 6, rxe.set([rxe.range('a', 'z'), rxe.range('A', 'Z')]))
  email = (rxe
    .exactly(username)
    .literal('@')
    .exactly(domain)
    .literal('.')
    .exactly(tld)
  )


This seems cool, but I think that's mostly because the regex you think of as a point of comparison is something like

    r'^[\w.%+-]+@[\w.-]+\.[a-zA-Z]{2,6}$'
which does look awful. But there's no reason you can't do

    username = r'[\w.%+-]+'
    domain = r'[\w.-]+'
    tld = r'[a-zA-Z]{2,6}'
    email = username + r'@' + domain + r'\.' + tld
which is arguably easier to read than the rxe version for someone familiar with regex.


And/or use verbose mode -- re.VERBOSE or (?x) --for comments and whitespace.


wow I didn't know you could concatenate regexes, thanks


Why not? They're just strings until they're compiled. "Code is data"


The thought of doing this literally didn’t occur to me... I think this is great!


Well I suppose the r in front of them in python made them look just enough "not a normal string" for me to forget how the + operator might act.

My day-to-day experience is in nodejs, where adding one regex object to another coerces them to normal strings first

[edit: hey neat, the hackernews form strips emoji from comments, I wonder if that's just ranges of unicode or if there's some crazy regex going on :D]


In Javascript you just need to wrap in `new RegExp(r1 + r2 + r3)`.


Yes, an rxe is longer. But: I've been using regexps was 20 years and I can't remember them (both reading and writing). My brain swaps it out, partly bc I know it's one SO away. But that's bad for code readability.


I actually find these too verbose. May be because I didn't know about them when I started out, was using regex in cli tools and Perl scripts.

There's also a module [1] which already has collection of common regex to match dates, links, emails, etc.

[1] https://github.com/madisonmay/CommonRegex


I use the below to extract email addresses. Took from some website some time ago and made super light changes to it. Not sure if this is the best but it has served me fairly well.

email_reg = re.compile (r'''

([a-zA-Z0-9._%+-]+ #First name and last name

@ #@ sign

[a-zA-Z0-9._%+-]+ #domain name

\.[a-zA-Z]{2,10}) #.com

''',re.VERBOSE)


You should allow “$” on the left; as per the RFC...


Why I hate regex :) . You think you know it but, you never really do.


That's about the email RFC, not about regex syntax.


Ok cool.


I would not allow the % sign.


To me "exactly" means the same as "literal". (I understand here its a full pattern match)


Never heard about the rxe but it looks like something I could use - thanks.


I'll have to look into this. In some limited cases verbosity that leads to a more obvious and understandable presentation is preferable, at least to me.


Too verbose, I prefer good old regular expression.


Completely agree. If you know good old regular expressions, then this rxe definition is actually harder to read. This is how APL programmers must feel when they look at code in popular programming languages.


This is excellent, I prefer this.


10 lines to replace maybe 20 characters. How horribly verbose.


But that is exactly what I like about it. I rarely have to use regex, but when I do have to write or change something I always have to spend 15 mins reading (remembering) most of the things about it again, this would help me understand my old code way better.


This is good point, and without trying to debate anything, regex is something of an art form for some people and I guess as many people are averse to it as there are high school people hating Shakespear.

But to give a bit of substance, you can often use a dictionary type approach in a situation where regex is needed. Example: replacing accented latin with normal (ascii) latin.

I do sometimes pride myself in necromancing skills of resurrection old Perl scripts from perlmonks.com but I suspect it is more of a hobby that out of absolute necessity. I find the memes about Perl/Python and Starwars to be pretty funny and much more entertaining than people actually debating programming languages. [1]

[1] https://www.python.org/doc/humor/#python-vs-perl-according-t...


That describes all of Python, to an APL programmer.


I personally learned RegEx with PowerShell for a work initiative that required heavy usage of it. Most of the documentation regarding RegEx was pretty language agnostic, so it's interesting when I run across guides that are specific to a particular language.

Is there anything about Regular Expressions in Python that creates a unique need for it's own domain specific guide?


There are enough differences in terms of syntax and features across languages that I felt separate books would help the user better than a common one. If you consider Python's built-in 're' module, it doesn't support possessive quantifiers, \K, variable-length lookbehind, set operations, control verbs like SKIP, \p{} unicode character sets and so on.

Another angle is with regards to usage of functions. When to use 're.findall' vs 're.finditer'. How does capture group affect 're.split' and 're.findall' and so on.


It is not really the regex syntax itself that changes (not too many python specific constructs), but rather, how you apply it. For example, you might use capture groups and re.findall, which will return a list of tuples, or re.search, which will return a match object, both of which have different ways of iterating.


None that I know of. My thought process has always been that regex was language agnostic. I think Perl has its own version that is widely used. But other than that, I do not know.


Perl's is a superset ("PCRE" - Perl Compatible Regular Expression). It's now supported by a lot of other tools, not least of which is GNU grep.


> My thought process has always been that regex was language agnostic.

+1

I was surprised to find that regex in python was not much different from the language I use ruby. Would anyone with sufficient knowledge care to eli15 why this is and how it is implemented ?


Ruby's regex engine is Onigmo which is very similar to Perl 5.10. Perl was one of Ruby's main influences, along with Smalltalk and Lisp, whereas Python's BDFL was never a fan of Perl. However, Perl's implementation of regular expressions (PCRE) has been widely adopted hence the similarity you refer to. If you compare with Javascript and PHP you'll find they're all similar.


Perl is a superior language. Besides being the fastest scripting language by far ...

Ruby was inspired by it.

Python stil can't do forward references, amongst many other puzzling limitations. I'm sure Python 3's split() improvements were copied from Perl.


Thanks for sharing! I only skimmed it but I liked how you include usage of the external regex module, which I hadn't realized allowed for the use of variable-length look-behinds.


The engine has support but the language doesn't expose it?


I don't think I understand your question (nor am I an expert on Python regex!)...but just to be clear, Python's regular expression standard library is named `re`. But there is an external lib – ostensibly a drop-in replacement – that goes by the name of `regex`. It is the `regex` library that supports variable lookbehind, not Python's standard library `re`

https://pypi.org/project/regex/


This is super helpful! I've always felt Python regex was one of the more complicated parts of the language.


Congrats on writing and publishing this book!


Thank you very much! This is really helpful.


This is a great resource. Thank you for taking the time to write this and making it available to us!


There two kinds of text search/parse problems:

- The really easy ones. A simple string search/split will do and a regex would be overkill. - The really hard ones. You'll need to fully parse this and using a regex will result in fragile/hard to understand code.

Please don't use regexes in production software. Learn how to write simple parsing code.


I couldn't agree more. I honestly find it rather baffling that someone would write a book on not just regular expressions in general, but one particular variant in one particular language.

I've been writing code for over 15 years and if I said I had reached for regular expressions maybe 10 times in that entire time, I'm pretty sure I wouldn't be off even by an order of magnitude.

There are very few cases where they're the appropriate tool, I think. They don't compose(unless you use a higher level library that lets you construct them declaratively maybe), they are hard to read and hard to debug and nearly impossible to extend.

In my opinion, parser combinators are state of the art for building parsers. They can read like prose, are very easy to extend and usually easy to debug as well since you can easily get detailed error messages.


Regex is handy for client side JS validation of fields such as a field needs to be an integer etc. You could use parser combinator, but why combinators (pun!) when a simple regex will do?


I need to validate an input string from the user to ensure it's a 32 character hex (I'm validating it's a NAA format 6 WWN) and regex is just the easy, fast and efficient to do this. It's not a complicated regex and is quite readable to anyone with basic regex knowledge, why is this wrong?


I don't have experience with parsers, so I cannot comment how easy/difficult it is compared to regex.

Regex can indeed lead to issues [1] but so could any other piece of code. So, I disagree that regex shouldn't be used in production. This article [2] by Jeff Atwood gives a balanced view of when to use regex and some nice tips.

[1] https://new.blog.cloudflare.com/details-of-the-cloudflare-ou...

[2] https://blog.codinghorror.com/regular-expressions-now-you-ha...


Can you elaborate a bit on this please? I'd be interested in resources on writing better parsers!

Quickly looking at the python standard lib (urlparse, shlex, etc) and Python packages (NLTK Treebank tokenizer), a lot of packages related to slicing, dicing and parsing strings use a mashup of regex and rule based code.


No experience with them in python, but look also for PEG grammars. They are significantly simpler than the traditional tower of lexer + ambiguous limited lookahead grammar.

Plus the memoized Packrat algorithm allows throwing in functions with custom conditions. (Somewhat like parser combinators, they also support custom logic.)


Parser combinators are the way to go. I don't use python often enough to know if such a library exists for python, but I would assume so.


> regex will result in fragile/hard to understand code

Sure, if you dont know what you're doing.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: