'Regex: find all sentence with a citation

I've found this code to detect all citation in a text:

author = r"(?:[A-Z][A-Za-z'`-]+)"
etal = r"(?:et al\.?)"
additional = f"(?:,? (?:(?:and |& )?{author}|{etal}))"
year_num = "(?:19|20)[0-9][0-9]"
page_num = "(?:, p\.? [0-9]+)?"  
year = fr"(?:, *{year_num}{page_num}| *\({year_num}{page_num}\))"
regex = fr'\b(?!(?:Although|Also)\b){author}{additional}*{year}'

It's actually working great, but I need to find all the sentence (from where it start after a dot untile the end, another dot) where the citation is. So in this example:

"Nothing is here. In this line, actually, there is a ciation (Author et al., 2022). Once again, In this line there is nothing."

I'd like to get this "In this line, actually, there is a ciation (Author et al., 2022)."

How should I edit the above code to achieve this?



Solution 1:[1]

You can use the following regular expression:

r"\s*([^.]+(?=\([\w ,.]+(, *\?)?(\d{4}|\d{2})\)\.?))(\([\w ,.]+(, *\?)?(\d{4}|\d{2})\)\.?)"

Proof here.

Solution 2:[2]

You need to solve the problem in two steps: a) break the text into sentences, b) detect sentences with a citation. Sentence tokenization is non-trivial to do right, so use a library to do it. For example:

>>> import nltk
>>> text = "Nothing is here. In this line, actually, there is a citation (Author et al., 2022). Once again, In this line there is nothing."
>>> sentences = nltk.sent_tokenize(text)
>>> print(sentences)
['Nothing is here.', 'In this line, actually, there is a citation (Author et al., 2022).', 'Once again, In this line there is nothing.']

Then, using your definitions:

>>> citation = fr"{author}{additional}*{year}" 
>>> for s in sentences:
>>> ...     if re.search(citation, s):
>>> ...             print(s)
>>> ... 
In this line, actually, there is a citation (Author et al., 2022).

PS. If you've never used the nltk before, you'll need to do a one-time download for the sentence tokenizer. You'll see an error message telling you to run this, just do it once and you're done forever.

nltk.download('punkt')

Solution 3:[3]

Try with this one:

(?<=\. )[^(]+\(([^)]+)\).*?\. 

Explanation:

  • (?<=\. ): lookbehind that checks for previous sequence of dot and space
  • [^(\.]+ : any combination of characters other than open parentheses and dots
  • \( : open parenthesis
  • ([^)]+) : any combination of characters other than closed parenthesis
  • \) : closed parenthesis
  • .*? : optional lazy combination of characters
  • \. : sequence of dot and space

Corner cases that this solution is not able to address:

  • <space><dot><word> (like .dotnet) is an inner word before parenthesis: it will always treat <space><dot> as begin of sentence.
  • <word><dot><space> (like e.g.) is an inner word after parenthesis: it will always treat <dot><space> as end of sentence.

One possibility of addressing these corner cases is to do some preprocessing first and transforming/removing any abbreviation present in the raw text.

Try it here.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1
Solution 2
Solution 3