How to extract all Date variants from real-world data using python.

Joachim Kuleafenu
Analytics Vidhya
Published in
4 min readJun 28, 2021

--

Your practical definitive guide to mining all Date variants from both structured and unstructured data using data mining techniques in python.

Photo by insuring Yoon on Unsplash

Whether you are a Data Scientist or an Analysts, you may one day encounter a problem that has to do with date and time.

Lifes becomes easier when the date is well cleaned and structured in a single column of a data frame like 06/18/2021, 09/27/2000. Unfortunately, it doesn’t always happen that way.

How do you extract dates when you get something like 7/20 or even in a sentence like ‘He came to the Hospital on 2/21 at dawn’

Before we get overwhelmed, let get started!!

Project Goal
Our goal is to correctly identify all of the different date variants encoded in a dataset and to properly normalize and sort them.

Things to learn
1. Perform basic operations on medical data using Pandas.
2. Extract all date variants using python regular expressions ( Regex).
3. Sort the dates in ascending chronological order according to some outlined rules.
4. Summary

Below are some of the date variants we may have to deal with:

  • 04/20/2009; 04/20/09; 4/20/09; 4/3/09
  • Mar-20–2009; Mar 20, 2009; March 20, 2009; Mar. 20, 2009; Mar 20 2009;
  • 20 Mar 2009; 20 March 2009; 20 Mar. 2009; 20 March, 2009
  • Mar 20th, 2009; Mar 21st, 2009; Mar 22nd, 2009
  • Feb 2009; Sep 2009; Oct 2010
  • 6/2008; 12/2009
  • 2009; 2010

Let’s learn some useful regular expressions to prepare us for the journey

[]: matches one of the sets of characters within []

[a-z]: matches one of the range characters a,b, …,z

a|b: matches either a or b, where a and b are strings

() : Scoping for operators

\: Escape character for special characters (\t, \n, \b)

\b: Matches word boundary

\d: Any digit, equivalent to [0–9]

\w: Alphanumeric character, equivalent to [a-zA-Z0–9_]

\s: Any whitespace

*’: matches zeros or more occurrences

+’: matches one or more occurrences

?: matches zero or one occurrence

{n} :exactly n repetitions,n≥0

{n,} :at least n repetitions

{,n} :at most n repetitions

{m,n} :at least m and at most n repetitions

Some useful functions

re.search(pattern,string,flags=0) : This function searches for the first occurrence of the RE pattern within a string with optional flags.

re.match(pattern, string, flags=0) : This function attempts to match the RE pattern to string with optional flags.

  1. Import libraries

NB: re is the python regular expression

2. Loading data

3. Date Extraction

i. We extract dates with the following format: 04/20/2009; 04/20/09; 4/20/09; 4/3/09

ii. Extract dates with formats # Mar-20–2009; Mar 20, 2009; March 20, 2009; Mar. 20, 2009; Mar 20 2009;

re.I Performs case-insensitive matching.

re.M Makes $ match the end of a line (not just the end of the string) and makes ^ match the start of any line (not just the start of the string).

iii. Extract dates with format # 6/2008; 12/2009

iv. Extract dates with the format; # 2009; 2010

v. Concatenate all the series

vi. Let's put all the above together in a single function

4. Reformating the extracted dates

Although we have extracted our dates, there are some outlined rules to put the date into good shape.

Rule #1
Assume all dates in xx/xx/xx format is mm/dd/yy

Rule #2
Assume all dates where year is encoded in only two digits are years from the 1900s (e.g. 1/5/89 is January 5th, 1989). Let’s do this conversion.

Rule #3
If the day is missing (e.g. 9/2009), assume it is the first day of the month (e.g. September 1, 2009). Let's do this:

Rule #4
If the month is missing (e.g. 2010), assume it is the first of January of that year (e.g. January 1, 2010). Let's do this:

Rule #5
Watch out for potential typos as this is a raw, real-life derived dataset.

5. Converting date in words to numbers

We need to sort the data into a specific chronological way. Because of that, all dates written in words would have to be converted to numbers. eg. January as 1, February as 2 … December as 12.

Let us write some function to tackle this.

Now let’s Combine the entire code parts into one whole

By calling the function date_sorter() , all the dates would be extracted from the text and be arranged in ascending order of magnitude as shown below.

6. Summary

  • we learnt the required regular expression for this project.
  • We introduced some useful re functions thus match and search.
  • All the date variants were extracted from the given document.
  • With specific rules, the data was reformated
  • All the dates in words were converted to digits for sorting
  • The extracted dates were then sorted in ascending order of magnitude.

--

--

Joachim Kuleafenu
Analytics Vidhya

Software Engineer. I build smart features with Machine Learning techniques.