12 Regex on Excel Files:

Download sample.xlsx excel file to use in the upcoming practice session:

Download the sample file from the above link and let’s start learning the usage of regular expression with excel files.

12.1 `Openpyxl Module`:

Install openpyxl module and then run import openpyxl in python shell/Python IDLE.

pip install openpyxl

The openpyxl module is a Python library used for working with Excel files (both .xlsx and .xlsm formats). It allows you to create, modify, and extract data from Excel spreadsheets. Some key features and use cases of openpyxl include:

Reading and Writing Excel Files: You can use openpyxl to create new Excel files, modify existing ones, and read data from Excel files.
Working with Worksheets: It provides functionality to create, delete, and manipulate worksheets within an Excel workbook.
Cell Manipulation: You can read and write cell values, apply formatting, and perform various operations on individual cells.
Formatting: openpyxl supports formatting options, such as font styles, colors, borders, and more.
Charts: You can create charts and graphs within Excel using the data provided through openpyxl.
Formula Support: It allows you to set and manipulate cell formulas within Excel.
Styling: You can apply styles to cells, rows, and columns to enhance the visual appearance of the Excel sheet.

Example of creating Excel files, read & write some of its data.

import openpyxl

# Create a new workbook
workbook = openpyxl.Workbook()

# Create a sheet inside the workbook
sheet1 = workbook.create_sheet('asadpro')

# Write data to a cell
for i in range(1,21):
    sheet1[f'A{i}'] = i

# Save the workbook
workbook.save('demo.xlsx')

# To open an existing workbook

read_sheet = openpyxl.load_workbook('demo.xlsx')

# Display the name of sheets inside the workbook
print(read_sheet.sheetnames)

# Accessing the sheet of our choice from the above 
reading = read_sheet['asadpro']

# Read data from a specific cell
print(reading['A3'].value) # Output: 3

# Another alternative to the above method of reading cell value using 'cell'

print(reading.cell(row=3,column=1).value) # Output: 3

Here’s another simple example of how you might use openpyxl to load an Excel file and read its data:

import openpyxl

path_to_file = r'C:\Users\User\Desktop\Machine Learning Foundations A Comprehensive Guide from Python to Mathematics\project files\sample.xlsx'

# Loading the existing Workbook
workbook = openpyxl.load_workbook(filename=path_to_file)

# Displaying all the worksheets in a single workbook
# print(workbook.sheetnames)


sheet = workbook['EmployeeData']

# Displaying the number of rows and columns
print(sheet.dimensions)

data = []

for row in sheet.values:
    a,b,c,d,e,f,g = row
    data.append(f"{a};{b};{c};{d};{e};{f};{g}")

print('\n'.join(data))

12.2 Practice 1:

Condition: Fetch the last name of the employee with a salary between 24,000 and 29,999 rupees.

import openpyxl
import re

path_to_file = r'C:\Users\User\Desktop\Machine Learning Foundations A Comprehensive Guide from Python to Mathematics\project files\sample.xlsx'

# Loading the existing Workbook
workbook = openpyxl.load_workbook(filename=path_to_file)

sheet = workbook['EmployeeData']

data = []

for value in sheet.values:
    a,b,c,d,e,f,g = value
    data.append(f'{a};{b};{c};{d};{e};{f};{g}')

employees = '\n'.join(data)

pattern = r'\d{1,};.+?;(.+?);.+;(2[4-9]\d{3})'

result = re.findall(pattern, employees)

for element in result:
    print(element)

Let’s break down the given regular expression step by step:

\d{1,};.+?;(.+?);.+;(2[4-9]\d{3})

\d{1,}: This part matches one or more digits. \d represents any digit (0-9), and {1,} specifies that there should be at least one digit.
.+?;: This part matches one or more of any character (.) in a non-greedy manner (+?) until a semicolon (;) is encountered.
(.+?);: This part captures one or more of any character in a non-greedy manner until a semicolon is encountered. The parentheses () denote a capturing group, allowing you to extract the content matched by this part later.
.+;: This part matches one or more of any character until a semicolon is encountered.
(2[4-9]\d{3}): This part captures a pattern that starts with a 2 followed by a digit from 4 to 9 ([4-9]) and is followed by exactly three more digits (\d{3}). This is enclosed in parentheses to create a capturing group.

12.3 Practice 2:

Fetch all the employees from the Excel file whose last names are no more than 5 characters long & who work in the IT or Marketing department. Condition 1: Last name <= 5 characters Condition 2: IT or Marketing department

import openpyxl
import re

path_to_file = r'C:\Users\User\Desktop\Machine Learning Foundations A Comprehensive Guide from Python to Mathematics\project files\sample.xlsx'

# Loading the existing Workbook
workbook = openpyxl.load_workbook(filename=path_to_file)

sheet = workbook['EmployeeData']

data = []

for value in sheet.values:
    a,b,c,d,e,f,g = value
    data.append(f'{a};{b};{c};{d};{e};{f};{g}')

employees = '\n'.join(data)

pattern = r'\d{1,};.+?;(.{1,5});(?:IT|Marketing);.+'

result = re.findall(pattern, employees)
print(result)

Let’s break down the given regular expression step by step:

\d{1,};.+?;(.{1,5});(?:IT|Marketing);.+

\d{1,}: This part matches one or more digits. \d represents any digit (0-9), and {1,} specifies that there should be at least one digit.
;: This matches a semicolon.
.+?;: This matches one or more of any character in a non-greedy manner until the next semicolon.
(.{1,5}): This part captures a group of characters with a length between 1 and 5. The capturing group is denoted by parentheses. The dot . matches any character, and {1,5} specifies a length between 1 and 5 characters.
;: This matches another semicolon.
(?:IT|Marketing): This is a non-capturing group (?: ... ) that matches either “IT” or “Marketing”.
;: This matches another semicolon.
.+: This matches one or more of any character.

12.4 Practice 3:

Condition #1: First name should start with a letter P or a subsequent letter in the english alphabet. Condition #2: First digit of the phone no should starts with even digit and the ending digit should end with odd digit.

import openpyxl
import re

path_to_file = r'C:\Users\User\Desktop\Machine Learning Foundations A Comprehensive Guide from Python to Mathematics\project files\sample.xlsx'

# Loading the existing Workbook
workbook = openpyxl.load_workbook(filename=path_to_file)

sheet = workbook['EmployeeData']

data = []

for value in sheet.values:
    a,b,c,d,e,f,g = value
    data.append(f'{a};{b};{c};{d};{e};{f};{g}')

employees = '\n'.join(data)
print(employees[:150])

pattern = r'\d{1,};([P-Z].+?);.+;[2468]\d{8}[13579];.+'

result = re.findall(pattern, employees)
print(result)

Let’s break down the given regular expression step by step:

\d{1,};([P-Z].+?);.+;[2468]\d{8}[13579];.+

\d{1,}: This part matches one or more digits. \d represents any digit (0-9), and {1,} specifies that there should be at least one digit.
;: This matches a semicolon.
([P-Z].+?): This part captures a group of characters. The capturing group is denoted by parentheses. It starts with a character range [P-Z], meaning any uppercase letter from ‘P’ to ‘Z’. The dot . matches any character, and +? makes it match one or more characters in a non-greedy manner (matching the minimum possible).
;: This matches another semicolon.
.+;: This matches one or more of any character until the next semicolon.
[2468]\d{8}[13579]: This part matches a specific pattern of digits:
- [2468]: Any one digit among 2, 4, 6, or 8.
- \d{8}: Exactly 8 digits.
- [13579]: Any one digit among 1, 3, 5, 7, or 9.
;: This matches another semicolon.
.+: This matches one or more of any character.

12.5 Practice 4:

Condition #1: Obtain all the employees who works in Sales department and live in New York. Condition #2: Fetch each employee his/her first & last name as well as his/her phone number.

import openpyxl
import re

path_to_file = r'C:\Users\User\Desktop\Machine Learning Foundations A Comprehensive Guide from Python to Mathematics\project files\sample.xlsx'

# Loading the existing Workbook
workbook = openpyxl.load_workbook(filename=path_to_file)

sheet = workbook['EmployeeData']

data = []

for value in sheet.values:
    a,b,c,d,e,f,g = value
    data.append(f'{a};{b};{c};{d};{e};{f};{g}')

employees = '\n'.join(data)

pattern = r'\d{1,};(.+);(.+);(Sales);(\d{10});(?:.*New York.*);.+'

result = re.findall(pattern, employees)

for element in result: 
    print(element)

Let’s break down the given regular expression step by step:

\d{1,};(.+);(.+);(Sales);(\d{10});(?:.*New York.*);.+

\d{1,}: This part matches one or more digits at the beginning. \d represents any digit (0-9), and {1,} specifies that there should be at least one digit.
;: This matches a semicolon.
(.+): This is the first capturing group, which matches one or more of any character (except for a newline) and captures it for later use.
;: This matches another semicolon.
(.+): This is the second capturing group, similar to the first one, capturing one or more of any character.
;: This matches another semicolon.
(Sales): This matches the exact string “Sales”.
;: This matches another semicolon.
(\d{10}): This is the third capturing group, which matches exactly 10 digits and captures them.
;: This matches another semicolon.
(?:.*New York.*): This is a non-capturing group that matches any characters (zero or more) followed by the string “New York” and then any characters (zero or more).
;: This matches another semicolon.
.+: This matches one or more of any character.

12.6 Practice 5:

Condition: Match any employee as long as his/her Address does not end in Miami

import openpyxl
import re

path_to_file = r'C:\Users\User\Desktop\Machine Learning Foundations A Comprehensive Guide from Python to Mathematics\project files\sample.xlsx'

# Loading the existing Workbook
workbook = openpyxl.load_workbook(filename=path_to_file)

sheet = workbook['EmployeeData']

data = []

for value in sheet.values:
    a,b,c,d,e,f,g = value
    data.append(f'{a};{b};{c};{d};{e};{f};{g}')

employees = '\n'.join(data)

print(employees[:150])
pattern = r'\d{1,};.+;(.+);.+;.+;.+, (?!Miami)'

result = re.findall(pattern, employees)

for element in result: 
    print(element)
    
print(len(result))

Let’s break down the given regular expression with negative lookahead:

\d{1,};.+;(.+);.+;.+;.*, (?!Miami)

\d{1,}: This part matches one or more digits at the beginning. \d represents any digit (0-9), and {1,} specifies that there should be at least one digit.
;: This matches a semicolon.
.+: This matches one or more of any character (except for a newline), representing the content between the first and second semicolons.
;: This matches another semicolon.
(.+): This is the first capturing group, which matches one or more of any character and captures it for later use.
;: This matches another semicolon.
.+: This matches one or more of any character, representing the content between the second and third semicolons.
;: This matches another semicolon.
.+: This matches one or more of any character, representing the content between the third and fourth semicolons.
;: This matches another semicolon.
.+: This matches one or more of any character, representing the content between the fourth and fifth semicolons.
,: This matches a comma followed by a space.
(?!Miami): This is a negative lookahead assertion. It checks that what follows the comma and space is not the exact string “Miami”.
.*: This matches zero or more of any character after the negative lookahead.

12.1 Openpyxl Module:

12.2 Practice 1:

12.3 Practice 2:

12.4 Practice 3:

12.5 Practice 4:

12.6 Practice 5:

12.1 `Openpyxl Module`: