10 Projects of Regular Expressions:
Download book.txt text file to use it in the upcoming practice session:
10.1 Project Bookshelf:
import re
with open(r'project files\book.txt','r') as file:
= file.read()
string 1: Match all the authors whose book title are shorter than 25 characters
Exercise
= r'.+?;(.{1,25});.+?'
pattern = re.findall(pattern, string)
result
# Exercise 2: Match the authors who publish their books after 2000
# pattern0 = r'.+;.+?;(2[0-9][0-9][0-9])'
# Alternative to the above regex
= r'.+;.+?;(2\d\d\d)'
pattern0
= re.findall(pattern0,string)
result0 print(result0)
Download phone.txt text file to use it in the upcoming practice session:
10.2 Project Phone Book:
import re
with open(r'project files\phone.txt','r') as file:
= file.read()
string
# Excercise 1: Access the last name and their phone number having zero at the end of the area code
# In short (LastName + area code end with 0 + phone number)
= r'.+ (.+)\s{1,}\(\d\d0\)\s(.+)'
pattern # result = re.findall(pattern, string)
# for tup in result:
# print(tup)
# Exercise 2: Fetch all the area-code whose last digit of phone number is 7
= r'.+\s+(\(\d\d\d\)) \d{3}-\d{3}7'
pattern0 = re.findall(pattern0, string)
result for tup in result:
print(tup)
10.3 Project Date and Time:
Download logs.txt text file to use it in the upcoming practice session:
import re
with open(r'project files\logs.txt','r') as file:
= file.read()
string
# Exercise 1: Search for all the log entries that generated b/w on date 11-16-jan-2020 and extract the source part of the
# entry meaning the component that causes the error e.g.
# Here ?: is the non-capturing group which will omit the time from the result
= r'(Critical 1/1[1-6]/2020) (?:\d+:\d+:\d+ AM|PM) (.+) \(\d+\) .+'
pattern
= re.findall(pattern, string)
result for tup in result:
print(tup)
10.4 Project Web Scrapping:
Download web.txt text file to use it in the upcoming practice session:
import re
with open(r'project files\web.txt','r') as file:
= file.read()
string
# extract web addresses of ecommerce site only
= r'.+\s+(https://.+\.\w{2,}\s.+Online shopping.+)'
pattern
= re.findall(pattern, string)
result for line in result:
print(line)
Download stocks.txt text file to use it in the upcoming practice session:
10.5 Project Stock:
import re
with open(r'project files\stocks.txt','r') as file:
= file.read()
string
# Exercise 1: Match all the companies whose revenue is less than 50 billion dollars
= r'(.+)\s+\d+\.\d+M\s+[1-4][0-9]\.\d+B\s+.+'
pattern
= re.findall(pattern, string)
result for tup in result:
print(tup)
10.6 Project: Log File Analysis Tool
Overview:
Develop a Python script that analyzes a server log file. The script should read a log file containing HTTP requests and extract useful information such as IP addresses, request types (GET, POST), URLs, status codes, and timestamps.
Steps:
- Read the Log File: Open and read the log file line by line.
- Pattern Matching:
- IP Address: Use a regex to match and extract IP addresses.
- Timestamp: Extract the date and time of each request.
- Request Type: Extract the request method (GET, POST, etc.).
- Status Code: Extract the HTTP status code.
- Optionally, extract the URL requested.
Tips:
- Start by writing regex patterns to accurately extract each piece of information.
- Use Python’s
re
module to find matches within each log line. - Store extracted data in an appropriate data structure (e.g., dictionaries, lists).
This project will challenge your understanding of regular expressions and give you practical experience with data extraction and analysis. It’s a great way to see the power of regex in parsing and understanding large volumes of text data.
10.7 Project Email & Phone Number Extraction:
Overview:
This project involves extracting phone numbers and email addresses from a text source, cleaning up the data, and storing it in a tabular format in both an Excel file (.xlsx) and a text file (.txt). The extracted information is then added to a dictionary, where phone numbers are used as keys and email addresses as values. Carriage return characters ( are removed from the keys in the dictionary to ensure clean data. The final step involves saving the processed data to both an Excel file and a text file for easy reference.
Steps:
- Copy the content of the ‘phonebook.txt’ file to the clipboard.
- Use regular expressions to find and extract phone numbers and email addresses from the clipboard content.
- Organize the extracted data into a dictionary, where phone numbers are keys and email addresses are values.
- Clean up the dictionary by removing carriage return characters from the keys.
- Create an Excel workbook and sheet, inserting the phone numbers into one column and email addresses into another.
- Save the Excel workbook as ‘phoneEmail.xlsx’.
- Create a PrettyTable to display the data in a tabular format.
- Write the tabular data to a text file named ‘phoneEmail.txt’.
- Optionally, use Pyperclip to copy the cleaned dictionary to the clipboard for easy access.
Tips:
- Ensure that the ‘pyperclip’, ‘openpyxl’, and ‘prettytable’ modules are installed before running the script.
- Verify that the ‘phonebook.txt’ file contains the expected data and is accessible.
- Review the generated ‘phoneEmail.xlsx’ Excel file and ‘phoneEmail.txt’ text file to confirm the output.
- Experiment with the code and modify it based on specific project requirements or preferences.
By following these steps, you can efficiently extract, clean, and organize phone numbers and email addresses from a text source, storing the information in both Excel and text files for convenient reference.
First you have to copy the content or text from the link phonebook to the clipboard and it has to be on the clipboard all the time in order for this project to work
Download phonebook.txt text file to use it in the upcoming practice session:
Condition 1: Get the text off the clipboard. Condition 2. Find all phone numbers and email addresses in the text. Condition 3. Paste them onto the clipboard.
import re
import pyperclip
# Click on the above link and copy the content of phonebook.txt to the clipboard.
# Step 1: Get the text off the clipboard.
= pyperclip.paste()
content
# Step 2: Find all phone numbers and email addresses in the text.
# Create a regex for email address
= r"(?:Email: )(.*@.+\.\w{2,3})"
email_regex
# Create a regex for phone number
= r'(?:Phone: )(.*)'
phone_regex
# Search for phone number in string
= re.findall(phone_regex, content)
result1
# Search for email address in string
= re.findall(email_regex, content)
result2
# Encapsulated both in a dictionary
= dict(zip(result1,result2))
d
# removing the carriage return character (\r) from every key in the dictionary to cleanup the dictionary
= {key.rstrip() : value for key, value in d.items()}
clean_d
# Iterating through the key value pair of dictionary
# for key, value in clean_d.items():
# print(key, " : ", value)
# Further expanding this project by adding the above content to the excel sheet or text file (in tabular format)
# ------------------------------------------------------------
# Adding the above dictionary to text file in tabular format
from openpyxl import Workbook
# Create workbook/spreadsheet
= Workbook()
workbook
# Focusing the current opening sheet
= workbook.active
sheet
# Inserting the phone number in one column and email addresses to another column
# The following code is explained briefly later on.
for index, value in enumerate(clean_d.items(),start=1):
f'A{index}'] = value[0]
sheet[f'B{index}'] = value[1]
sheet[
# Saving the excel file
'phoneEmail.xlsx')
workbook.save(
# ------------------------------------------------------------
# Now adding the above content to a text file in tabular format
from prettytable import PrettyTable
= PrettyTable(['Phone Number','Email Address'])
table for key, value in clean_d.items():
table.add_row([key, value])
with open('phoneEmail.txt', 'w') as file:
file.write(str(table))
# ------------------------------------------------------------
# Step 3: Paste them onto the clipboard.
# pyperclip.copy(str(clean_d))
Explanation to the above snippet code:
for index, value in enumerate(clean_d.items(),start=1):
sheet[f'A{index}'] = value[0]
sheet[f'B{index}'] = value[1]
Let’s break down the code step by step:
enumerate(cleaned_dict.items(), start=1)
: This function is used to iterate over the items of thecleaned_dict
dictionary, and it returns pairs of the form(index, (key, value))
. Thestart=1
argument specifies that the enumeration should start from index 1.for i, (key, value) in ...
: This is a loop that iterates through the enumerated items.i
is the index, and(key, value)
is the key-value pair.sheet[f'A{i}'] = key
: This line assigns thekey
to the cell in column ‘A’ and rowi
of the Excel sheet. Thef-string
is used for string formatting to include the value ofi
in the cell reference.sheet[f'B{i}'] = value
: Similarly, this line assigns thevalue
to the cell in column ‘B’ and rowi
of the Excel sheet.The loop continues for each key-value pair in the
cleaned_dict
, and the corresponding values are added to the ‘A’ and ‘B’ columns in the Excel sheet.
In summary, the code iterates through the key-value pairs of the dictionary, assigns each key to column ‘A’ and each value to column ‘B’ in the Excel sheet, and increments the row index (i
) for each iteration. This way, each key-value pair gets its own row in the Excel sheet.