13 Pandas
module:
Let’s break down the pandas
library in simple terms with analogies.
Pandas as a Data Organizer:
Imagine you have a big box of LEGO bricks. Each LEGO brick is like a piece of data. Now, you want to build something cool with these bricks, but managing them in the box can be messy. Here’s where pandas
comes in.
Series - The Single LEGO Stack:
- A Series is like a single stack of LEGO bricks. It’s organized and labeled. Each brick (data point) has its place, and you can easily refer to them by their position or label.
import pandas as pd # Creating a Series = pd.Series([25, 28, 24, 30, 22], name='Temperature') temperatures
Just like a stack of LEGO bricks neatly arranged, a Series keeps your data in order.
DataFrame - The LEGO Structure:
- Now, imagine you want to build something more complex, like a spaceship. A DataFrame is like a structured LEGO creation. It consists of multiple stacks (Series), each representing a specific aspect of your project.
import pandas as pd # Creating a DataFrame = {'Day': ['Mon', 'Tue', 'Wed', 'Thu', 'Fri'], data 'Temperature': [25, 28, 24, 30, 22]} = pd.DataFrame(data) weather_df
In this analogy, your spaceship (DataFrame) has different sections (columns) for the day and temperature, and each section is like a well-organized stack of LEGO bricks (Series).
Let’s look another simple example of creating dataframe:
Certainly, Asad_Pro_Beta! Let’s break down the provided code step by step:
import pandas as pd
# Creating a DataFrame
= pd.DataFrame([[1000, 'steve', 86],
df 1001, 'mathew', 91],
[1002, 'jose', 72],
[1003, 'patty', 69],
[1004, 'vin', 88]],
[=['Regd.No', 'Name', 'Marks%'],
columns=['ID1', 'ID2', 'ID3', 'ID4', 'ID5'])
index
# Displaying the DataFrame
print(df)
Explanation:
- Importing Pandas:
import pandas as pd
: This line imports thepandas
library and gives it the aliaspd
. This alias is commonly used for brevity.
- Creating a DataFrame:
pd.DataFrame(...)
: This creates a DataFrame. The data is provided as a list of lists, where each inner list represents a row of data.
- Data Values:
- The inner lists contain values for ‘Regd.No’ (Registration Number), ‘Name’, and ‘Marks%’ respectively for each student.
- Columns:
columns=['Regd.No', 'Name', 'Marks%']
: This specifies the column names for the DataFrame.
- Index:
index=['ID1', 'ID2', 'ID3', 'ID4', 'ID5']
: This sets custom index values for the DataFrame. Each index corresponds to a row.
- Displaying the DataFrame:
print(df)
: This prints the DataFrame to the console.
Resulting DataFrame:
Regd.No Name Marks%
ID1 1000 steve 86
ID2 1001 mathew 91
ID3 1002 jose 72
ID4 1003 patty 69
ID5 1004 vin 88
So, the DataFrame df
is a table with columns ‘Regd.No’, ‘Name’, and ‘Marks%’, and custom row indices ‘ID1’ through ‘ID5’, representing information about students, their registration numbers, names, and marks percentage. Each row corresponds to a different student.
Note: From now on i will be using Jupyter Notebook to do data analysis because Jupyter Notebook is made for that purpose. To install Jupyter Notebook go on to this Jupyter Notebook
**Creating a DataFrame and adding some data to it.
# Giving alias to pandas module and calling it pd instead of pandas as you would give a nickname to a person
import pandas as pd
= pd.DataFrame([[1000,'steve',86],
df 1001,'mathew',91],
[1002,'jose',72],
[1003,'patty',69],
[1004,'vin',88]], columns = ['Regd.No','Name','Marks%'], index=['ID1','ID2','ID3','ID4','ID5'])
[
df
Reading the data from a webpage
import pandas as pd
= 'https://en.wikipedia.org/wiki/Python_(programming_language)'
url
= pd.read_html(url)
d 1] d[
Now we will use the above data to perform various operation on it.
You have to convert the DataFrame into string for regular expression to work on it.
= d[1].to_string()
content 1000] content[:
Scenario 1: Return Data types that are mutable
= d[1].to_string()
content = r'\n\d{1,}\s{1,}(.+?)\s{1,}mutable.+'
pattern
= re.findall(pattern, content)
result result
Scenario 2: Return those data types that uses the curly braces format
= r'\d{1,}\s{1,}(\w+?)\s{1,}.+\{.*\}'
pattern2 = re.findall(pattern2, content)
result result
Scenario 3: Extract datatype names who is no longer than 4 characters e.g. int, set, dict, list so on.
= r'\n\d{1,}\s{1,}(\w{3,4})\s{1,}.+'
pattern3 = re.findall(pattern3, content)
result result
Scenario 3: return the description of those id who is odd
= r'\n(\d*[13579])\s{1,}.+?\s{1,}\w+\s{1,}(.{,10}).+'
pattern4 = re.findall(pattern4,content)
result result
Scenario 4: return all the data type who’s syntax part contain at least one decimal part
= r'\n\d{1,}\s{1,}(\w+).+\s{1,}.+(\d+\.\d+)'
pattern5 = re.findall(pattern5, content)
result result