MSBTE IV Sem: Pandas

5.1 Pandas

Pandas is one of the most powerful and widely used libraries in Python for data manipulation and analysis. It provides easy-to-use data structures and functions to efficiently handle large datasets.

Why Use Pandas?

Efficient data handling with DataFrames and Series
Supports CSV, Excel, SQL, JSON, and many other file formats
Powerful data cleaning, manipulation, and transformation tools
Built-in statistical and analytical functions
Easy integration with NumPy, Matplotlib, and other libraries

Installing Pandas

If you don’t have Pandas installed, you can install it using:


pip install pandas

1. Importing Pandas


import pandas as pd

2. Creating DataFrames

A DataFrame is a table-like structure that consists of rows and columns.

a) Creating a DataFrame from a Dictionary


data = {
    'Name': ['Alice', 'Bob', 'Charlie'],
    'Age': [25, 30, 35],
    'City': ['New York', 'Los Angeles', 'Chicago']
}

df = pd.DataFrame(data)
print(df)

Output:


    Name  Age         City
0  Alice   25     New York
1    Bob   30  Los Angeles
2 Charlie   35     Chicago

b) Creating a DataFrame from a CSV File


df = pd.read_csv('data.csv')
print(df.head())  # Display first 5 rows

3. Basic DataFrame Operations

a) Checking Data Information


print(df.info())  # Summary of the dataset
print(df.describe())  # Statistical summary
print(df.shape)  # Rows and columns count

b) Selecting Columns


print(df['Name'])  # Selecting a single column
print(df[['Name', 'Age']])  # Selecting multiple columns

c) Selecting Rows


print(df.iloc[0])  # Selecting the first row
print(df.loc[df['Age'] > 30])  # Filtering rows where Age > 30

4. Data Manipulation

a) Adding a New Column


df['Salary'] = [50000, 60000, 70000]
print(df)

b) Updating Values


df.loc[df['Name'] == 'Alice', 'Age'] = 26

c) Dropping a Column


df.drop(columns=['Salary'], inplace=True)

d) Handling Missing Data


df.fillna(0, inplace=True)  # Replace NaN values with 0
df.dropna(inplace=True)  # Remove rows with NaN values

5. Grouping and Aggregation


df_grouped = df.groupby('City')['Age'].mean()
print(df_grouped)

6. Merging and Joining DataFrames


df1 = pd.DataFrame({'ID': [1, 2], 'Name': ['Alice', 'Bob']})
df2 = pd.DataFrame({'ID': [1, 2], 'Salary': [50000, 60000]})

df_merged = pd.merge(df1, df2, on='ID')
print(df_merged)

7. Exporting Data


df.to_csv('output.csv', index=False)  # Save as CSV
df.to_excel('output.xlsx', index=False)  # Save as Excel

Conclusion

Pandas is an essential tool for data analysis and manipulation in Python. It simplifies handling and processing of structured data, making it a must-learn library for data science and machine learning.

Pandas Series :

A Pandas Series is a one-dimensional labeled array capable of holding data of any type (integers, strings, floating point numbers, Python objects, etc.). It is one of the core data structures in Pandas and serves as the building block for DataFrames.

Key Characteristics

Labeled Data: Each element in a Series has an associated label (called an index). This makes data access and alignment easier.
Homogeneous Data: Unlike DataFrames, a Series is designed to store data of a single type.
Vectorized Operations: Pandas Series support vectorized operations which means that you can perform operations on the entire series without explicitly looping over elements.
Integration with NumPy: A Series is built on top of NumPy arrays, providing both speed and functionality.

Creating a Pandas Series

1. From a List

You can create a Series from a list. By default, the index will be a range of integers starting at 0.


import pandas as pd

# Creating a Series from a list
data = [10, 20, 30, 40]
series_from_list = pd.Series(data)
print(series_from_list)

Output:


0    10
1    20
2    30
3    40
dtype: int64

2. With Custom Index Labels

You can also specify custom labels for the Series.


# Creating a Series with custom index labels
series_custom_index = pd.Series(data, index=['a', 'b', 'c', 'd'])
print(series_custom_index)

Output:


a    10
b    20
c    30
d    40
dtype: int64

3. From a Dictionary

When creating a Series from a dictionary, the keys become the index labels and the values become the data.


# Creating a Series from a dictionary
data_dict = {'apple': 3, 'banana': 5, 'orange': 2}
series_from_dict = pd.Series(data_dict)
print(series_from_dict)

Output:

apple     3
banana    5
orange    2
dtype: int64

Basic Operations on Series

Accessing Elements

Accessing data in a Series can be done using the index label or the integer position.


# Accessing elements using custom index labels
print(series_custom_index['b'])  # Output: 20

# Accessing elements by position
print(series_custom_index.iloc[2])  # Output: 30

Slicing

You can slice a Series similar to how you slice a list.


# Slicing the Series
print(series_custom_index['b':'d'])

Output:


b    20
c    30
d    40
dtype: int64

Vectorized Operations

Series allow arithmetic operations directly on the entire array.


# Arithmetic operations on Series
series_doubled = series_custom_index * 2
print(series_doubled)

Output:


a    20
b    40
c    60
d    80
dtype: int64

Handling Missing Data

If the Series contains missing data, Pandas provides methods to handle them.


# Create a Series with missing data
data_with_nan = [1, 2, None, 4]
series_with_nan = pd.Series(data_with_nan)
print(series_with_nan)

# Fill missing values
filled_series = series_with_nan.fillna(0)
print(filled_series)

Output:


0    1.0
1    2.0
2    NaN
3    4.0
dtype: float64

0    1.0
1    2.0
2    0.0
3    4.0
dtype: float64

Advanced Usage

Applying Functions

You can apply functions to each element in the Series using the apply() method.


# Define a function to square a number
def square(x):
    return x * x

# Apply the function to the Series
squared_series = series_from_list.apply(square)
print(squared_series)

Output:


0    100
1    400
2    900
3   1600
dtype: int64

Boolean Indexing

Filtering a Series based on a condition is straightforward.


# Filter Series where values are greater than 20
filtered_series = series_from_list[series_from_list > 20]
print(filtered_series)

Output:

2    30
3    40
dtype: int64

Conclusion

Pandas Series are a fundamental component for data manipulation and analysis. Their integration with Pandas DataFrames and NumPy makes them a versatile and efficient tool for handling one-dimensional data. Whether you’re performing data cleaning, transformation, or statistical analysis, mastering Series is essential for any data science or analytical task in Python.

Pandas DataFrames are a core data structure in the pandas library, designed for handling and manipulating structured data efficiently. They resemble tables in databases or spreadsheets and provide powerful tools for data analysis.

Creating a DataFrame

You can create a DataFrame from various sources:

1. From a Dictionary

import pandas as pd

data = {
    'Name': ['Alice', 'Bob', 'Charlie'],
    'Age': [25, 30, 35],
    'City': ['New York', 'Los Angeles', 'Chicago']
}

df = pd.DataFrame(data)
print(df)

2. From a List of Lists

data = [['Alice', 25, 'New York'],
        ['Bob', 30, 'Los Angeles'],
        ['Charlie', 35, 'Chicago']]

df = pd.DataFrame(data, columns=['Name', 'Age', 'City'])
print(df)

3. From a CSV File

df = pd.read_csv('data.csv')
print(df.head())  # Display the first 5 rows

Basic Operations on DataFrames

1. Viewing Data

df.head()    # First 5 rows
df.tail()    # Last 5 rows
df.info()    # Summary of DataFrame
df.describe() # Statistical summary of numeric columns

2. Selecting Columns

print(df['Name'])  # Select a single column
print(df[['Name', 'Age']])  # Select multiple columns

3. Selecting Rows

print(df.iloc[0])  # Select first row (by index)
print(df.loc[0])   # Select first row (by label)
print(df[df['Age'] > 28])  # Filter rows

4. Adding and Removing Columns

df['Salary'] = [50000, 60000, 70000]  # Add new column
df.drop('Salary', axis=1, inplace=True)  # Remove column

5. Sorting Data

df.sort_values(by='Age', ascending=False, inplace=True)

6. Handling Missing Data

df.fillna(0)   # Replace NaN with 0
df.dropna()    # Remove rows with NaN

Merging and Grouping

1. Merging Two DataFrames


df1 = pd.DataFrame({'ID': [1, 2, 3], 'Name': ['Alice', 'Bob', 'Charlie']})
df2 = pd.DataFrame({'ID': [1, 2, 3], 'Salary': [50000, 60000, 70000]})

merged_df = pd.merge(df1, df2, on='ID')
print(merged_df)

2. Grouping and Aggregation


df.groupby('City')['Age'].mean()  # Group by city and get average age

Pandas DataFrames are extremely powerful for data manipulation and analysis. Do

Reading Various Types of Data Using Pandas

Pandas provides functions to read various types of data files and convert them into DataFrames for easy manipulation. Here’s how you can read different file formats using pandas.

1. Reading CSV Files (.csv)

CSV (Comma-Separated Values) files are the most common data format.

import pandas as pd

df = pd.read_csv('data.csv') # Read a CSV file

print(df.head()) # Display first 5 rows

Common Parameters:

sep=',' → Specify a different delimiter (e.g., sep=';' for semicolon-separated files).
header=None → If there’s no header row.
names=['col1', 'col2'] → Specify column names.
index_col=0 → Use the first column as the index.

2. Reading Excel Files (.xlsx, .xls)

Pandas can read Excel files using read_excel().

df = pd.read_excel('data.xlsx', sheet_name='Sheet1') # Read an Excel file

print(df.head())

Common Parameters:

sheet_name='Sheet1' → Specify which sheet to read.
usecols=['A', 'B'] → Read specific columns.

Requires the openpyxl library for .xlsx files:
Install with pip install openpyxl

3. Reading JSON Files (.json)

JSON (JavaScript Object Notation) is commonly used in web APIs.

df = pd.read_json('data.json') # Read a JSON file

print(df.head())

If the JSON file is not in a structured format, use:

df = pd.read_json('data.json', orient='records')

4. Reading SQL Databases

You can directly load data from SQL databases into a DataFrame.

import sqlite3

conn = sqlite3.connect('database.db') # Connect to a database

df = pd.read_sql_query("SELECT * FROM table_name", conn)

print(df.head())

For MySQL, PostgreSQL, etc., use the sqlalchemy library.

5. Reading Parquet Files (.parquet)

Parquet is a high-performance binary format optimized for large datasets.

df = pd.read_parquet('data.parquet')

print(df.head())

Requires the pyarrow or fastparquet library:
Install with pip install pyarrow fastparquet

6. Reading HTML Tables

If a webpage contains tables, pandas can extract them.

url = "https://example.com/data"

df_list = pd.read_html(url)

df = df_list[0] # Select the first table

print(df.head())

7. Reading XML Files (.xml)

df = pd.read_xml('data.xml')

print(df.head())

Requires lxml or xml.etree.ElementTree:
Install with pip install lxml

8. Reading Plain Text Files (.txt)

If data is stored in a text file with delimiters, use read_csv() with sep.

df = pd.read_csv('data.txt', sep='\t') # Read tab-separated values

print(df.head())

9. Reading Pickle Files (.pkl)

Pickle files store Python objects in a serialized format.

df = pd.read_pickle('data.pkl')

print(df.head())

Summary Table

File Format	Function
CSV	pd.read_csv('file.csv')
Excel	pd.read_excel('file.xlsx')
JSON	pd.read_json('file.json')
SQL	pd.read_sql_query('SQL QUERY', connection)
Parquet	pd.read_parquet('file.parquet')
HTML	pd.read_html('url')
XML	pd.read_xml('file.xml')
TXT	pd.read_csv('file.txt', sep='\t')
Pickle	pd.read_pickle('file.pkl')

Wednesday, 19 March 2025

Pandas

Why Use Pandas?

Installing Pandas

1. Importing Pandas

2. Creating DataFrames

a) Creating a DataFrame from a Dictionary

b) Creating a DataFrame from a CSV File

3. Basic DataFrame Operations

a) Checking Data Information

b) Selecting Columns

c) Selecting Rows

4. Data Manipulation

a) Adding a New Column

b) Updating Values

c) Dropping a Column

d) Handling Missing Data

5. Grouping and Aggregation

6. Merging and Joining DataFrames

7. Exporting Data

Conclusion

Key Characteristics

Creating a Pandas Series

1. From a List

2. With Custom Index Labels

3. From a Dictionary

Basic Operations on Series

Accessing Elements

Slicing

Vectorized Operations

Handling Missing Data

Advanced Usage

Applying Functions

Boolean Indexing

Conclusion

Creating a DataFrame

1. From a Dictionary

2. From a List of Lists

3. From a CSV File

Basic Operations on DataFrames

1. Viewing Data

2. Selecting Columns

3. Selecting Rows

4. Adding and Removing Columns

5. Sorting Data

6. Handling Missing Data

Merging and Grouping

1. Merging Two DataFrames

2. Grouping and Aggregation

No comments:

Post a Comment

Desktop Virtualisation