Wednesday, 19 March 2025

Pandas

5.1 Pandas

Pandas is one of the most powerful and widely used libraries in Python for data manipulation and analysis. It provides easy-to-use data structures and functions to efficiently handle large datasets.

Why Use Pandas?

  • Efficient data handling with DataFrames and Series
  • Supports CSV, Excel, SQL, JSON, and many other file formats
  • Powerful data cleaning, manipulation, and transformation tools
  • Built-in statistical and analytical functions
  • Easy integration with NumPy, Matplotlib, and other libraries

Installing Pandas

If you don’t have Pandas installed, you can install it using:


pip install pandas

1. Importing Pandas


import pandas as pd

2. Creating DataFrames

A DataFrame is a table-like structure that consists of rows and columns.

a) Creating a DataFrame from a Dictionary


data = { 'Name': ['Alice', 'Bob', 'Charlie'], 'Age': [25, 30, 35], 'City': ['New York', 'Los Angeles', 'Chicago'] } df = pd.DataFrame(data) print(df)

Output:


Name Age City 0 Alice 25 New York 1 Bob 30 Los Angeles 2 Charlie 35 Chicago

b) Creating a DataFrame from a CSV File


df = pd.read_csv('data.csv') print(df.head()) # Display first 5 rows

3. Basic DataFrame Operations

a) Checking Data Information


print(df.info()) # Summary of the dataset print(df.describe()) # Statistical summary print(df.shape) # Rows and columns count

b) Selecting Columns


print(df['Name']) # Selecting a single column print(df[['Name', 'Age']]) # Selecting multiple columns

c) Selecting Rows


print(df.iloc[0]) # Selecting the first row print(df.loc[df['Age'] > 30]) # Filtering rows where Age > 30

4. Data Manipulation

a) Adding a New Column


df['Salary'] = [50000, 60000, 70000] print(df)

b) Updating Values


df.loc[df['Name'] == 'Alice', 'Age'] = 26

c) Dropping a Column


df.drop(columns=['Salary'], inplace=True)

d) Handling Missing Data


df.fillna(0, inplace=True) # Replace NaN values with 0 df.dropna(inplace=True) # Remove rows with NaN values

5. Grouping and Aggregation


df_grouped = df.groupby('City')['Age'].mean() print(df_grouped)

6. Merging and Joining DataFrames


df1 = pd.DataFrame({'ID': [1, 2], 'Name': ['Alice', 'Bob']}) df2 = pd.DataFrame({'ID': [1, 2], 'Salary': [50000, 60000]}) df_merged = pd.merge(df1, df2, on='ID') print(df_merged)

7. Exporting Data


df.to_csv('output.csv', index=False) # Save as CSV df.to_excel('output.xlsx', index=False) # Save as Excel

Conclusion

Pandas is an essential tool for data analysis and manipulation in Python. It simplifies handling and processing of structured data, making it a must-learn library for data science and machine learning.

Pandas Series :

A Pandas Series is a one-dimensional labeled array capable of holding data of any type (integers, strings, floating point numbers, Python objects, etc.). It is one of the core data structures in Pandas and serves as the building block for DataFrames.


Key Characteristics

  • Labeled Data: Each element in a Series has an associated label (called an index). This makes data access and alignment easier.
  • Homogeneous Data: Unlike DataFrames, a Series is designed to store data of a single type.
  • Vectorized Operations: Pandas Series support vectorized operations which means that you can perform operations on the entire series without explicitly looping over elements.
  • Integration with NumPy: A Series is built on top of NumPy arrays, providing both speed and functionality.

Creating a Pandas Series

1. From a List

You can create a Series from a list. By default, the index will be a range of integers starting at 0.


import pandas as pd # Creating a Series from a list data = [10, 20, 30, 40] series_from_list = pd.Series(data) print(series_from_list)

Output:


0 10 1 20 2 30 3 40 dtype: int64

2. With Custom Index Labels

You can also specify custom labels for the Series.


# Creating a Series with custom index labels series_custom_index = pd.Series(data, index=['a', 'b', 'c', 'd']) print(series_custom_index)

Output:


a 10 b 20 c 30 d 40 dtype: int64

3. From a Dictionary

When creating a Series from a dictionary, the keys become the index labels and the values become the data.


# Creating a Series from a dictionary data_dict = {'apple': 3, 'banana': 5, 'orange': 2} series_from_dict = pd.Series(data_dict) print(series_from_dict)

Output:

apple 3 banana 5 orange 2 dtype: int64

Basic Operations on Series

Accessing Elements

Accessing data in a Series can be done using the index label or the integer position.


# Accessing elements using custom index labels print(series_custom_index['b']) # Output: 20 # Accessing elements by position print(series_custom_index.iloc[2]) # Output: 30

Slicing

You can slice a Series similar to how you slice a list.


# Slicing the Series print(series_custom_index['b':'d'])

Output:


b 20 c 30 d 40 dtype: int64

Vectorized Operations

Series allow arithmetic operations directly on the entire array.


# Arithmetic operations on Series series_doubled = series_custom_index * 2 print(series_doubled)

Output:


a 20 b 40 c 60 d 80 dtype: int64

Handling Missing Data

If the Series contains missing data, Pandas provides methods to handle them.


# Create a Series with missing data data_with_nan = [1, 2, None, 4] series_with_nan = pd.Series(data_with_nan) print(series_with_nan) # Fill missing values filled_series = series_with_nan.fillna(0) print(filled_series)

Output:


0 1.0 1 2.0 2 NaN 3 4.0 dtype: float64 0 1.0 1 2.0 2 0.0 3 4.0 dtype: float64

Advanced Usage

Applying Functions

You can apply functions to each element in the Series using the apply() method.


# Define a function to square a number def square(x): return x * x # Apply the function to the Series squared_series = series_from_list.apply(square) print(squared_series)

Output:


0 100 1 400 2 900 3 1600 dtype: int64

Boolean Indexing

Filtering a Series based on a condition is straightforward.


# Filter Series where values are greater than 20 filtered_series = series_from_list[series_from_list > 20] print(filtered_series)

Output:

2 30 3 40 dtype: int64

Conclusion

Pandas Series are a fundamental component for data manipulation and analysis. Their integration with Pandas DataFrames and NumPy makes them a versatile and efficient tool for handling one-dimensional data. Whether you’re performing data cleaning, transformation, or statistical analysis, mastering Series is essential for any data science or analytical task in Python.

Pandas DataFrames are a core data structure in the pandas library, designed for handling and manipulating structured data efficiently. They resemble tables in databases or spreadsheets and provide powerful tools for data analysis.

Creating a DataFrame

You can create a DataFrame from various sources:

1. From a Dictionary

import pandas as pd data = { 'Name': ['Alice', 'Bob', 'Charlie'], 'Age': [25, 30, 35], 'City': ['New York', 'Los Angeles', 'Chicago'] } df = pd.DataFrame(data) print(df)

2. From a List of Lists

data = [['Alice', 25, 'New York'], ['Bob', 30, 'Los Angeles'], ['Charlie', 35, 'Chicago']] df = pd.DataFrame(data, columns=['Name', 'Age', 'City']) print(df)

3. From a CSV File

df = pd.read_csv('data.csv') print(df.head()) # Display the first 5 rows

Basic Operations on DataFrames

1. Viewing Data

df.head() # First 5 rows df.tail() # Last 5 rows df.info() # Summary of DataFrame df.describe() # Statistical summary of numeric columns

2. Selecting Columns

print(df['Name']) # Select a single column print(df[['Name', 'Age']]) # Select multiple columns

3. Selecting Rows

print(df.iloc[0]) # Select first row (by index) print(df.loc[0]) # Select first row (by label) print(df[df['Age'] > 28]) # Filter rows

4. Adding and Removing Columns

df['Salary'] = [50000, 60000, 70000] # Add new column df.drop('Salary', axis=1, inplace=True) # Remove column

5. Sorting Data

df.sort_values(by='Age', ascending=False, inplace=True)

6. Handling Missing Data

df.fillna(0) # Replace NaN with 0 df.dropna() # Remove rows with NaN

Merging and Grouping

1. Merging Two DataFrames


df1 = pd.DataFrame({'ID': [1, 2, 3], 'Name': ['Alice', 'Bob', 'Charlie']}) df2 = pd.DataFrame({'ID': [1, 2, 3], 'Salary': [50000, 60000, 70000]}) merged_df = pd.merge(df1, df2, on='ID') print(merged_df)

2. Grouping and Aggregation


df.groupby('City')['Age'].mean() # Group by city and get average age

Pandas DataFrames are extremely powerful for data manipulation and analysis. Do

Reading Various Types of Data Using Pandas

Pandas provides functions to read various types of data files and convert them into DataFrames for easy manipulation. Here’s how you can read different file formats using pandas.


1. Reading CSV Files (.csv)

CSV (Comma-Separated Values) files are the most common data format.

import pandas as pd


df = pd.read_csv('data.csv')  # Read a CSV file

print(df.head())  # Display first 5 rows

  • Common Parameters:
    • sep=',' → Specify a different delimiter (e.g., sep=';' for semicolon-separated files).
    • header=None → If there’s no header row.
    • names=['col1', 'col2'] → Specify column names.
    • index_col=0 → Use the first column as the index.

2. Reading Excel Files (.xlsx, .xls)

Pandas can read Excel files using read_excel().

df = pd.read_excel('data.xlsx', sheet_name='Sheet1')  # Read an Excel file

print(df.head())

  • Common Parameters:
    • sheet_name='Sheet1' → Specify which sheet to read.
    • usecols=['A', 'B'] → Read specific columns.

Requires the openpyxl library for .xlsx files:
Install with pip install openpyxl


3. Reading JSON Files (.json)

JSON (JavaScript Object Notation) is commonly used in web APIs.

df = pd.read_json('data.json')  # Read a JSON file

print(df.head())

  • If the JSON file is not in a structured format, use:

df = pd.read_json('data.json', orient='records')


4. Reading SQL Databases

You can directly load data from SQL databases into a DataFrame.

import sqlite3

conn = sqlite3.connect('database.db')  # Connect to a database

df = pd.read_sql_query("SELECT * FROM table_name", conn)

print(df.head())

For MySQL, PostgreSQL, etc., use the sqlalchemy library.


5. Reading Parquet Files (.parquet)

Parquet is a high-performance binary format optimized for large datasets.

df = pd.read_parquet('data.parquet')

print(df.head())

Requires the pyarrow or fastparquet library:
Install with pip install pyarrow fastparquet


6. Reading HTML Tables

If a webpage contains tables, pandas can extract them.

url = "https://example.com/data"

df_list = pd.read_html(url)

df = df_list[0]  # Select the first table

print(df.head())


7. Reading XML Files (.xml)

df = pd.read_xml('data.xml')

print(df.head())

Requires lxml or xml.etree.ElementTree:
Install with pip install lxml


8. Reading Plain Text Files (.txt)

If data is stored in a text file with delimiters, use read_csv() with sep.

df = pd.read_csv('data.txt', sep='\t')  # Read tab-separated values

print(df.head())


9. Reading Pickle Files (.pkl)

Pickle files store Python objects in a serialized format.

df = pd.read_pickle('data.pkl')

print(df.head())


Summary Table

File Format

Function

CSV

pd.read_csv('file.csv')

Excel

pd.read_excel('file.xlsx')

JSON

pd.read_json('file.json')

SQL

pd.read_sql_query('SQL QUERY', connection)

Parquet

pd.read_parquet('file.parquet')

HTML

pd.read_html('url')

XML

pd.read_xml('file.xml')

TXT

pd.read_csv('file.txt', sep='\t')

Pickle

pd.read_pickle('file.pkl')

  

No comments:

Post a Comment

Desktop Virtualisation

Desktop Virtualization ( DV ) Desktop Virtualization ( DV ) is a technique that creates an illusion of a desktop provided to the user. It d...