5.1 Pandas
Pandas is one of the most powerful and widely used libraries in Python for data manipulation and analysis. It provides easy-to-use data structures and functions to efficiently handle large datasets.
Why Use Pandas?
- Efficient data handling with DataFrames and Series
- Supports CSV, Excel, SQL, JSON, and many other file formats
- Powerful data cleaning, manipulation, and transformation tools
- Built-in statistical and analytical functions
- Easy integration with NumPy, Matplotlib, and other libraries
Installing Pandas
If you don’t have Pandas installed, you can install it using:
1. Importing Pandas
2. Creating DataFrames
A DataFrame is a table-like structure that consists of rows and columns.
a) Creating a DataFrame from a Dictionary
Output:
b) Creating a DataFrame from a CSV File
3. Basic DataFrame Operations
a) Checking Data Information
b) Selecting Columns
c) Selecting Rows
4. Data Manipulation
a) Adding a New Column
b) Updating Values
c) Dropping a Column
d) Handling Missing Data
5. Grouping and Aggregation
6. Merging and Joining DataFrames
7. Exporting Data
Conclusion
Pandas is an essential tool for data analysis and manipulation in Python. It simplifies handling and processing of structured data, making it a must-learn library for data science and machine learning.
Pandas Series :
A Pandas Series is a one-dimensional labeled array capable of holding data of any type (integers, strings, floating point numbers, Python objects, etc.). It is one of the core data structures in Pandas and serves as the building block for DataFrames.
Key Characteristics
- Labeled Data: Each element in a Series has an associated label (called an index). This makes data access and alignment easier.
- Homogeneous Data: Unlike DataFrames, a Series is designed to store data of a single type.
- Vectorized Operations: Pandas Series support vectorized operations which means that you can perform operations on the entire series without explicitly looping over elements.
- Integration with NumPy: A Series is built on top of NumPy arrays, providing both speed and functionality.
Creating a Pandas Series
1. From a List
You can create a Series from a list. By default, the index will be a range of integers starting at 0.
Output:
2. With Custom Index Labels
You can also specify custom labels for the Series.
Output:
3. From a Dictionary
When creating a Series from a dictionary, the keys become the index labels and the values become the data.
Output:
Basic Operations on Series
Accessing Elements
Accessing data in a Series can be done using the index label or the integer position.
Slicing
You can slice a Series similar to how you slice a list.
Output:
Vectorized Operations
Series allow arithmetic operations directly on the entire array.
Output:
Handling Missing Data
If the Series contains missing data, Pandas provides methods to handle them.
Output:
Advanced Usage
Applying Functions
You can apply functions to each element in the Series using the apply() method.
Output:
Boolean Indexing
Filtering a Series based on a condition is straightforward.
Output:
Conclusion
Pandas Series are a fundamental component for data manipulation and analysis. Their integration with Pandas DataFrames and NumPy makes them a versatile and efficient tool for handling one-dimensional data. Whether you’re performing data cleaning, transformation, or statistical analysis, mastering Series is essential for any data science or analytical task in Python.
Pandas DataFrames are a core data structure in the pandas library, designed for handling and manipulating structured data efficiently. They resemble tables in databases or spreadsheets and provide powerful tools for data analysis.
Creating a DataFrame
You can create a DataFrame from various sources:
1. From a Dictionary
2. From a List of Lists
3. From a CSV File
Basic Operations on DataFrames
1. Viewing Data
2. Selecting Columns
3. Selecting Rows
4. Adding and Removing Columns
5. Sorting Data
6. Handling Missing Data
Merging and Grouping
1. Merging Two DataFrames
2. Grouping and Aggregation
Pandas DataFrames are extremely powerful for data manipulation and analysis. Do
Reading Various Types of Data Using Pandas
Pandas provides functions to read various types of data
files and convert them into DataFrames for easy manipulation. Here’s how you
can read different file formats using pandas.
1. Reading CSV Files (.csv)
CSV (Comma-Separated Values) files are the most common data
format.
import pandas as pd
df = pd.read_csv('data.csv')
# Read a CSV file
print(df.head()) #
Display first 5 rows
- Common
Parameters:
- sep=','
→ Specify a different delimiter (e.g., sep=';' for semicolon-separated
files).
- header=None
→ If there’s no header row.
- names=['col1',
'col2'] → Specify column names.
- index_col=0
→ Use the first column as the index.
2. Reading Excel Files (.xlsx, .xls)
Pandas can read Excel files using read_excel().
df = pd.read_excel('data.xlsx', sheet_name='Sheet1') # Read an Excel file
print(df.head())
- Common
Parameters:
- sheet_name='Sheet1'
→ Specify which sheet to read.
- usecols=['A',
'B'] → Read specific columns.
Requires the openpyxl library for .xlsx files:
Install with pip install openpyxl
3. Reading JSON Files (.json)
JSON (JavaScript Object Notation) is commonly used in web
APIs.
df = pd.read_json('data.json') # Read a JSON file
print(df.head())
- If
the JSON file is not in a structured format, use:
df = pd.read_json('data.json', orient='records')
4. Reading SQL Databases
You can directly load data from SQL databases into a
DataFrame.
import sqlite3
conn = sqlite3.connect('database.db') # Connect to a database
df = pd.read_sql_query("SELECT * FROM table_name",
conn)
print(df.head())
For MySQL, PostgreSQL, etc., use the sqlalchemy library.
5. Reading Parquet Files (.parquet)
Parquet is a high-performance binary format optimized for
large datasets.
df = pd.read_parquet('data.parquet')
print(df.head())
Requires the pyarrow or fastparquet library:
Install with pip install pyarrow fastparquet
6. Reading HTML Tables
If a webpage contains tables, pandas can extract them.
url = "https://example.com/data"
df_list = pd.read_html(url)
df = df_list[0] #
Select the first table
print(df.head())
7. Reading XML Files (.xml)
df = pd.read_xml('data.xml')
print(df.head())
Requires lxml or xml.etree.ElementTree:
Install with pip install lxml
8. Reading Plain Text Files (.txt)
If data is stored in a text file with delimiters, use read_csv()
with sep.
df = pd.read_csv('data.txt', sep='\t') # Read tab-separated values
print(df.head())
9. Reading Pickle Files (.pkl)
Pickle files store Python objects in a serialized format.
df = pd.read_pickle('data.pkl')
print(df.head())
Summary Table
|
File Format |
Function |
|
CSV |
pd.read_csv('file.csv') |
|
Excel |
pd.read_excel('file.xlsx') |
|
JSON |
pd.read_json('file.json') |
|
SQL |
pd.read_sql_query('SQL QUERY', connection) |
|
Parquet |
pd.read_parquet('file.parquet') |
|
HTML |
pd.read_html('url') |
|
XML |
pd.read_xml('file.xml') |
|
TXT |
pd.read_csv('file.txt', sep='\t') |
|
Pickle |
pd.read_pickle('file.pkl') |
No comments:
Post a Comment