- 🍨 本文为🔗365天深度学习训练营 中的学习记录博客
- 🍖 原作者:K同学啊
一、知识介绍
1. 数据加载与预处理
知识点:
pd.read_csv()
:Pandas读取CSV文件的核心方法pd.to_datetime()
:将字符串转换为日期时间对象dt
访问器:提取日期特征(年、月、日)drop()
:删除不必要的列(Date)- 数据处理意义:原始数据通常包含冗余信息和需要转换的格式,预处理是建模的基础
2. 探索性数据分析(EDA)
核心知识点:
- 相关性分析:
select_dtypes()
:智能选择数值型特征corr()
:计算特征间相关系数sns.heatmap()
:可视化相关矩阵
- 分布分析:
sns.countplot()
:分类变量分布可视化pd.crosstab()
:创建交叉表计算条件概率
- 地理分析:
sort_values()
:按降雨概率排序plot(kind="barh")
:水平条形图展示城市降雨概率
- 特征关系:
sns.scatterplot()
:散点图展示特征间关系hue
参数:用颜色区分目标变量
数据分析意义:理解数据分布、发现模式、识别重要特征,为特征工程提供依据
3. 缺失值处理
关键技术:
isnull().sum()
:检测缺失值数量- 填充策略:
- 随机填充:
np.random.choice()
(适合连续变量) - 众数填充:
mode()[0]
(适合分类变量) - 中位数填充:
median()
(适合有离群点的数值变量)
- 随机填充:
select_dtypes()
:智能区分数值列和分类列
意义:真实数据常含缺失值,合理填充可避免建模偏差
4. 特征编码
核心方法:
LabelEncoder()
:将分类标签转换为数值fit_transform()
:学习并应用转换规则
必要性:神经网络只能处理数值输入,必须将文本特征编码为数字
5. 数据集准备
关键步骤:
train_test_split()
:划分训练集/测试集(75%/25%)MinMaxScaler()
:归一化特征到[0,1]范围fit_transform()
:在训练集上计算并应用缩放transform()
:在测试集上应用相同缩放
- PyTorch数据转换:
torch.tensor()
:创建张量unsqueeze(1)
:调整标签维度(N×1)
- 数据加载器:
TensorDataset()
:封装特征和标签DataLoader()
:批量加载数据(batch_size=32)
意义:正确准备数据是模型训练的前提,归一化可加速收敛
6. 神经网络模型
PyTorch核心组件:
class WeatherPredictor(nn.Module):def __init__(self, input_size):super().__init__()self.network = nn.Sequential(nn.Linear(input_size, 24),nn.Tanh(),...)def forward(self, x):return self.network(x)
nn.Module
:所有神经网络的基类nn.Sequential
:按顺序组合网络层- 层类型:
nn.Linear
:全连接层nn.Tanh
/nn.Sigmoid
:激活函数nn.Dropout
:正则化防止过拟合
- 前向传播:
forward()
定义数据流向
架构特点:
- 输入层 → 3个隐藏层(24-18-23)→ 输出层
- Tanh激活:解决梯度消失问题
- Dropout:0.5/0.2的比例随机失活神经元
- Sigmoid输出:二分类概率输出
7. 训练配置
关键元素:
- 损失函数:
nn.BCELoss()
(二元交叉熵) - 优化器:
optim.Adam(lr=1e-4)
- 自适应学习率
- 动量加速收敛
- 训练参数:
- epochs=30:最大迭代次数
- patience=25:早停耐心值
- min_delta=0.001:最小改进阈值
8. 训练循环
核心流程:
for epoch in range(epochs):# 训练阶段model.train()for inputs, labels in train_loader:optimizer.zero_grad()outputs = model(inputs)loss = criterion(outputs, labels)loss.backward()optimizer.step()# 验证阶段model.eval()with torch.no_grad():for inputs, labels in test_loader:...# 早停判断if val_loss improved:save_model()else:patience_counter += 1
- 训练模式:
model.train()
(启用Dropout) - 验证模式:
model.eval()
(禁用Dropout) - 梯度管理:
zero_grad()
:清空梯度backward()
:反向传播step()
:更新权重
- 无梯度计算:
with torch.no_grad()
- 早停机制:防止过拟合,保存最佳模型
9. 性能评估
评估指标:
- 分类指标:
classification_report
:精确率/召回率/F1值confusion_matrix
:TP/FP/TN/FN
- 回归指标(概率评估):
accuracy
:准确率r2_score
:决定系数MAE
/MSE
/RMSE
:误差度量MAPE
:百分比误差
可视化:
- 训练曲线:损失和准确率变化趋势
- 混淆矩阵:分类错误分布
10. 工程实践技巧
- 可复现性:
torch.manual_seed(42) np.random.seed(42)
- 资源管理:
- 批量训练(batch_size)
- 张量数据类型(float32)
- 模型保存:
torch.save(model.state_dict(), 'best_model.pth') model.load_state_dict(torch.load('best_model.pth'))
- 可视化增强:
- 标题/标签/网格线
- 颜色方案/布局优化
二、前期准备
1. 导入库函数
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import warningswarnings.filterwarnings('ignore')from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler, LabelEncoder
from sklearn.metrics import classification_report, confusion_matrix, r2_score, mean_absolute_error, \mean_absolute_percentage_error, mean_squared_error# PyTorch相关库
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import DataLoader, TensorDataset
2. 数据集处理
# 1. 数据加载和预处理
data = pd.read_csv("./data/weatherAUS.csv")
df = data.copy()
3. 日期处理
# 日期处理,将数据转化为日期时间格式
data['Date'] = pd.to_datetime(data['Date'])
data['year'] = data['Date'].dt.year
data['Month'] = data['Date'].dt.month
data['day'] = data['Date'].dt.day
data.drop('Date', axis=1, inplace=True)
三、数据分析
1. 数据相关性探索
# 只选择数值列计算相关性
numeric_cols = data.select_dtypes(include=['float64', 'int64']).columns
numeric_data = data[numeric_cols]plt.figure(figsize=(15, 13))
ax = sns.heatmap(numeric_data.corr(),square=True,annot=True,fmt='.2f')
ax.set_xticklabels(ax.get_xticklabels(), rotation=90)
plt.title('Feature Correlation Heatmap')
plt.show()
2.是否会下雨
# 降雨分布图
sns.set(style="whitegrid", palette="Set2")
fig, axes = plt.subplots(1, 2, figsize=(10, 4))
title_font = {'fontsize': 14, 'fontweight': 'bold', 'color': 'darkblue'}sns.countplot(x='RainTomorrow', data=data, ax=axes[0], edgecolor='black')
axes[0].set_title('Rain Tomorrow', fontdict=title_font)
axes[0].set_xlabel('Will it Rain Tomorrow?', fontsize=12)
axes[0].set_ylabel('Count', fontsize=12)sns.countplot(x='RainToday', data=data, ax=axes[1], edgecolor='black')
axes[1].set_title('Rain Today', fontdict=title_font)
axes[1].set_xlabel('Did it Rain Today?', fontsize=12)
axes[1].set_ylabel('Count', fontsize=12)sns.despine()
plt.tight_layout()
plt.show()
3.下雨条件分析
# 条件概率分析
x = pd.crosstab(data['RainTomorrow'], data['RainToday'])
y = x / x.transpose().sum().values.reshape(2, 1) * 100
print("Conditional Probability Table:")
print(y)
Conditional Probability Table:
RainToday RainTomorrowNo Yes No 84.616648 15.383352Yes 53.216243 46.783757
结果可知:如果今天不下雨那么明天也不下雨的概率是84.6
y.plot(kind="bar", figsize=(4, 3), color=['#006666', '#d279a6'])
plt.title('Rain Tomorrow Probability by Rain Today')
plt.ylabel('Probability (%)')
plt.show()
4. 地理位置与下雨的关系
# 城市降雨分析
x = pd.crosstab(data['Location'], data['RainToday'])
y = x / x.transpose().sum().values.reshape((-1, 1)) * 100
y = y.sort_values(by='Yes', ascending=True)color = ['#cc6699', '#006699', '#006666', '#862d86', '#ff9966']
y.Yes.plot(kind="barh", figsize=(15, 20), color=color)
plt.title('Rain Probability by City')
plt.xlabel('Probability of Rain (%)')
plt.show()
5. 湿度和压力对下雨的影响
# 3. 特征关系散点图
plt.figure(figsize=(8, 6))
sns.scatterplot(data=data, x='Pressure9am', y='Pressure3pm', hue='RainTomorrow', alpha=0.6)
plt.title('Pressure at 9am vs 3pm')
plt.show()plt.figure(figsize=(8, 6))
sns.scatterplot(data=data, x='Humidity9am', y='Humidity3pm', hue='RainTomorrow', alpha=0.6)
plt.title('Humidity at 9am vs 3pm')
plt.show()
6. 气温对下雨的影响
plt.figure(figsize=(8, 6))
sns.scatterplot(x='MaxTemp', y='MinTemp', data=data, hue='RainTomorrow', alpha=0.6)
plt.title('Max Temperature vs Min Temperature')
plt.show()
7.数据预处理
# 4. 缺失值处理
print("\nMissing Values Percentage:")
print(data.isnull().sum() / data.shape[0] * 100)# 随机填充特定列
lst = ['Evaporation', 'Sunshine', 'Cloud9am', 'Cloud3pm']
for col in lst:fill_list = data[col].dropna()data[col] = data[col].fillna(pd.Series(np.random.choice(fill_list, size=len(data.index))))# 分类列用众数填充
object_cols = data.select_dtypes(include=['object']).columns
for col in object_cols:data[col].fillna(data[col].mode()[0], inplace=True)# 数值列用中位数填充
num_cols = data.select_dtypes(include=['float64']).columns
for col in num_cols:data[col].fillna(data[col].median(), inplace=True)print("\nAfter Missing Value Handling:")
print(data.isnull().sum().sum(), "missing values remaining")# 5. 特征编码
label_encoder = LabelEncoder()
for col in object_cols:data[col] = label_encoder.fit_transform(data[col])
8. 构建数据集
# 6. 数据集准备
X = data.drop(['RainTomorrow', 'day'], axis=1)
y = data['RainTomorrow']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=101)# 数据归一化
scaler = MinMaxScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)# 转换为PyTorch张量
X_train_tensor = torch.tensor(X_train, dtype=torch.float32)
y_train_tensor = torch.tensor(y_train.values, dtype=torch.float32).unsqueeze(1)
X_test_tensor = torch.tensor(X_test, dtype=torch.float32)
y_test_tensor = torch.tensor(y_test.values, dtype=torch.float32).unsqueeze(1)# 创建数据加载器
train_dataset = TensorDataset(X_train_tensor, y_train_tensor)
test_dataset = TensorDataset(X_test_tensor, y_test_tensor)batch_size = 32
train_loader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True)
test_loader = DataLoader(test_dataset, batch_size=batch_size)
Missing Values Percentage:
Location 0.000000
MinTemp 1.020899
MaxTemp 0.866905
Rainfall 2.241853
Evaporation 43.166506
Sunshine 48.009762
WindGustDir 7.098859
WindGustSpeed 7.055548
WindDir9am 7.263853
WindDir3pm 2.906641
WindSpeed9am 1.214767
WindSpeed3pm 2.105046
Humidity9am 1.824557
Humidity3pm 3.098446
Pressure9am 10.356799
Pressure3pm 10.331363
Cloud9am 38.421559
Cloud3pm 40.807095
Temp9am 1.214767
Temp3pm 2.481094
RainToday 2.241853
RainTomorrow 2.245978
year 0.000000
Month 0.000000
day 0.000000
dtype: float64
四、构建神经网络
class WeatherPredictor(nn.Module):def __init__(self, input_size):super(WeatherPredictor, self).__init__()self.network = nn.Sequential(nn.Linear(input_size, 24),nn.Tanh(),nn.Linear(24, 18),nn.Tanh(),nn.Linear(18, 23),nn.Tanh(),nn.Dropout(0.5),nn.Linear(23, 12),nn.Tanh(),nn.Dropout(0.2),nn.Linear(12, 1),nn.Sigmoid())def forward(self, x):return self.network(x)# 初始化模型
input_size = X_train.shape[1]
model = WeatherPredictor(input_size)# 8. 定义损失函数和优化器
criterion = nn.BCELoss()
optimizer = optim.Adam(model.parameters(), lr=1e-4)# 9. 训练参数
epochs = 30
best_val_loss = float('inf')
patience = 25
patience_counter = 0
min_delta = 0.001# 记录训练历史
history = {'train_loss': [],'val_loss': [],'train_acc': [],'val_acc': []
}# 10. 训练循环
for epoch in range(epochs):# 训练阶段model.train()train_loss, train_correct, total_samples = 0, 0, 0for inputs, labels in train_loader:optimizer.zero_grad()outputs = model(inputs)loss = criterion(outputs, labels)loss.backward()optimizer.step()train_loss += loss.item() * inputs.size(0)predicted = (outputs > 0.5).float()train_correct += (predicted == labels).sum().item()total_samples += labels.size(0)train_loss = train_loss / total_samplestrain_acc = train_correct / total_samples# 验证阶段model.eval()val_loss, val_correct, val_samples = 0, 0, 0with torch.no_grad():for inputs, labels in test_loader:outputs = model(inputs)loss = criterion(outputs, labels)val_loss += loss.item() * inputs.size(0)predicted = (outputs > 0.5).float()val_correct += (predicted == labels).sum().item()val_samples += labels.size(0)val_loss = val_loss / val_samplesval_acc = val_correct / val_samples# 记录历史history['train_loss'].append(train_loss)history['val_loss'].append(val_loss)history['train_acc'].append(train_acc)history['val_acc'].append(val_acc)# 打印训练进度print(f'Epoch {epoch + 1}/{epochs} | 'f'Train Loss: {train_loss:.4f} Acc: {train_acc:.4f} | 'f'Val Loss: {val_loss:.4f} Acc: {val_acc:.4f}')# Early Stopping 逻辑if val_loss < best_val_loss - min_delta:best_val_loss = val_losspatience_counter = 0torch.save(model.state_dict(), 'best_weather_model.pth')print(f'Validation loss improved to {val_loss:.4f}, saving model...')else:patience_counter += 1print(f'Validation loss did not improve. Patience: {patience_counter}/{patience}')if patience_counter >= patience:print(f'Early stopping triggered after {epoch + 1} epochs')break# 11. 加载最佳模型
model.load_state_dict(torch.load('best_weather_model.pth'))
print("Loaded best model weights")
After Missing Value Handling:
0 missing values remaining
Epoch 1/30 | Train Loss: 0.4622 Acc: 0.8055 | Val Loss: 0.4021 Acc: 0.8252
Validation loss improved to 0.4021, saving model...
Epoch 2/30 | Train Loss: 0.3945 Acc: 0.8334 | Val Loss: 0.3813 Acc: 0.8349
Validation loss improved to 0.3813, saving model...
Epoch 3/30 | Train Loss: 0.3845 Acc: 0.8383 | Val Loss: 0.3776 Acc: 0.8388
Validation loss improved to 0.3776, saving model...
Epoch 4/30 | Train Loss: 0.3822 Acc: 0.8383 | Val Loss: 0.3746 Acc: 0.8393
Validation loss improved to 0.3746, saving model...
Epoch 5/30 | Train Loss: 0.3804 Acc: 0.8391 | Val Loss: 0.3733 Acc: 0.8397
Validation loss improved to 0.3733, saving model...
Epoch 6/30 | Train Loss: 0.3794 Acc: 0.8399 | Val Loss: 0.3723 Acc: 0.8401
Validation loss did not improve. Patience: 1/25
Epoch 7/30 | Train Loss: 0.3784 Acc: 0.8404 | Val Loss: 0.3716 Acc: 0.8404
Validation loss improved to 0.3716, saving model...
Epoch 8/30 | Train Loss: 0.3777 Acc: 0.8400 | Val Loss: 0.3711 Acc: 0.8403
Validation loss did not improve. Patience: 1/25
Epoch 9/30 | Train Loss: 0.3775 Acc: 0.8395 | Val Loss: 0.3709 Acc: 0.8403
Validation loss did not improve. Patience: 2/25
Epoch 10/30 | Train Loss: 0.3774 Acc: 0.8402 | Val Loss: 0.3706 Acc: 0.8402
Validation loss did not improve. Patience: 3/25
Epoch 11/30 | Train Loss: 0.3768 Acc: 0.8402 | Val Loss: 0.3707 Acc: 0.8399
Validation loss did not improve. Patience: 4/25
Epoch 12/30 | Train Loss: 0.3764 Acc: 0.8404 | Val Loss: 0.3699 Acc: 0.8400
Validation loss improved to 0.3699, saving model...
Epoch 13/30 | Train Loss: 0.3765 Acc: 0.8400 | Val Loss: 0.3701 Acc: 0.8405
Validation loss did not improve. Patience: 1/25
Epoch 14/30 | Train Loss: 0.3763 Acc: 0.8405 | Val Loss: 0.3701 Acc: 0.8402
Validation loss did not improve. Patience: 2/25
Epoch 15/30 | Train Loss: 0.3755 Acc: 0.8402 | Val Loss: 0.3705 Acc: 0.8402
Validation loss did not improve. Patience: 3/25
Epoch 16/30 | Train Loss: 0.3756 Acc: 0.8403 | Val Loss: 0.3691 Acc: 0.8407
Validation loss did not improve. Patience: 4/25
Epoch 17/30 | Train Loss: 0.3756 Acc: 0.8400 | Val Loss: 0.3692 Acc: 0.8407
Validation loss did not improve. Patience: 5/25
Epoch 18/30 | Train Loss: 0.3752 Acc: 0.8396 | Val Loss: 0.3690 Acc: 0.8408
Validation loss did not improve. Patience: 6/25
Epoch 19/30 | Train Loss: 0.3747 Acc: 0.8402 | Val Loss: 0.3690 Acc: 0.8405
Validation loss did not improve. Patience: 7/25
Epoch 20/30 | Train Loss: 0.3747 Acc: 0.8403 | Val Loss: 0.3687 Acc: 0.8401
Validation loss improved to 0.3687, saving model...
Epoch 21/30 | Train Loss: 0.3747 Acc: 0.8394 | Val Loss: 0.3685 Acc: 0.8401
Validation loss did not improve. Patience: 1/25
Epoch 22/30 | Train Loss: 0.3744 Acc: 0.8402 | Val Loss: 0.3706 Acc: 0.8389
Validation loss did not improve. Patience: 2/25
Epoch 23/30 | Train Loss: 0.3750 Acc: 0.8397 | Val Loss: 0.3688 Acc: 0.8401
Validation loss did not improve. Patience: 3/25
Epoch 24/30 | Train Loss: 0.3744 Acc: 0.8410 | Val Loss: 0.3693 Acc: 0.8399
Validation loss did not improve. Patience: 4/25
Epoch 25/30 | Train Loss: 0.3739 Acc: 0.8409 | Val Loss: 0.3694 Acc: 0.8406
Validation loss did not improve. Patience: 5/25
Epoch 26/30 | Train Loss: 0.3740 Acc: 0.8409 | Val Loss: 0.3680 Acc: 0.8411
Validation loss did not improve. Patience: 6/25
Epoch 27/30 | Train Loss: 0.3740 Acc: 0.8408 | Val Loss: 0.3682 Acc: 0.8403
Validation loss did not improve. Patience: 7/25
Epoch 28/30 | Train Loss: 0.3735 Acc: 0.8407 | Val Loss: 0.3677 Acc: 0.8399
Validation loss improved to 0.3677, saving model...
Epoch 29/30 | Train Loss: 0.3732 Acc: 0.8401 | Val Loss: 0.3675 Acc: 0.8413
Validation loss did not improve. Patience: 1/25
Epoch 30/30 | Train Loss: 0.3736 Acc: 0.8411 | Val Loss: 0.3677 Acc: 0.8404
Validation loss did not improve. Patience: 2/25
Loaded best model weights
2. 可视化及评价值标
# 12. 绘制训练曲线
plt.figure(figsize=(14, 5))# 准确率曲线
plt.subplot(1, 2, 1)
plt.plot(history['train_acc'], label='Training Accuracy')
plt.plot(history['val_acc'], label='Validation Accuracy')
plt.xlabel('Epochs')
plt.ylabel('Accuracy')
plt.title('Training and Validation Accuracy')
plt.legend()
plt.grid(True, linestyle='--', alpha=0.7)# 损失曲线
plt.subplot(1, 2, 2)
plt.plot(history['train_loss'], label='Training Loss')
plt.plot(history['val_loss'], label='Validation Loss')
plt.xlabel('Epochs')
plt.ylabel('Loss')
plt.title('Training and Validation Loss')
plt.legend()
plt.grid(True, linestyle='--', alpha=0.7)plt.tight_layout()
plt.show()# 13. 模型评估
model.eval()
all_preds = []
all_labels = []with torch.no_grad():for inputs, labels in test_loader:outputs = model(inputs)preds = (outputs > 0.5).float()all_preds.extend(preds.cpu().numpy().flatten())all_labels.extend(labels.cpu().numpy().flatten())# 分类报告和混淆矩阵
print("\nClassification Report:")
print(classification_report(all_labels, all_preds))print("\nConfusion Matrix:")
print(confusion_matrix(all_labels, all_preds))# 计算额外评估指标
print(f"\nAccuracy: {np.mean(np.array(all_labels) == np.array(all_preds)):.4f}")
print(f"R2 Score: {r2_score(all_labels, all_preds):.4f}")
print(f"MAE: {mean_absolute_error(all_labels, all_preds):.4f}")
print(f"MAPE: {mean_absolute_percentage_error(all_labels, all_preds):.4f}")
print(f"MSE: {mean_squared_error(all_labels, all_preds):.4f}")
print(f"RMSE: {np.sqrt(mean_squared_error(all_labels, all_preds)):.4f}")
Classification Report:precision recall f1-score support0.0 0.87 0.94 0.90 283431.0 0.69 0.50 0.58 8022accuracy 0.84 36365macro avg 0.78 0.72 0.74 36365
weighted avg 0.83 0.84 0.83 36365Confusion Matrix:
[[26560 1783][ 4040 3982]]Accuracy: 0.8399
R2 Score: 0.0687
MAE: 0.1601
MAPE: 220814469234688.0000
MSE: 0.1601
RMSE: 0.4002