Python开发从入门到精通:机器学习与深度学习入门
《Python开发从入门到精通》设计指南第七篇:机器学习与深度学习入门

一、学习目标与重点
💡 学习目标:掌握机器学习与深度学习的基本概念,理解常用的算法和模型;学习Scikit-learn、TensorFlow、Keras等核心库的使用;通过实战案例解决真实问题。
⚠️ 学习重点:机器学习基本概念、Scikit-learn常用算法、TensorFlow/Keras基础、神经网络基础、深度学习实战。
7.1 机器学习概述
7.1.1 机器学习的定义
机器学习是一种让计算机从数据中学习规律的方法,不需要明确编程。机器学习的核心思想是“从经验中学习”,通过对大量数据的分析,发现隐藏的模式和规律,并用于预测未来的数据。
7.1.2 机器学习的类型
- 监督学习:使用带标签的数据进行训练,预测新数据的标签。
- 无监督学习:使用无标签的数据进行训练,发现数据中的模式和结构。
- 强化学习:通过与环境的交互,学习如何获得最大的回报。
7.1.3 常用机器学习算法
- 线性回归:用于预测连续变量。
- 逻辑回归:用于二分类问题。
- 决策树:用于分类和回归问题。
- 随机森林:集成学习算法,提高预测准确性。
- 支持向量机:用于分类问题。
- K近邻:用于分类和回归问题。
- 聚类算法:如K-means、层次聚类等。
7.2 机器学习基础
7.2.1 数据准备
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
# 读取数据
data = pd.read_csv('data.csv')
# 数据处理
X = data.drop('target', axis=1)
y = data['target']
# 数据分割
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# 数据标准化
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
7.2.2 模型训练与评估
7.2.2.1 线性回归
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
model = LinearRegression()
model.fit(X_train_scaled, y_train)
y_pred = model.predict(X_test_scaled)
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)
print(f"均方误差:{mse}")
print(f"R2分数:{r2}")
7.2.2.2 逻辑回归
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix
model = LogisticRegression()
model.fit(X_train_scaled, y_train)
y_pred = model.predict(X_test_scaled)
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)
cm = confusion_matrix(y_test, y_pred)
print(f"准确率:{accuracy}")
print(f"精确率:{precision}")
print(f"召回率:{recall}")
print(f"F1分数:{f1}")
print(f"混淆矩阵:
{cm}")
7.2.2.3 决策树
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix
model = DecisionTreeClassifier()
model.fit(X_train_scaled, y_train)
y_pred = model.predict(X_test_scaled)
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)
cm = confusion_matrix(y_test, y_pred)
print(f"准确率:{accuracy}")
print(f"精确率:{precision}")
print(f"召回率:{recall}")
print(f"F1分数:{f1}")
print(f"混淆矩阵:
{cm}")
7.2.2.4 随机森林
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix
model = RandomForestClassifier()
model.fit(X_train_scaled, y_train)
y_pred = model.predict(X_test_scaled)
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)
cm = confusion_matrix(y_test, y_pred)
print(f"准确率:{accuracy}")
print(f"精确率:{precision}")
print(f"召回率:{recall}")
print(f"F1分数:{f1}")
print(f"混淆矩阵:
{cm}")
7.2.3 模型调优
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestClassifier
param_grid = {
'n_estimators': [100, 200, 300],
'max_depth': [3, 5, 7],
'min_samples_split': [2, 4, 6]
}
model = RandomForestClassifier()
grid_search = GridSearchCV(model, param_grid, cv=5, scoring='accuracy')
grid_search.fit(X_train_scaled, y_train)
print(f"最佳参数:{grid_search.best_params_}")
print(f"最佳分数:{grid_search.best_score_}")
best_model = grid_search.best_estimator_
y_pred = best_model.predict(X_test_scaled)
accuracy = accuracy_score(y_test, y_pred)
print(f"测试集准确率:{accuracy}")
7.3 深度学习基础
7.3.1 深度学习概述
深度学习是机器学习的一个分支,使用神经网络模型来模拟人脑的学习过程。深度学习在计算机视觉、自然语言处理、语音识别等领域取得了显著的成果。
7.3.2 神经网络基础
- 神经元:神经网络的基本单元,接收输入并产生输出。
- 层数:神经网络的深度,包括输入层、隐藏层、输出层。
- 激活函数:用于引入非线性因素,如ReLU、Sigmoid、Tanh。
- 损失函数:用于计算预测值与真实值之间的误差,如均方误差、交叉熵损失。
- 优化器:用于更新神经网络的权重,如SGD、Adam、RMSProp。
7.3.3 使用TensorFlow/Keras构建神经网络
7.3.3.1 安装TensorFlow/Keras
pip install tensorflow
7.3.3.2 构建简单的神经网络
import tensorflow as tf
from tensorflow import keras
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
# 加载数据
housing = fetch_california_housing()
X = housing.data
y = housing.target
# 数据分割
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# 数据标准化
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
# 构建神经网络
model = keras.Sequential([
keras.layers.Dense(30, activation='relu', input_shape=X_train_scaled.shape[1:]),
keras.layers.Dense(1)
])
# 编译模型
model.compile(optimizer='sgd', loss='mean_squared_error', metrics=['mean_squared_error'])
# 训练模型
history = model.fit(X_train_scaled, y_train, epochs=30, validation_data=(X_test_scaled, y_test))
# 模型评估
test_loss, test_mse = model.evaluate(X_test_scaled, y_test)
print(f"测试集均方误差:{test_mse}")
7.3.3.3 构建深层神经网络
import tensorflow as tf
from tensorflow import keras
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
# 加载数据
housing = fetch_california_housing()
X = housing.data
y = housing.target
# 数据分割
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# 数据标准化
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
# 构建深层神经网络
model = keras.Sequential([
keras.layers.Dense(30, activation='relu', input_shape=X_train_scaled.shape[1:]),
keras.layers.Dense(15, activation='relu'),
keras.layers.Dense(1)
])
# 编译模型
model.compile(optimizer='sgd', loss='mean_squared_error', metrics=['mean_squared_error'])
# 训练模型
history = model.fit(X_train_scaled, y_train, epochs=30, validation_data=(X_test_scaled, y_test))
# 模型评估
test_loss, test_mse = model.evaluate(X_test_scaled, y_test)
print(f"测试集均方误差:{test_mse}")
7.4 深度学习实战
7.4.1 图像分类
7.4.1.1 数据集准备
使用Keras内置的MNIST数据集:
import tensorflow as tf
from tensorflow import keras
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
# 加载MNIST数据集
mnist = keras.datasets.mnist
(X_train, y_train), (X_test, y_test) = mnist.load_data()
# 数据处理
X_train = X_train.reshape(-1, 28 * 28).astype('float32') / 255.0
X_test = X_test.reshape(-1, 28 * 28).astype('float32') / 255.0
# 数据分割
X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, test_size=0.2, random_state=42)
7.4.1.2 构建图像分类模型
import tensorflow as tf
from tensorflow import keras
# 构建图像分类模型
model = keras.Sequential([
keras.layers.Dense(300, activation='relu', input_shape=X_train.shape[1:]),
keras.layers.Dense(100, activation='relu'),
keras.layers.Dense(10, activation='softmax')
])
# 编译模型
model.compile(optimizer='sgd', loss='sparse_categorical_crossentropy', metrics=['accuracy'])
# 训练模型
history = model.fit(X_train, y_train, epochs=30, validation_data=(X_val, y_val))
# 模型评估
test_loss, test_accuracy = model.evaluate(X_test, y_test)
print(f"测试集准确率:{test_accuracy}")
7.4.1.3 使用CNN进行图像分类
import tensorflow as tf
from tensorflow import keras
# 加载MNIST数据集
mnist = keras.datasets.mnist
(X_train, y_train), (X_test, y_test) = mnist.load_data()
# 数据处理
X_train = X_train.reshape(-1, 28, 28, 1).astype('float32') / 255.0
X_test = X_test.reshape(-1, 28, 28, 1).astype('float32') / 255.0
# 数据分割
X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, test_size=0.2, random_state=42)
# 构建CNN模型
model = keras.Sequential([
keras.layers.Conv2D(32, (3, 3), activation='relu', input_shape=X_train.shape[1:]),
keras.layers.MaxPooling2D((2, 2)),
keras.layers.Conv2D(64, (3, 3), activation='relu'),
keras.layers.MaxPooling2D((2, 2)),
keras.layers.Conv2D(64, (3, 3), activation='relu'),
keras.layers.Flatten(),
keras.layers.Dense(64, activation='relu'),
keras.layers.Dense(10, activation='softmax')
])
# 编译模型
model.compile(optimizer='adam', loss='sparse_categorical_crossentropy', metrics=['accuracy'])
# 训练模型
history = model.fit(X_train, y_train, epochs=10, validation_data=(X_val, y_val))
# 模型评估
test_loss, test_accuracy = model.evaluate(X_test, y_test)
print(f"测试集准确率:{test_accuracy}")
7.4.2 自然语言处理
7.4.2.1 文本分类
import tensorflow as tf
from tensorflow import keras
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
# 加载数据集
newsgroups = fetch_20newsgroups(subset='all', remove=('headers', 'footers', 'quotes'))
X = newsgroups.data
y = newsgroups.target
# 数据处理
vectorizer = TfidfVectorizer(stop_words='english', max_features=10000)
X_vectorized = vectorizer.fit_transform(X)
# 数据分割
X_train, X_test, y_train, y_test = train_test_split(X_vectorized, y, test_size=0.2, random_state=42)
# 构建文本分类模型
model = keras.Sequential([
keras.layers.Dense(100, activation='relu', input_shape=(X_train.shape[1],)),
keras.layers.Dense(50, activation='relu'),
keras.layers.Dense(20, activation='softmax')
])
# 编译模型
model.compile(optimizer='sgd', loss='sparse_categorical_crossentropy', metrics=['accuracy'])
# 训练模型
history = model.fit(X_train.toarray(), y_train, epochs=30, validation_data=(X_test.toarray(), y_test))
# 模型评估
test_loss, test_accuracy = model.evaluate(X_test.toarray(), y_test)
print(f"测试集准确率:{test_accuracy}")
7.5 实战案例:预测房屋价格
7.5.1 需求分析
预测房屋价格,使用California Housing数据集。
7.5.2 代码实现
import pandas as pd
import numpy as np
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error, r2_score
import matplotlib.pyplot as plt
# 加载数据
housing = fetch_california_housing()
X = housing.data
y = housing.target
# 数据处理
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
# 模型训练与评估
# 线性回归
lr_model = LinearRegression()
lr_model.fit(X_train_scaled, y_train)
lr_pred = lr_model.predict(X_test_scaled)
lr_mse = mean_squared_error(y_test, lr_pred)
lr_r2 = r2_score(y_test, lr_pred)
# 随机森林
rf_model = RandomForestRegressor()
rf_model.fit(X_train_scaled, y_train)
rf_pred = rf_model.predict(X_test_scaled)
rf_mse = mean_squared_error(y_test, rf_pred)
rf_r2 = r2_score(y_test, rf_pred)
# 结果对比
print("线性回归:")
print(f"均方误差:{lr_mse}")
print(f"R2分数:{lr_r2}")
print()
print("随机森林:")
print(f"均方误差:{rf_mse}")
print(f"R2分数:{rf_r2}")
# 可视化
plt.figure(figsize=(10, 6))
plt.scatter(y_test, lr_pred, label='线性回归', alpha=0.5)
plt.scatter(y_test, rf_pred, label='随机森林', alpha=0.5)
plt.plot([y.min(), y.max()], [y.min(), y.max()], 'k--', lw=2)
plt.xlabel('真实价格')
plt.ylabel('预测价格')
plt.title('房屋价格预测')
plt.legend()
plt.show()
7.5.3 实施过程
- 加载数据:使用
fetch_california_housing()函数加载California Housing数据集。 - 数据处理:使用
StandardScaler()函数对数据进行标准化,使用train_test_split()函数分割数据。 - 模型训练与评估:使用线性回归和随机森林模型进行训练和评估。
- 可视化:使用Matplotlib绘制真实价格与预测价格的散点图。
7.5.4 最终效果
通过对比线性回归和随机森林模型的结果,我们可以看到随机森林模型的预测效果更好。
总结
✅ 本文详细介绍了机器学习与深度学习的基本概念,包括常用的算法和模型;学习了Scikit-learn、TensorFlow/Keras等核心库的使用;通过实战案例解决真实问题。
✅ 建议读者在学习过程中多练习,通过编写代码加深对知识点的理解。











