Python实现电子书词频统计与可视化工具：从文本到词云的完整流程

背景介绍

在阅读电子书时，我们常常希望快速把握书籍的核心主题、词汇风格等特征。通过词频统计与可视化，我们可以将抽象的文本转化为直观的图表（如柱状图、词云），从而高效分析书籍的词汇分布。本文将介绍如何使用Python开发一个工具，完成从电子书文本读取、分词过滤到词频可视化的全流程。

思路分析

工具的核心流程分为四步：
1. 文件读取：读取本地TXT格式的电子书内容。
2. 分词与停用词过滤：使用jieba分词工具将文本拆分为词语，并过滤掉无意义的停用词（如“的”“了”等）。
3. 词频统计：统计每个词语的出现次数，排序后获取高频词。
4. 可视化输出：使用matplotlib绘制高频词柱状图，使用wordcloud生成词云图，直观展示词频分布。

代码实现

1. 导入依赖库

需要安装jieba（中文分词）、matplotlib（可视化）、wordcloud（词云）库：

pip install jieba matplotlib wordcloud

代码中导入所需库：

import jieba
import matplotlib.pyplot as plt
from wordcloud import WordCloud
import numpy as np
from PIL import Image
import os

2. 加载停用词

内置常见中文停用词，也支持用户自定义（通过文件路径传入）：

def load_stopwords(stopwords_path=None):
    stopwords = set()
    # 内置常见中文停用词
    default_stopwords = {'的', '了', '在', '是', '我', '你', '他', '她', '它', '和', '就', '也', '都', '这', '那', '啊', '呀', '呢', '吧', '吗',
                         '之', '以', '于', '而', '且', '为', '对', '到', '来', '去', '上', '下', '出', '入', '大', '小', '多', '少', '一', '二',
                         '三', '四', '五', '六', '七', '八', '九', '十', '个', '只', '片', '条', '本', '把', '张', '件', '位', '群', '堆', '些',
                         '点', '种', '样', '类', '项', '等', '等等', '及', '以及', '并', '或', '与', '同', '跟', '向', '往', '朝', '从', '自',
                         '由', '被', '给', '让', '使', '叫', '会', '能', '可', '应', '该', '要', '想', '将', '得', '地', '着', '过'}
    stopwords.update(default_stopwords)
    # 若用户提供停用词文件，追加自定义停用词
    if stopwords_path and os.path.exists(stopwords_path):
        with open(stopwords_path, 'r', encoding='utf-8') as f:
            for line in f:
                stopwords.add(line.strip())
    return stopwords

2. 文件读取与文本处理

读取电子书文本，并进行分词、停用词过滤：

def process_text(file_path, stopwords):
    # 读取文件内容（UTF-8编码）
    with open(file_path, 'r', encoding='utf-8') as f:
        text = f.read()
    # 分词（中文用jieba，英文可按空格分割，此处以中文为例）
    words = jieba.lcut(text)
    # 过滤停用词、空字符串、单字（可选）
    filtered_words = [word for word in words 
                      if word not in stopwords 
                      and len(word) > 1  # 过滤单字
                      and not word.isspace()]  # 过滤空白
    return filtered_words

3. 词频统计与排序

统计词语出现次数，按频率降序排序：

def count_frequency(words, top_n=20):
    freq_dict = {}
    for word in words:
        freq_dict[word] = freq_dict.get(word, 0) + 1
    # 按词频降序排序
    sorted_items = sorted(freq_dict.items(), key=lambda x: x[1], reverse=True)
    # 取前top_n个高频词
    top_words = sorted_items[:top_n]
    return freq_dict, top_words

4. 可视化：柱状图与词云图

分别用matplotlib和wordcloud生成可视化图表：

def plot_bar(top_words, title="高频词柱状图"):
    # 提取词与频率
    words, freqs = zip(*top_words)
    # 设置图片清晰度
    plt.rcParams['figure.dpi'] = 300
    # 中文显示支持
    plt.rcParams['font.sans-serif'] = ['SimHei']
    plt.rcParams['axes.unicode_minus'] = False
    # 绘制柱状图
    x = range(len(words))
    plt.bar(x, freqs, color='#1f77b4')
    plt.xticks(x, words, rotation=45, ha='right')  # 旋转x轴标签，避免重叠
    plt.xlabel('词语')
    plt.ylabel('出现次数')
    plt.title(title)
    plt.tight_layout()  # 自动调整布局
    plt.show()

def generate_wordcloud(freq_dict, background_color='white', title="词云图"):
    # 生成词云，字体大小反映词频
    wc = WordCloud(
        font_path='simhei.ttf',  # 中文支持（需本地有黑体字体文件）
        background_color=background_color,
        width=800,
        height=600,
        max_words=200,  # 最多显示词数
        max_font_size=150,
        min_font_size=10,
    )
    # 从词频字典生成词云
    wc.generate_from_frequencies(freq_dict)
    # 显示词云
    plt.figure(figsize=(10, 8))
    plt.imshow(wc, interpolation='bilinear')
    plt.axis('off')  # 隐藏坐标轴
    plt.title(title)
    plt.tight_layout()
    plt.show()

5. 主函数整合流程

将上述函数串联，实现完整工具：

def main():
    # 1. 选择电子书文件（此处以《小王子》中文txt为例，需替换为实际路径）
    file_path = "xiaowangzi.txt"  # 请替换为你的电子书路径
    # 2. 加载停用词
    stopwords = load_stopwords()  # 可传入自定义停用词文件路径，如load_stopwords("my_stopwords.txt")
    # 3. 文本处理（分词+过滤）
    filtered_words = process_text(file_path, stopwords)
    # 4. 词频统计
    freq_dict, top_words = count_frequency(filtered_words, top_n=20)
    # 5. 可视化输出
    plot_bar(top_words, title="《小王子》高频词柱状图（前20）")
    generate_wordcloud(freq_dict, title="《小王子》词云图")
    # 打印前20个高频词
    print("前20个高频词：")
    for word, freq in top_words:
        print(f"{word}（{freq}次）")

if __name__ == "__main__":
    main()

代码说明与扩展

停用词扩展：若需自定义停用词，可创建文本文件（每行一个词），并通过load_stopwords("my_stopwords.txt")加载。
分词适配：若分析英文书籍，可将jieba.lcut替换为text.split()（按空格分词），并调整停用词表为英文停用词。
可视化优化：wordcloud的font_path需指定本地中文字体（如simhei.ttf），确保中文正常显示。

总结

通过本文的工具，我们完成了“文本读取→分词过滤→词频统计→可视化”的完整流程。该工具不仅能帮助分析电子书的词汇特征，还能巩固Python的文件操作、字符串处理、数据结构与可视化技能。你可以根据需求扩展功能，如支持多语言、自定义可视化风格等，进一步提升工具的实用性。

希望本文能帮助你理解文本分析与可视化的完整链路，快去试试分析你喜欢的电子书吧！

运行提示

安装依赖：pip install jieba matplotlib wordcloud。
替换file_path为你的电子书TXT路径。
确保本地有中文字体文件（如simhei.ttf），或修改WordCloud的font_path为系统中存在的中文字体。

通过上述代码，你将得到类似示例的输出：
– 柱状图展示前20个高频词的频次对比。
– 词云图中“小王子”“狐狸”等高频词以较大字体突出显示。
– 控制台打印前20个高频词的列表。

AI管家

Python实现电子书词频统计与可视化工具：从文本到词云的完整流程

背景介绍

思路分析

代码实现

1. 导入依赖库

2. 加载停用词

2. 文件读取与文本处理

3. 词频统计与排序

4. 可视化：柱状图与词云图

5. 主函数整合流程

代码说明与扩展

总结

运行提示

发表回复取消回复

Python实现电子书词频统计与可视化工具：从文本到词云的完整流程

背景介绍

思路分析

代码实现

1. 导入依赖库

2. 加载停用词

2. 文件读取与文本处理

3. 词频统计与排序

4. 可视化：柱状图与词云图

5. 主函数整合流程

代码说明与扩展

总结

运行提示

发表回复 取消回复

发表回复取消回复