# 文本词频统计与可视化工具：从文件读取到数据可视化的全流程实现

在文本分析、文学研究、日志监控等场景中，词频统计是理解文本核心内容的基础手段。例如，分析小说的高频词可揭示主题倾向，统计日志的高频词能定位系统问题。本文将带你实现一个支持中英文的文本词频统计与可视化工具，涵盖文件读取、文本预处理、词频统计和可视化展示全流程。

背景与需求

我们需要开发一个工具，支持以下功能：
– 读取任意文本文件（英文/中文）；
– 预处理文本（分词、去标点、过滤停用词）；
– 统计并排序词频；
– 可视化展示前N个高频词（柱状图）；
– 命令行交互（指定文件路径、高频词数量、停用词过滤）。

技术思路分析

工具的核心流程分为四步：
1. 文件读取：处理UTF-8编码的文本文件；
2. 文本预处理：
– 英文：正则去标点→空格分词→过滤停用词；
– 中文：正则去标点→jieba分词→过滤停用词；
3. 词频统计：用Counter统计词频，按频率排序；
4. 可视化：用matplotlib绘制柱状图，直观展示高频词。

代码实现（Python）

下面是完整的工具代码，包含详细注释：

import os
import re
import jieba
import matplotlib.pyplot as plt
from collections import Counter
import argparse

# 英文停用词表（常见无意义词）
english_stop_words = {'the', 'a', 'an', 'and', 'or', 'is', 'are', 'was', 'were', 'to', 'of', 'in', 'for', 'on', 'at',
                      'by', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after',
                      'above', 'below', 'from', 'up', 'down', 'in', 'out', 'over', 'under', 'again', 'further', 'then',
                      'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few',
                      'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than',
                      'too', 'very', 's', 't', 'can', 'will', 'just', 'don', 'should', 'now'}

# 中文停用词表（常见无意义词）
chinese_stop_words = {'的', '了', '是', '在', '和', '就', '也', '都', '不', '人', '一', '一个', '上', '下', '大', '小', '中',
                      '到', '来', '去', '有', '着', '没有', '你', '我', '他', '她', '它', '我们', '你们', '他们', '她们', '它们',
                      '这', '那', '这些', '那些', '这里', '那里', '什么', '怎么', '为什么', '所以', '因为', '因此', '如果',
                      '虽然', '但是', '不过', '而且', '或者', '还是', '只要', '只有', '除非', '假如', '即使', '既然', '尽管',
                      '宁可', '即使', '要么', '何况', '况且', '而', '与', '则', '却', '正', '同', '把', '被', '给', '让', '使',
                      '由', '从', '向', '朝', '往', '自', '以', '为', '对', '关于', '对于', '由于', '因为', '为了', '所以',
                      '因此', '之', '乎', '者', '也', '矣', '哉', '焉', '乎', '欤', '耶', '哉', '兮', '夫', '盖', '惟', '其',
                      '斯', '是', '此', '彼', '何', '曷', '胡', '奚', '焉', '安', '恶', '乌', '乎', '哉', '矣', '也', '者',
                      '欤', '耶', '哉', '兮', '夫', '盖', '惟', '其', '斯', '之', '的', '了', '呢', '啊', '呀', '哇', '哈',
                      '吗', '吧', '啊', '哦', '嗯', '哎', '喂', '唉', '咦', '哟', '呵', '哈', '嘿', '嘻', '噫', '嗬', '哦',
                      '嗯', '哎', '喂', '唉', '咦', '哟', '呵', '哈', '嘿', '嘻', '噫', '嗬'}


def read_file(file_path):
    """读取文本文件内容（UTF-8编码）"""
    try:
        with open(file_path, 'r', encoding='utf-8') as f:
            return f.read()
    except Exception as e:
        print(f"文件读取失败：{e}")
        return None


def is_chinese_text(text, threshold=0.3):
    """通过中文字符比例判断文本是否为中文（默认阈值30%）"""
    chinese_chars = re.findall(r'[\u4e00-\u9fa5]', text)
    total_chars = len(text)
    if total_chars == 0:
        return False
    return len(chinese_chars) / total_chars >= threshold


def preprocess_text(text, stop_words, is_chinese=False):
    """文本预处理：去标点、分词、过滤停用词"""
    # 去除非字母、数字、空格、中文的字符
    text = re.sub(r'[^\w\s\u4e00-\u9fa5]', '', text)
    # 英文转小写
    text = text.lower()
    # 分词（中文用jieba，英文用空格分割）
    words = jieba.lcut(text) if is_chinese else text.split()
    # 过滤停用词和空字符串
    return [word for word in words if word.strip() and word not in stop_words]


def main():
    # 解析命令行参数
    parser = argparse.ArgumentParser(description='文本词频统计与可视化工具')
    parser.add_argument('--file_path', type=str, required=True, help='文本文件路径（如 novel.txt）')
    parser.add_argument('--top_n', type=int, default=10, help='展示的高频词数量（默认10）')
    parser.add_argument('--use_stopwords', action='store_true', default=True, help='是否过滤停用词（默认True）')
    args = parser.parse_args()

    # 读取文件
    text = read_file(args.file_path)
    if not text:
        return

    # 判断文本语言（中文/英文）
    is_chinese = is_chinese_text(text)

    # 选择停用词表
    stop_words = chinese_stop_words if (is_chinese and args.use_stopwords) else english_stop_words
    if not args.use_stopwords:
        stop_words = set()  # 不过滤停用词

    # 文本预处理
    words = preprocess_text(text, stop_words, is_chinese)

    # 词频统计（按频率降序排序）
    word_counts = Counter(words)
    top_words = word_counts.most_common(args.top_n)

    # 输出文本结果
    print(f"Top {args.top_n} Frequent Words (After Preprocessing):")
    for word, count in top_words:
        print(f"{word}: {count}")

    # 可视化展示
    words, counts = zip(*top_words)  # 解包为单词和频率元组

    # 配置matplotlib（支持中文显示）
    plt.rcParams['font.sans-serif'] = ['SimHei']  # 中文适配字体
    plt.rcParams['axes.unicode_minus'] = False    # 负号适配

    # 绘制柱状图
    plt.figure(figsize=(10, 6))
    plt.bar(words, counts, color='skyblue')
    plt.title(f"Top {args.top_n} Frequent Words in {os.path.basename(args.file_path)}")
    plt.xlabel("单词")
    plt.ylabel("词频")
    plt.xticks(rotation=45)  # 旋转x轴标签避免重叠
    plt.tight_layout()       # 自动调整布局
    plt.show()


if __name__ == "__main__":
    main()

代码解析

文件读取：read_file 函数处理UTF-8编码的文本，兼容中英文文件。
语言判断：is_chinese_text 通过统计中文字符比例（默认30%）判断文本语言，避免硬编码语言类型。
文本预处理：
- 正则去除非字母、数字、空格、中文的字符；
- 英文转小写，中文保留原格式；
- 分词：中文用 jieba.lcut，英文用 split()；
- 过滤停用词和空字符串。
词频统计：使用 collections.Counter 统计词频，most_common 方法快速获取前N个高频词。
可视化：matplotlib 绘制柱状图，支持中文显示（配置 SimHei 字体），旋转x轴标签避免重叠。

运行示例

命令行调用（英文文本）：

python word_freq_visualizer.py --file_path "the_little_prince.txt" --top_n 10

输出效果：

文本输出：
Top 10 Frequent Words (After Preprocessing): the: 120 of: 85 and: 78 a: 65 to: 59 in: 48 he: 45 was: 42 it: 39 for: 37
可视化输出：弹出柱状图，x轴为单词，y轴为词频，标题包含文件名。

扩展与优化

自定义停用词：支持从文件读取停用词表（如 --stopwords_path stopwords.txt）；
数据导出：将词频结果导出为CSV/Excel（使用 pandas 或 openpyxl）；
多文件支持：批量处理文件夹内的所有文本文件；
交互界面：用 tkinter 或 PyQt 开发图形界面，降低使用门槛。

总结

本工具整合了文件操作、字符串处理、词频统计、可视化四大核心能力，支持中英文文本分析。通过扩展停用词表、适配更多格式或开发交互界面，可进一步提升工具的实用性。适合Python初学者学习文件处理、自然语言处理和数据可视化的基础逻辑。

AI管家

# 文本词频统计与可视化工具：从文件读取到数据可视化的全流程实现

背景与需求

技术思路分析

代码实现（Python）

代码解析

运行示例

命令行调用（英文文本）：

输出效果：

扩展与优化

总结

发表回复取消回复

# 文本词频统计与可视化工具：从文件读取到数据可视化的全流程实现

背景与需求

技术思路分析

代码实现（Python）

代码解析

运行示例

命令行调用（英文文本）：

输出效果：

扩展与优化

总结

发表回复 取消回复

发表回复取消回复