NLP来分类/标记句子的内容（Ruby绑定necesarry）

我正在分析几百万封电子邮件。我的目标是能够将其分类成小组。团体可以是例如：

交货问题 （交货缓慢，发货前处理缓慢，供应信息不正确等）
客户服务问题 （电子邮件响应时间慢，回复不礼貌等）
退货问题 （退货请求处理缓慢，客户服务缺乏帮助等）
定价投诉 （已发现隐藏费用等）

为了执行这种分类，我需要一个可以识别单词组组合的NLP，如：

“[他们|公司|公司|网站|商家]”
“[没有|没有|没有]”
“[回应|回应|答案|回复]”
“[在第二天之前|足够快] |”
等等

这些示例组中的一些组合应该匹配如下句子：

“他们没有回应”
“他们根本没有回应”
“根本没有回应”
“我没有收到网站的回复”

然后将句子归类为客户服务问题 。

哪个NLP能够处理这样的任务？从我读到的这些是最相关的：

斯坦福CoreNLP
OpenNLP

还要检查这些建议的NLP 。

使用OpenNLP doccat api，您可以创建训练数据，然后根据训练数据创建模型。这比像朴素贝叶斯分类器这样的优势在于它返回了一组概率分布。

所以如果你用这种格式创建一个文件：

customerserviceproblems They did not respond customerserviceproblems They didn't respond customerserviceproblems They didn't respond at all customerserviceproblems They did not respond at all customerserviceproblems I received no response from the website customerserviceproblems I did not receive response from the website

等….提供尽可能多的样本，并确保每行以\ n换行结束

使用这个appoach你可以添加你想要的任何意味着“客户服务问题”，你也可以添加任何其他类别，所以你不必太确定哪些数据属于哪些类别

这是java构建模型的样子

 DoccatModel model = null; InputStream dataIn = new FileInputStream(yourFileOfSamplesLikeAbove); try { ObjectStream lineStream = new PlainTextByLineStream(dataIn, "UTF-8"); ObjectStream sampleStream = new DocumentSampleStream(lineStream); model = DocumentCategorizerME.train("en", sampleStream); OutputStream modelOut = new BufferedOutputStream(new FileOutputStream(modelOutFile)); model.serialize(modelOut); System.out.println("Model complete!"); } catch (IOException e) { // Failed to read or parse training data, training failed e.printStackTrace(); }

一旦你有了模型，你就可以使用它：

 DocumentCategorizerME documentCategorizerME; DoccatModel doccatModel; doccatModel = new DoccatModel(new File(pathToModelYouJustMade)); documentCategorizerME = new DocumentCategorizerME(doccatModel); /** * returns a map of a category to a score * @param text * @return * @throws Exception */ private Map getScore(String text) throws Exception { Map scoreMap = new HashMap<>(); double[] categorize = documentCategorizerME.categorize(text); int catSize = documentCategorizerME.getNumberOfCategories(); for (int i = 0; i < catSize; i++) { String category = documentCategorizerME.getCategory(i); scoreMap.put(category, categorize[documentCategorizerME.getIndex(category)]); } return scoreMap; }

然后在返回的hashmap中，您拥有您建模的每个类别和分数，您可以使用分数来确定输入文本属于哪个类别。

不完全确定，但我可以想到两种尝试解决问题的方法：

标准机器学习

如评论中所述，仅从每封邮件中提取关键字并使用它们训练分类器。事先定义相关的关键字集，并仅在电子邮件中提取这些关键字（如果存在）。

这是一种简单但function强大的技术，不可低估，因为在许多情况下它会产生非常好的结果。你可能想先尝试这个，因为更复杂的算法可能有点过分。
文法

如果您真的想深入研究NLP，根据您的问题描述，您可能会尝试定义某种语法并根据该语法解析电子邮件。我对ruby没有太多经验，但我确信存在某种类似于lex-yacc的工具。快速的网络搜索给出了这个问题和这个问题。通过识别这些短语，您可以通过计算每个类别的短语比例来判断电子邮件属于哪个类别。

例如，直观地说，语法中的一些产品可以定义为：
```
 {organization}{negative}{verb} :- delivery problems 
```
organization = [they|the company|the firm|the website|the merchant]等

这些方法可能是一种开始的方式。

NLP来分类/标记句子的内容（Ruby绑定necesarry）

如何使用dBpedia在ruby-on-rails应用程序上设置neo4j？

ruby / rails的自然语言日期解析器

在基于OOP的文本游戏中进行优雅的命令解析

Ruby，Count音节

你如何解析一段文字到句子？（相当于Ruby）

NLP来分类/标记句子的内容（Ruby绑定necesarry）

如何使用dBpedia在ruby-on-rails应用程序上设置neo4j？

ruby / rails的自然语言日期解析器

在基于OOP的文本游戏中进行优雅的命令解析

Ruby，Count音节

你如何解析一段文字到句子？ （相当于Ruby）

你如何解析一段文字到句子？（相当于Ruby）