إزالة كلمات التوقف stop words باستخدام NLTK في بايثون

إياد أحمد · 9 ديسمبر 2021

أقوم ببعض عمليات المعالجة المسبقة للبيانات، وحالياً أريد أن أقوم بعملية تصفية للكلمات غير المفيدة أو كما تسمى ب " stop words" مثل (.إلخ...."such as “the”, “a”, “an”, “in”) أي أحرف الجر و الضمائر وغيرهم، لذا هل هناك طريقة لتصفيتهم من النص؟

Ali Haidar Ahmad · 9 ديسمبر 2021

يمكنك القيام بذلك من خلال nltk حيث أنه لديها قاموساً يجمع هذه الكلمات. ويمكنك الوصول لها واستعراضها (وتعديلها إذا أردت عن طريق إضافة أو حذف بعض الكلمات) بالشكل التالي:

#  stopwords نقوم باستيراد الوحدة 
from nltk.corpus import stopwords
# 'english' ونمرر لها الوسيط words نقوم باستدعاء الدالة  stopwords من خلال الوحدة
sw=stopwords.words('english')
# الآن أصبح لدينا مجموعة كلمات التوقف الأساسية في اللغة الإنجليزية
print(sw)
"""
['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've",
"you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself',
'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their',
'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those',
'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does',
'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of',
'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before',
'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over',
'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why',
'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such',
'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', 'too', 'very', 's', 't',
'can', 'will', 'just', 'don', "don't", 'should', "should've", 'now', 'd', 'll',
'm', 'o', 're', 've', 'y', 'ain', 'aren', "aren't", 'couldn', "couldn't",
'didn', "didn't", 'doesn', "doesn't", 'hadn', "hadn't", 'hasn', "hasn't",
'haven', "haven't", 'isn', "isn't", 'ma', 'mightn', "mightn't", 'mustn',
"mustn't", 'needn', "needn't", 'shan', "shan't", 'shouldn', "shouldn't",
'wasn', "wasn't", 'weren', "weren't", 'won', "won't", 'wouldn', "wouldn't"]

"""

الآن سنقوم بكتابة مثال لاستخدام هذه القائمة لحذف كلمات التوقف:

from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
# تحديد النص
text = """Stopwords are the English words which does not add much
meaning to a sentence. They can safely be ignored without sacrificing
the meaning of the sentence. For example, the words like the, he, have etc. 
Such words are already captured this in corpus named corpus. We first download 
it to our python environment. """
# إنشاء مجموعة من كلمات التوقف
sw = set(stopwords.words('english'))
# للنص tokenization القيام بعملية 
word_tokens = word_tokenize(text)
# word_tokens تصفية كلمات التوقف من  ال
filteredText = [word for word in word_tokens if not word.lower() in sw]
print(filteredText)

Ahmed Sharshar · 10 ديسمبر 2021

لو أن عندنا مجموعة كبيرة من الكلمات الإنجليزية، ونريد أن نقوم بتصفيتها كالتالي:

word_list = "Nick likes to play football, however he is not too fond of tennis."

الكود التالي يوضح ببساطة كيفية القيام بذلك:

filtered_word_list = word_list[:] #عمل نسخة من الكلمات
for word in word_list: # نقوم بالبحث في كل كلمة
  if word in stopwords.words('english'): 
    filtered_word_list.remove(word) # ازالة أي كلمة عبارة عن stop word

ويكون الناتج:

['Nick', 'likes', 'play', 'football', ',', 'however', 'fond', 'tennis', '.']

إزالة كلمات التوقف stop words باستخدام NLTK في بايثون

السؤال

إياد أحمد

2 أجوبة على هذا السؤال

Recommended Posts

Ali Haidar Ahmad

Ahmed Sharshar

انضم إلى النقاش

إعلانات

تابعنا على

الرئيسية

كيف أتعلم؟

تابعنا

دروس ومقالات

أسئلة وأجوبة

كتب

دورات

بطاقات هدية