كيف يمكننا القيام بعملية Tokinaization للنص باستخدام NLTK في بايثون

إياد أحمد · 10 ديسمبر 2021

أعمل على مجموعة للنصوص و أريد القيام بعملية Tokinaization للنص، فهل هناك دوال مساعدة للقيام بالأمر؟

Ali Haidar Ahmad · 10 ديسمبر 2021

من خلال مكتبة nltk يمكنك القيام بهذه العملية من خلال الدالة word_tokenize بالشكل التالي:

# word_tokenize استيراد الدالة 
from nltk.tokenize import word_tokenize
# تحديد النص
text = """Stopwords are the English words which does not add much
meaning to a sentence. They can safely be ignored without sacrificing
the meaning of the sentence. For example, the words like the, he, have etc. 
Such words are already captured this in corpus named corpus. We first download 
it to our python environment. """
# استخدامها
tokens=word_tokenize(text)
print(tokens)
"""
['Stopwords', 'are', 'the', 'English', 'words', 'which', 'does', 'not', 'add',
'much', 'meaning', 'to', 'a', 'sentence', '.', 'They', 'can', 'safely', 'be',
'ignored', 'without', 'sacrificing', 'the', 'meaning', 'of', 'the', 'sentence',
'.', 'For', 'example', ',', 'the', 'words', 'like', 'the', ',', 'he', ',',
'have', 'etc', '.', 'Such', 'words', 'are', 'already', 'captured', 'this', 'in',
'corpus', 'named', 'corpus', '.', 'We', 'first', 'download', 'it', 'to', 'our',
'python', 'environment', '.']

"""

كما يمكنك استخادم الكلاس RegexpTokenizer الذي يمكنك من استخدام التعابير المنتظمة لإنجاز عملية ال Tokenization كالتالي:

# RegexpTokenizer استيراد الكلاس
from nltk.tokenize import RegexpTokenizer
# تحديد النص
text = """Stopwords are the English words which does not add much
meaning to a sentence. They can safely be ignored without sacrificing
the meaning of the sentence. For example, the words like the, he, have etc. 
Such words are already captured this in corpus named corpus. We first download 
it to our python environment. """
# تعريف كائن من الصنف السابق
tk = RegexpTokenizer('\s+', gaps = True) # الوسيط الأول هو التعبير المنتظم  الوسيط 
# tokenizeation المعرفة ضمن هذا الصنف للقيام بعملية ال tokenize نقوم الآن باستدعاء الدالة 
tokens=tk.tokenize(text)
print(tokens)
"""
['Stopwords', 'are', 'the', 'English', 'words', 'which', 'does', 'not', 'add', 'much',
'meaning', 'to', 'a', 'sentence.', 'They', 'can', 'safely', 'be', 'ignored', 'without',
'sacrificing', 'the', 'meaning', 'of', 'the', 'sentence.', 'For', 'example,', 'the',
'words', 'like', 'the,', 'he,', 'have', 'etc.', 'Such', 'words', 'are', 'already', 'captured',
'this', 'in', 'corpus', 'named', 'corpus.', 'We', 'first', 'download', 'it', 'to', 'our',
'python', 'environment.']

"""

Ahmed Sharshar · 10 ديسمبر 2021

يمكنك عمل Tokinaization ببساطة اذا كان النص بسيط من خلال split والتي تمكنا من فصل الكلمات عن بعضها البعض كالتالي:

text = """Founded in 2002, SpaceX’s mission is to enable humans
to become a spacefaring civilization and a multi-planet
species by building a self-sustaining city on Mars. 
In 2008, SpaceX’s Falcon 1 became the first privately developed 
liquid-fuel launch vehicle to orbit the Earth."""
# فصلها من عند المسافة
text.split()

أو باستخدام NLTK كالتالي:

from nltk.tokenize import word_tokenize 
text = """Founded in 2002, SpaceX’s mission is to enable humans 
to become a spacefaring civilization and a multi planet 
species by building a self-sustaining city on Mars. 
In 2008, SpaceX’s Falcon 1 became the first privately developed 
liquid-fuel launch vehicle to orbit the Earth."""
word_tokenize(text)

ويكون الناتج في الحالتين كالتالي:

Output : ['Founded', 'in', '2002,', 'SpaceX’s', 'mission', 'is', 'to', 'enable', 'humans', 
          'to', 'become', 'a', 'spacefaring', 'civilization', 'and', 'a', 'multi-planet', 
          'species', 'by', 'building', 'a', 'self-sustaining', 'city', 'on', 'Mars.', 'In', 
          '2008,', 'SpaceX’s', 'Falcon', '1', 'became', 'the', 'first', 'privately', 
          'developed', 'liquid-fuel', 'launch', 'vehicle', 'to', 'orbit', 'the', 'Earth.']

كيف يمكننا القيام بعملية Tokinaization للنص باستخدام NLTK في بايثون

السؤال

إياد أحمد

2 أجوبة على هذا السؤال

Recommended Posts

Ali Haidar Ahmad

Ahmed Sharshar

انضم إلى النقاش

إعلانات

تابعنا على

الرئيسية

كيف أتعلم؟

تابعنا

دروس ومقالات

أسئلة وأجوبة

كتب

دورات

بطاقات هدية