اذهب إلى المحتوى
  • 0

استخراج النصوص من الفقرات في صفحات الويب باستخدام BeautifulSoup في بايثون

إياد أحمد

السؤال

أحاول جمع بعض المعلومات من المواقع المختلفة وأريد استخدام bs4 لاستخلاص بعض الفقرات من مواقع مختلفة،  فيكف يمكنني القيام بذلك؟
على سبيل المثال لدي صفحة الويب التالية:
https://undergrad.cs.umd.edu/what-computer-science
أريد أن أقوم بسحب هذه الفقرة.
 

رابط هذا التعليق
شارك على الشبكات الإجتماعية

Recommended Posts

  • 1

يمكنك القيام بذلك بالشكل التالي:

# استيراد الوحدات
from bs4 import BeautifulSoup
import requests
# تحديد العنوان
url="https://undergrad.cs.umd.edu/what-computer-science"
# والحصول على الصفحة GET إرسال طلب   
page = requests.get(url)
# BeautifulSoup تحليل مكونات الصفحة باستخدام 
soup = BeautifulSoup(page.content, "lxml") # lxml استخدمنا المحلل 
# استخلاص كل الفقرات وعرضها
for para in soup.find_all("p"):
  print(para.get_text())

الخرج:

Computer Science is the study of computers and computational systems. Unlike electrical and computer engineers, computer scientists deal mostly with software and software systems; this includes their theory, design, development, and application.
Principal areas of study within Computer Science include artificial intelligence, computer systems and networks, security, database systems, human computer interaction, vision and graphics, numerical analysis, programming languages, software engineering, bioinformatics and theory of computing. 
Although knowing how to program is essential to the study of computer science, it is only one element of the field. Computer scientists design and analyze algorithms to solve programs and study the performance of computer hardware and software. The problems that computer scientists encounter range from the abstract-- determining what problems can be solved with computers and the complexity of the algorithms that solve them – to the tangible – designing applications that perform well on handheld devices, that are easy to use, and that uphold security measures. 
Graduates of University of Maryland’s Computer Science Department are lifetime learners; they are able to adapt quickly with this challenging field.
Contact Our Office

 

رابط هذا التعليق
شارك على الشبكات الإجتماعية

  • 0

بجانب استخدام xml ك parser يمكنك كذلك استخدام html مع استخدام urllib.request كبديل للمكتبة requests كالتالي:

# استدعاء المكتبات
import urllib.request 
from bs4 import BeautifulSoup
  
# الموقع
url = "https://undergrad.cs.umd.edu/what-computer-science"
  
# قراءة الملفات من الموقع
html = urllib.request.urlopen(url)
  
# تحويلها الى html
htmlParse = BeautifulSoup(html, 'html.parser')
  
# الحصول على كل الفقرات
for para in htmlParse.find_all("p"):
    print(para.get_text())

ويكون العائد منها كالتالي:

Computer Science is the study of computers and computational systems. Unlike electrical and computer engineers, computer scientists deal mostly with software and software systems; this includes their theory, design, development, and application.
Principal areas of study within Computer Science include artificial intelligence, computer systems and networks, security, database systems, human computer interaction, vision and graphics, numerical analysis, programming languages, software engineering, bioinformatics and theory of computing. 
Although knowing how to program is essential to the study of computer science, it is only one element of the field. Computer scientists design and analyze algorithms to solve programs and study the performance of computer hardware and software. The problems that computer scientists encounter range from the abstract-- determining what problems can be solved with computers and the complexity of the algorithms that solve them – to the tangible – designing applications that perform well on handheld devices, that are easy to use, and that uphold security measures. 
Graduates of University of Maryland’s Computer Science Department are lifetime learners; they are able to adapt quickly with this challenging field.
Contact Our Office

 

رابط هذا التعليق
شارك على الشبكات الإجتماعية

انضم إلى النقاش

يمكنك أن تنشر الآن وتسجل لاحقًا. إذا كان لديك حساب، فسجل الدخول الآن لتنشر باسم حسابك.

زائر
أجب على هذا السؤال...

×   لقد أضفت محتوى بخط أو تنسيق مختلف.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   جرى استعادة المحتوى السابق..   امسح المحرر

×   You cannot paste images directly. Upload or insert images from URL.

  • إعلانات

  • تابعنا على



×
×
  • أضف...