Karim Aya

Karim Aya

1564631749

Web Scraping news articles in Python

This article is the second of a series in which I will cover the whole process of developing a machine learning project. If you have not read the first one, I strongly encourage you to do it here.

The project involves the creation of a real-time web application that gathers data from several newspapers and shows a summary of the different topics that are being discussed in the news articles.

This is achieved with a supervised machine learning classification modelthat is able to predict the category of a given news article, a web scraping method that gets the latest news from the newspapers, and an interactive web application that shows the obtained results to the user.

As I explained in the first post of this series, the motivation behind writing these articles is that a lot of the articles or content published on the internet, books or literature regarding data science and machine learning models focus on the modelling part with the training data. However, a machine learning project is much more than that: once you have a trained model, you need to feed new data to it and what is more important, you need to provide useful insights to the final user.

The whole process is divided in three different posts:

·        Classification model training (link)

·        News articles web scraping (this post)

·        App creation and deployment (will be published soon)

The github repo can be found here. It includes all the code and a complete report.

In the first article, we developed the text classification model in Python, which allowed us to get a certain news article text and predict its category with an overall good accuracy.

This post covers the second part: News articles web scraping. We’ll create a script that scrapes the latest news articles from different newspapers and stores the text, which will be fed into the model afterwards to get a prediction of its category. We’ll cover it in the following steps:

1.    A brief introduction to webpages and HTML

2.  Web scraping with BeautifulSoup in Python

1. A brief introduction to webpage design and HTML

If we want to be able to extract news articles (or, in fact, any other kind of text) from a website, the first step is to know how a website works. We will follow an example with the Towards Data Science webpage.

When we insert an url into the web browser (i.e. Google Chrome, Firefox, etc…) and access to it, what we see is the combination of three technologies:

1.    HTML (HyperText Markup Language): it is the standard language for adding content to a website. It allows us to insert text, images and other things to our site. In one word, HTML determines the content of a webpage.

2.  CSS (Cascading Style Sheets): this language allows us to set the visual design of a website. This means, it determines the style of a webpage.

3.  JavaScript: JavaScript allows us to make the content and the style interactive.

Note that these three are programming languages. They will allow us to create and manipulate every aspect of the design of a webpage.

However, if we want a website to be accessible to every one in a browser, we need to know about additional things: standing up a web server, using a certain domain, etc… But since we are only interested in extracting content from a webpage, this will be enough for today.

Let’s illustrate these concepts with an example. When we visit the Towards Data Science homepage, we see the following:

If we deleted the CSS content from the webpage, we would see something like this:

And if we disabled JavaScript, we would not be able to use this pop-up no more:

At this point, I’ll ask the following question:

“If I want to extract the content of a webpage via web scraping, where do I need to look up?”

If your answer was the HTML code, then you’re absolutely getting it. In the above example we can see that after disabling CSS, the content (text, images, etc…) is still there.

So, the last step before performing web scraping methods is to understand a bit of the HTML language.

HTML is, from a really basic point of view, composed of elements that have attributes. An element could be a paragraph, and an attribute could be that the paragraph is in bold letter.

There are a lot of different types of elements, each one with its own attributes. To identify an element (this means, as an example, to set if some text is a heading or a paragraph) we use tags. These tags are represented with the <> symbols (for example, a 

 tag means a certain text is acting as a paragraph).

For example, this HTML code below allows us to change the alignment of the paragraphs:

Consequently, when we visit a website, we will be able to find the contentand its properties in the HTML code.

Once we have presented these concepts, we are ready for some web scraping!

2. Web scraping with BeautifulSoup in Python

There are several packages in Python that allow us to scrape information from webpages. One of the most common ones is BeautifulSoup. The official package information can be found here.

BeautifulSoup allows us to parse the HTML content of a given URL and access its elements by identifying them with their tags and attributes. For this reason, we will use it to extract certain pieces of text from the websites.

It is an extremely easy-to-use yet powerful package. With almost 3–5 lines of code we will be able to extract any text we want from the internet.

To install it, please type the following code into your Python distribution:

! pip install beautifulsoup4

So as to provide BeautifulSoup with the HTML code of any page, we will also need to import the requests module. In order to install it if it’s not already included in your python distribution, please type:

! pip install requests

We will use the requests module to get the HTML code from the page and then navigate through it with the BeautifulSoup package. We will learn to use two commands that will be enough for our task:

·        find_all(element tag, attribute): it allows us to locate any HTML element from a webpage introducing its tag and attributes. This command will locate all the elements of the same type. In order to get only the first one, we can use find() instead.

·        get_text(): once we have located a given element, this command will allow us to extract the text inside.

So, at this point, what we need to do is to navigate through the HTML code of our webpage (for example, in Google Chrome we need to enter the webpage, press right click button and go to See source code) and locate the elements we want to scrape. We can simply do this searching with Ctrl+F or Cmd+F once we are seeing the source code.

Once we have identified the elements of interest, we will get the HTML code with the requests module and extract those elements with BeautifulSoup.

We will carry out an example with the El Pais English newspaper. We will first try to web scrape the news articles titles from the frontpage and then extract the text out of them.

Once we enter the website, we need to inspect the HTML code to locate the news articles. After a fast look we can see that each article in the frontpage is an element like this:

The title is an <h2> (heading-2) element with itemprop=”headline" and class=”articulo-titulo" atributes. It has an <a> element with an hrefattribute which contains the text. So, in order to extract the text, we need to code the following commands:

# importing the necessary packages
import requests
from bs4 import BeautifulSoup

With the requests module we can get the HTML content and save into the coverpage variable:

r1 = requests.get(url)
coverpage = r1.content

Next, we need to create a soup in order to allow BeautifulSoup to work:

soup1 = BeautifulSoup(coverpage, ‘html5lib’)

And finally, we can locate the elements we are looking for:

coverpage_news = soup1.find_all(‘h2’, class_=‘articulo-titulo’)

This will return a list in which each element is a news article (because with find_all we are getting all ocurrences):

If we code the following command, we will be able to extract the text:

coverpage_news[4].get_text()

If we want to access the value of an attribute (in this case, the link), we can type the following:

coverpage_news[4][‘href’]

And we’ll get the link in plain text.

If you have understood until this point, you are ready to web scrape any content you want.

The next step would be to access each of the news articles content with the href attribute, get the source code again and find the paragraphs in the HTML code to finally get them with BeautifulSoup. It’s the same idea as before, but we need to locate the tags and attributes that identify the news article content.

The code of the full process is the following. I will show the code but won’t enter in the same detail as before since it’s exactly the same idea.

# Scraping the first 5 articles
number_of_articles = 5

Empty lists for content, links and titles

news_contents = []
list_links = []
list_titles = []

for n in np.arange(0, number_of_articles):

# only news articles (there are also albums and other things)
if "inenglish" not in coverpage\_news\[n\].find('a')\['href'\]:  
    continue

# Getting the link of the article
link = coverpage\_news\[n\].find('a')\['href'\]
list\_links.append(link)

# Getting the title
title = coverpage\_news\[n\].find('a').get\_text()
list\_titles.append(title)

# Reading the content (it is divided in paragraphs)
article = requests.get(link)
article\_content = article.content
soup\_article = BeautifulSoup(article\_content, 'html5lib')
body = soup\_article.find\_all('div', class\_='articulo-cuerpo')
x = body\[0\].find\_all('p')

# Unifying the paragraphs
list\_paragraphs = \[\]
for p in np.arange(0, len(x)):
    paragraph = x\[p\].get\_text()
    list\_paragraphs.append(paragraph)
    final\_article = " ".join(list\_paragraphs)
    
news\_contents.append(final\_article)

All the details can be found in my github repo.

It is important to mention that this code is only useful for this webpage in particular. If we want to scrape another one, we should expect that elements are identified with different tags and attributes. But once we know how to identify them, the process is exactly the same.

At this point, we are able to extract the content of different news articles. The final step is to apply the machine learning model we trained in the first post to predict its categories and show a summary to the user. This will be covered in the final post of this series.

Originally published by Miguel Fernández Zafra at towardsdatascience.com

#python #web-development #data-science

What is GEEK

Buddha Community

Web Scraping news articles in Python
Sival Alethea

Sival Alethea

1624402800

Beautiful Soup Tutorial - Web Scraping in Python

The Beautiful Soup module is used for web scraping in Python. Learn how to use the Beautiful Soup and Requests modules in this tutorial. After watching, you will be able to start scraping the web on your own.
📺 The video in this post was made by freeCodeCamp.org
The origin of the article: https://www.youtube.com/watch?v=87Gx3U0BDlo&list=PLWKjhJtqVAbnqBxcdjVGgT3uVR10bzTEB&index=12
🔥 If you’re a beginner. I believe the article below will be useful to you ☞ What You Should Know Before Investing in Cryptocurrency - For Beginner
⭐ ⭐ ⭐The project is of interest to the community. Join to Get free ‘GEEK coin’ (GEEKCASH coin)!
☞ **-----CLICK HERE-----**⭐ ⭐ ⭐
Thanks for visiting and watching! Please don’t forget to leave a like, comment and share!

#web scraping #python #beautiful soup #beautiful soup tutorial #web scraping in python #beautiful soup tutorial - web scraping in python

How POST Requests with Python Make Web Scraping Easier

When scraping a website with Python, it’s common to use the

urllibor theRequestslibraries to sendGETrequests to the server in order to receive its information.

However, you’ll eventually need to send some information to the website yourself before receiving the data you want, maybe because it’s necessary to perform a log-in or to interact somehow with the page.

To execute such interactions, Selenium is a frequently used tool. However, it also comes with some downsides as it’s a bit slow and can also be quite unstable sometimes. The alternative is to send a

POSTrequest containing the information the website needs using the request library.

In fact, when compared to Requests, Selenium becomes a very slow approach since it does the entire work of actually opening your browser to navigate through the websites you’ll collect data from. Of course, depending on the problem, you’ll eventually need to use it, but for some other situations, a

POSTrequest may be your best option, which makes it an important tool for your web scraping toolbox.

In this article, we’ll see a brief introduction to the

POSTmethod and how it can be implemented to improve your web scraping routines.

#python #web-scraping #requests #web-scraping-with-python #data-science #data-collection #python-tutorials #data-scraping

Shardul Bhatt

Shardul Bhatt

1626775355

Why use Python for Software Development

No programming language is pretty much as diverse as Python. It enables building cutting edge applications effortlessly. Developers are as yet investigating the full capability of end-to-end Python development services in various areas. 

By areas, we mean FinTech, HealthTech, InsureTech, Cybersecurity, and that's just the beginning. These are New Economy areas, and Python has the ability to serve every one of them. The vast majority of them require massive computational abilities. Python's code is dynamic and powerful - equipped for taking care of the heavy traffic and substantial algorithmic capacities. 

Programming advancement is multidimensional today. Endeavor programming requires an intelligent application with AI and ML capacities. Shopper based applications require information examination to convey a superior client experience. Netflix, Trello, and Amazon are genuine instances of such applications. Python assists with building them effortlessly. 

5 Reasons to Utilize Python for Programming Web Apps 

Python can do such numerous things that developers can't discover enough reasons to admire it. Python application development isn't restricted to web and enterprise applications. It is exceptionally adaptable and superb for a wide range of uses.

Robust frameworks 

Python is known for its tools and frameworks. There's a structure for everything. Django is helpful for building web applications, venture applications, logical applications, and mathematical processing. Flask is another web improvement framework with no conditions. 

Web2Py, CherryPy, and Falcon offer incredible capabilities to customize Python development services. A large portion of them are open-source frameworks that allow quick turn of events. 

Simple to read and compose 

Python has an improved sentence structure - one that is like the English language. New engineers for Python can undoubtedly understand where they stand in the development process. The simplicity of composing allows quick application building. 

The motivation behind building Python, as said by its maker Guido Van Rossum, was to empower even beginner engineers to comprehend the programming language. The simple coding likewise permits developers to roll out speedy improvements without getting confused by pointless subtleties. 

Utilized by the best 

Alright - Python isn't simply one more programming language. It should have something, which is the reason the business giants use it. Furthermore, that too for different purposes. Developers at Google use Python to assemble framework organization systems, parallel information pusher, code audit, testing and QA, and substantially more. Netflix utilizes Python web development services for its recommendation algorithm and media player. 

Massive community support 

Python has a steadily developing community that offers enormous help. From amateurs to specialists, there's everybody. There are a lot of instructional exercises, documentation, and guides accessible for Python web development solutions. 

Today, numerous universities start with Python, adding to the quantity of individuals in the community. Frequently, Python designers team up on various tasks and help each other with algorithmic, utilitarian, and application critical thinking. 

Progressive applications 

Python is the greatest supporter of data science, Machine Learning, and Artificial Intelligence at any enterprise software development company. Its utilization cases in cutting edge applications are the most compelling motivation for its prosperity. Python is the second most well known tool after R for data analytics.

The simplicity of getting sorted out, overseeing, and visualizing information through unique libraries makes it ideal for data based applications. TensorFlow for neural networks and OpenCV for computer vision are two of Python's most well known use cases for Machine learning applications.

Summary

Thinking about the advances in programming and innovation, Python is a YES for an assorted scope of utilizations. Game development, web application development services, GUI advancement, ML and AI improvement, Enterprise and customer applications - every one of them uses Python to its full potential. 

The disadvantages of Python web improvement arrangements are regularly disregarded by developers and organizations because of the advantages it gives. They focus on quality over speed and performance over blunders. That is the reason it's a good idea to utilize Python for building the applications of the future.

#python development services #python development company #python app development #python development #python in web development #python software development

Sarah Adina

1604646180

A Beginner's Guide to Web Scraping in Python

In this article, you’re going to learn the basics of web scraping in python and we’ll do a demo project to scrape quotes from a website.

What is web scraping?

Web scraping is extracting data from a website programmatically. Using web scraping you can extract the text in HTML tags, download images & files and almost do anything you do manually with copying and pasting but in a faster way.

Should you learn web scraping?

Yeah, absolutely as a programmer in many cases you might need to use the content found on other people’s websites but those website doesn’t give you API to that, that’s why you need to learn web scraping to be able to that.

#python #python-tutorials #web-scraping-with-python #python-programming #python-tips

Arvel  Parker

Arvel Parker

1593156510

Basic Data Types in Python | Python Web Development For Beginners

At the end of 2019, Python is one of the fastest-growing programming languages. More than 10% of developers have opted for Python development.

In the programming world, Data types play an important role. Each Variable is stored in different data types and responsible for various functions. Python had two different objects, and They are mutable and immutable objects.

Table of Contents  hide

I Mutable objects

II Immutable objects

III Built-in data types in Python

Mutable objects

The Size and declared value and its sequence of the object can able to be modified called mutable objects.

Mutable Data Types are list, dict, set, byte array

Immutable objects

The Size and declared value and its sequence of the object can able to be modified.

Immutable data types are int, float, complex, String, tuples, bytes, and frozen sets.

id() and type() is used to know the Identity and data type of the object

a**=25+**85j

type**(a)**

output**:<class’complex’>**

b**={1:10,2:“Pinky”****}**

id**(b)**

output**:**238989244168

Built-in data types in Python

a**=str(“Hello python world”)****#str**

b**=int(18)****#int**

c**=float(20482.5)****#float**

d**=complex(5+85j)****#complex**

e**=list((“python”,“fast”,“growing”,“in”,2018))****#list**

f**=tuple((“python”,“easy”,“learning”))****#tuple**

g**=range(10)****#range**

h**=dict(name=“Vidu”,age=36)****#dict**

i**=set((“python”,“fast”,“growing”,“in”,2018))****#set**

j**=frozenset((“python”,“fast”,“growing”,“in”,2018))****#frozenset**

k**=bool(18)****#bool**

l**=bytes(8)****#bytes**

m**=bytearray(8)****#bytearray**

n**=memoryview(bytes(18))****#memoryview**

Numbers (int,Float,Complex)

Numbers are stored in numeric Types. when a number is assigned to a variable, Python creates Number objects.

#signed interger

age**=**18

print**(age)**

Output**:**18

Python supports 3 types of numeric data.

int (signed integers like 20, 2, 225, etc.)

float (float is used to store floating-point numbers like 9.8, 3.1444, 89.52, etc.)

complex (complex numbers like 8.94j, 4.0 + 7.3j, etc.)

A complex number contains an ordered pair, i.e., a + ib where a and b denote the real and imaginary parts respectively).

String

The string can be represented as the sequence of characters in the quotation marks. In python, to define strings we can use single, double, or triple quotes.

# String Handling

‘Hello Python’

#single (') Quoted String

“Hello Python”

# Double (") Quoted String

“”“Hello Python”“”

‘’‘Hello Python’‘’

# triple (‘’') (“”") Quoted String

In python, string handling is a straightforward task, and python provides various built-in functions and operators for representing strings.

The operator “+” is used to concatenate strings and “*” is used to repeat the string.

“Hello”+“python”

output**:****‘Hello python’**

"python "*****2

'Output : Python python ’

#python web development #data types in python #list of all python data types #python data types #python datatypes #python types #python variable type