Scraping A Web Page Using Node.js And Puppeteer

Posted by

Puppeteer is a Node library that provides ability to control Chrome or Chromium. It has many uses. Puppeteer can run automation inside a web page (submitting forms, simulating clicks, etc.). It can also be used to scrape data from a web page. In this article, I will explain how to scrape data using Puppeteer.

To begin scraping, we need to create a Node project. In this case, we’ll put the project inside a folder named scraping. Run a command line interface and move the directory inside the folder. Then, we’ll run this command:

npm init

Next up, we’ll install Puppeteer. To install it in the project, run this command:

npm install puppeteer

After that, create a .js file. We’ll name it index.js. This is where our code will be. After this step, our project directory should look like this:

Next, we’ll include the puppeteer module inside our index.js file. We’ll also create an async function that we will be calling directly after creation. We use async function because we want to use await syntax.

const puppeteer = require('puppeteer');
(async function () {

})();

Inside the function, we’ll create a headless chrome instance and create a new page. Then, we can make the page go to a URL we want. In this case, we’ll use cnbc economic news page.

const browser = await puppeteer.launch()
const page = await browser.newPage()
await page.goto('https://www.cnbc.com/economy/')

For example, we want to get the title and URL of the trending news. We can see from the image below that the title is displayed with <a> tag with class “TrendingNow-title”. The <a> tag also links to the URL of the news.

To get the elements, we’ll use evaluate method of our page instance. This method accepts pageFunction argument. The pageFunction argument is a function to be evaluated in the page context. This method will return the return value of pageFunction. So, inside the pageFunction, we’ll get the “.TrendingNow-title” elements link and text. Then, we’ll pass the pageFunction to the evaluate method.

let pageFunction = () => {
let elements = […document.querySelectorAll(".TrendingNow-title")]
return elements.map(element => [ element.innerHTML, element.href ])
}
let result = await page.evaluate( pageFunction )
console.log(result)

The result variable will now contain titles and URLs of the trending news. That’s all we need to do. We can now run our script. To run the script, we’ll go back to our command line interface and execute this command below. The command line will then print the trending news in console.

node index.js

Output of the Code

Here’s how the resulting code looks like:

Feel free to ask me any question. Happy Coding 😀

Leave a Reply

Your email address will not be published. Required fields are marked *