Scraping A Dynamic Web Page Using Puppeteer

Posted by

In my previous blog post, I explained about using Puppeteer to scrap data from a static web page. Now, what if we needed to scrap data from a dynamic web page? Fortunately, puppeteer also has the ability to do some automation. We can simulate clicks, scrolls, keypresses, etc.

We’re going to use the same page as the other post, since it also contains dynamic content. We’re going to get these news here after we click the Load More button 3 times.

To do that, we’ll need to launch our puppeteer and create a new page. This time, we’ll use headless: false to show what our browser is doing. After that we’ll order the page to go to the page. We’ll order the page to wait for 3 seconds to make sure everything on the page is loaded. Here’s the code to do that:

const browser = await puppeteer.launch({headless: false})
const page = await browser.newPage()
await page.goto('https://www.cnbc.com/economy/')
await page.waitFor(3000)

Next up, we’ll create a function to make the page scroll to almost at the end of page. The scrolling is only for demonstration purpose only, so that you can see what’s going on in the browser. So, the list below will be our sequence of action. We’ll repeat that 3 times.

  1. Click the load more button
  2. Wait 2s to ensure the content is loaded
  3. Scroll the page

To write the actions above as code, we need to know what the selector of the Load More button first. The image below shows the detaill of the Load More button. Here we can see that the button has “LoadMoreButton-loadMore” class. We’ll use this selector for the code.

Selector of Load More Button

After we know what selector to use, we can use page.click() method to simulate clicking. To click the Load More button we can simply run this command:

await page.click('.LoadMoreButton-loadMore')

Next, we’ll create a function for scrolling to be evaluated with page.evaluate() method. The function will be like the code below. The window.scrollTo() method accept 2 arguments. The first is horizontal scroll position and the second is vertical scroll position. We’ll set the vertical scroll position to 90 percent of the document height.

let scrollToBottom = () => window.scrollTo(0, document.body.scrollHeight * 0.9)

After that, we can use this function in page.evaluate() method. Combined with our other codes it will look like the code below. We’re going to repeat these statements 3 times.

let scrollToBottom = () => window.scrollTo(0, document.body.scrollHeight * 0.9)
for (var i = 0; i < 3; i++) {
await page.click('.LoadMoreButton-loadMore')
await page.waitFor(2000)
await page.evaluate( scrollToBottom )
}

After we write all that code, we can run the code and see the browser in action. Here’s how it looks:

The Code In Action

Now that we know, our code works correctly to retrieve more data, we can move on to getting the data. First, we need to find the needed selector. We can see from the image below that we can get the title and link with “.Card-title” selector.

The Target Selector

But there’s a problem, though. There are other elements other than the ones in this section that uses the same selector. In that case, we’ll need to narrow our search. To do that, we can use the parent elements.

Container Selector

We can use the text “More In Economy” to identify the parent container. The image above shows that the selector for the whole section has “.PageBuilder-pageRow” class. So, we’ll look for element with class “.PageBuilder-pageRow” that contains “More In Economy”.

After all that, we can build our evaluate function to get our data. The code below shows our function to retrieve data. First, we get all elements with class “.PageBuilder-pageRow”. After that, we use find() method to get only one element that has “More In Economy” inside. Then, we can look for elements with class “.Card-title” inside and extract their texts and links.

let pageFunction = () => {
let selectors = document.querySelectorAll(".PageBuilder-pageRow")
selectors = […selectors] //Convert to array
let container = selectors.find((element) => {
return element.textContent.includes("More In Economy")
})
let titleElements = container.querySelectorAll(".Card-title")
titleElements = […titleElements] //Convert to array
return titleElements.map((element) => [element.innerText, element.href])
}

Scraping result

Here’s the whole code if you want to try it out yourself.

Feel free to ask me any question. Happy Coding 😀

Leave a Reply

Your email address will not be published. Required fields are marked *