Where AI ACTUALLY belongs in Web Scraping - en - Twincloud's Youtube Subtitle Extractor

Don't fall for the AI buzzword in web scraping. It's not the solution you're

being told, but it does have a place in our scraping workflow. Let me show you.

We'll take this spider example I've written and slowly improve it using AI.

But first, we need to understand what our main pain points are when we're

scraping data. And I've categorized these into three different categories.

The first is being blocked or banned. There are a few ways around this, but

this is definitely one of the main things that going to get in our way

scraping data. actually being blocked by the website. The second one is selecting

the data we need from that raw HTML or that whatever we get back. Writing

selectors against this can become very brittle and are a real pain to manage

and maintain going forward. Anytime the site changes, they break. And the third

is setup time. The spider that I've written for this example probably took

me about 45 minutes, which is not that big a deal, but it was quite an easy

site to scrape. But even if they were easy all the time, imagine having to do

a hundred. That's an awful lot of time. And then plus the time you need to

maintain them, you're looking at a full-time developer job just to keep

those ones going, let alone adding any new ones. So, I reached out to Daniel at

Zite. I've worked with them in the past to produce sponsored videos, and I know

that they're doing a lot in this space. He suggested I should take their AI

tools and share my honest opinion. So, this is a bit of a treat because I've

had full access to play with the latest tools, and I've had some time to really

consider where I think they fit in. You see, as LLMs get better and better and

more affordable, naturally, people are going to be looking for ways to utilize

them in their area. And this is no different to web scraping. But honestly,

I think most people are doing it wrong. Every time I see someone trying to send

pages and pages of HTML to an LLM to pass it. I just feel like this is

totally inefficient and not what we should be doing, right? So, let's

upgrade my spider. Here's my starting spider. As you can see, it's pretty

simple, but you know, fairly typical of what you might find. We go through all

the product links on the pages over here. Start request and we're going to

yield out some information. And this is how I got it. You know, CSS selectors,

finding the information on the page. Pretty typical of what you might see.

So, if we run this uh scrapey crawl products, uh we're going to zip through

all of these. And you can see we're going to get going to get some products

back here. And I think we have 48, which I think is about right given the fact

that I'm only asking for a few pages. We want to really improve on this. We want

to make sure that a we avoid being blocked and banned because at the moment

we're just going through my standard IP which will get blocked very quickly. And

we want to avoid having to type out all of these selectors to yield the data.

These are going to be super brittle and I've only got three data points here.

Imagine if we had 5, 10, 15, 20. We want to utilize Zites ML model to do that. So

what you need to do is you need to pip install the Zite scrapey API package.

And then we can come over to our settings and we want to put these at the

bottom somewhere. You need your Z API key. And you need to enable the add-ons

like so. Like this. That's going to do everything for you to get that working.

Now, every request I make now, I'm actually just going to change this. Get

rid of my user agent. Every request that I make now is going to go through the uh

site API. So, we're going to have all of that uh band protection. It's going to

rotate all the IPs for us. So we don't have to worry about proxies or anything

like that. It's all handled now. So let's modify this to use the ML model.

So I'm going to get rid of the get all because we're going to have a slightly

different approach to this. I don't think we're going to need Do we need

this or not? I can't remember. We'll find out. Uh so let's put that in there.

Uh we want to get rid of this because we want to do follow all. So I'm going to

do for request uh in response.follow follow all which is like a shortcut for

handling uh creating requests within scrapey. Um you could do it the other

way if you wanted to but you know we might as well use these shortcuts whilst

they're there. And we want to still have our call back to our

self.pass item not passer pass item like so. Then from this you want to add in a

meta key. So I'm going to do request meta and that's going to be uh zite API

automap like so. And then the value for this is going to be product and that is

true. So what we're saying is every request we make we're going to go ahead

and add the meta for this. It's going to tell the API hey we want to use the

product ML model. So we don't want to have to worry about any of this stuff

here. We just want we want you to give us back that extracted data that matches

that schema. Then we yield out our request like so. Now we can get rid of

all of this. We don't need it. We can simply yield out uh response dot raw API

response and we want to ask for uh the product like so. And this is just part

of the response that comes back from the API. So that's it. We've we've really

shortened our passing here. Uh, and we're not even using it for the product

links in this case, but we're just basically going to say, hey, we'll use

this as an example. So, I'm going to save this, and I'm going to change the

name of my spider because I created a duplicate. We'll just call this one Z

products like so. And I'm going to run scrapey crawl on my new

spider. I do output. And we'll call this one all.json elf, JSON lines. Um, we'll

call this extract. So now when I run this, we're going to

basically go through the API which is going to give us all of that band

protection, all of the rotating IPs that we need. It's going to do it

asynchronously. You can see it all coming through here. It's going to

handle any retries and everything like that. So we're pretty much, you know,

going to get the information. We're going to get that data back. Then we're

also going to utilize the ML model through the AI to actually match the

data on that page to the schema. Now this ML model is really specifically

trained on like like I should imagine hundreds of thousands of product pages

at the moment. So it knows all the common points and it knows how to

extract them. You can see it's doing it right here. So once this is finished,

we'll have a quick look at the data and then we'll move on to using the LLM to

really fine-tune it. Okay, so let's open up our extract all. And we can see that

we've got, you know, all this information, all these all these data

points, name, price, currency, currency, raw, all of this stuff, availability,

including all of the images, all these additional properties, everything,

everything, everything. So there's loads of information here. Everything like

most of what you could pretty much ask for. As you can see, we achieved quite a

lot here with just that supervised ML model, allowing us to pass over all of

our HTML passing to this part of the AI that's highly trained and highly

specific with that designated schema. So, we don't really have to think about

our selectors breaking or anything like that, which is really going to solve one

of the main problems for us. So, by just adding in the Zite AI, it's going to

utilize all of their technology, their AI behind the scenes to allow us to

naturally bypass any band. So, we don't have to worry about that. If you think

about it, you're going to need to use proxies anyway because you can't send a

load of requests to any website from the same IP without being blocked. So,

utilizing the Zite API in this way, which fits really neatly into Scrapey

with their own package, it just makes a lot of sense and gives us that extra

edge and avoids that one pain point of getting banned. But the problem is with

this ML model is we can only return a fixed schema. And that's where the

custom attributes come in. So, what these are going to do is it's going to

allow us to pass off that little bit of information and write a prompt to pass

it on to the LLM. And that LLM is then going to look at what we've got and it's

going to try and extract that data for us. But what if there was extra data

points in here that you didn't get for whatever reason? So, I'm looking at this

data right here and this is a table, but for some reason in here there is no

category. So, we can see additional properties which is great. color, size,

description, HTML description, full URL, all this information. And here's another

one, draws, storage bed. There's no category. So, what we're going to do is

we're going to add that in as a LLM touchpoint. So, let's close this and

let's come back to our code. This is very simple to do. We're already adding

in request meta here. So, we just need to add in another one. So, I'm going to

call this this one is the custom attribute. So we want here attributes

and this in itself is a dictionary as well. So I save this in my code editor.

Hopefully it will format. It won't. It'll keep going. So what we want to do

is we want to say what are we going to call this one? We're going to call this

one category. So this is how it's going to appear on our export. There we go.

Formatted a bit better. And now we want to have a type and the description. So

type is going to be a string because this is a string response. And the

description is essentially our prompt. So, I'm going to put in here um what

should we say? Uh what type of furniture item

is this? Cool. So, now we've added this in. And you can have multiple custom

attributes. I'm going to stick with one for the moment. What we want to do is

when we actually yield this out, we need to yield out the uh custom attribute as

well. So, what I'm going to do is I'm just going to do that first. So, we'll

do instead of the product, we'll just do our

custom attributes like so. Let's save that. So, now when we get the uh the

data out, we're not going to have all that product information, but we will

have the custom attributes. I'm just going to see if this is working to what

I was expecting to see. Okay, we see can start getting it back. So, I'm actually

going to stop this. And we can see here this one has categorized itself as chest

of drawers. And we can talk about we can well, we can see the tokens available

there. So now I know that that's working. What I want to do is I want to

add this in. So instead of you know just the custom attributes or just the

response, what I'm going to do is I'm going to create a new dictionary and I'm

going just going to call this item and uh this will be uh make this a list I

think and we'll have um probably want to copy

this code editor format itself nicely. So the first thing we're going to have

is our product data and then the custom attributes. So, I'm going to save this

and we're going to come back over here, clear this up, and we're going to remove

our data files that I had that we don't actually need now. And we're going to

run this with scrapey crawl again. I'm going to do this one output. And we'll

just call this all.json L because I want this as a JSON lines file. So, I'm going

to let this run. And what we should get now is not only are we got AI in the

antibban and the block protection, we've got AI using the smaller ML model which

is going to know the data points based on the type that we've given it. And

there's like lots of different types. We're using the product one and then

we're using the wider LLM model just on this really specific bit because we

didn't get that attribute, that information from the ML model. So, we're

using the larger LLM to tweak it so we can get that category put into our data.

And we're going to let this just run now. And we'll have that information out

at the end. Great. So, let's have a look at our data. So, now we've got the uh

individual item with all the information that we had before. And then underneath

we have our value, our custom attribute that we created using the LLM. And it

gives us the information about the tokens as well. So, we can see this

one's categorized as bedside table. I hope that's

right. What does it say in the name? Yeah, three drawer bedside table. So,

you can really start to see how you can use it to really drill down and get more

information about the product that you know is on the page somewhere but isn't

coming through in the main description. And this is this is very different to

utilizing an LLM to just pass a massive chunk of HTML because we're using so

much less information here. It's already got most of what we want. So, it's much

more targeted and much more specific. And I think that's the real key here.

Obviously, you can have multiple uh custom attributes as you see fit. Um, I

think I'd stay away from like getting it to do any kind of calculations or

anything like that simply just to pull extra bits of information from the page

that you may not get already. Now, for me, this is a really specific tool and

it's going to have a lot of good use cases, but I really want to be careful

about not overdoing it and not trying to ask it to do too much stuff. I really

like to separate out my scrapers as much as possible. But being able to have

access to this LLM, this big model, this AI just to get help with that little bit

of data extraction, I think is really, really important. It lets you expand on

what the ML model can already do, giving you that flexibility and making sure

that you don't miss any data off the page. Now, of course, if you don't want

to do any coding whatsoever, you can come to Zite and you can create a new

product and use the AI powered spider. So, I'm just going to call this one

made. And we can create our project here. And this is couldn't be simply see

here are the ML models that we want to use. This is an e-commerce site. That's

kind of what I do. So we'll select and it's you can add your own templates to

this as well. So if you got something specific, you can then create that

template to then use across which is really cool. Or you can modify these

templates. They're all open source, so you can pull them out and modify them uh

as much as you need to. But I found this to be pretty straightforward and pretty

good. So what I'm going to do is I'm going to come back over here. Excuse the

mess. And we'll go and put in the URL. And I'll just do P1. Uh, and we'll call

this one furniture. Now, uh, I'm going to do

automatic call strategy. Or what shall we do? Let's make this a bit bigger so

you guys can see. This gives you all the options. So, I'm just going to do

um this will be kind of like the full thing. So, this isn't totally going to

be comparable, but um, we'll leave it on like a 100 max requests for the moment

so we don't do too many. And here's our custom attributes. So, I'm again I'm

going to do um uh

category and I'll say uh what type of item. What did I say it was? Let's go

back to my mess of code. Let's put this on a different

screen. There we go. Now, we can actually go back to our What type of

furniture item is this? Okay.

And that should be fine. We'll use generative. That's fine. And I'm going

to do save and run. And what this is going to do, that's just basically going

to go off and just do exactly what we did within scrapb except all through the

um the a through the UI here utilizing, you know, sites API. It's going to use

the same ML model that we're using. It's obviously just going to generate all the

spiders for us. So, we can really see here how you could save yourself so much

time just by creating them here. or if you wanted to do them in scrapey, which

is what I tend to do, um you can do it that way because then you can manage all

your spiders however you do already just by dropping these things in. So, I'll

let this run and we'll have a look at some of the data at the end. So, it's

still running, but you can I can see, you know, the information as it goes

through. And we're going to just check the first product here. So, our custom

attributes come out on top. See, it's a wardrobe and it's got all the same

information as we pulled from uh the scrapey uh from our scrapey spider

there. So, this is a really very very easy way to scrape data. This utilizes

AI in all the right places in my opinion. It's not reliant. It's not like

super reliant on it. So, you know, you're not going to be sending loads of

HTML to the LLM unless you need to. You can create your own custom attributes.

So, it's a very efficient, quick way of doing things. I've had a bit more time

now to test and have a play around with the product and I think it's fantastic.

It's very very powerful. But I kind of been come away thinking this well what's

this really going to cost me? But my conclusion was pretty straightforward.

We need to use proxies anyway often residential ones which have a higher

cost associated to them. So why not just swap out and use the ZI API which is

very very similar in cost and have it handle all of that for you. Then we have

access to the AI tools as and when we need them. And that's not to mention all

the time that you'll save with them not having to fix broken selectors and swap

out proxies when they stop working. So with all that on board, where do I

actually see AI fitting in the industry? So Dan's take could be boiled down to

one main idea really, and that was if you want to use AI to help you with your

scraping, you really just have to think about the using the right tool for the

right job. AI could be anything from unblocking sites to using it to help you

pass the data. You really just have to think about using the right tool for the

right job. We really don't want to be sending a load of stuff to an LLM when

we really don't need to. So, this all totally makes sense to me. You know,

understanding the use case for it and where it fits in and then starting to

apply it to my own jobs and my own workflow. You kind of have to think

about who's going to be using this. And it's for people that want to really

scrape at scale. We have to understand that the needs of a company that

scraping hundreds or thousands of websites is much different to the one

person scraping on their own. So, if you just need a little bit of help with

bands, for example, instead of proxies, use the ZI API and you're going to be

well on your way and you can implement that in wherever your workflow detect

dictates. But if you're doing loads and loads of data, you might want to go for

the whole package like I showed you here. You can just run these AI spiders

on your Zite account and you'll be right away and you'll have no problems. That's

the real power and benefits of the whole thing. I think that it all links

together. You can go incrementally in stages and you can then work with what

you need to and have that fit your specific use case. So should you use AI

in web scraping? Yes, I think you should. But this is a massive caveat. It

really depends on what you're trying to achieve and what you actually want out

of it and how you use it. for a lot of people the use cases where you know

there's some places you should use it 100% maybe the bands maybe through API

or maybe the ML model is going to make your life a lot lot easier but you know

it's a very specialized tool very specific and I think that given the pain

points that we talk about in web scraping with the hardest one actually

being getting the data maybe you should really focus on that first and then

think about passing it afterwards once you've solved that issue it's very

powerful very specific tool that I think requires quite the specific use case.

So, let me know what you think down below.