Don't fall for the AI buzzword in web scraping. It's not the solution you're
being told, but it does have a place in our scraping workflow. Let me show you.
We'll take this spider example I've written and slowly improve it using AI.
But first, we need to understand what our main pain points are when we're
scraping data. And I've categorized these into three different categories.
The first is being blocked or banned. There are a few ways around this, but
this is definitely one of the main things that going to get in our way
scraping data. actually being blocked by the website. The second one is selecting
the data we need from that raw HTML or that whatever we get back. Writing
selectors against this can become very brittle and are a real pain to manage
and maintain going forward. Anytime the site changes, they break. And the third
is setup time. The spider that I've written for this example probably took
me about 45 minutes, which is not that big a deal, but it was quite an easy
site to scrape. But even if they were easy all the time, imagine having to do
a hundred. That's an awful lot of time. And then plus the time you need to
maintain them, you're looking at a full-time developer job just to keep
those ones going, let alone adding any new ones. So, I reached out to Daniel at
Zite. I've worked with them in the past to produce sponsored videos, and I know
that they're doing a lot in this space. He suggested I should take their AI
tools and share my honest opinion. So, this is a bit of a treat because I've
had full access to play with the latest tools, and I've had some time to really
consider where I think they fit in. You see, as LLMs get better and better and
more affordable, naturally, people are going to be looking for ways to utilize
them in their area. And this is no different to web scraping. But honestly,
I think most people are doing it wrong. Every time I see someone trying to send
pages and pages of HTML to an LLM to pass it. I just feel like this is
totally inefficient and not what we should be doing, right? So, let's
upgrade my spider. Here's my starting spider. As you can see, it's pretty
simple, but you know, fairly typical of what you might find. We go through all
the product links on the pages over here. Start request and we're going to
yield out some information. And this is how I got it. You know, CSS selectors,
finding the information on the page. Pretty typical of what you might see.
So, if we run this uh scrapey crawl products, uh we're going to zip through
all of these. And you can see we're going to get going to get some products
back here. And I think we have 48, which I think is about right given the fact
that I'm only asking for a few pages. We want to really improve on this. We want
to make sure that a we avoid being blocked and banned because at the moment
we're just going through my standard IP which will get blocked very quickly. And
we want to avoid having to type out all of these selectors to yield the data.
These are going to be super brittle and I've only got three data points here.
Imagine if we had 5, 10, 15, 20. We want to utilize Zites ML model to do that. So
what you need to do is you need to pip install the Zite scrapey API package.
And then we can come over to our settings and we want to put these at the
bottom somewhere. You need your Z API key. And you need to enable the add-ons
like so. Like this. That's going to do everything for you to get that working.
Now, every request I make now, I'm actually just going to change this. Get
rid of my user agent. Every request that I make now is going to go through the uh
site API. So, we're going to have all of that uh band protection. It's going to
rotate all the IPs for us. So we don't have to worry about proxies or anything
like that. It's all handled now. So let's modify this to use the ML model.
So I'm going to get rid of the get all because we're going to have a slightly
different approach to this. I don't think we're going to need Do we need
this or not? I can't remember. We'll find out. Uh so let's put that in there.
Uh we want to get rid of this because we want to do follow all. So I'm going to
do for request uh in response.follow follow all which is like a shortcut for
handling uh creating requests within scrapey. Um you could do it the other
way if you wanted to but you know we might as well use these shortcuts whilst
they're there. And we want to still have our call back to our
self.pass item not passer pass item like so. Then from this you want to add in a
meta key. So I'm going to do request meta and that's going to be uh zite API
automap like so. And then the value for this is going to be product and that is
true. So what we're saying is every request we make we're going to go ahead
and add the meta for this. It's going to tell the API hey we want to use the
product ML model. So we don't want to have to worry about any of this stuff
here. We just want we want you to give us back that extracted data that matches
that schema. Then we yield out our request like so. Now we can get rid of
all of this. We don't need it. We can simply yield out uh response dot raw API
response and we want to ask for uh the product like so. And this is just part
of the response that comes back from the API. So that's it. We've we've really
shortened our passing here. Uh, and we're not even using it for the product
links in this case, but we're just basically going to say, hey, we'll use
this as an example. So, I'm going to save this, and I'm going to change the
name of my spider because I created a duplicate. We'll just call this one Z
products like so. And I'm going to run scrapey crawl on my new
spider. I do output. And we'll call this one all.json elf, JSON lines. Um, we'll
call this extract. So now when I run this, we're going to
basically go through the API which is going to give us all of that band
protection, all of the rotating IPs that we need. It's going to do it
asynchronously. You can see it all coming through here. It's going to
handle any retries and everything like that. So we're pretty much, you know,
going to get the information. We're going to get that data back. Then we're
also going to utilize the ML model through the AI to actually match the
data on that page to the schema. Now this ML model is really specifically
trained on like like I should imagine hundreds of thousands of product pages
at the moment. So it knows all the common points and it knows how to
extract them. You can see it's doing it right here. So once this is finished,
we'll have a quick look at the data and then we'll move on to using the LLM to
really fine-tune it. Okay, so let's open up our extract all. And we can see that
we've got, you know, all this information, all these all these data
points, name, price, currency, currency, raw, all of this stuff, availability,
including all of the images, all these additional properties, everything,
everything, everything. So there's loads of information here. Everything like
most of what you could pretty much ask for. As you can see, we achieved quite a
lot here with just that supervised ML model, allowing us to pass over all of
our HTML passing to this part of the AI that's highly trained and highly
specific with that designated schema. So, we don't really have to think about
our selectors breaking or anything like that, which is really going to solve one
of the main problems for us. So, by just adding in the Zite AI, it's going to
utilize all of their technology, their AI behind the scenes to allow us to
naturally bypass any band. So, we don't have to worry about that. If you think
about it, you're going to need to use proxies anyway because you can't send a
load of requests to any website from the same IP without being blocked. So,
utilizing the Zite API in this way, which fits really neatly into Scrapey
with their own package, it just makes a lot of sense and gives us that extra
edge and avoids that one pain point of getting banned. But the problem is with
this ML model is we can only return a fixed schema. And that's where the
custom attributes come in. So, what these are going to do is it's going to
allow us to pass off that little bit of information and write a prompt to pass
it on to the LLM. And that LLM is then going to look at what we've got and it's
going to try and extract that data for us. But what if there was extra data
points in here that you didn't get for whatever reason? So, I'm looking at this
data right here and this is a table, but for some reason in here there is no
category. So, we can see additional properties which is great. color, size,
description, HTML description, full URL, all this information. And here's another
one, draws, storage bed. There's no category. So, what we're going to do is
we're going to add that in as a LLM touchpoint. So, let's close this and
let's come back to our code. This is very simple to do. We're already adding
in request meta here. So, we just need to add in another one. So, I'm going to
call this this one is the custom attribute. So we want here attributes
and this in itself is a dictionary as well. So I save this in my code editor.
Hopefully it will format. It won't. It'll keep going. So what we want to do
is we want to say what are we going to call this one? We're going to call this
one category. So this is how it's going to appear on our export. There we go.
Formatted a bit better. And now we want to have a type and the description. So
type is going to be a string because this is a string response. And the
description is essentially our prompt. So, I'm going to put in here um what
should we say? Uh what type of furniture item
is this? Cool. So, now we've added this in. And you can have multiple custom
attributes. I'm going to stick with one for the moment. What we want to do is
when we actually yield this out, we need to yield out the uh custom attribute as
well. So, what I'm going to do is I'm just going to do that first. So, we'll
do instead of the product, we'll just do our
custom attributes like so. Let's save that. So, now when we get the uh the
data out, we're not going to have all that product information, but we will
have the custom attributes. I'm just going to see if this is working to what
I was expecting to see. Okay, we see can start getting it back. So, I'm actually
going to stop this. And we can see here this one has categorized itself as chest
of drawers. And we can talk about we can well, we can see the tokens available
there. So now I know that that's working. What I want to do is I want to
add this in. So instead of you know just the custom attributes or just the
response, what I'm going to do is I'm going to create a new dictionary and I'm
going just going to call this item and uh this will be uh make this a list I
think and we'll have um probably want to copy
this code editor format itself nicely. So the first thing we're going to have
is our product data and then the custom attributes. So, I'm going to save this
and we're going to come back over here, clear this up, and we're going to remove
our data files that I had that we don't actually need now. And we're going to
run this with scrapey crawl again. I'm going to do this one output. And we'll
just call this all.json L because I want this as a JSON lines file. So, I'm going
to let this run. And what we should get now is not only are we got AI in the
antibban and the block protection, we've got AI using the smaller ML model which
is going to know the data points based on the type that we've given it. And
there's like lots of different types. We're using the product one and then
we're using the wider LLM model just on this really specific bit because we
didn't get that attribute, that information from the ML model. So, we're
using the larger LLM to tweak it so we can get that category put into our data.
And we're going to let this just run now. And we'll have that information out
at the end. Great. So, let's have a look at our data. So, now we've got the uh
individual item with all the information that we had before. And then underneath
we have our value, our custom attribute that we created using the LLM. And it
gives us the information about the tokens as well. So, we can see this
one's categorized as bedside table. I hope that's
right. What does it say in the name? Yeah, three drawer bedside table. So,
you can really start to see how you can use it to really drill down and get more
information about the product that you know is on the page somewhere but isn't
coming through in the main description. And this is this is very different to
utilizing an LLM to just pass a massive chunk of HTML because we're using so
much less information here. It's already got most of what we want. So, it's much
more targeted and much more specific. And I think that's the real key here.
Obviously, you can have multiple uh custom attributes as you see fit. Um, I
think I'd stay away from like getting it to do any kind of calculations or
anything like that simply just to pull extra bits of information from the page
that you may not get already. Now, for me, this is a really specific tool and
it's going to have a lot of good use cases, but I really want to be careful
about not overdoing it and not trying to ask it to do too much stuff. I really
like to separate out my scrapers as much as possible. But being able to have
access to this LLM, this big model, this AI just to get help with that little bit
of data extraction, I think is really, really important. It lets you expand on
what the ML model can already do, giving you that flexibility and making sure
that you don't miss any data off the page. Now, of course, if you don't want
to do any coding whatsoever, you can come to Zite and you can create a new
product and use the AI powered spider. So, I'm just going to call this one
made. And we can create our project here. And this is couldn't be simply see
here are the ML models that we want to use. This is an e-commerce site. That's
kind of what I do. So we'll select and it's you can add your own templates to
this as well. So if you got something specific, you can then create that
template to then use across which is really cool. Or you can modify these
templates. They're all open source, so you can pull them out and modify them uh
as much as you need to. But I found this to be pretty straightforward and pretty
good. So what I'm going to do is I'm going to come back over here. Excuse the
mess. And we'll go and put in the URL. And I'll just do P1. Uh, and we'll call
this one furniture. Now, uh, I'm going to do
automatic call strategy. Or what shall we do? Let's make this a bit bigger so
you guys can see. This gives you all the options. So, I'm just going to do
um this will be kind of like the full thing. So, this isn't totally going to
be comparable, but um, we'll leave it on like a 100 max requests for the moment
so we don't do too many. And here's our custom attributes. So, I'm again I'm
going to do um uh
category and I'll say uh what type of item. What did I say it was? Let's go
back to my mess of code. Let's put this on a different
screen. There we go. Now, we can actually go back to our What type of
furniture item is this? Okay.
And that should be fine. We'll use generative. That's fine. And I'm going
to do save and run. And what this is going to do, that's just basically going
to go off and just do exactly what we did within scrapb except all through the
um the a through the UI here utilizing, you know, sites API. It's going to use
the same ML model that we're using. It's obviously just going to generate all the
spiders for us. So, we can really see here how you could save yourself so much
time just by creating them here. or if you wanted to do them in scrapey, which
is what I tend to do, um you can do it that way because then you can manage all
your spiders however you do already just by dropping these things in. So, I'll
let this run and we'll have a look at some of the data at the end. So, it's
still running, but you can I can see, you know, the information as it goes
through. And we're going to just check the first product here. So, our custom
attributes come out on top. See, it's a wardrobe and it's got all the same
information as we pulled from uh the scrapey uh from our scrapey spider
there. So, this is a really very very easy way to scrape data. This utilizes
AI in all the right places in my opinion. It's not reliant. It's not like
super reliant on it. So, you know, you're not going to be sending loads of
HTML to the LLM unless you need to. You can create your own custom attributes.
So, it's a very efficient, quick way of doing things. I've had a bit more time
now to test and have a play around with the product and I think it's fantastic.
It's very very powerful. But I kind of been come away thinking this well what's
this really going to cost me? But my conclusion was pretty straightforward.
We need to use proxies anyway often residential ones which have a higher
cost associated to them. So why not just swap out and use the ZI API which is
very very similar in cost and have it handle all of that for you. Then we have
access to the AI tools as and when we need them. And that's not to mention all
the time that you'll save with them not having to fix broken selectors and swap
out proxies when they stop working. So with all that on board, where do I
actually see AI fitting in the industry? So Dan's take could be boiled down to
one main idea really, and that was if you want to use AI to help you with your
scraping, you really just have to think about the using the right tool for the
right job. AI could be anything from unblocking sites to using it to help you
pass the data. You really just have to think about using the right tool for the
right job. We really don't want to be sending a load of stuff to an LLM when
we really don't need to. So, this all totally makes sense to me. You know,
understanding the use case for it and where it fits in and then starting to
apply it to my own jobs and my own workflow. You kind of have to think
about who's going to be using this. And it's for people that want to really
scrape at scale. We have to understand that the needs of a company that
scraping hundreds or thousands of websites is much different to the one
person scraping on their own. So, if you just need a little bit of help with
bands, for example, instead of proxies, use the ZI API and you're going to be
well on your way and you can implement that in wherever your workflow detect
dictates. But if you're doing loads and loads of data, you might want to go for
the whole package like I showed you here. You can just run these AI spiders
on your Zite account and you'll be right away and you'll have no problems. That's
the real power and benefits of the whole thing. I think that it all links
together. You can go incrementally in stages and you can then work with what
you need to and have that fit your specific use case. So should you use AI
in web scraping? Yes, I think you should. But this is a massive caveat. It
really depends on what you're trying to achieve and what you actually want out
of it and how you use it. for a lot of people the use cases where you know
there's some places you should use it 100% maybe the bands maybe through API
or maybe the ML model is going to make your life a lot lot easier but you know
it's a very specialized tool very specific and I think that given the pain
points that we talk about in web scraping with the hardest one actually
being getting the data maybe you should really focus on that first and then
think about passing it afterwards once you've solved that issue it's very
powerful very specific tool that I think requires quite the specific use case.
So, let me know what you think down below.