okay look at this I got a basic spider here I didn't change anything this is
straight from the box I did a th000 requests which is what I actually asked
for so it stopped on that rather than a limit 756 items no errors full logs and
it only took half an hour I mean that's just mindblowing so coming up this
morning I have a really interesting call it's with a guy called Ian from zit and
he is the chief product officer over there and they have a new Scrapy product
that we're going to talk about and the thing that makes this particularly
interesting is that it's AI back definitely when AI came out originally
and sort of Hit the Forefront with chat GPT I couldn't really see how we could
use that as web scrapers and data extractors because the problems that we
had couldn't really be solved by that sort of a tool by main problem I always
thought about that was the fact that the llms are heavy on compute and heavy on
compute cost so it's going to be interesting to see how they get around
that this video is sponsored by zeit they gave me access to the AI tool and
ask me for my genuine feedback as well as to share it with you guys I want to
know is the the motivation behind creating this this tool our motivation
is that we think the web scraping is too slow that really both the challenge of
getting set up for new sites and what what does that look like when there's
multiple sites it's set up for each and every one there's the challenge of
Maintenance and when we speak to engineering teams they often say we're
under pressure we've got to add 100 100 site we are maintaining the ones that
we've already got adding 100 is difficult and so our motivation is how
can we make this quicker really interesting because it's it's rather
than trying to make it easier to scrape it's trying to get you to a base where
you can start the rest of your project quicker so what we've looked to do is
produce a solution which has automation where the automation does the routine
the for the most common types of website like product for the most common data
points in a schema we've got about 20 for the most common crawling strategies
like whole size or category or Search keywords we can get you to data
immediately but then to give people control that's good to know that it's
still going to be open source or at least the spider part is and you'll be
able to actually take that and build upon it and then have your own sort of
spider that you can put into their system host and then run that kind of
makes sense too do you think that part of it make creating all that automation
to to take that big chunk of work out of the start makes this different to other
generic AI I'm going to use that term loosely web scraping Solutions where
people are like just asking chat GPT for an exam as an example because this is a
more controlled use case we've been able to develop models that are specific to
this and so we have a machine learning model for product type websites we have
another for articles for jobs and because they're much more focused
they're more accurate for that task but they're also a lot cheaper to be honest
that the the spider itself is relatively simple so what that's doing the crawling
part or or the controlling of the crawling and so that's then calling zit
API which is solving the bands which is allowing people to retrieve the HTML in
the format they want do they need it rendered or not do they need to do
browser actions like scripting um it's the machine learning of the passing of
of turning the HTML into Json uh but uh it's it's controlled by a relatively
simple scrapey spider uh and and typically what we find is that if people
want to make a modification like add a filter to a crawl it's a couple of lines
of code but it's also quite a nice way for people to get started because they
can start something where they can see code and they can see what it does yeah
and so modifying that is very straight forwards only scenario we find where
this isn't a great solution is the very very largest sites where the cost of
setup and maintenance is dwarfed by the volume of quests in those scenarios the
um it's worth it to create a spider because running static code will be the
cheapest solution in general um for anyone who is looking to set up sites
where they're not yet in the billions of requests per month it's merely up to
tens of millions we're finding it's a great solution right so I need to know
is this going to change the landscape forever or is this going to make me
redundant so what I'm going to do I'm going to take 2 hours I'm going to build
two spiders and I'm going to compare myself to the AI
[Music] [Music]
[Music] [Music]
okay so here's the results for the two spiders here's mine I got
1634 items and it took uh 1250 seconds which is about 20 minutes um I got a
decent amount of data points so on the left side here is the overview for the
AI spider job now this one took about an hour I thought this was interesting
because this was much slower than the other sites that I did in my test we did
get all of the data points though that I was expecting to get in fact more than I
pulled um because I didn't want to pass all of that information I would
definitely check with Zite as to why this one took an hour to run slightly
slower we had no issues it did it all there and given the fact that mine took
an hour to write we definitely in the positive here so we have a quick look at
the items you can see here's all the data points and all the items of which
you can download and export in these formats if you want to and there are
full logs and stats as well so the stats are going to be similar to the ones here
as you can see request counts etc etc everything that you can want to see from
your spider comes back to you here and here's the results from the other spider
I got about the same amount of items 1547 and I think I got1 15590 1566 mine
took about 37 minutes to run though and this one was 26 minutes and that's
because the AI was deciding when to use browser rendering and when not whereas I
was forcing browser rendering all the time which caused mine to be a bit
slower plus it took me about an hour to write as well so you know this is much
much quicker again let's have a quick look at the stats over here just so you
can see everything any all the information here that you want to know
about all your general SP stats the logs as well and the items here very similar
to the item data that I pulled out you can see I have it here with the skew
slightly different uh format but it's all here available you can see all of
the information and again you can download as you need here all in all
very very positive experience with this one given how easy it was to create
these all I had to do was put in the URL and configure two or three settings and
set it to run the fact that it managed to get all of these data points out in
such a short amount of time with very very little setup default out of the box
is very impressive now there's a lot to be said about getting to that point so
you can actually build upon it if you need to but in these cases I wouldn't
have to so the time saved would be huge especially considering this was just
over two use cases if I had to do that tenfold the amount of time I would save
by using the AI spider is astronomical so what are my thoughts
about this and has it changed how I feel about AI in this space well I think the
tool kind of speaks for itself it's pretty incredible the fact that you can
go from nothing to data with just one URL that quickly is really impressive
the fact that it hand it handles all the bands through the Zite API as well it's
just an extension of that I think it makes it a really really great package
I've used it quite a lot now over the last week or so since I recorded the
rest of this video and I've been very impressed I think that it's important to
know that tools like this are supposed to be supplementary to your work not
supposed to take over what you do whilst it does take out quite a lot of the
knowledge it does mean for me I think that I can provide a better service to
my clients by being able to give them the data they need in much less time and
I think I'm just going to honestly adapt to it and work with it and build upon it
myself and then use it to enhance what the services that I can deliver how do I
feel about AI in this space still mixed one thing that I think is important is
that this is a much smaller model this isn't the large language models that we
would associate with AI this is a very different thing it's a much more along
the machine learning type of uh style here where it's been fed data specific
data it's been trained on lots of different specific sites to pull out
information and I think that is definitely the right place to be in when
we're talking about this sort of thing AIML in web scraping using it in
conjunction with other tools to pull out the data that you need to save yourself
a load of time passing out a load of data if you want to try this tool for
yourself there'll be a link in the description and there'll be a code as
well for you to use to get you so to get you going and hopefully you enjoy it and
use it and like it as much as I did so yeah once again thanks to Ze for this
opportunity and thank you very much for all watching join the Discord like
comment subscribe it all makes a lot of difference to me thank you very much
I'll see you in the next one