friend came up to me and said hey I want to start a coffee business but I don't
know what to price my coffee beans at can you help me I like of course I can
we can do some market research scrape a few sites it's dead easy and in this
video I'm going to show you how easy it is to scrape generic e-commerce sites to
pull price information you can do for your own analysis so all I did here was
Google coffee beans and picked a few shops we have 1 2 3 four five six the
first one here what we're going to do is we're going to straight away we're going
to look at the URL what we're looking for is two different things we're
looking for to see if this is a Shopify store if it is you're likely going to
see something like Collections and products like so if it is just type Json
at the end of the product now you see it's kind of loaded but it went back on
itself there that's interesting so we got two choices you can try and go a bit
further back and find the Json or you can do view page source and do line wrap
here and make this a bit bigger so we can all see and then you want to search
in here for one keyword called offers and what we're looking for here is
within our schema which is this I'm going to zoom straight back out a little
bit go back up to this one and look right here we have inside this script
tag application LD plus Json we have this context screa schema product offers
here we go so it tells you all the prices for the SKS etc etc so we can log
this and we can visit this page as many times as we want to we can come back to
it check the skew check the current price log it make note again let's go to
the next site here so this this isn't a Shopify store there's no like you know
collections or products like that but let's go back view page
Source line wrap and what's our keyword offers so I missed it it was the very
very first one here here we go now this is exactly the same Json it's just
formatted you know it's just all in one line as opposed to be neatly split out
with this right here is going to have the offers the type the price the
currency and the skew so again we can come back to this this page as many
times as we need to every day check the price of this product scaling up your
code is one of the biggest problems we face when scraping data and I always
recommend the first step is to use a highquality residential proxy with proxy
scrape the sponsor of today's video we have access to high quality secure fast
and ethically sourced proxies that are perfect for our use case there's 10
million plus proxies in the port use all with unlimited concurrent sessions from
countries all over the globe enabling us to scrape quickly and efficiently I
exclusively use residential proxies as these are the best option for beating
any anti-bot protection on the sites we're scraping and with the auto
rotation this is the simplest but most effective way to avoid our projects
being blocked and allow us access to the data that we need there only one line of
code to add to your project and then we let proxy scrape handle the rest from
there and any traffic you purchase is yours to use whenever you need need as
it doesn't ever expire other options though is if you just want throughput
then data center proxies with unlimited bandwidth 99% up time and no rate limit
from reputable countries and IP authentication make them a very easy to
use and attractive option within the right use case so if this all sounds
good to you go ahead and check out proxy scrape at the link in the description
below now let's on with our project let's go to another one let's try the do
Json on this one okay that worked even better we can just go to the product put
Json afterwards and here's all the product information literally everything
this is just the whole Shopify thing here it'll give you everything the skew
the price everything that the customer everything that the retailer has put in
including all of the variations so again these should have a skew yep there we go
so you can come back to this product as many times as you need to with that
unique identifier here's another one this one looks like a Shopify store as
well but again if we do Json it's going to uh redirect us back so we have a
choice we can do let's try products. Json okay the main one worked so this
gives us every product on this website now instead of just the one that we were
looking at so we can Loop through this what you can do as well with this one is
you can put query parameters after it I think it's limit and I think 250 is the
most that needs to be an equal sign I think 250 is the highest so you see it's
taking even longer so if I shrink this down 148 I've got every product on this
website now uh with all the product information that we could ever want so
there's another one Here We Go Again here's another products one again this
would be a Shopify store we could do the Json I'm going to show you as well here
that if we go to offers again here we go here's the data that we wanted and this
is in exactly the same format for every site this is a schema this is a generic
way of sending this information to the front end from their backend data this
is really important because what this means is we can create one single model
and we can pull this schema out from any these sites stick it straight into that
and it's going to work just fine I got one more to show you uh again here
there's products so this is probably a Shopify store it is there we go again
you can do that way or if I go to view page
Source come here offers it doesn't have it in this one
let's try schema okay so this one isn't going to
have this piece of information in oh there is uh this one has this is
different this has the question schema you can see it right here FAQ page
that's interesting again we don't need that though because this was a Shopify
store so we can dojon at the end and it will work so what I'm going to do now
and now that you know how to do this is I'm going to go through and I'm going to
code out an example to pull each of the products information from all of these
stores we'll do it one by one we'll get all that information then I'll show you
how to create a model to save it all and we'll save it all to a neat CSV file or
something similar maybe a database so you get the idea of what we're trying to
achieve now there's one thing that's really important quite a lot of the
really big Ecom stores do use this method with the schema but it's going to
be slightly different and it's probably going to be a little bit more difficult
to scrape because you know these massive stores these massive companies have a
lot more uh bot protection again we need to use those proxies that we talked
about earlier but what we're trying to do here is have a look and pull from
lots of different stores for that kind of like you know authentic type thing
and not your Amazon but you know you're kind of like yeah whatever this is the
ra Rave coffee that I mean I mean I don't want to drink this but maybe you
want to try it uh this sort of thing so you got to keep in mind what you're
trying to do and what you're trying to achieve so let's go ahead and write some
code scrape all this data save it all out so we have a nice something a nice
neat piece of code that we can run every day get these prices let's start off
with the simplest version that we can so I have created a project folder and
started my virtual environment and I've installed a few things so you're going
to need to do uh kl- cffi this is a really good wrapper around requests that
makes our uh requesting to the server a bit easier because we can impersonate a
browser I'm going to use select Ox to pass some things Rich to help me with
Printing and I chose to use or Json to pass the Json in this um particular
project you could of course use the standard Json Library if you wanted to
so inside my main.py file we're going to import what we need to start with just
make this a little bit bigger just these things that for the moment and as I said
curl CFI is an import is a wrapper around request so it Imports as requests
and use it as standard or you need rich as well I've set the URL as one of the
products that we just looked at this is a Shopify one but we're not actually
going to go to the Shopify products. Json we're going to use the schema
because that's the consistent thing that I'm going to use for this project
I'm going to create my request with responses equal to request.get and
you'll notice with curl cf5 we have this impersonate thing here that just sets
some of our um extra parameters and everything to sort of make us look a
little bit more like a Chrome browser not necessarily for the not necess not
necessary possibly for this but good practice to use then we take the HTML um
passer and give it the response. text and create our HTML variable from here
we just need to find all of those script tags that we just looked at that had the
Json data in they are usually application LD plus Json and this is the
CSS selector to get all of them with select ox. CSS gets everything that
matches so this is going to return a list from here we want to Loop through
that list because this tag can hold all sorts of different bits of information
we only want to find the one that has the word offers in the script. text this
matches the schema for the product and it's a
good key to use from here we're going to take that data and we're going to load
it into our data variable here again this is where I decided I wanted to try
all Json you can of course absolutely use the standard Json package in Python
to do this there's no benefit to all Json here I just wanted to try it
out then I'm going to do is just print data. getet the offers part just so I
can see the offer section here so I'm going to save this we're going to come
over to our run code and we're going to do Python 3 pi main.py and this is going
to return us that small section of data that holds the offer in so if I was to
come back here and just print the whole thing and run it we're going to get all
of that schema back here so if I go to scroll mode you can see here was the
offers section that I printed out earlier and here's the rest of the
information that matches it all here so this is the easiest way to do it it's 14
13 lines of code depending on what you're trying to do but we want to
expand on this we want to make this much better much more robust we want to take
the data from multiple URLs and we want to store it into a database with
timestamps against the Imports against the scraped parts of the offer data so
we can start to collect those prices and that's what we're going to do next so
now we've seen a basic example let's expand on it and make it something that
we can actually use by adding in a database and getting it to a point where
we could run it every day or whatever and add the prices in so we're going to
create a new project here in this folder I'm using UV for my project management
at the moment it's a pretty cool tool um by the same people that make rough so
I'm going to do UV VMV and then I have a act UV no UV act which actually is a
shortcut to uh create the veg environment for me and activate it then
I'm going to do UV pip install we're going to need quite a few things here
I'm going to use httpx this time um I prefer that I'm going to use SQL Alchemy
for the database and we also going to need uh
rich and we're going to use a thing called extract which I will show you
when we get there it's just a nice neat way of doing those things and if you
spell SQL come me correctly we get it going okay cool so let's think about the
structure of our projects we're going to have a few different files I'm going to
create them now we'll have our main.py py file to run everything and have a
database. py file which is going to call the main database connection that we can
use uh models.py and we'll also have um I think that's probably it for now um
we're also going to have our URLs CSV let's put that in as well here so I'm
going to open this in my code editor and I'm going to go straight actually what
I'm going to do first is I'm going to do get in it we're going to use this we're
use git and I'm going to do uh a git ignore and we're going to say well let's
ignore any VMV and any pach folders cool and now when I open up
Helix we get just our working direct our working files here so let's start with
the database one the idea of this is we sort of create something where we can
initialize our database and hold all that good stuff so we'll do from SQL
Alchemy we're going to import in create engine and from
sqlalchemy.orm we're going to import in session maker the session is what we're
going to use to actually act on the database and we're going to import the
models uh I'm going to do from Models we're going to import in I haven't yet
to create these but there's going to be base model there'll be a product model
and an offer model we will create these in a minute and I'm going to say that
our database uh database URL I suppose it's going to be
SQ light we're going to use an SQ light 3
database and we want to do it's one two three and then put in this folder called
products. DB does that look right I think so now we can say that our engine
going to be equal to the create engine and this is using
our database URL I'm going to put Echo is equal to true on for the moment so we
can see everything happening and I'll say that our session local is going to
be equal to the session maker and this is going to do um I don't
know what do we need let's do auto auto commit I
think false uh auto flush I'm not sure if these are necessary I kind of just
got in the habit of doing them false uh this one is though bind equals to engine
engine so we can now have our session which we can act on do things on our
database let's create a new function we call this one our init DB function and
when we call this function when we run it we'll have our base model and this
will be meta data. create all and the bind is going to be the
engine so we basically just setting it up so uh once we create our models we
call this function from our main.py file and it will create the uh models create
the tables and everything for us when we need them and it will create the
database or next if it doesn't exist and give us a session that we can actually
work with uh to um query it cool so let's save this and go to our models in
our models.py file let's start with doing uh from sqlalchemy.orm we'll
Import in what we need here our declarative Base Class and
mapped and mapped column so we can create this and we're also we going to
need relationship we'll need to import a few other things that we'll start here
so our first one is going to be our Base Class which is going to be our inherit
from our de inherit from the deit base and we can just do pass on this right
now we can actually create the models that we want so the first one is going
to be our product model this inherits from our Base Class that we've just
created uh now what's cool with this is we can do table name and we can set this
to something else so we can call this products for example which makes a bit
more sense so we need to decide what information we want um from the uh data
that we're going to pull out now from looking back at the schemas there's all
sorts of information but what I'm going to suggest is that we get you know we
have an an ID which we'll use internally maybe the name description skew we're
also going to have uh a foreign key uh many to one where you know one product
can have many uh one to many rather so one product can have many offers and
that's how we're going to uh compare the prices as we go so every time we scrape
more data we're going to add more offers against the same product so we can see
any changes over time so we'll do our first one is going to be our ID and this
is going to be a mapped integer and it's going to be equal to a mapped
column and we'll say primary key here is going to be equal
to uh true so this is our um this is is going to be our internal
ID then what we'll have is name mapped string uh mapped
column and I'm going to say this is a string which we probably going need to
import in yes we do uh we'll do that in just a
second uh there we go from that import that's right and we'll have this one can
be we're going to need to have a name so we'll say nullable is
false on this one and now we can have a URL which is also going to be useful for
us to hold uh map string I think there's a URL um type which you could use maybe
I don't think it's important though because we using an SQ like database now
we'll have our mapped column not collection mapped
column again I'm going to make this a string and this one can be nullable uh
this just put this as nullable as equal to true for the moment I'm not sure if
that's relevant for us they're not really going to be able to well maybe
maybe some of them don't have the URL so we'll leave it like that uh description
I might leave the description let's leave the description no let's put it in
we'll have it so let's have description again
mapped string M
column and nullable this can be null if it needs to be
what's wrong here oh I haven't done uh Miss than
equals right skew let's have uh mapped string is equal to
mapped column and this is going to be a string because these are it's not
integers not it's not always letter it's always it's usually letters and numbers
sometimes dashes and we'll have this one is going to be important because we want
to have unique is equal to true because we only want to have one product with
one skew otherwise you know this is we'll end up with like loads of entries
into the database that we don't want and nullable is equal to false because this
has to be something and I need to just change this to a lowercase
U then I'm going to have brand this is usually inside that schema which is also
probably going to be useful for and again
Maps column string and nullable so you kind
of get the idea I'm just going through and creating all of the databases uh
this can be nullable if there's no brand it's not that relevant it's not that
relevant to us now this is a great this is always a good field to have your
created at field and we can do this um uh in um automatically datetime
let go me do I need this no I need the actual python date
time okay let's import it ourselves seeing it
doesn't want to do it for me so we can always see you know when
things were created etc etc going to be really useful for us and it's going to
do it automatically if we do this we do mapped column then we have our date
time which is from our SQL Alchemy this one and we do server default is equal to
funk. now and this funk is also from SQL Alchemy this one I think there we go f
not now and we'll have nullable as this one as well null nullable is equal
to false so this is going to automatically add our created ad date uh
whenever we add the uh whenever we add the um record to the database it's going
to automatically do this created app for us so we know that this went into our
database at this time which is really useful to know so then we're going to
have offers and this is going to be a mapped uh capital
M uh list and it's going to relate to the offers uh with the
offer table which we're just about to create after this and this is going to
be a relationship and this is going to be
backrop populates equals to product so this is
going to bring the information back into us here so that should be it for that so
now we need to create the offers table uh offer table isn't it so base and then
we do the table name can be equal to offers there we go cool so the first
thing I want to do is again we'll have um our own id
id fact it can be the same as this can't it put this here same
thing then we'll have price which is going to be a
mapped um I normally use decimal for the price do we have
decimal we do SQL Alchemy decimal cool let's try
that and we'll do mapped column uh
decimal don't know why it's in caps and then nullable is equal to false we don't
want it if there's no price it's no use to us uh availability is a good one to
store as well we can see how the availability changes over
time availability and this is going to be a
string equals to Maps column string and this could be
Mal B is true we don't if it's not there that's fine we don't need to know if it
isn't and we'll have the created at as well this can be the same as this again
so we can see when this record is added now we need to do the um mapping to the
products column so we have our relationship so we wanted to have our
let's map it back to the product ID uh which going to
be mapped int and this is going to be equal to a mapped column
and this is where we have our foreign key which we need to put in here uh SQ
can be foreign key yep and this goes back to products and
this is like going calling the table name because the table name is products
and the ID so we're going to go call it back to there and again nullable uh is
equal to false because this can't be null and then our
product is going to be mapped back to the product
table and this is our relationship and we do back populates here back populates
is equal to offers which is our table so with the new W of SQ Alchemy you can
have the tables in any order in your code because we just map the two
together and it's happy that it knows it um it's mapping here a list so each
product can have a list of offers attached to it which is obviously many
and the offer is mapped back to one specific product so that was a lot of
work but this is the kind of like the core of the um we don't need mapped
collection this is like the core of your project because you need to have
something to put your data into your data has to go um into your database and
you need to map it out so once this is kind of done and this is like the hard
bit then it's pretty much plain sailing from there um so yeah just got to figure
out what bits of data you want and create your database like so for it so
now I'm going to go over over to the main file so we'll do uh main which is
currently empty and I'm going to do from DB and we're going to import in our
session which we created our session local and also that initialize DB
function then we're going to do from models and we're going to import in
product and offer which is are both of our models we're going to need these
then we'll do from let's just do import httpx which is what we're going to use
to make the requests uh from Rich I'm going to import in
print and also from Rich dot loging I'm going to import in the
rich Handler because we're going to be using Rich
logging uh which is really cool it's really it gives you brilliant looking
logging straight away for for no no real cost or anything um so the other the um
and then we just need to import in logging as well so we'll start with that
we will we will need to import in some other stuff as we go what I'm going to
do now is I'm just going to do the logging bit so it's done and out the way
so whenever um I whenever we get to you know starting to run our code we can we
can just have this ready so all it does is it just gives you a format of a
message and we're saying our basic config is logging level debug which is
going to be everything the format is just this and with the date and our
Handler is the rich Handler um I've got into the habit of sticking this into
pretty much all of my code and it makes a world of difference um to you know
being able to see and clearly understand what's happening in your project and
what's when you're trying to run it which is also extremely crucial when you
run this uh on the cloud or anything like that Etc you need to be able to see
what's going on right so the first thing I want to do is I'm going to import our
URLs and we'll scaffold out our um project uh and we'll build it up from
there so the first function is going to be to get the URLs because we want to
have them imported into our code so we can Loop through them etc etc so I'll do
with open and this is url. CSV and we're
going to need to import in the CSV module and this is going to be we only
want read as f um then we can do our reader is equal to CSV do I have this
will it import it no it doesn't want to do that CSV do reader so let's just
import CSV here cool and let's have uh we need to
to put in the file F then I say our URLs is going to be equal to and use uh list
comprehension URL of the first index for URL in reader and then we're going to
just return our URLs which is a nice neat list we don't need this line here
so when you import from a CSV like this like I'm going to show you you need to
just index the first otherwise you get a list of lists or list of tupal I can't
remember which so let's go and fill out our URLs so let's just open that up I'll
just grab these over from my other there we go and just bring those over these
were the same URLs that we were working with before so let's go back to our main
file and let's have our uh create a main function and then we'll have our
if name is equal to main this just is going to say hey if
we run this file run this so normally I'd have Main in here like this but I'm
going to comment that out because we're just going to have we're just going to
print get URLs and now I need to run this so I'm
going to create uh I'm going to exit this for the moment I always forget to
do this first I'm going to create a new Tark session however you choose to get
to uh your terminal okay so let's go and run this
oh I need to run main I've got some errors mapped column
nullable I spell nullable wrong okay fine there we go n
right fantastic so that's our URLs nice and neat nice list of URLs that we've
got there okay so let's go back to our main file and let's build out another
let's get rid of that we don't need that there let's build out our next function
so I'm thinking about getting the data now so I'm going to do get HTML I'm
going to show you something cool in here as well so we have URL which is going to
be our string and I'm going to say our response is going to be
httpx do get and we going to say the URL um is going to be equal to the
URL and I'm just going to quickly put some headers in here we're going to need
a user agent header I would normally do this as a client but we're actually only
going to be doing um I put my user agent in here I would normally do this as a
client but we're actually going to be making one request from here so I don't
think there's no there's no real need to we can just do this headers is equal to
headers and now we can go and say um let's do if uh response. status code
does not equal 200 so we don't want to we don't want our program to crash we
don't want our program to end if we get a like a 400 or something like that
because you know maybe the URL disappeared we're going to get a 404 so
so if you did raise for status it would just end your program so I'm just going
to do uh loging do info and we'll just say
um something like uh url url
responded with Bay bad stat code uh response do status code and I
need a cool so that should be fine so that just
means it's going to tell us this this went wrong when when things happen and
now we can just put our else in here and we can say let's do um we want to
extract the data now this is a really cool package which I didn't know existed
but I really should have done so we're going to do from
extract. Json LD we're going to import in the Json LD extractor so if you
watched me earlier talk about how the there's a load of Json data within
that tag this um actually gives us a really easy way to extract that without
having to do any of it ourself and it says uh to do something like this we can
do create a Json LD extractor an instance of the class and then all we do
is we give it the data we give it the response. text so we don't have to do
any we don't have to give it a par or anything like that we can just say it's
it all comes with it in in built-in so we can just say yeah here's the here's
the response. text and we can just return out the
data and inside this data is going to have a list of all of the Json LD plus
uh the application Json LD plus whatever it is those LD tags is going to have all
of the information neatly formatted as a list of dictionaries so we don't have to
do any of it ourselves yes it's another dependency but it's going to save us
time in the long run I think and so far it seems to be working really well so
now I'm going to try this out so we're going to come back here and we'll say
that our URLs is going to be equal to our get URLs and now we can do for
URL in that list we can do uh let's just print out actually no let's
say uh no let's just print it for now um our get
HTML of the URL so let's sling back over to this part of our uh project and run
it so you can see all of the uh debugging everything there all the
logging you can see the data is coming back so let's just uh scroll back up so
you can see here's the information from one of the
um uh one of the shops the under the URL so we had this schema this lovely
formatted data now you can see it does pick up the other one so we're going to
need to just filter that out but that's that's really easy and we can see here's
the um cuz I'm on um this is the debug logging I think I left it on debug and
so you can see all of the um request response data that httpx is doing for us
uh here's some more here's the here's more of the data so you can see here's
the information we're after I think it might be up here
actually yes it's this one so this one is this one's interesting so we have the
product and the offers but it has all sorts of different products within the
same um uh like parent products so what we'll want to do is if I scroll
down it's just flipped around this way for some reason is I'm going to match
this skew against the offer because these are all different products that
we're not necessarily interested in uh here we go and I think there did I see
one that failed no that seems to be fine there's
more there's more of the data that we wanted um I'm going to trim the logging
down once we're happy and satisfied here's the more uh so did we get a good
response from everyone I thought I saw one that
failed nope seems to be fine perfect great so we'll we'll put
the logging off the debug once we've got going a bit better a bit further so but
now we know that this is working and we're getting this data back out uh we
can actually start to think about using it now to use it we want to basically
save it into our database um so let's go ahead and give this a variable um
something like I think data is probably fine or let's let's call it product data
Maybe we need add equals here cool now what we want to do is we
want to say for um let's think so this is going to give
us a list so we want to say for I'll just
call this data in product data we want to do this is where we want to have our
if offers because we want to only get the uh dictionary from back from the
list if it's in data so if it's in there like this like so so now if we were to
print out um the data here and if I change the
let's go back up to the top and let's just change the logging level back down
to let's put it to info it should be fine and let's run this again we should
see that we only get uh this information back and we go
back to scroll mode please t-x there we go so now we're only
getting back the uh printing out the information for uh the schema that has
the offer in so we're missing all of the extra ones which we don't need so now we
know that that's good we can sort of proceed on to thinking about how we're
going to store this information and uh this might be none okay that's fine I
don't care how we going to store this information into our database
so the first one we want to do is we want to load the product because we need
to have the product loaded before we can load any offers against it so I'm going
to call this one load product and this is going to take in a session and some
data uh let's give ourselves a little bit of room here to work with so I going
to say that our new product let's call this new product it's not technically
might not be a new product it's going to be equal to an instance of our class
product which is our database uh model and our database table so we say our
name it's going to be equal to data and we just need to fill this in here um so
the URL I think I put as n nullable models let's go back up URL nullable is
equal to True okay so what that means is we can let's come back to our main file
is when you call when you use the square brackets and the key like this on a this
is a dictionary this data is going to be a dictionary it will fail if this
doesn't exist but if you do this instead and you do uh we'll do URL data. getet
and we say URL like this if I didn't want to do
that if um if URL doesn't
exist if URL doesn't exist this is just going to become none in which case you
know it won't go into our database then we can do
description I should have done uh is equal to uh data I think I did the same
here so we'll do data. getet description and this if I do this is
going to be it doesn't matter we'll leave it like
this uh description and then skew is equal to
data skew and uh brand is equal to
data brand name I mixed these up I really
probably should have used the same one consistently throughout we'll we'll see
if it bites we'll see if it comes around to bit us on the ass but maybe it needs
to be a bit better than that that'll do for now so now we want to add them to
our database so basically you do session with our database session. add and then
put in our new product but the problem is is that if this is going to try and
add a product that already exists I.E CU our SK you which we set to Unique is um
it has to be unique it'll fail and we'll get an error and our program will crash
now the error that we get is an Integrity error so what we want to do is
we want to do something like try here and we'll do
session. add and we'll say our new product and if that is successful um
it's not enjoying my does this work yeah there we go uh
then if that works we can do session. commit now I am going to just do a
commit for each one um you can do B Comm commits it's probably a better idea and
I'll say that our new um product I don't know if I'm going to need this we'll
say let's just do product is equal to uh session.
refresh and we'll give it the uh new product here as well so this just kind
of refresh our session and that should give us the new product data back uh and
then we can just return that product I think that's how that works cool so
that's if it works now if it doesn't we need to do an exception so we'll have
our integrity error here and if I started typing
Integrity it's not importing it itself so I'm just going to go back up to the
top and I think it's from SQL Alchemy dot
exceptions and we're Import in in Integrity error there we go that's
the one we want um let's go back down here so this is going to be our
exception our exception is going to be the Integrity error as
error uh we now we have a couple of choices I'm going to do um it doesn't
like my try and accept indenting uh we'll print the error out uh in fact no
we don't want to print the error we want to logging
dot war and we'll say
uh we need to make this an F string for our logging and we'll just say error
that'll be fine I'm not too worried about about that but the most that's
fine but the most important thing that we want to do is we want to do a
session. rollback because it's already going to have tried to do this so it's
tried to do this and if it fails we need to roll the session back so we have
still have access to it and then I'm just going to return um new product back
out so it's it's there I don't think we need this but just in case so this is
this essentially what we're going to do we're going to take the um product data
we're going to push it into our model and we're going to try to add and commit
if that fails we do on Integrity error we say hey this this has failed on this
error and we roll back so that's essentially what we have to do and we're
going to need to do that for the offers part as well so let's look at trying to
get some products into our database uh so at the moment we're uh just printing
out the data here so I'm going to do we'll uh we don't need the print we're
going to do load product and let's have this as a
um variable because this uh what do we want to call this my brain is not
functioning um let's just call this one p um that will be fine and then we
can give it the session uh which should have come from
our um I've missed that out that's fine that's cool so up here we need to do um
we need to initialize our database I imported it but I never ran it that's of
course now we can create our session which is going to be equal to the
session local that we imported at the top perfect so now we now we have our
database initialized uh when we run this and we have our
um product here so this could throw an error now because I never did this
before so if I did anything wrong with the database we might get some issues
let's run it cool I didn't see any warnings or
anything like that uh it looks like we put the data in so let's check now you
can of you can use a few different ways to check what's in your database there's
DB browser for SQ light which is a graphical user interface which is pretty
good um I'm just going to do site 3 in my terminal and this gives us this and
we can just do open products. DB we got that and then we can just do tables or
table and it shows our two tables then we can just run SQL queries from here
so I'm going to do select star from products I'm just going to check that
we've got something in there we do so this is why that I wasn't sure about the
descriptions I think we probably want to redo this and do a strip the white space
from this but let's just select the um what should we do uh what did I
call it skew or let's just do ID skew
name cool so we now have four products and I think I put in four
URLs one two three four five okay so one is failing fine uh so now we have four
products in here that we can see in our database and we can just get out of this
here and I didn't mean to close that tmox window so let's just get that back
there we go cool right so now we know that we're loading the products and it
is working fine that's good I'm going to get rid of this print statement don't I
don't need this at the moment so now we the are loading the products we need to
load the orders that are associated to these products in fact let's run this
again so we can see that we should hopefully be getting that um warning
there we go there's our first warning Integrity error Integrity error unique
constant fail which is exactly what we're expecting because it say hey this
already exists in our database so now you think about loading in the offers um
so we're going to build ourselves a load offers function
like so let's call it load offers and again it's going to take
in the session and also the same set of data we need to think about this a
little bit more because as I showed you in some of the um uh in the in the Json
schemas earlier that some of them the offers was in a list and some of them it
was not in a list we'll cover that in just a second because the first thing
that we need to do is find the product that is associ assciated to the offer
now when we get this in we will have just load either we have just loaded the
product or we've skipped over it because the product doesn't already exists so
I'm going call say our product is going to be equal to session.
query on our product model then we can do filter against the product. skew is
equal to the data skew here and then we want to do DOT first so it returns us
the first thing this is is going to get us that product because obviously the
skew is unique and the skew is going to come from the data that we just scraped
so we're going to do that here what we want to do now is we want to check to
see what sort of data the offers is now in some of them there was a list in
others it was just a single dictionary there's a few different ways you can do
this the way that I solved it was um if uh is is instance and then we can say
data offers this is the key in inside our dictionary that we've scraped we say
if this is a list so what I'm going to do is I'm going to duplicate a little
bit of stuff here which isn't ideal but this does work and given a bit more time
I probably would maybe sort this out a little bit I'm going to do for offer in
uh data offers and we can now go ahead and
create the new offer but what I want to do here is what I talked about when I
showed you before the offers for that one product had you know one big one
parent product had lots of different SKS and and we were focusing on one we don't
want to have you know one product skew and then you know loads of offers that
have different product skes in so I'm just going to say if
offer skew is equal to our product. skew which is the product we pulled from the
database here so this is just going to give us that so now we can create our
new offer and this is going to be equal to
an instance of our offer model and this is going to be price is going to be
equal to the offer price and the
availability is equal to uh offer uh doget I'm going to do here because this
was not available in every one I think so this will again handle that for us
and then the product ID that this is associated to again this is another
database table is going to be equal to the product that we pulled do ID so this
part here is that uh field in the database if I go quickly back to the
models uh here this is this field that we're mapping uh to the foreign key so
this is this one and we're saying it's going to equal
to the product ID that we pulled from our database here so we're mapping
everything across nice and neatly now what we can do is we can have this
again uh again and know I'm duplicating some
code put this in here oh I didn't copy it oh I did actually or did I I did so
we can put this in indent this put this here and this
needs to be uh new offer I don't think we need this bit
because I was planning on doing something with the um the data that
comes back once it's been added in uh but I'm not but we're going to do it
we'll just leave it like that anyway so this is going to handle it if this is an
instance of a list so we now we just need to do it if it's not an instance of
a list which is where we're going to do a little bit more code duplication um
maybe you'd want to split this out into two separate things or maybe you would
create a um maybe we'd split this out and have a um like a loading function
that we could call but basically what I'm going to do now is I'm just going to
do this and I know I know what you're going to say um but we're just going to
do that and we're going to go if and we'll say else because this is not no
this is going to be I need to zoom out a bit so if it's an instance of the list
we're going to do this else we're going to do
this and the only thing that needs to change is some of this does not the
same let's just fix my indenting and now we just want to
because this is now inside the data so we do offers
price and availability is going to be data
offers. getet like so this is going to be the
same this should be the same and just get rid of this let's just
get rid of this I don't need this don't need that we don't need we
don't need this I'm going to get rid of it this is clogging up my stuff there we
go cool right so again I know that I've
duplicated a lot of stuff here and this was only because of this way that you
know the list was it was the one of them was coming back back as a list and I
needed a way to handle that so I think tidying this up would probably be a good
idea uh however this should work and we now can have we can say with our load
offers our data and let's come back here and we'll probably have something that
we need to fix because I just typed all of that out uh offer availability is an
invalid keyword okay so I've done something
wrong I miss spell availability no problem let's run again
again and we're getting stuff added in nope okay so this is where uh I needed
to do this so it doesn't exist for one of the
items there we go and now we should
run it's just going to put none in there instead there we go cool um cool so
let's check out our SQ light again and let's do do open
products. DB let's do select all from our offers table offers where product.
ID is equal to one
products oh am I doing the wrong thing products it's this key here there we go
cool so now I have three entries uh for this products that are
the same price uh the first one was when I had an issue with the availability
which is why it isn't in there so what I'm going to do is I'm going to exit out
of this I'm going to remove my products. DB we're going to clear and we're going
to do p Main and we're going to run again oh he catches me out and we're
going to start adding some data in so this should be one lot of products and
uh we should now be adding in I we going same this the same offer over and over
again because obviously you know we to do it different times to maybe get some
different prices but I'm going to put some data into the database so it's
available for us let's give it another one
sweet so let's go SQ 3 and we'll do do open our
products. DB database um
from uh select all from offers where product ID is equal to
two cool so there we have it so now we've got these entries uh and the date
and time that they were entered for them there let's close out this and come back
to our code that's pretty much it really this is um this is kind of like a Bare
Bones kind of SK skeleton to a project where you know we take the URLs we look
we take what we looked at earlier where you know we could find the data in that
schema as Jason we use this called Json LD extractor tool to do the do the
handling for us this one function here this is the scraping part that's it the
rest of it is you know passing it and adding it to the database and our models
Etc and everything like that so I mean that's just kind of the way it is um and
then we just run it all down here which is it's is quite neat neat and tidy so
it's basically 100 OD lines of code um to improve this what I would do is if I
was going to be running this a lot I'd be tempted to put in some more
validation um in here and handle this load offers a bit better maybe but this
will work fine you got we are pulling in specific structured schema data so this
is this isn't going to change um which is good for you know when you instead of
trying to like pass a load of HTML this isn't going to change so that's good but
you know we maybe we would want to put in something like pad antic to validate
it first so we can avoid any extra errors um but that's generally better
for when you're taking in User submitted data uh we didn't need print here and I
imported that by accident there we go so what I'll do is I'll put this code all
on my GitHub for you to have a look at um and you know maybe decide what what
bits you like learn some from some pieces take something away from it maybe
um but this has been quite a long video and like a full more of a full project
so hopefully you've enjoyed it and got something out of it um if you still want
some more coding after this which of course you do make sure you subscribe to
my channel and check out this video here which is more on webscraping data rather
than building something like this to save it