Video Thumbnail 54:48
How to Scrape Data for Market Research (full project)
11.0K
401
2024-08-25
Check Out ProxyScrape here: https://proxyscrape.com/?ref=jhnwr ➡ WEB https://johnwr.com ➡ COMMUNITY https://discord.gg/C4J2uckpbR https://www.patreon.com/johnwatsonrooney ➡ PROXIES https://proxyscrape.com/?ref=jhnwr ➡ HOSTING (Digital Ocean) https://m.do.co/c/c7c90f161ff6 If you are new, welcome. I'm John, a self taught Python developer working in the web and data space. I specialize in data extraction and automation. If you like programming and web content as much as I do, you can subscri...
Subtitles

friend came up to me and said hey I want to start a coffee business but I don't

know what to price my coffee beans at can you help me I like of course I can

we can do some market research scrape a few sites it's dead easy and in this

video I'm going to show you how easy it is to scrape generic e-commerce sites to

pull price information you can do for your own analysis so all I did here was

Google coffee beans and picked a few shops we have 1 2 3 four five six the

first one here what we're going to do is we're going to straight away we're going

to look at the URL what we're looking for is two different things we're

looking for to see if this is a Shopify store if it is you're likely going to

see something like Collections and products like so if it is just type Json

at the end of the product now you see it's kind of loaded but it went back on

itself there that's interesting so we got two choices you can try and go a bit

further back and find the Json or you can do view page source and do line wrap

here and make this a bit bigger so we can all see and then you want to search

in here for one keyword called offers and what we're looking for here is

within our schema which is this I'm going to zoom straight back out a little

bit go back up to this one and look right here we have inside this script

tag application LD plus Json we have this context screa schema product offers

here we go so it tells you all the prices for the SKS etc etc so we can log

this and we can visit this page as many times as we want to we can come back to

it check the skew check the current price log it make note again let's go to

the next site here so this this isn't a Shopify store there's no like you know

collections or products like that but let's go back view page

Source line wrap and what's our keyword offers so I missed it it was the very

very first one here here we go now this is exactly the same Json it's just

formatted you know it's just all in one line as opposed to be neatly split out

with this right here is going to have the offers the type the price the

currency and the skew so again we can come back to this this page as many

times as we need to every day check the price of this product scaling up your

code is one of the biggest problems we face when scraping data and I always

recommend the first step is to use a highquality residential proxy with proxy

scrape the sponsor of today's video we have access to high quality secure fast

and ethically sourced proxies that are perfect for our use case there's 10

million plus proxies in the port use all with unlimited concurrent sessions from

countries all over the globe enabling us to scrape quickly and efficiently I

exclusively use residential proxies as these are the best option for beating

any anti-bot protection on the sites we're scraping and with the auto

rotation this is the simplest but most effective way to avoid our projects

being blocked and allow us access to the data that we need there only one line of

code to add to your project and then we let proxy scrape handle the rest from

there and any traffic you purchase is yours to use whenever you need need as

it doesn't ever expire other options though is if you just want throughput

then data center proxies with unlimited bandwidth 99% up time and no rate limit

from reputable countries and IP authentication make them a very easy to

use and attractive option within the right use case so if this all sounds

good to you go ahead and check out proxy scrape at the link in the description

below now let's on with our project let's go to another one let's try the do

Json on this one okay that worked even better we can just go to the product put

Json afterwards and here's all the product information literally everything

this is just the whole Shopify thing here it'll give you everything the skew

the price everything that the customer everything that the retailer has put in

including all of the variations so again these should have a skew yep there we go

so you can come back to this product as many times as you need to with that

unique identifier here's another one this one looks like a Shopify store as

well but again if we do Json it's going to uh redirect us back so we have a

choice we can do let's try products. Json okay the main one worked so this

gives us every product on this website now instead of just the one that we were

looking at so we can Loop through this what you can do as well with this one is

you can put query parameters after it I think it's limit and I think 250 is the

most that needs to be an equal sign I think 250 is the highest so you see it's

taking even longer so if I shrink this down 148 I've got every product on this

website now uh with all the product information that we could ever want so

there's another one Here We Go Again here's another products one again this

would be a Shopify store we could do the Json I'm going to show you as well here

that if we go to offers again here we go here's the data that we wanted and this

is in exactly the same format for every site this is a schema this is a generic

way of sending this information to the front end from their backend data this

is really important because what this means is we can create one single model

and we can pull this schema out from any these sites stick it straight into that

and it's going to work just fine I got one more to show you uh again here

there's products so this is probably a Shopify store it is there we go again

you can do that way or if I go to view page

Source come here offers it doesn't have it in this one

let's try schema okay so this one isn't going to

have this piece of information in oh there is uh this one has this is

different this has the question schema you can see it right here FAQ page

that's interesting again we don't need that though because this was a Shopify

store so we can dojon at the end and it will work so what I'm going to do now

and now that you know how to do this is I'm going to go through and I'm going to

code out an example to pull each of the products information from all of these

stores we'll do it one by one we'll get all that information then I'll show you

how to create a model to save it all and we'll save it all to a neat CSV file or

something similar maybe a database so you get the idea of what we're trying to

achieve now there's one thing that's really important quite a lot of the

really big Ecom stores do use this method with the schema but it's going to

be slightly different and it's probably going to be a little bit more difficult

to scrape because you know these massive stores these massive companies have a

lot more uh bot protection again we need to use those proxies that we talked

about earlier but what we're trying to do here is have a look and pull from

lots of different stores for that kind of like you know authentic type thing

and not your Amazon but you know you're kind of like yeah whatever this is the

ra Rave coffee that I mean I mean I don't want to drink this but maybe you

want to try it uh this sort of thing so you got to keep in mind what you're

trying to do and what you're trying to achieve so let's go ahead and write some

code scrape all this data save it all out so we have a nice something a nice

neat piece of code that we can run every day get these prices let's start off

with the simplest version that we can so I have created a project folder and

started my virtual environment and I've installed a few things so you're going

to need to do uh kl- cffi this is a really good wrapper around requests that

makes our uh requesting to the server a bit easier because we can impersonate a

browser I'm going to use select Ox to pass some things Rich to help me with

Printing and I chose to use or Json to pass the Json in this um particular

project you could of course use the standard Json Library if you wanted to

so inside my main.py file we're going to import what we need to start with just

make this a little bit bigger just these things that for the moment and as I said

curl CFI is an import is a wrapper around request so it Imports as requests

and use it as standard or you need rich as well I've set the URL as one of the

products that we just looked at this is a Shopify one but we're not actually

going to go to the Shopify products. Json we're going to use the schema

because that's the consistent thing that I'm going to use for this project

I'm going to create my request with responses equal to request.get and

you'll notice with curl cf5 we have this impersonate thing here that just sets

some of our um extra parameters and everything to sort of make us look a

little bit more like a Chrome browser not necessarily for the not necess not

necessary possibly for this but good practice to use then we take the HTML um

passer and give it the response. text and create our HTML variable from here

we just need to find all of those script tags that we just looked at that had the

Json data in they are usually application LD plus Json and this is the

CSS selector to get all of them with select ox. CSS gets everything that

matches so this is going to return a list from here we want to Loop through

that list because this tag can hold all sorts of different bits of information

we only want to find the one that has the word offers in the script. text this

matches the schema for the product and it's a

good key to use from here we're going to take that data and we're going to load

it into our data variable here again this is where I decided I wanted to try

all Json you can of course absolutely use the standard Json package in Python

to do this there's no benefit to all Json here I just wanted to try it

out then I'm going to do is just print data. getet the offers part just so I

can see the offer section here so I'm going to save this we're going to come

over to our run code and we're going to do Python 3 pi main.py and this is going

to return us that small section of data that holds the offer in so if I was to

come back here and just print the whole thing and run it we're going to get all

of that schema back here so if I go to scroll mode you can see here was the

offers section that I printed out earlier and here's the rest of the

information that matches it all here so this is the easiest way to do it it's 14

13 lines of code depending on what you're trying to do but we want to

expand on this we want to make this much better much more robust we want to take

the data from multiple URLs and we want to store it into a database with

timestamps against the Imports against the scraped parts of the offer data so

we can start to collect those prices and that's what we're going to do next so

now we've seen a basic example let's expand on it and make it something that

we can actually use by adding in a database and getting it to a point where

we could run it every day or whatever and add the prices in so we're going to

create a new project here in this folder I'm using UV for my project management

at the moment it's a pretty cool tool um by the same people that make rough so

I'm going to do UV VMV and then I have a act UV no UV act which actually is a

shortcut to uh create the veg environment for me and activate it then

I'm going to do UV pip install we're going to need quite a few things here

I'm going to use httpx this time um I prefer that I'm going to use SQL Alchemy

for the database and we also going to need uh

rich and we're going to use a thing called extract which I will show you

when we get there it's just a nice neat way of doing those things and if you

spell SQL come me correctly we get it going okay cool so let's think about the

structure of our projects we're going to have a few different files I'm going to

create them now we'll have our main.py py file to run everything and have a

database. py file which is going to call the main database connection that we can

use uh models.py and we'll also have um I think that's probably it for now um

we're also going to have our URLs CSV let's put that in as well here so I'm

going to open this in my code editor and I'm going to go straight actually what

I'm going to do first is I'm going to do get in it we're going to use this we're

use git and I'm going to do uh a git ignore and we're going to say well let's

ignore any VMV and any pach folders cool and now when I open up

Helix we get just our working direct our working files here so let's start with

the database one the idea of this is we sort of create something where we can

initialize our database and hold all that good stuff so we'll do from SQL

Alchemy we're going to import in create engine and from

sqlalchemy.orm we're going to import in session maker the session is what we're

going to use to actually act on the database and we're going to import the

models uh I'm going to do from Models we're going to import in I haven't yet

to create these but there's going to be base model there'll be a product model

and an offer model we will create these in a minute and I'm going to say that

our database uh database URL I suppose it's going to be

SQ light we're going to use an SQ light 3

database and we want to do it's one two three and then put in this folder called

products. DB does that look right I think so now we can say that our engine

going to be equal to the create engine and this is using

our database URL I'm going to put Echo is equal to true on for the moment so we

can see everything happening and I'll say that our session local is going to

be equal to the session maker and this is going to do um I don't

know what do we need let's do auto auto commit I

think false uh auto flush I'm not sure if these are necessary I kind of just

got in the habit of doing them false uh this one is though bind equals to engine

engine so we can now have our session which we can act on do things on our

database let's create a new function we call this one our init DB function and

when we call this function when we run it we'll have our base model and this

will be meta data. create all and the bind is going to be the

engine so we basically just setting it up so uh once we create our models we

call this function from our main.py file and it will create the uh models create

the tables and everything for us when we need them and it will create the

database or next if it doesn't exist and give us a session that we can actually

work with uh to um query it cool so let's save this and go to our models in

our models.py file let's start with doing uh from sqlalchemy.orm we'll

Import in what we need here our declarative Base Class and

mapped and mapped column so we can create this and we're also we going to

need relationship we'll need to import a few other things that we'll start here

so our first one is going to be our Base Class which is going to be our inherit

from our de inherit from the deit base and we can just do pass on this right

now we can actually create the models that we want so the first one is going

to be our product model this inherits from our Base Class that we've just

created uh now what's cool with this is we can do table name and we can set this

to something else so we can call this products for example which makes a bit

more sense so we need to decide what information we want um from the uh data

that we're going to pull out now from looking back at the schemas there's all

sorts of information but what I'm going to suggest is that we get you know we

have an an ID which we'll use internally maybe the name description skew we're

also going to have uh a foreign key uh many to one where you know one product

can have many uh one to many rather so one product can have many offers and

that's how we're going to uh compare the prices as we go so every time we scrape

more data we're going to add more offers against the same product so we can see

any changes over time so we'll do our first one is going to be our ID and this

is going to be a mapped integer and it's going to be equal to a mapped

column and we'll say primary key here is going to be equal

to uh true so this is our um this is is going to be our internal

ID then what we'll have is name mapped string uh mapped

column and I'm going to say this is a string which we probably going need to

import in yes we do uh we'll do that in just a

second uh there we go from that import that's right and we'll have this one can

be we're going to need to have a name so we'll say nullable is

false on this one and now we can have a URL which is also going to be useful for

us to hold uh map string I think there's a URL um type which you could use maybe

I don't think it's important though because we using an SQ like database now

we'll have our mapped column not collection mapped

column again I'm going to make this a string and this one can be nullable uh

this just put this as nullable as equal to true for the moment I'm not sure if

that's relevant for us they're not really going to be able to well maybe

maybe some of them don't have the URL so we'll leave it like that uh description

I might leave the description let's leave the description no let's put it in

we'll have it so let's have description again

mapped string M

column and nullable this can be null if it needs to be

what's wrong here oh I haven't done uh Miss than

equals right skew let's have uh mapped string is equal to

mapped column and this is going to be a string because these are it's not

integers not it's not always letter it's always it's usually letters and numbers

sometimes dashes and we'll have this one is going to be important because we want

to have unique is equal to true because we only want to have one product with

one skew otherwise you know this is we'll end up with like loads of entries

into the database that we don't want and nullable is equal to false because this

has to be something and I need to just change this to a lowercase

U then I'm going to have brand this is usually inside that schema which is also

probably going to be useful for and again

Maps column string and nullable so you kind

of get the idea I'm just going through and creating all of the databases uh

this can be nullable if there's no brand it's not that relevant it's not that

relevant to us now this is a great this is always a good field to have your

created at field and we can do this um uh in um automatically datetime

let go me do I need this no I need the actual python date

time okay let's import it ourselves seeing it

doesn't want to do it for me so we can always see you know when

things were created etc etc going to be really useful for us and it's going to

do it automatically if we do this we do mapped column then we have our date

time which is from our SQL Alchemy this one and we do server default is equal to

funk. now and this funk is also from SQL Alchemy this one I think there we go f

not now and we'll have nullable as this one as well null nullable is equal

to false so this is going to automatically add our created ad date uh

whenever we add the uh whenever we add the um record to the database it's going

to automatically do this created app for us so we know that this went into our

database at this time which is really useful to know so then we're going to

have offers and this is going to be a mapped uh capital

M uh list and it's going to relate to the offers uh with the

offer table which we're just about to create after this and this is going to

be a relationship and this is going to be

backrop populates equals to product so this is

going to bring the information back into us here so that should be it for that so

now we need to create the offers table uh offer table isn't it so base and then

we do the table name can be equal to offers there we go cool so the first

thing I want to do is again we'll have um our own id

id fact it can be the same as this can't it put this here same

thing then we'll have price which is going to be a

mapped um I normally use decimal for the price do we have

decimal we do SQL Alchemy decimal cool let's try

that and we'll do mapped column uh

decimal don't know why it's in caps and then nullable is equal to false we don't

want it if there's no price it's no use to us uh availability is a good one to

store as well we can see how the availability changes over

time availability and this is going to be a

string equals to Maps column string and this could be

Mal B is true we don't if it's not there that's fine we don't need to know if it

isn't and we'll have the created at as well this can be the same as this again

so we can see when this record is added now we need to do the um mapping to the

products column so we have our relationship so we wanted to have our

let's map it back to the product ID uh which going to

be mapped int and this is going to be equal to a mapped column

and this is where we have our foreign key which we need to put in here uh SQ

can be foreign key yep and this goes back to products and

this is like going calling the table name because the table name is products

and the ID so we're going to go call it back to there and again nullable uh is

equal to false because this can't be null and then our

product is going to be mapped back to the product

table and this is our relationship and we do back populates here back populates

is equal to offers which is our table so with the new W of SQ Alchemy you can

have the tables in any order in your code because we just map the two

together and it's happy that it knows it um it's mapping here a list so each

product can have a list of offers attached to it which is obviously many

and the offer is mapped back to one specific product so that was a lot of

work but this is the kind of like the core of the um we don't need mapped

collection this is like the core of your project because you need to have

something to put your data into your data has to go um into your database and

you need to map it out so once this is kind of done and this is like the hard

bit then it's pretty much plain sailing from there um so yeah just got to figure

out what bits of data you want and create your database like so for it so

now I'm going to go over over to the main file so we'll do uh main which is

currently empty and I'm going to do from DB and we're going to import in our

session which we created our session local and also that initialize DB

function then we're going to do from models and we're going to import in

product and offer which is are both of our models we're going to need these

then we'll do from let's just do import httpx which is what we're going to use

to make the requests uh from Rich I'm going to import in

print and also from Rich dot loging I'm going to import in the

rich Handler because we're going to be using Rich

logging uh which is really cool it's really it gives you brilliant looking

logging straight away for for no no real cost or anything um so the other the um

and then we just need to import in logging as well so we'll start with that

we will we will need to import in some other stuff as we go what I'm going to

do now is I'm just going to do the logging bit so it's done and out the way

so whenever um I whenever we get to you know starting to run our code we can we

can just have this ready so all it does is it just gives you a format of a

message and we're saying our basic config is logging level debug which is

going to be everything the format is just this and with the date and our

Handler is the rich Handler um I've got into the habit of sticking this into

pretty much all of my code and it makes a world of difference um to you know

being able to see and clearly understand what's happening in your project and

what's when you're trying to run it which is also extremely crucial when you

run this uh on the cloud or anything like that Etc you need to be able to see

what's going on right so the first thing I want to do is I'm going to import our

URLs and we'll scaffold out our um project uh and we'll build it up from

there so the first function is going to be to get the URLs because we want to

have them imported into our code so we can Loop through them etc etc so I'll do

with open and this is url. CSV and we're

going to need to import in the CSV module and this is going to be we only

want read as f um then we can do our reader is equal to CSV do I have this

will it import it no it doesn't want to do that CSV do reader so let's just

import CSV here cool and let's have uh we need to

to put in the file F then I say our URLs is going to be equal to and use uh list

comprehension URL of the first index for URL in reader and then we're going to

just return our URLs which is a nice neat list we don't need this line here

so when you import from a CSV like this like I'm going to show you you need to

just index the first otherwise you get a list of lists or list of tupal I can't

remember which so let's go and fill out our URLs so let's just open that up I'll

just grab these over from my other there we go and just bring those over these

were the same URLs that we were working with before so let's go back to our main

file and let's have our uh create a main function and then we'll have our

if name is equal to main this just is going to say hey if

we run this file run this so normally I'd have Main in here like this but I'm

going to comment that out because we're just going to have we're just going to

print get URLs and now I need to run this so I'm

going to create uh I'm going to exit this for the moment I always forget to

do this first I'm going to create a new Tark session however you choose to get

to uh your terminal okay so let's go and run this

oh I need to run main I've got some errors mapped column

nullable I spell nullable wrong okay fine there we go n

right fantastic so that's our URLs nice and neat nice list of URLs that we've

got there okay so let's go back to our main file and let's build out another

let's get rid of that we don't need that there let's build out our next function

so I'm thinking about getting the data now so I'm going to do get HTML I'm

going to show you something cool in here as well so we have URL which is going to

be our string and I'm going to say our response is going to be

httpx do get and we going to say the URL um is going to be equal to the

URL and I'm just going to quickly put some headers in here we're going to need

a user agent header I would normally do this as a client but we're actually only

going to be doing um I put my user agent in here I would normally do this as a

client but we're actually going to be making one request from here so I don't

think there's no there's no real need to we can just do this headers is equal to

headers and now we can go and say um let's do if uh response. status code

does not equal 200 so we don't want to we don't want our program to crash we

don't want our program to end if we get a like a 400 or something like that

because you know maybe the URL disappeared we're going to get a 404 so

so if you did raise for status it would just end your program so I'm just going

to do uh loging do info and we'll just say

um something like uh url url

responded with Bay bad stat code uh response do status code and I

need a cool so that should be fine so that just

means it's going to tell us this this went wrong when when things happen and

now we can just put our else in here and we can say let's do um we want to

extract the data now this is a really cool package which I didn't know existed

but I really should have done so we're going to do from

extract. Json LD we're going to import in the Json LD extractor so if you

watched me earlier talk about how the there's a load of Json data within

that tag this um actually gives us a really easy way to extract that without

having to do any of it ourself and it says uh to do something like this we can

do create a Json LD extractor an instance of the class and then all we do

is we give it the data we give it the response. text so we don't have to do

any we don't have to give it a par or anything like that we can just say it's

it all comes with it in in built-in so we can just say yeah here's the here's

the response. text and we can just return out the

data and inside this data is going to have a list of all of the Json LD plus

uh the application Json LD plus whatever it is those LD tags is going to have all

of the information neatly formatted as a list of dictionaries so we don't have to

do any of it ourselves yes it's another dependency but it's going to save us

time in the long run I think and so far it seems to be working really well so

now I'm going to try this out so we're going to come back here and we'll say

that our URLs is going to be equal to our get URLs and now we can do for

URL in that list we can do uh let's just print out actually no let's

say uh no let's just print it for now um our get

HTML of the URL so let's sling back over to this part of our uh project and run

it so you can see all of the uh debugging everything there all the

logging you can see the data is coming back so let's just uh scroll back up so

you can see here's the information from one of the

um uh one of the shops the under the URL so we had this schema this lovely

formatted data now you can see it does pick up the other one so we're going to

need to just filter that out but that's that's really easy and we can see here's

the um cuz I'm on um this is the debug logging I think I left it on debug and

so you can see all of the um request response data that httpx is doing for us

uh here's some more here's the here's more of the data so you can see here's

the information we're after I think it might be up here

actually yes it's this one so this one is this one's interesting so we have the

product and the offers but it has all sorts of different products within the

same um uh like parent products so what we'll want to do is if I scroll

down it's just flipped around this way for some reason is I'm going to match

this skew against the offer because these are all different products that

we're not necessarily interested in uh here we go and I think there did I see

one that failed no that seems to be fine there's

more there's more of the data that we wanted um I'm going to trim the logging

down once we're happy and satisfied here's the more uh so did we get a good

response from everyone I thought I saw one that

failed nope seems to be fine perfect great so we'll we'll put

the logging off the debug once we've got going a bit better a bit further so but

now we know that this is working and we're getting this data back out uh we

can actually start to think about using it now to use it we want to basically

save it into our database um so let's go ahead and give this a variable um

something like I think data is probably fine or let's let's call it product data

Maybe we need add equals here cool now what we want to do is we

want to say for um let's think so this is going to give

us a list so we want to say for I'll just

call this data in product data we want to do this is where we want to have our

if offers because we want to only get the uh dictionary from back from the

list if it's in data so if it's in there like this like so so now if we were to

print out um the data here and if I change the

let's go back up to the top and let's just change the logging level back down

to let's put it to info it should be fine and let's run this again we should

see that we only get uh this information back and we go

back to scroll mode please t-x there we go so now we're only

getting back the uh printing out the information for uh the schema that has

the offer in so we're missing all of the extra ones which we don't need so now we

know that that's good we can sort of proceed on to thinking about how we're

going to store this information and uh this might be none okay that's fine I

don't care how we going to store this information into our database

so the first one we want to do is we want to load the product because we need

to have the product loaded before we can load any offers against it so I'm going

to call this one load product and this is going to take in a session and some

data uh let's give ourselves a little bit of room here to work with so I going

to say that our new product let's call this new product it's not technically

might not be a new product it's going to be equal to an instance of our class

product which is our database uh model and our database table so we say our

name it's going to be equal to data and we just need to fill this in here um so

the URL I think I put as n nullable models let's go back up URL nullable is

equal to True okay so what that means is we can let's come back to our main file

is when you call when you use the square brackets and the key like this on a this

is a dictionary this data is going to be a dictionary it will fail if this

doesn't exist but if you do this instead and you do uh we'll do URL data. getet

and we say URL like this if I didn't want to do

that if um if URL doesn't

exist if URL doesn't exist this is just going to become none in which case you

know it won't go into our database then we can do

description I should have done uh is equal to uh data I think I did the same

here so we'll do data. getet description and this if I do this is

going to be it doesn't matter we'll leave it like

this uh description and then skew is equal to

data skew and uh brand is equal to

data brand name I mixed these up I really

probably should have used the same one consistently throughout we'll we'll see

if it bites we'll see if it comes around to bit us on the ass but maybe it needs

to be a bit better than that that'll do for now so now we want to add them to

our database so basically you do session with our database session. add and then

put in our new product but the problem is is that if this is going to try and

add a product that already exists I.E CU our SK you which we set to Unique is um

it has to be unique it'll fail and we'll get an error and our program will crash

now the error that we get is an Integrity error so what we want to do is

we want to do something like try here and we'll do

session. add and we'll say our new product and if that is successful um

it's not enjoying my does this work yeah there we go uh

then if that works we can do session. commit now I am going to just do a

commit for each one um you can do B Comm commits it's probably a better idea and

I'll say that our new um product I don't know if I'm going to need this we'll

say let's just do product is equal to uh session.

refresh and we'll give it the uh new product here as well so this just kind

of refresh our session and that should give us the new product data back uh and

then we can just return that product I think that's how that works cool so

that's if it works now if it doesn't we need to do an exception so we'll have

our integrity error here and if I started typing

Integrity it's not importing it itself so I'm just going to go back up to the

top and I think it's from SQL Alchemy dot

exceptions and we're Import in in Integrity error there we go that's

the one we want um let's go back down here so this is going to be our

exception our exception is going to be the Integrity error as

error uh we now we have a couple of choices I'm going to do um it doesn't

like my try and accept indenting uh we'll print the error out uh in fact no

we don't want to print the error we want to logging

dot war and we'll say

uh we need to make this an F string for our logging and we'll just say error

that'll be fine I'm not too worried about about that but the most that's

fine but the most important thing that we want to do is we want to do a

session. rollback because it's already going to have tried to do this so it's

tried to do this and if it fails we need to roll the session back so we have

still have access to it and then I'm just going to return um new product back

out so it's it's there I don't think we need this but just in case so this is

this essentially what we're going to do we're going to take the um product data

we're going to push it into our model and we're going to try to add and commit

if that fails we do on Integrity error we say hey this this has failed on this

error and we roll back so that's essentially what we have to do and we're

going to need to do that for the offers part as well so let's look at trying to

get some products into our database uh so at the moment we're uh just printing

out the data here so I'm going to do we'll uh we don't need the print we're

going to do load product and let's have this as a

um variable because this uh what do we want to call this my brain is not

functioning um let's just call this one p um that will be fine and then we

can give it the session uh which should have come from

our um I've missed that out that's fine that's cool so up here we need to do um

we need to initialize our database I imported it but I never ran it that's of

course now we can create our session which is going to be equal to the

session local that we imported at the top perfect so now we now we have our

database initialized uh when we run this and we have our

um product here so this could throw an error now because I never did this

before so if I did anything wrong with the database we might get some issues

let's run it cool I didn't see any warnings or

anything like that uh it looks like we put the data in so let's check now you

can of you can use a few different ways to check what's in your database there's

DB browser for SQ light which is a graphical user interface which is pretty

good um I'm just going to do site 3 in my terminal and this gives us this and

we can just do open products. DB we got that and then we can just do tables or

table and it shows our two tables then we can just run SQL queries from here

so I'm going to do select star from products I'm just going to check that

we've got something in there we do so this is why that I wasn't sure about the

descriptions I think we probably want to redo this and do a strip the white space

from this but let's just select the um what should we do uh what did I

call it skew or let's just do ID skew

name cool so we now have four products and I think I put in four

URLs one two three four five okay so one is failing fine uh so now we have four

products in here that we can see in our database and we can just get out of this

here and I didn't mean to close that tmox window so let's just get that back

there we go cool right so now we know that we're loading the products and it

is working fine that's good I'm going to get rid of this print statement don't I

don't need this at the moment so now we the are loading the products we need to

load the orders that are associated to these products in fact let's run this

again so we can see that we should hopefully be getting that um warning

there we go there's our first warning Integrity error Integrity error unique

constant fail which is exactly what we're expecting because it say hey this

already exists in our database so now you think about loading in the offers um

so we're going to build ourselves a load offers function

like so let's call it load offers and again it's going to take

in the session and also the same set of data we need to think about this a

little bit more because as I showed you in some of the um uh in the in the Json

schemas earlier that some of them the offers was in a list and some of them it

was not in a list we'll cover that in just a second because the first thing

that we need to do is find the product that is associ assciated to the offer

now when we get this in we will have just load either we have just loaded the

product or we've skipped over it because the product doesn't already exists so

I'm going call say our product is going to be equal to session.

query on our product model then we can do filter against the product. skew is

equal to the data skew here and then we want to do DOT first so it returns us

the first thing this is is going to get us that product because obviously the

skew is unique and the skew is going to come from the data that we just scraped

so we're going to do that here what we want to do now is we want to check to

see what sort of data the offers is now in some of them there was a list in

others it was just a single dictionary there's a few different ways you can do

this the way that I solved it was um if uh is is instance and then we can say

data offers this is the key in inside our dictionary that we've scraped we say

if this is a list so what I'm going to do is I'm going to duplicate a little

bit of stuff here which isn't ideal but this does work and given a bit more time

I probably would maybe sort this out a little bit I'm going to do for offer in

uh data offers and we can now go ahead and

create the new offer but what I want to do here is what I talked about when I

showed you before the offers for that one product had you know one big one

parent product had lots of different SKS and and we were focusing on one we don't

want to have you know one product skew and then you know loads of offers that

have different product skes in so I'm just going to say if

offer skew is equal to our product. skew which is the product we pulled from the

database here so this is just going to give us that so now we can create our

new offer and this is going to be equal to

an instance of our offer model and this is going to be price is going to be

equal to the offer price and the

availability is equal to uh offer uh doget I'm going to do here because this

was not available in every one I think so this will again handle that for us

and then the product ID that this is associated to again this is another

database table is going to be equal to the product that we pulled do ID so this

part here is that uh field in the database if I go quickly back to the

models uh here this is this field that we're mapping uh to the foreign key so

this is this one and we're saying it's going to equal

to the product ID that we pulled from our database here so we're mapping

everything across nice and neatly now what we can do is we can have this

again uh again and know I'm duplicating some

code put this in here oh I didn't copy it oh I did actually or did I I did so

we can put this in indent this put this here and this

needs to be uh new offer I don't think we need this bit

because I was planning on doing something with the um the data that

comes back once it's been added in uh but I'm not but we're going to do it

we'll just leave it like that anyway so this is going to handle it if this is an

instance of a list so we now we just need to do it if it's not an instance of

a list which is where we're going to do a little bit more code duplication um

maybe you'd want to split this out into two separate things or maybe you would

create a um maybe we'd split this out and have a um like a loading function

that we could call but basically what I'm going to do now is I'm just going to

do this and I know I know what you're going to say um but we're just going to

do that and we're going to go if and we'll say else because this is not no

this is going to be I need to zoom out a bit so if it's an instance of the list

we're going to do this else we're going to do

this and the only thing that needs to change is some of this does not the

same let's just fix my indenting and now we just want to

because this is now inside the data so we do offers

price and availability is going to be data

offers. getet like so this is going to be the

same this should be the same and just get rid of this let's just

get rid of this I don't need this don't need that we don't need we

don't need this I'm going to get rid of it this is clogging up my stuff there we

go cool right so again I know that I've

duplicated a lot of stuff here and this was only because of this way that you

know the list was it was the one of them was coming back back as a list and I

needed a way to handle that so I think tidying this up would probably be a good

idea uh however this should work and we now can have we can say with our load

offers our data and let's come back here and we'll probably have something that

we need to fix because I just typed all of that out uh offer availability is an

invalid keyword okay so I've done something

wrong I miss spell availability no problem let's run again

again and we're getting stuff added in nope okay so this is where uh I needed

to do this so it doesn't exist for one of the

items there we go and now we should

run it's just going to put none in there instead there we go cool um cool so

let's check out our SQ light again and let's do do open

products. DB let's do select all from our offers table offers where product.

ID is equal to one

products oh am I doing the wrong thing products it's this key here there we go

cool so now I have three entries uh for this products that are

the same price uh the first one was when I had an issue with the availability

which is why it isn't in there so what I'm going to do is I'm going to exit out

of this I'm going to remove my products. DB we're going to clear and we're going

to do p Main and we're going to run again oh he catches me out and we're

going to start adding some data in so this should be one lot of products and

uh we should now be adding in I we going same this the same offer over and over

again because obviously you know we to do it different times to maybe get some

different prices but I'm going to put some data into the database so it's

available for us let's give it another one

sweet so let's go SQ 3 and we'll do do open our

products. DB database um

from uh select all from offers where product ID is equal to

two cool so there we have it so now we've got these entries uh and the date

and time that they were entered for them there let's close out this and come back

to our code that's pretty much it really this is um this is kind of like a Bare

Bones kind of SK skeleton to a project where you know we take the URLs we look

we take what we looked at earlier where you know we could find the data in that

schema as Jason we use this called Json LD extractor tool to do the do the

handling for us this one function here this is the scraping part that's it the

rest of it is you know passing it and adding it to the database and our models

Etc and everything like that so I mean that's just kind of the way it is um and

then we just run it all down here which is it's is quite neat neat and tidy so

it's basically 100 OD lines of code um to improve this what I would do is if I

was going to be running this a lot I'd be tempted to put in some more

validation um in here and handle this load offers a bit better maybe but this

will work fine you got we are pulling in specific structured schema data so this

is this isn't going to change um which is good for you know when you instead of

trying to like pass a load of HTML this isn't going to change so that's good but

you know we maybe we would want to put in something like pad antic to validate

it first so we can avoid any extra errors um but that's generally better

for when you're taking in User submitted data uh we didn't need print here and I

imported that by accident there we go so what I'll do is I'll put this code all

on my GitHub for you to have a look at um and you know maybe decide what what

bits you like learn some from some pieces take something away from it maybe

um but this has been quite a long video and like a full more of a full project

so hopefully you've enjoyed it and got something out of it um if you still want

some more coding after this which of course you do make sure you subscribe to

my channel and check out this video here which is more on webscraping data rather

than building something like this to save it