Using Browser Cookies & Headers to Scrape Data - en - Twincloud's Youtube Subtitle Extractor

I talk about this web scraping method a lot it's intercepting the network

requests from the front end site to the backend API and then mimicking those

requests and in this video not only I'm going to show you how to do that I'm

going to show you how to find which headers you actually need and then I'm

going to show you how to reliably get them each and every time using a stealth

browser so what I'm going to do here is I'm going to open up the dev tools I'm

going to come over to network and we're on this site here this page I'm just

going to load it up and we're just going to find the AP call that we want to make

to make ourselves how you do this on the site that you're looking at is up to you

the best way to do it in my opinion is just to start clicking around and

looking for things so I'm going to go and make this one here right so we found

it so to me this one has a good response of products data prices and everything

so this is what we're going to want to do but if we go ahead and just copy this

URL and paste it into our browser you can see we're getting authentication

denied now quite often that won't be the case but when it is we need to figure

out what it is that's stopping us what are we what do we need to be

authenticated so this video is sponsored by proxy scrape Friends of the channel

and their proxies that I use myself whenever I scrape data I'm always behind

a proxy it's just necessity this these days to be able to rotate through IPS

now I pretty much always use residential ones but there is use cases for the data

center proxies too you just need to figure out what's going to work best for

you I've actually been using the mobile proxies a lot too recently and found

them to be pretty good especially when you're us utilizing them within the

country that you're excite you're expecting to scrape very easy to add

them to your project I use a environment variable just to pull mine in and add it

in so I don't have to worry about anything like you know making sure that

the credentials are there it's all done nice and neatly and then you just let

proxy scrape handle the rest for you you can choose to rotate them if you want a

new proxy every request or you can have the sticky sessions which keeps hold of

the same proxy for a few minutes for you which also can be pretty useful and

something that I would potentially consider are doing um in the project

that I'm working here although in this case I'm just going to rotate through

the mobile proxies and also any traffic you purchase is yours to use whenever as

it doesn't ever expire which is a nice touch too so if this all sounds good to

you go ahead and check out proxy scrape at the link in the description below now

let's get back to our project so I'm going to do this here I'm going to come

over and I'm going to go and I'm going to do uh edit and resend and I'm going

to move this over a little bit and then bring this one out here now right away

the amongst you will see that there's this

client ID and client secret down here this has been generated by my browser

when it's loaded up the page so if I was to go ahead and click off these two and

then do send you'll see down here our response is authenticated denied

authentication denied which is what we got with our when we made it in the

plain browser so if I was to put I think it's just the client ID back in and send

again now we have the response so we know that we need to get this client ID

header whatever this is going to be to be able to make the request properly so

how do we do that well we need to go through a browser right so we need to

load the page up in a browser and find this client ID and get it and then add

it to our session so we can then make the requests that we need so if I was to

come over here and if I copy this um copy as curl and I come back over to my

browser and we do curl converter and paste this in you'll see

that we do have the a load of cookies and we also have the um client ID here

so to to a demonstration I'm going to copy this out and I'm going to put this

into um uh my code editor real quick and we'll just have a quick look at this and

make sure it works as we're expecting and then we can just double

check you know what we need so let's open this up and save so what I'm going

to do is I'm actually just going to blast away all the cookies yeah so we

don't we're not actually putting the cookies in if we remove them here but we

are going to put the head in so we'll do print and this is all me just figuring

out what we need to do right so response.

Json so if we run this one LV edit okay so this is the data that we

actually want back right say that for example so we know from this example

that we don't need these cookies these cookies are not relevant to us getting

this information back and that's really important because if they are you need

to include them but if they aren't you can just admit them completely so what

we want to do now is we want to build up our code and project we can actually

make one request to the main site with our browser with our stealth browser and

then capture those cookies and headers or just the headers in this case and

then load them into a request session then we can then make that session with

all those requests so like I have here I'm going to create a session that's

going to have these headers in with this client ID every time we run it we will

get a new ID blah blah blah etc etc so it will work for us and we can then make

one lot request to with the browser which takes a while and then we can make

subsequent requests to this API however you want to and however you you decide

to um it depends you know whatever website you're doing will depend this

this one just looks like it takes in a group of product IDs which is fine we're

not going to focus on that we're going to focus on how to get the actual

product uh and actual headers in so I'm going to create a new project in my

projects folder and we'll just call this one LV

edit and I'm going to CD into this folder and I'm going to do pip through

uh we need to do actually we need to do a virtual

environment so we create our virtual environment and the tools that we're

going to use for this we're going to use requests but we're also going to use

this uh this stealth browser here I'm going to do a separate video on this

um just because I think it's really powerful and I think it's it's possibly

the best one I've seen so far um it talks a lot about there's some really

good information on here um the only thing thing that I'll say is when it

comes to stealth browsers um everyone's kind of got their own way

of doing things and eventually you know those the ways that this you get around

it with the stealth tend to get patched so this is open source so that brings

you know the benefits of it's free to use you can use it whenever or wherever

you like but it's also you know if something changes it's just up to the

the one maintainer to to fix and Patch stuff but for me right now this works

really well so if you're struggling when you use stealth browser defin me try

this so we're going to do we to activate our environment and we'll do pip three

install and we'll have this and I'll have requests and rich as well so one of

the good things about cam Fox is that it's built around playwrights API so

anything that you can do in playright you can do with Cam Fox pretty much so

another thing I'm going to do differently here is I'm actually going

to create a couple of different files so I'm going to have an extractor class I'm

going to put this in here and then I'm going to have my main py file um and I'm

going to call that extractor the reason why I'm doing this

is that I want to separate things out because um I want to be able to build

upon this so everything that I put inside this class that I will then

access you could do separately if you wanted to so let's open up our extractor

class and let's Import in what we need so we're going to create our extractor

class and then we're going to initialize it and we want to give it some

information to initialize the class with the first one is we want to have a class

session object because we want to be able to update this session with the

headers and the proxies that we're going to be pulling so the proxies from my um

environment variables and the headers from the requests we're making so this

is my proxy that I'm going to be using I'm going to be using my mobile proxy

the UK version um I think this is really these are one of my favorite ones at the

moment they're they're much harder to block um because there's you know such a

wide use case in those um and then I'm going to say you know if it doesn't

exist or if it does exist we do this um and then we want to do our headers from

browser method so what this is going to involve is going to be loading up the

page using the camo Fox browser then we're going to create some handlers to

actually pull the requests out and this is like the standard kind of playright

way you know if you've ever seen that or done something like that we create a

Handler so what I'm going to do is I'm going to look for the word uh the

letters API inside the request URL so when we load the page that page is going

to create a request through XXR xhr to the back end and that it's going to have

the keyword API in it um which we saw over here this one here you know API so

when we find that we're going to update the headers um we're going to pull the

headers from it and then we're going to go through each of the header items and

we're going to create you know a um a little d a little list here and we're

not going to include these ones now this is important because if you include set

cookie this way you're going to get a um an error because it won't be formatted

the right way if you want to do cookies as well you have to do that through the

cookie jar I believe then I'm going to update our session with all of the

headers so what this code is doing right here is it's just going through all of

the headers it finds when it makes this request um and then it's going to turn

them into a key value dictionary and update our session then we need to

attach our handlers to our page that we've created through our camera Fox

which essentially playright then we're going to go to our URL wait for load

State and reload now I had some issues here and I don't know if this is typical

or whether this is going to be like something that happens a lot but I had

to basically reload the page go to it first and then reload it which was

interesting and if I didn't have this it didn't work it didn't actually catch

those API requests so something worth bearing in mind uh I've just put a print

statement in here so I can see that it's updated with the headers this is just

overriding you know this is my functions this is just doing a get request using

the session but you know the session will have all of the extra headers in um

so that's pretty simple that's pretty much all it is now as I said if you

don't want to use a class you don't have to you can make this all um as you need

it to be I wanted to put it in a class just because it was going to be easier

for me to go through and explain and show and now we can import that class

into our um our our our main.py file and we can go from there so I did put this

in just to go to Google first I don't think this is actually necessary this is

something that I left in for the sake of it um so what I'm going to do now is I'm

going to go to uh my main.py file and we're going to import in our extractor

class uh we need to do from

extractor Import in our extractor class and I think that's all we need so I'm

going to say that re e is going to be equal to the class of extractor and

here's the two URLs this is the one for the page to go to the browser page and

this is the one with the API now this could be you know anything that you

wanted it to be you know this is just the one with those those products on um

this is the part you know once we've got this working that we would actually

start to change the API URL to try and get more data or less data or more

specific data that we want we're just focusing here on actually getting the

headers and the uh client ID header that we need so that they can make this work

I'm just going to call E headers from browser and I'm going to pass uh pass in

the browser URL that I need there we go and now at this point

in our code we will have our session activated with all of the headers that

we need in it to be able to make requests so we should just be able to do

our responses equal to E.G get and that's the function that we created with

our session the API URL and that should be that simple and I'm just going to do

for uh item in response. Json and we'll just print

we'll print the item for the moment and then we'll just narrow it down so we can

actually see so when I run this now we should have our browser open up uh and a

good thing about cam Fox is that it it just handles all of the different uh

fingerprints for us so you might see it in different sizes or whatever and

that's just it's doing it let's run it uh oh we need to actually make sure this

the first time I've done it here so there we go cool so it's load loading up

Google and you can see it's all it's a different kind of size and whatnot and

then it's going to load up the actual page we were looking at we'll make this

a little bit smaller because there's going to be loads of data on this side

of the screen in just a minute in fact we can make it like this um I didn't

bother with any of the clicking and that's worked like so I'm going to

scroll back up so in here these are the headers and

the cookies that I pulled out or rather the headers uh and this is what we've

updated our session with so we've got the browser uh user agent we've got all

of the referers we've got the client ID which is the most important thing all of

this information here we do have the cookie because that was set um I don't

think you needed this but we set it anyway just by by doing it like that um

and that's it then we were able to make this request where we had this

information so I'll just do um instead of all of this we'll print out the

product ID or something actually what was it called just so we can actually

see item and product

ID and I think name again you can get whatever

information you want from here and you know just to show uh if we just copy

this let's just say that we were going to make different requests I'm going to

do the same request though um but you know we'll just do it twice just just so

you get the idea that it's working so again we can just run this and now we're

going to have to wait a few seconds for our browser to do its thing because we

need that data once that's done we should be able to make a decent amount

of calls through their API with that client ID that's come from our browser

now it's worth noting that there you can see it's worked I'm

not doing anything here that my browser isn't doing I'm basically just instead

of accessing the data through the browser I'm just loading my browser up

getting the information that it sends over and then accessing the data that

way in my opinion that's still perfectly fine I'm not pulling any information

that's behind a login or behind uh anything that isn't you know outwardly

visible on the main page I'm merely doing it in a more uh targeted and

quicker and more efficient manner um by utilizing you know what my browser is

doing I'm not so so my opinion it's absolutely fine so from here what you

would want to do is you know just update this API URL to whatever you wanted it

to be go ahead and explore that and find out what you need and then um just make

the request using the session and this should work on quite a lot of sites as

well you might find that some need different bits of information but

generally pulling the headers from the browser is going to work well enough