Video Thumbnail 10:13
Python Web Scraping: JSON in SCRIPT tags
43.9K
796
2020-05-28
This video covers a simple and easy way to web scrape with python by getting the data out in JSON format from the HTML script tags. This inline JSON is common place on lots of websites and can be used to our advantage in just a few lines of code! JSON Formatter: https://jsonformatter.org/ Python Web Scraping Guide: https://youtu.be/J91bHusPatc ------------------------------------- twitter https://twitter.com/jhnwr code editor https://code.visualstudio.com/ WSL2 (linux on windows) https://docs....
Subtitles

hi everyone and welcome today's video is going to be about how to extract the

date JSON data from inside script tag on an HTML page so this is the page that

we're going to be using it's a student accommodation and I've picked London so

if we have a look you can see that the accommodations all loaded up like this

and if you go to view source you will see that there is no useable HTML data

for us to actually scrape out using beautifulsoup but what we do have is

this long list here of what looks like JSON data we can see that it's got lots

of the information that we're interested in so you can see that it's got the

address the area of the city postcode picture

links etc etc and the value so what we want to be able to do is use Python to

extract this information and pass it with Jason just to go out what we want

or to make our own data set so the first thing we need to do is get over to our

text editor and copy the URL so I'll get this we're going to need to import a few

different libraries so we're going to import requests to go out and get the

information then we're going to import Jason because we're going to need that

to work with the data and then we are going to do from ps4 imp or

beautifulsoup because we're going to use that as well to pass the HTML take that

it's all good right so we'll set the URL that's a nice long one there we go and

we need to do now is R is equal to go and get the data requests get the URL so

now if we do print I think it's our door let's code we're getting a 200 which

means we are connecting fine ok so what we want to do is we want to create our

soup variable and pass that information into beautiful soup so we can extract

the data from that script tag so we're going to do soup is equal to beautiful

soup and then our content and we're going to specify the HTML

Parsa like this as i tend to always do let's print something out so we know

that we are in the right place I tend to just do the title or something like that

okay now we've known that we were in the right page what we need to do now is

just a bit of recon and we need to find out where this information is and how

we're going to plan to get it out so the first thing we need to do is if we go

back to our source code we can see here that it's inside a script tag but

there's nothing else that defines what this script tag is there's no ID or it's

not in a div or anything like that so in these cases the easiest way to do it is

to literally count down how many script tags you are in and then use an index

when we do find all with beautifulsoup so I'm going to start up here and I see

this is the first one why not it's closed - that's the third one that's the

fourth one and this is the fifth one so this is the one that has our data in so

we need to use beautifulsoup to find all some script tags index out the fifth one

which would be number four because it's a zero index and then get the

information out something to do script is equal to soup dot find all and we're

going to look for the script tags and we saying that we need to fall so if we now

go and print out that and see what we get you can see right away that we are

we are in the right place and we are getting back all of this information so

that's great but the problem is is there's that this here has got a lot of

extra data around it which means we can't just dump that straight into JSON

library in Python and get the information that we want so what I tend

to do is I like to copy all of this so let's copy all of this out all of it we

go all the way down to everything inside the script tags and I put it into an

online JSON poor matter this is the one I use because it will tell you what the

problems are so if we paste this in we can see right away it's saying that it

is an error and is expecting a string or blah blah so this means that

if we try to load this in as a JSON object into a Python script it will just

fail so we need to change this string up a bit before we can load it in so the

first thing I can see here is there's a lot of white space at the beginning so

that's fine we can get rid of that nice and easily so we can do dot strip first

of all we undo dot text sorry just so we get the text from this and then we're

going to do dot strip and this is going to remove if I make this a bit bigger

and come up to the top this is going to remove the text it's going to remove the

script tags and the dot strip will remove the leading white spaces so let's

do that again okay so now we are just down to this so

that's good so we go back to our formatter and we go well we've got that

we've got rid of the leading whitespace but we're still not quite there yet

what we need to do is we need to basically chop off the beginning and

anything at the end of this string to make it match the JSON parser so we can

get that information so what we want to do is basically we want to count how

many characters including white space that we want to get rid of at the front

of our string and Jason will always start if you look here so you'll start

or something like this and we can see that the first thing that we match is

the bracket hi there so we want to get rid of everything before this bracket so

I've just counted this out and I think it's about 55 so what I'm going to do is

I'm going to go ahead and I'm gonna put our index for the text and this sorry

I'll slice it for the text and I'm gonna go ahead and say remove the first 55

characters from this string you can see here 55 that means start 55 characters

in so loop on this again let's see what we get okay so I'm not quite there yet

I've still got a few left so let's say 55 and white bit of white space 56 57 58

so let's go for 58 and then go again that's great so there's nothing before

our leading eye curly bracket there so now we know we're getting that we can

get rid of all of this at the start now if we try and validate again we're

getting an arab end so we can see that it shouldn't end with the curly bracket

not this bill semicolon well that's nice and easy we can apply the same method

and we're taking 58 off the front if we do minus 1 that means we're going to

leave one left at the end so we're going to come in one from the end so if we run

this again you can see now we've gotten this semicolons gone and nothing at the

start so what we want to do is if we come back here and we get rid of this

semicolon and validate and sometimes this doesn't work so we need to copy it

delete it and we paste it in there we go so now this is telling me that if we now

that we've cut our string down to this we can pass this into the json library

and python and we can then extract information from it as we would do

normally so let's do that now so what we want to do move this down into that we

want to do let's call it JSON object it's equal to actually no let's call it

data is equal to json dot load s because we're loading a string into it and we're

going to do script like this and now if we print our data we should get exactly

this back again there we go so now we basically have a JSON object loaded in

and saved into our data variable we can now go ahead and manipulate as we would

normally so what I'll do is we'll just come back here and we can see that it's

inside our main bracket we've got properties which is where we want to be

and then listings then groups which then becomes a list that's a list and then

results and then property so we need to go all the way through this first so if

I just quickly do that so we want to do properties then it was listings group

listings groups I work nope can't spell listings

groups and then zero fry zero index and then results should give all the results

and then if we pick the first one that is essentially the first one here which

is this and that's all the information it so you could go even further and you

could get just the addresses out or you could go and just get say the postcode

and then the price you can create your own data set or you could scrape this

every day and see if any new properties come up or something like that so that's

how I would go about approaching this we can use beautifulsoup to get the to find

the script tag and we've counted how many script tags down because there was

no idea if there's an ID you can find it that way and we're basically just

removing characters from the end and the beginning of the string to make it into

a JSON format so that we can then we can then manipulate it with the json dot

loads and going through that way and i always find that the online parsers are

really useful I'll leave a link to that one that I use and also I'll leave a

link to a couple of my other videos where I explain some more of the other

concepts that we've used in here that I've probably glossed over really

quickly so hopefully this has been helpful to you guys let me know in the

comments any questions or queries give it a like if you liked the video

consider subscribing on my channel there's more web scraping content and

there is more to come cheers guys bye