hi everyone and welcome today's video is going to be about how to extract the
date JSON data from inside script tag on an HTML page so this is the page that
we're going to be using it's a student accommodation and I've picked London so
if we have a look you can see that the accommodations all loaded up like this
and if you go to view source you will see that there is no useable HTML data
for us to actually scrape out using beautifulsoup but what we do have is
this long list here of what looks like JSON data we can see that it's got lots
of the information that we're interested in so you can see that it's got the
address the area of the city postcode picture
links etc etc and the value so what we want to be able to do is use Python to
extract this information and pass it with Jason just to go out what we want
or to make our own data set so the first thing we need to do is get over to our
text editor and copy the URL so I'll get this we're going to need to import a few
different libraries so we're going to import requests to go out and get the
information then we're going to import Jason because we're going to need that
to work with the data and then we are going to do from ps4 imp or
beautifulsoup because we're going to use that as well to pass the HTML take that
it's all good right so we'll set the URL that's a nice long one there we go and
we need to do now is R is equal to go and get the data requests get the URL so
now if we do print I think it's our door let's code we're getting a 200 which
means we are connecting fine ok so what we want to do is we want to create our
soup variable and pass that information into beautiful soup so we can extract
the data from that script tag so we're going to do soup is equal to beautiful
soup and then our content and we're going to specify the HTML
Parsa like this as i tend to always do let's print something out so we know
that we are in the right place I tend to just do the title or something like that
okay now we've known that we were in the right page what we need to do now is
just a bit of recon and we need to find out where this information is and how
we're going to plan to get it out so the first thing we need to do is if we go
back to our source code we can see here that it's inside a script tag but
there's nothing else that defines what this script tag is there's no ID or it's
not in a div or anything like that so in these cases the easiest way to do it is
to literally count down how many script tags you are in and then use an index
when we do find all with beautifulsoup so I'm going to start up here and I see
this is the first one why not it's closed - that's the third one that's the
fourth one and this is the fifth one so this is the one that has our data in so
we need to use beautifulsoup to find all some script tags index out the fifth one
which would be number four because it's a zero index and then get the
information out something to do script is equal to soup dot find all and we're
going to look for the script tags and we saying that we need to fall so if we now
go and print out that and see what we get you can see right away that we are
we are in the right place and we are getting back all of this information so
that's great but the problem is is there's that this here has got a lot of
extra data around it which means we can't just dump that straight into JSON
library in Python and get the information that we want so what I tend
to do is I like to copy all of this so let's copy all of this out all of it we
go all the way down to everything inside the script tags and I put it into an
online JSON poor matter this is the one I use because it will tell you what the
problems are so if we paste this in we can see right away it's saying that it
is an error and is expecting a string or blah blah so this means that
if we try to load this in as a JSON object into a Python script it will just
fail so we need to change this string up a bit before we can load it in so the
first thing I can see here is there's a lot of white space at the beginning so
that's fine we can get rid of that nice and easily so we can do dot strip first
of all we undo dot text sorry just so we get the text from this and then we're
going to do dot strip and this is going to remove if I make this a bit bigger
and come up to the top this is going to remove the text it's going to remove the
script tags and the dot strip will remove the leading white spaces so let's
do that again okay so now we are just down to this so
that's good so we go back to our formatter and we go well we've got that
we've got rid of the leading whitespace but we're still not quite there yet
what we need to do is we need to basically chop off the beginning and
anything at the end of this string to make it match the JSON parser so we can
get that information so what we want to do is basically we want to count how
many characters including white space that we want to get rid of at the front
of our string and Jason will always start if you look here so you'll start
or something like this and we can see that the first thing that we match is
the bracket hi there so we want to get rid of everything before this bracket so
I've just counted this out and I think it's about 55 so what I'm going to do is
I'm going to go ahead and I'm gonna put our index for the text and this sorry
I'll slice it for the text and I'm gonna go ahead and say remove the first 55
characters from this string you can see here 55 that means start 55 characters
in so loop on this again let's see what we get okay so I'm not quite there yet
I've still got a few left so let's say 55 and white bit of white space 56 57 58
so let's go for 58 and then go again that's great so there's nothing before
our leading eye curly bracket there so now we know we're getting that we can
get rid of all of this at the start now if we try and validate again we're
getting an arab end so we can see that it shouldn't end with the curly bracket
not this bill semicolon well that's nice and easy we can apply the same method
and we're taking 58 off the front if we do minus 1 that means we're going to
leave one left at the end so we're going to come in one from the end so if we run
this again you can see now we've gotten this semicolons gone and nothing at the
start so what we want to do is if we come back here and we get rid of this
semicolon and validate and sometimes this doesn't work so we need to copy it
delete it and we paste it in there we go so now this is telling me that if we now
that we've cut our string down to this we can pass this into the json library
and python and we can then extract information from it as we would do
normally so let's do that now so what we want to do move this down into that we
want to do let's call it JSON object it's equal to actually no let's call it
data is equal to json dot load s because we're loading a string into it and we're
going to do script like this and now if we print our data we should get exactly
this back again there we go so now we basically have a JSON object loaded in
and saved into our data variable we can now go ahead and manipulate as we would
normally so what I'll do is we'll just come back here and we can see that it's
inside our main bracket we've got properties which is where we want to be
and then listings then groups which then becomes a list that's a list and then
results and then property so we need to go all the way through this first so if
I just quickly do that so we want to do properties then it was listings group
listings groups I work nope can't spell listings
groups and then zero fry zero index and then results should give all the results
and then if we pick the first one that is essentially the first one here which
is this and that's all the information it so you could go even further and you
could get just the addresses out or you could go and just get say the postcode
and then the price you can create your own data set or you could scrape this
every day and see if any new properties come up or something like that so that's
how I would go about approaching this we can use beautifulsoup to get the to find
the script tag and we've counted how many script tags down because there was
no idea if there's an ID you can find it that way and we're basically just
removing characters from the end and the beginning of the string to make it into
a JSON format so that we can then we can then manipulate it with the json dot
loads and going through that way and i always find that the online parsers are
really useful I'll leave a link to that one that I use and also I'll leave a
link to a couple of my other videos where I explain some more of the other
concepts that we've used in here that I've probably glossed over really
quickly so hopefully this has been helpful to you guys let me know in the
comments any questions or queries give it a like if you liked the video
consider subscribing on my channel there's more web scraping content and
there is more to come cheers guys bye