Hello, I am writing this post because I wanted to share how I created a sitemap for my blog using python.
For our practice I have crated a dummy blog and can be found here: rocdummy and the corresponding sitemap can be seen here: rocdummy Sitemap
Now before we create a script that will generate HTML sitemap we will first see an interactive session using python IDLE. Open your idle and type the following:
In the above code we are importing the
In the above code we have imported
Next I have used the
Finally to convert the data to
As you can see that we only have urls and we need to find out the title of the post we will again use
(Let us assume the url for the post as 'http://rocdummy.blogspot.com/2016/03/test-post-4.html'):
As we know that if you want to read a line in file in a for loop then you can use
Also we have used
Okay as we have seen all the basics required, now we will put all of the above code together and make a script file to get the link of all the links in the sitemap(sitemap.py):
Line 1: We have imported all the required modules
Line 3: We have entered the website url and stored it in
Line 5-8: We have used the
Line 11-13: We are converting the data from xml to
Line 15: If you have seen the structure of data then it would look like this:
So to get the url we will have to use
So if we will loop through each element then we can get the
Line 19: We are looping through the above list to get the url of the post.
Line 21: A for loop for getting the
If you run sitemap.py then you will get the following output:
If you know requests-python then you can skim the following code otherwise skip to the next section.
If you want to rewrite the same script using
The above code works similar to sitemap.py but I just want to explain
To get the whole code, you can get it from here: Github
If you are using tumblr, then the sitemap different sitemaps can be found at : yourwebsite.tumblt.com/sitemap.xml. If you are a tumblr user then you will have to add an another for loop to loop through each and every sitemap.xml file.
As always, I have tried to explain each and everything in this post in a way that it is easy to understand even for beginners. But if you haven't understood anything or have any doubt then please do comment in the comment box below or contact me. I will try to sort out your problem. You can contact me from here: contact me
If you are using a CMS or service that has sitemaps similar to the one which we are using in this post please do let me know, so that I can update the post with all the possibilities that the user can get.
Please do comment on how I can improve this post such that everyone will be comfortable reading this. Also comment if I have forgot any topic or made any mistake. I want to improve my writing skills so that I can share my knowledge to a wider group!
Thank you, Have a nice day
References:
Python read website data line by line when available
Requests Python: iter_lines Documentation
Converting a string to dictionary using Python
Requests official Documentation
json python official documentation
urllib2 python docs
Python official documentation
Sitemaps.org
Syntax highlighter used in this post to highlight the code is hilite.me
I will first explain you what a sitemap is, then continue why I used my own python script instead of other free web apps available online and finally end this post by giving you a tutorial on how to create sitemap for your blog using python. Okay folks fasten your seat belts and get ready for takeoff ;)
What is sitemap?
According to wikipedia:
A site map (or sitemap) is a list of pages of a web site accessible to crawlers or users. It can be either a document in any form used as a planning tool for Web design, or a Web page that lists the pages on a Web site, typically organized in hierarchical fashion.To be simple sitemap gives you a list of all the links that correspond to your website. For example you can see the sitemap of this blog from here: Radiusofcircle
But why do we need sitemap?
Even though they are usually used for improving the search engine results of the websites, but it is a good practice to maintain sitemap that are also human readable. This will give the user a user friendly option to see all the links and the content on your website. So you also have to create a sitemap with HTML.
Why should we create our own sitemap?
When I wanted to create a human readable sitemap, I googled and found this website: XML Sitemaps. Even though they provide a free service, they only create sitemap for first 500 web pages. But this was not sufficient in my case or else I would have to pay some money to get sitemap for more than 500 pages and thus I had to find some other option. I chose python :)
Can we create a sitemap for your blog using python?
If you are using blogger.com or wordpress.com or tumblr.com or blog.com or wordpress.org service to host/create your website then you can apply the concept we will discuss directly without any modifications. But if you are using any other service then you can apply the same concept with some modifications to generate your sitemap.html if you are having an already generated xml sitemap.
Creating the sitemap
To create the HTML version of our sitemap with python we will be using a module named xmltodict. This module will convert your xml to dictionary format which is very easy to use.
To get xmltodict open your command prompt/terminal and type the following:
$ pip install xmltodict
For our practice I have crated a dummy blog and can be found here: rocdummy and the corresponding sitemap can be seen here: rocdummy Sitemap
Now before we create a script that will generate HTML sitemap we will first see an interactive session using python IDLE. Open your idle and type the following:
Python 2.7.9 (default, Dec 10 2014, 12:24:55) [MSC v.1500 32 bit (Intel)] on win32 Type "copyright", "credits" or "license()" for more information. >>> import urllib2 >>> url = 'http://rocdummy.blogspot.in/sitemap.xml' #Url for the xml site >>> content = urllib2.urlopen(url) >>> content.read() '<?xml version=\'1.0\' encoding=\'UTF-8\'?><urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"><url><loc>http://rocdummy.blogspot.com/2016/03/test-post-5.html</loc><lastmod>2016-03-02T14:11:25Z</lastmod></url><url><loc>http://rocdummy.blogspot.com/2016/03/test-post-4.html</loc><lastmod>2016-03-02T14:11:17Z</lastmod></url><url><loc>http://rocdummy.blogspot.com/2016/03/test-post-3.html</loc><lastmod>2016-03-02T14:11:05Z</lastmod></url><url><loc>http://rocdummy.blogspot.com/2016/03/test-post-2.html</loc><lastmod>2016-03-02T14:10:54Z</lastmod></url><url><loc>http://rocdummy.blogspot.com/2016/03/test-post-1.html</loc><lastmod>2016-03-02T14:10:39Z</lastmod></url></urlset>' >>>
urllib2
library to connect to internet and download the file. Next we will open the url and fetch the data using urllib2.urlopen
. This will save the website data as file object. We convert or read the file content
using content.read()
.>>> import xmltodict >>> import json >>> xml_string = '<?xml version=\'1.0\' encoding=\'UTF-8\'?><urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"><url><loc>http://rocdummy.blogspot.com/2016/03/test-post-5.html</loc><lastmod>2016-03-02T14:11:25Z</lastmod></url><url><loc>http://rocdummy.blogspot.com/2016/03/test-post-4.html</loc><lastmod>2016-03-02T14:11:17Z</lastmod></url><url><loc>http://rocdummy.blogspot.com/2016/03/test-post-3.html</loc><lastmod>2016-03-02T14:11:05Z</lastmod></url><url><loc>http://rocdummy.blogspot.com/2016/03/test-post-2.html</loc><lastmod>2016-03-02T14:10:54Z</lastmod></url><url><loc>http://rocdummy.blogspot.com/2016/03/test-post-1.html</loc><lastmod>2016-03-02T14:10:39Z</lastmod></url></urlset>' >>> xml_json = json.dumps(xmltodict.parse(xml_string)) >>> xml_json '{"urlset": {"@xmlns": "http://www.sitemaps.org/schemas/sitemap/0.9", "url": [{"loc": "http://rocdummy.blogspot.com/2016/03/test-post-5.html", "lastmod": "2016-03-02T14:11:25Z"}, {"loc": "http://rocdummy.blogspot.com/2016/03/test-post-4.html", "lastmod": "2016-03-02T14:11:17Z"}, {"loc": "http://rocdummy.blogspot.com/2016/03/test-post-3.html", "lastmod": "2016-03-02T14:11:05Z"}, {"loc": "http://rocdummy.blogspot.com/2016/03/test-post-2.html", "lastmod": "2016-03-02T14:10:54Z"}, {"loc": "http://rocdummy.blogspot.com/2016/03/test-post-1.html", "lastmod": "2016-03-02T14:10:39Z"}]}}' >>>
xmltodict
and json
. We have copied the same string that we have downloaded using urllib2
to xml_string
.Next I have used the
json.dumps
which will serialize the object to JSON formatted str
. Typing xml_json
in the interpreter have given me the output of how my json data looks.Finally to convert the data to
dict
format we will be doing the following:>>> json.loads(xml_json) {u'urlset': {u'@xmlns': u'http://www.sitemaps.org/schemas/sitemap/0.9', u'url': [{u'loc': u'http://rocdummy.blogspot.com/2016/03/test-post-5.html', u'lastmod': u'2016-03-02T14:11:25Z'}, {u'loc': u'http://rocdummy.blogspot.com/2016/03/test-post-4.html', u'lastmod': u'2016-03-02T14:11:17Z'}, {u'loc': u'http://rocdummy.blogspot.com/2016/03/test-post-3.html', u'lastmod': u'2016-03-02T14:11:05Z'}, {u'loc': u'http://rocdummy.blogspot.com/2016/03/test-post-2.html', u'lastmod': u'2016-03-02T14:10:54Z'}, {u'loc': u'http://rocdummy.blogspot.com/2016/03/test-post-1.html', u'lastmod': u'2016-03-02T14:10:39Z'}]}} >>>
urllib2
to get the data(Let us assume the url for the post as 'http://rocdummy.blogspot.com/2016/03/test-post-4.html'):
>>> post_url = 'http://rocdummy.blogspot.com/2016/03/test-post-4.html' >>> for line in urllib2.urlopen(post_url): if '<title>' in line: line = line.strip() print 'Post title is : '+line[7:-8] Post title is : Dummy Blog for Radius of Circle sitemap practice: Test post 4 >>>
for line in file:
and this will read all the lines one line at a time. The same concept have been used used in the above code as because urllib2.urlopen
will save the data as file object.Also we have used
line[7:-8]
(String Slicing)to remove the <title>
in the front and </title>
at the last.Okay as we have seen all the basics required, now we will put all of the above code together and make a script file to get the link of all the links in the sitemap(sitemap.py):
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 | import xmltodict, urllib2, json website_url = 'http://rocdummy.blogspot.in/sitemap.xml' content = urllib2.urlopen(website_url) #Read all the data from the content website_text = content.read() #Using the xmltodict to get the dict as string webJson = json.dumps(xmltodict.parse(website_text)) urlset = json.loads(webJson) urls = urlset['urlset']['url'] print(len(urls)) for element in urls: url = element['loc'] for line in urllib2.urlopen(url): if '<title>' in line: line = line.strip() print '<a href="'+url+'" >'+line[7:-8]+'</a><br /><br />\n' |
Line 3: We have entered the website url and stored it in
website_url
Line 5-8: We have used the
urllib2
to get the data and stored the data in website_text
Line 11-13: We are converting the data from xml to
dict
Line 15: If you have seen the structure of data then it would look like this:
{u'urlset': {u'@xmlns': u'http://www.sitemaps.org/schemas/sitemap/0.9', u'url': [{u'lastmod': u'2016-03-02T14:11:25Z', u'loc': u'http://rocdummy.blogspot.com/2016/03/test-post-5.html'}, {u'lastmod': u'2016-03-02T14:11:17Z', u'loc': u'http://rocdummy.blogspot.com/2016/03/test-post-4.html'}, {u'lastmod': u'2016-03-02T14:11:05Z', u'loc': u'http://rocdummy.blogspot.com/2016/03/test-post-3.html'}, {u'lastmod': u'2016-03-02T14:10:54Z', u'loc': u'http://rocdummy.blogspot.com/2016/03/test-post-2.html'}, {u'lastmod': u'2016-03-02T14:10:39Z', u'loc': u'http://rocdummy.blogspot.com/2016/03/test-post-1.html'}]}}
urlset['urlset']['url']
which will output the following:[{u'lastmod': u'2016-03-02T14:11:25Z', u'loc': u'http://rocdummy.blogspot.com/2016/03/test-post-5.html'}, {u'lastmod': u'2016-03-02T14:11:17Z', u'loc': u'http://rocdummy.blogspot.com/2016/03/test-post-4.html'}, {u'lastmod': u'2016-03-02T14:11:05Z', u'loc': u'http://rocdummy.blogspot.com/2016/03/test-post-3.html'}, {u'lastmod': u'2016-03-02T14:10:54Z', u'loc': u'http://rocdummy.blogspot.com/2016/03/test-post-2.html'}, {u'lastmod': u'2016-03-02T14:10:39Z', u'loc': u'http://rocdummy.blogspot.com/2016/03/test-post-1.html'}]
url
of the each post along with the lastmod
data.Line 19: We are looping through the above list to get the url of the post.
Line 21: A for loop for getting the
title
of the post which we have already seen.If you run sitemap.py then you will get the following output:
5 <a href="http://rocdummy.blogspot.com/2016/03/test-post-5.html" >Dummy Blog for Radius of Circle sitemap practice: Test post 5</a><br /><br /> <a href="http://rocdummy.blogspot.com/2016/03/test-post-4.html" >Dummy Blog for Radius of Circle sitemap practice: Test post 4</a><br /><br /> <a href="http://rocdummy.blogspot.com/2016/03/test-post-3.html" >Dummy Blog for Radius of Circle sitemap practice: Test Post 3</a><br /><br /> <a href="http://rocdummy.blogspot.com/2016/03/test-post-2.html" >Dummy Blog for Radius of Circle sitemap practice: Test post 2</a><br /><br /> <a href="http://rocdummy.blogspot.com/2016/03/test-post-1.html" >Dummy Blog for Radius of Circle sitemap practice: Test Post 1</a><br /><br />
If you know requests-python then you can skim the following code otherwise skip to the next section.
If you want to rewrite the same script using
requests
module then it will look like this(sitemap_requests.py):1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 | import xmltodict, json, requests website_url = 'http://rocdummy.blogspot.in/sitemap.xml' content = requests.get(website_url) #Read all the data from the content website_text = content.text #Using the xmltodict to get the dict as string webJson = json.dumps(xmltodict.parse(website_text)) urlset = json.loads(webJson) urls = urlset['urlset']['url'] print(len(urls)) for element in urls: url = element['loc'] r = requests.get(url) for line in r.iter_lines(): if '<title>' in line: line = line.strip() print '<a href="'+url+'" >'+line[7:-8]+'</a><br /><br />\n' |
r.iter_lines()
. After we have got the data which is stored as request
object then to iterate through each line we use .iter_lines
method. As simple as that.To get the whole code, you can get it from here: Github
If you are using tumblr, then the sitemap different sitemaps can be found at : yourwebsite.tumblt.com/sitemap.xml. If you are a tumblr user then you will have to add an another for loop to loop through each and every sitemap.xml file.
As always, I have tried to explain each and everything in this post in a way that it is easy to understand even for beginners. But if you haven't understood anything or have any doubt then please do comment in the comment box below or contact me. I will try to sort out your problem. You can contact me from here: contact me
If you are using a CMS or service that has sitemaps similar to the one which we are using in this post please do let me know, so that I can update the post with all the possibilities that the user can get.
Please do comment on how I can improve this post such that everyone will be comfortable reading this. Also comment if I have forgot any topic or made any mistake. I want to improve my writing skills so that I can share my knowledge to a wider group!
Thank you, Have a nice day
References:
Python read website data line by line when available
Requests Python: iter_lines Documentation
Converting a string to dictionary using Python
Requests official Documentation
json python official documentation
urllib2 python docs
Python official documentation
Sitemaps.org
Syntax highlighter used in this post to highlight the code is hilite.me