How to generate sitemaps using python?

Hello, I am writing this post because I wanted to share how I created a sitemap for my blog using python.

I will first explain you what a sitemap is, then continue why I used my own python script instead of other free web apps available online and finally end this post by giving you a tutorial on how to create sitemap for your blog using python. Okay folks fasten your seat belts and get ready for takeoff ;)

What is sitemap?

According to wikipedia:

A site map (or sitemap) is a list of pages of a web site accessible to crawlers or users. It can be either a document in any form used as a planning tool for Web design, or a Web page that lists the pages on a Web site, typically organized in hierarchical fashion.

To be simple sitemap gives you a list of all the links that correspond to your website. For example you can see the sitemap of this blog from here: Radiusofcircle

But why do we need sitemap?

Even though they are usually used for improving the search engine results of the websites, but it is a good practice to maintain sitemap that are also human readable. This will give the user a user friendly option to see all the links and the content on your website. So you also have to create a sitemap with HTML.

Why should we create our own sitemap?

When I wanted to create a human readable sitemap, I googled and found this website: XML Sitemaps. Even though they provide a free service, they only create sitemap for first 500 web pages. But this was not sufficient in my case or else I would have to pay some money to get sitemap for more than 500 pages and thus I had to find some other option. I chose python :)

Can we create a sitemap for your blog using python?

If you are using blogger.com or wordpress.com or tumblr.com or blog.com or wordpress.org service to host/create your website then you can apply the concept we will discuss directly without any modifications. But if you are using any other service then you can apply the same concept with some modifications to generate your sitemap.html if you are having an already generated xml sitemap.

Creating the sitemap

To create the HTML version of our sitemap with python we will be using a module named xmltodict. This module will convert your xml to dictionary format which is very easy to use.

I am assuming that you already have python installed on your computer.

To get xmltodict open your command prompt/terminal and type the following:

$ pip install xmltodict

For our practice I have crated a dummy blog and can be found here: rocdummy and the corresponding sitemap can be seen here: rocdummy Sitemap

Now before we create a script that will generate HTML sitemap we will first see an interactive session using python IDLE. Open your idle and type the following:

Python 2.7.9 (default, Dec 10 2014, 12:24:55) [MSC v.1500 32 bit (Intel)] on win32
Type "copyright", "credits" or "license()" for more information.
>>> import urllib2
>>> url = 'http://rocdummy.blogspot.in/sitemap.xml' #Url for the xml site
>>> content = urllib2.urlopen(url)
>>> content.read()
'<?xml version=\'1.0\' encoding=\'UTF-8\'?><urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"><url><loc>http://rocdummy.blogspot.com/2016/03/test-post-5.html</loc><lastmod>2016-03-02T14:11:25Z</lastmod></url><url><loc>http://rocdummy.blogspot.com/2016/03/test-post-4.html</loc><lastmod>2016-03-02T14:11:17Z</lastmod></url><url><loc>http://rocdummy.blogspot.com/2016/03/test-post-3.html</loc><lastmod>2016-03-02T14:11:05Z</lastmod></url><url><loc>http://rocdummy.blogspot.com/2016/03/test-post-2.html</loc><lastmod>2016-03-02T14:10:54Z</lastmod></url><url><loc>http://rocdummy.blogspot.com/2016/03/test-post-1.html</loc><lastmod>2016-03-02T14:10:39Z</lastmod></url></urlset>'
>>>

In the above code we are importing the urllib2 library to connect to internet and download the file. Next we will open the url and fetch the data using urllib2.urlopen. This will save the website data as file object. We convert or read the file content using content.read().

>>> import xmltodict
>>> import json
>>> xml_string = '<?xml version=\'1.0\' encoding=\'UTF-8\'?><urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"><url><loc>http://rocdummy.blogspot.com/2016/03/test-post-5.html</loc><lastmod>2016-03-02T14:11:25Z</lastmod></url><url><loc>http://rocdummy.blogspot.com/2016/03/test-post-4.html</loc><lastmod>2016-03-02T14:11:17Z</lastmod></url><url><loc>http://rocdummy.blogspot.com/2016/03/test-post-3.html</loc><lastmod>2016-03-02T14:11:05Z</lastmod></url><url><loc>http://rocdummy.blogspot.com/2016/03/test-post-2.html</loc><lastmod>2016-03-02T14:10:54Z</lastmod></url><url><loc>http://rocdummy.blogspot.com/2016/03/test-post-1.html</loc><lastmod>2016-03-02T14:10:39Z</lastmod></url></urlset>'
>>> xml_json = json.dumps(xmltodict.parse(xml_string))
>>> xml_json
'{"urlset": {"@xmlns": "http://www.sitemaps.org/schemas/sitemap/0.9", "url": [{"loc": "http://rocdummy.blogspot.com/2016/03/test-post-5.html", "lastmod": "2016-03-02T14:11:25Z"}, {"loc": "http://rocdummy.blogspot.com/2016/03/test-post-4.html", "lastmod": "2016-03-02T14:11:17Z"}, {"loc": "http://rocdummy.blogspot.com/2016/03/test-post-3.html", "lastmod": "2016-03-02T14:11:05Z"}, {"loc": "http://rocdummy.blogspot.com/2016/03/test-post-2.html", "lastmod": "2016-03-02T14:10:54Z"}, {"loc": "http://rocdummy.blogspot.com/2016/03/test-post-1.html", "lastmod": "2016-03-02T14:10:39Z"}]}}'
>>>

In the above code we have imported xmltodict and json. We have copied the same string that we have downloaded using urllib2 to xml_string.

Next I have used the json.dumps which will serialize the object to JSON formatted str. Typing xml_json in the interpreter have given me the output of how my json data looks.

Finally to convert the data to dict format we will be doing the following:

>>> json.loads(xml_json)
{u'urlset': {u'@xmlns': u'http://www.sitemaps.org/schemas/sitemap/0.9', u'url': [{u'loc': u'http://rocdummy.blogspot.com/2016/03/test-post-5.html', u'lastmod': u'2016-03-02T14:11:25Z'}, {u'loc': u'http://rocdummy.blogspot.com/2016/03/test-post-4.html', u'lastmod': u'2016-03-02T14:11:17Z'}, {u'loc': u'http://rocdummy.blogspot.com/2016/03/test-post-3.html', u'lastmod': u'2016-03-02T14:11:05Z'}, {u'loc': u'http://rocdummy.blogspot.com/2016/03/test-post-2.html', u'lastmod': u'2016-03-02T14:10:54Z'}, {u'loc': u'http://rocdummy.blogspot.com/2016/03/test-post-1.html', u'lastmod': u'2016-03-02T14:10:39Z'}]}}
>>>

As you can see that we only have urls and we need to find out the title of the post we will again use urllib2 to get the data
(Let us assume the url for the post as 'http://rocdummy.blogspot.com/2016/03/test-post-4.html'):

>>> post_url = 'http://rocdummy.blogspot.com/2016/03/test-post-4.html'
>>> for line in urllib2.urlopen(post_url):
 if '<title>' in line:
  line = line.strip()
  print 'Post title is : '+line[7:-8]

  
Post title is : Dummy Blog for Radius of Circle sitemap practice: Test post 4
>>>

As we know that if you want to read a line in file in a for loop then you can use for line in file: and this will read all the lines one line at a time. The same concept have been used used in the above code as because urllib2.urlopen will save the data as file object.

Also we have used line[7:-8] (String Slicing)to remove the <title> in the front and </title> at the last.

Okay as we have seen all the basics required, now we will put all of the above code together and make a script file to get the link of all the links in the sitemap(sitemap.py):

import xmltodict, urllib2, json

website_url = 'http://rocdummy.blogspot.in/sitemap.xml'

content = urllib2.urlopen(website_url)

#Read all the data from the content
website_text = content.read()

#Using the xmltodict to get the dict as string
webJson = json.dumps(xmltodict.parse(website_text))

urlset = json.loads(webJson)

urls = urlset['urlset']['url']

print(len(urls))

for element in urls:
    url = element['loc']
    for line in urllib2.urlopen(url):
        if '<title>' in line:
            line = line.strip()
            print '<a href="'+url+'" >'+line[7:-8]+'</a><br /><br />\n'

Line 1: We have imported all the required modules
Line 3: We have entered the website url and stored it in website_url
Line 5-8: We have used the urllib2 to get the data and stored the data in website_text
Line 11-13: We are converting the data from xml to dict
Line 15: If you have seen the structure of data then it would look like this:

{u'urlset': {u'@xmlns': u'http://www.sitemaps.org/schemas/sitemap/0.9',
             u'url': [{u'lastmod': u'2016-03-02T14:11:25Z',
                       u'loc': u'http://rocdummy.blogspot.com/2016/03/test-post-5.html'},
                      {u'lastmod': u'2016-03-02T14:11:17Z',
                       u'loc': u'http://rocdummy.blogspot.com/2016/03/test-post-4.html'},
                      {u'lastmod': u'2016-03-02T14:11:05Z',
                       u'loc': u'http://rocdummy.blogspot.com/2016/03/test-post-3.html'},
                      {u'lastmod': u'2016-03-02T14:10:54Z',
                       u'loc': u'http://rocdummy.blogspot.com/2016/03/test-post-2.html'},
                      {u'lastmod': u'2016-03-02T14:10:39Z',
                       u'loc': u'http://rocdummy.blogspot.com/2016/03/test-post-1.html'}]}}

So to get the url we will have to use urlset['urlset']['url'] which will output the following:

[{u'lastmod': u'2016-03-02T14:11:25Z',
                       u'loc': u'http://rocdummy.blogspot.com/2016/03/test-post-5.html'},
                      {u'lastmod': u'2016-03-02T14:11:17Z',
                       u'loc': u'http://rocdummy.blogspot.com/2016/03/test-post-4.html'},
                      {u'lastmod': u'2016-03-02T14:11:05Z',
                       u'loc': u'http://rocdummy.blogspot.com/2016/03/test-post-3.html'},
                      {u'lastmod': u'2016-03-02T14:10:54Z',
                       u'loc': u'http://rocdummy.blogspot.com/2016/03/test-post-2.html'},
                      {u'lastmod': u'2016-03-02T14:10:39Z',
                       u'loc': u'http://rocdummy.blogspot.com/2016/03/test-post-1.html'}]

So if we will loop through each element then we can get the url of the each post along with the lastmod data.
Line 19: We are looping through the above list to get the url of the post.
Line 21: A for loop for getting the title of the post which we have already seen.

If you run sitemap.py then you will get the following output:

5
<a href="http://rocdummy.blogspot.com/2016/03/test-post-5.html" >Dummy Blog for Radius of Circle sitemap practice: Test post 5</a><br /><br />

<a href="http://rocdummy.blogspot.com/2016/03/test-post-4.html" >Dummy Blog for Radius of Circle sitemap practice: Test post 4</a><br /><br />

<a href="http://rocdummy.blogspot.com/2016/03/test-post-3.html" >Dummy Blog for Radius of Circle sitemap practice: Test Post 3</a><br /><br />

<a href="http://rocdummy.blogspot.com/2016/03/test-post-2.html" >Dummy Blog for Radius of Circle sitemap practice: Test post 2</a><br /><br />

<a href="http://rocdummy.blogspot.com/2016/03/test-post-1.html" >Dummy Blog for Radius of Circle sitemap practice: Test Post 1</a><br /><br />

If you know requests-python then you can skim the following code otherwise skip to the next section.

If you want to rewrite the same script using requests module then it will look like this(sitemap_requests.py):

import xmltodict, json, requests

website_url = 'http://rocdummy.blogspot.in/sitemap.xml'

content = requests.get(website_url)

#Read all the data from the content
website_text = content.text

#Using the xmltodict to get the dict as string
webJson = json.dumps(xmltodict.parse(website_text))

urlset = json.loads(webJson)


urls = urlset['urlset']['url']

print(len(urls))
for element in urls:
    url = element['loc']
    r = requests.get(url)
    for line in r.iter_lines():
        if '<title>' in line:
            line = line.strip()
            print '<a href="'+url+'" >'+line[7:-8]+'</a><br /><br />\n'

The above code works similar to sitemap.py but I just want to explain r.iter_lines(). After we have got the data which is stored as request object then to iterate through each line we use .iter_lines method. As simple as that.

To get the whole code, you can get it from here: Github

If you are using tumblr, then the sitemap different sitemaps can be found at : yourwebsite.tumblt.com/sitemap.xml. If you are a tumblr user then you will have to add an another for loop to loop through each and every sitemap.xml file.

As always, I have tried to explain each and everything in this post in a way that it is easy to understand even for beginners. But if you haven't understood anything or have any doubt then please do comment in the comment box below or contact me. I will try to sort out your problem. You can contact me from here: contact me

If you are using a CMS or service that has sitemaps similar to the one which we are using in this post please do let me know, so that I can update the post with all the possibilities that the user can get.

Please do comment on how I can improve this post such that everyone will be comfortable reading this. Also comment if I have forgot any topic or made any mistake. I want to improve my writing skills so that I can share my knowledge to a wider group!

Thank you, Have a nice day

References:
Python read website data line by line when available
Requests Python: iter_lines Documentation
Converting a string to dictionary using Python
Requests official Documentation
json python official documentation
urllib2 python docs
Python official documentation
Sitemaps.org
Syntax highlighter used in this post to highlight the code is hilite.me

Problem 60 Project Euler Solution with python

Prime pair sets The primes 3, 7, 109, and 673, are quite remarkable. By taking any two primes and concatenating them in any order the result will always be prime. For example, taking 7 and 109, both 7109 and 1097 are prime. The sum of these four primes, 792, represents the lowest sum for a set of four primes with this property. Find the lowest sum for a set of five primes for which any two primes concatenate to produce another prime. This problem is j u st a brute force problem. If you have come here because you don't know the limit upto which you will h ave to gener ate the prime numbers t hen go ahe ad and t r y with 10,000 . When I first start ed solving the problem I chose 1 million(beca use most of the problem s on project E uler have this limit ), but it took very long for the computer to fin d the solution. After searching on the internet then I found many people choosing 10, 000 so I have changed my in put f rom 1 million to 10000 and the output was f ast. He...

RADIUS OF CIRCLE

Search This Blog