Getting reading time of an online article and more

If someone sent you a link to page or article, ever wanted to know how long is the article before even opening and reading the article, or if you could get a sense of the server response time or the page load time, number of images and paragraphs in that article? Well, I did and wrote this solution to get me some basic information without opening it in a browser. In this post, I share that example implementation.

You can enter (type of paste) a web address (to a page or server) and you’ll get information such as: page load time, word count on that page, number of paragraphs, and estimate reading time based on average reading speed. Additionally, if the page author marked it accordingly, even the publish date of the article.

And example output from my script, written in Python, is shown below:

You can run it right on this page by clicking on run on the widget below. You can run it as many times (one web address per run) as you like and get the information. You can enter the full URL or leave out the http:// or https:// part (my script will automatically add those as needed). While my code will work on both http and https sites, the widget platform below only allows https: sites for security. So, be sure to enter a site that is actually a https://. Otherwise, you’ll get a blank report.

For example, if you entered (without specifying https://): flyingsalmon.net the code will convet the full address to https://flyingsalmon.net and it’ll work. If you entered https://flyingsalmon.net it’ll also work because you specified https:// and it’ll fetch the info from that URL and flyingsalmon.net is a secure site running on https: protocol. However, if you typed https://flyingsalmon.net (without the ‘s’ after http) then the backend of trinket widget will not work. Go ahead and give it a try.

The Plumbing

The code is in Python using its ‘requests’ and ‘bs4’ libraries to fetch the HTML or XML content and scrape it for the desired information. The estimated reading time is calculated based on 200 words per minute average reading speed. It’s worth noting that not all pages can be scraped successfully (e.g. behind paywall pages, not allowed by the server, dynamically generated content such as based on user-scrolling the content may be generated at real-time, or other non-standard HTML/XML pages…they may return status_code such as 403 or inaccurate information). To extract the date of the page or aticle is even trickier as no real standard is followed, sometimes they are put inside time tags, and sometimes within meta tags, and sometimes ommitted altogether. My script checks in both time and meta tags and if it doesn’t find any publish time specified, it’ll simply say: “Publish date is not specified in the page.”

I hope you found this post helpful and interesting. Explore this site for more tips and articles. Be sure to also check out my Patreon site where you can find free downloads and optional fee-based code and documentation. Thanks for visiting!