How to create your own SEO tool
Having any type of digital product is great, you can sell it. Usually, there are no costs associated with delivery, no production costs, and no warehouse.
Implementing your own idea, SEO or otherwise is a process that has many hardships, but can have immense satisfaction.
So how do you create your own SEO tool?
Types of SEO tools
SEO tool is a broad definition, and there are many niches you can get into, each with its pros and cons, this article will outline the most typical niches, and summarize what how to build each tool:
- Backlinks explorer tool.
- Keyword explorer tool.
- Rank checker tool.
- On-page SEO Audit tool.
How does SEO tools work?
Before going into the details of how to create each SEO tool, it’s crucial to understand how SEO tools work, the tools (any SEO tool), has multiple stages:
- Acquire the data from various sources: Public internet data, clickstream, 3rd party data, SERP.
- Store it on a database: MySQL, PostgreSQL, Redshift, custom database.
- Process the data using specific algorithms to get insights from the data: Domain rank, which keywords a site is ranking for.
- Show the results to the end-user via GUI or provide the results via SEO tools API.
How to create backlinks explorer tool
In this section, we will go over what does a backlinks explorer tool does, what are the parts needed to develop such a tool, and how to develop each part.
What is a backlink explorer tool?
Backlinks explorer is a tool that allows the end-user to:
- Learn who links to his site, and how
- Learn where the competitors are getting links from
- Learn where sites link out to
- Help in the process of domain vetting for:
- Purchasing live domains
- Purchasing expired domains
- Deciding where to do guest posting
How the backlink explorer tool works?
Three parts make the backlinks explorer tool:
A crawler is a software that retrieves the content of public web sites, the crawler can be small, for small projects and low amount of webpages, or enterprise-grade to crawl the entire Internet, just like Google’s crawler, which is the largest that exists.
Once the crawler got the links, it will store the links in a database. The database allows for fast retrieval of the data. The size and type of the database are dependent on the amount of data crawled, for a small project a single server can do. Still, for Internet-wide crawls, there’s a need for a distributed database, most likely with in house customizations.
For backlinks explorer tool, the product is mostly SAAS (Software As A Service), which means the GUI is web-based. The GUI should be intuitive as there’s many data point and the end-user should understand rather fast how to access the data he’s looking for.
How to create backlinks explorer tool
How to create backlinks crawler
There are numbers of open-source crawlers available, written in different languages. When we started with our crawler, we had two programmers build a Python-based crawler, and it was able to crawl ten pages per second on a VPS with one core, taking 100% CPU.
We had someone build a C based crawler, and it was able to crawl 100 pages per second on one core, taking 1-3% CPU.
So, choosing the right language of the crawler would impact the hardware selection, it’s a tradeoff between an expensive programmer and expensive hardware.
The essential traits of a crawler are:
- Fast and light, use as little CPU as possible per downloadable page.
- Honours robots.txt, to avoid abuse reports.
- Able to parse sitemaps to get new data.
- If it’s enterprise-grade, then it should be distributable.
Building a crawler with Python
Python is an easy to learn language, with a rich ecosystem and libraries; it cost us 100$ to have a freelancer build a basic crawler with Python and Scrapy. Also, Python has many libraries to parse the DOM and get only the essential elements like Page title, Page description, links.
The development would be fast and not too expensive; the downside is the need for more hardware.
Building a crawler with Libcurl
Libcurl is the leading development library for HTTP/S communications, it also supports asynchronous connections, for single HTTP/S requests it’s straightforward and easy. Still, to write a crawler, that’s more complex but doable.
A preferred language would be C or C++, and you will also need some libraries to parse the HTML to extract the links and other information from the page.
Build it all yourself
Our experience shows that for core technologies that need to be highly customized and built for speed, it’s better to write it yourself, we assume you have a very experienced programmer up to the task, just like we have in house.
We developed our own crawler, written in C with BSD sockets and OpenSSL, our own HTTP and HTML parsers. Writing your own crawler allows for the best performance, but the cost was extra development time.
Bandwidth cost is also an essential factor; some hosts like OVH provides unmetered bandwidth based on the port speed (100MB, 1GB).
Some hosts like Amazon charge for the actual bandwidth, and you may end up paying more for the bandwidth than the hardware.
The bandwidth criteria mean you might need to use a host that requires more knowledge and expertise but gives cheap bandwidth over using an easy to use as Amazon.
How to build a database for backlinks
Once you have the data, it’s time to store it. How much data do you plan to have? Ten billion backlinks would take 4-5 Terabyte of database storage, for comparison the competitors (Ahrefs, Semrush) offer a database of over 1 Trillion backlinks.
The issues with open source databases (MySQL, for example) is that after 50GB in storage, the insert speed degrades exponentially. For backlinks database, there’s a need for sustained insert speed, for our needs it was about 50,000 inserts per second, and after 50GB of storage the speed started to go down, at 200GB of storage it was so slow, it was unusable.
Amazon Redshift database can be scaled easily to have a large storage, and shouldn’t have the same problem the open-source solution has, but 10 Terabyte of space would cost 30,000$ monthly on SSD storage.
Some hosting providers offer clustered MySQL or PostgreSQL out of the box. A clustered solution can get around the insertion speed issue if each node is limited to 200GB storage.
The problem is that the clustered solution costs 10,000-20,000$ per month for ten Terabyte of storage, depending on the hosting provider.
There are commercial databases that can handle this load, like Oracle database, we don’t know what the price is, Oracle database is an enterprise product, it may not be suitable for startups.
Percona MySQL database
Percona is a company that specializes in database optimizations, and they have a custom version for MySQL, MongoDB and PostgreSQL. We wrote a long article in Quora about the database ordeal, and someone asked if we tried their Tuko engine, I didn’t. So it might help, we never tested it.
Custom database solution
We decided that paying 10,000-20,000$ per month for a database is too much. Besides we haven’t made a single dime yet, is not the way to go, we don’t have VC money, and that amount puts great pressure on finances.
We decided to develop a custom database that can handle the insert and select load. We can host it for 350$ monthly, but it took us two years to build it, so we traded time for money.
You can read our article about which database we use (hint: we built it), it covers in more detail about the limitations of current databases and how we solved it.
Integration and GUI for backlinks tools
Now that you got the database and crawler, it’s time to integrate and make sure you build a friendly GUI.
For web SAAS, there are several languages people use: PHP, NodeJS. We use PHP and a custom HTML theme; we cover more about GUI development later in this article.
How to create keywords explorer tool?
What is a keywords explorer tool?
A keywords explorer tool allows the end-user to gather keyword ranking information about his website, and the competitors:
- Which keywords does his site ranks on, and which position?
- Which keywords do the competitors rank on?
- Keywords opportunities and ranking difficulty.
- Keywords search volume.
- Keywords PPC costs.
- Long-tail keywords.
- Similar keywords.
Why should someone use a keywords explorer tool?
A keyword explorer tool is used mostly for SEO and PPC:
- Research which keywords to focus on, when writing articles.
- Discover easy to rank keywords.
- Discover cheap PPC keywords.
How does a keywords tool work?
Keywords explorer tool has several components:
Keywords explorer scraper
The scraper is the component that gets the data from the search engines (mostly Google, but some products supports more engines), parses it and make it ready for storage.
Keywords’ search volume data
Keyword search volume data is static data that is displayed next to the keywords, and it might also contain keyword ranking trends.
Keywords’ PPC data
Keyword PPC data is static data that is displayed next to the keywords, and it shows the average cost for Google AdWords PPC.
Keywords explorer database
The database stores the data in the previous sections:
- SERP results.
- Keywords’ search volume data and trends.
- Keywords’ PPC data.
GUI for keywords explorer
The answer is almost the same as for backlinks explorer tool the exception is that keywords explorer is only SAAS and not a local software. So with SAAS, the GUI is web-based.
How to create keywords explorer tool
How to write a scraper
Type of IPs
A scraper gets the search engine results. Google and the other search engines don’t like it, so they employ Recaptcha to stop the scraping. Even if someone is using anti Recaptcha methods, eventually Google will block that IP anyways.
There are various methods to scrape Google successfully:
Datacenter IPs are easy to come by, you can spawn a VPS starting from 2$ on low-cost providers, to 5$ on a leading provider.
The problem is – Google knows that too; it will display the captcha after about 100 requests.
There can be several solutions to this problem:
- Replace the IP every time the IP is blocked, and most VPS providers have an API to do so, check the provider TOS before doing that, see that this is allowed.
- Make a request every x seconds, depending on the results you ask for, the more results, the bigger the x. The problem here that you must maintain a large number of proxies.
Using residential IP
Since Google knows the IP came from a residential IP, it will not block the requests as long as you wait a few seconds, the problem is, how to get such IP?
If you live in the place, you plan to scrape the results from, and then you can install several connections and use them, the problem is, for an extensive keyword database, this will not scale well.
Another solution is to use a residential IP provider like Luminati, and the problem is the price, the charge per bandwidth, which makes the operation very expensive.
Which language to develop the scraper?
Unlike the backlinks crawler, the scraper doesn’t have to be very efficient, so Python or even PHP can be the right choice.
If there’s a need to render JS, then using headless Chrome or headless Firefox will be preferred.
Where to get search volume data?
Clickstream data is collected from real users, and it contains every URL the users visited and every search phrase they searched.
The search volume of every keyword can be deducted based on the number of panellists that are part of the clickstream data in a specific country.
This data allows us to get an estimate of the search volume because it’s only a part of the actual users searching. Also, the way this data was collected affects the results.
When an Antivirus provides the clickstream data (like it was with Avast), it means the data represents the way that end-users that installed Avast behaves, it may be different from other types of end users that wouldn’t install Avast.
If the budget allows, purchasing clickstream data from multiple sources will allow for better estimation of search volume.
Search volume from API
Some companies such as ours are selling search volume metrics with an API, and there are two types of commercial API usage:
Internal API usage
When the data is used internally, which means it’s not displayed publicly, it can be used for the company’s internal needs, or to show private reports to SEO clients.
This usage should be cheaper than white label usage.
Whitelabel API usage
Is used when developing a public app or SAAS and you want to show 3rd party data, the best example would be Neil Patel’s Ubersuggest, he publicly revealed his monthly spending on the tool. Part of the expenditure is for development and role for the data.
Obviously, this data is more expensive, and most companies will perform due diligence before providing such API functionality.
Google offers an API that provides PPC prices information, search volume data. It’s all the data that is available via their Keyword planner tool. They provide the API after a vetting process, and the API can only be used for specific goals.
Database for keyword explorer
Unlike the backlinks explorer, the data is not too big, and the insert speed requirement is low. One or two instances of MySQL would be enough; the challenge here is the sorting based on search volume, searching for similar keywords, and searching for similar phrases. For example:
How to sort big data
A site like eBay is ranked for millions of keywords in the SERP. When showing eBay’s result (or any other results), they are sorted by the search volume.
It doesn’t matter if the applications display five results, or a hundred results, the database will have to fetch all the results (millions) and sort the result before providing an answer.
The speed will depend on the database’s hosting computer performance, and number of jobs the database is currently doing, but even on an idle server, this will take at least a few seconds which can make the site look slow.
The way we solved it is to precompute the sorting. We developed a multithreaded C++ program to precalculate sorted results, which means we have a table that holds all the keywords for the sites, and the search volume is already sorted.
Using this method – fetching the results is almost instantaneous.
How to sort similar keywords
Similar keywords are keywords that contain a seed keyword, for example, for the keyword “seo services”, has the following similar list of keywords: professional seo services, affordable seo services, white label seo services reviews.
We have 270 million keywords just in our primary database, doing a similar search even with full-text indexes can be time and resource consuming.
The way we solved it, we used a C++ program (again) with Aho-corasik algorithm to precalculate those keywords lists, and then we store it in a table.
Aho-corasik is an algorithm that allows doing a fast search on a string; in one pass, it finds all the keywords in our dictionary.
Even with that algorithm in place, it takes the program up to 48 hours to go over all the keywords, think how long it would take to do it in the database level.
How to sort similar phrases
Similar phrases are phrases that contain the seed word, just not in the same order, for example, the seed keyword: “free mac pdf editor”, has the following similar phrase list:free pdf editor for mac without watermark, adobe pdf editor mac free, free pdf editor mac download, pdf editor mac free online software.
I’m not sure it’s even possible to design a database query to fetch such a list, the way we have done it, is use the same program from before, but this time we take each keyword and build all the permutations, and then we search those permutations.
When we first ran the program, it was “stuck”, when looking deeper we found the problem, a phrase with 12 keywords, generates 479,001,600 options, which will take two days just to search.
We had to limit the phrase building to 9 keywords, which generates “only” 362,880 results, we put this on a 40 logical thread server, it takes a month to go over all the keywords.
Integration and GUI for backlinks tools
The GUI is always the same for all tools; later in the article, we provide a detailed explanation on how to write GUI.
How to create a rank tracker
What is a rank tracker?
A rank tracker is a tool that tracks a site rankings in the search engines. It will:
- Track keywords across different geographies.
- Track keywords in specific cities which is needed for local SEO.
How does a rank tracker work?
Rank trackers are search engine scrapers that use proxies or other means to use different IPs to scrape the SERP in a specific geography or cities.
How to build a rank tracker?
The rank tracker should be the easiest tool to create, that’s why there’s so many out there, the entry barrier is low.
Because there’s little data, any database will do, you can go with MySQL.
Since you need to track keywords across various geographies, using a residential IP proxy provider makes sense since the number of tracked keywords is tied to the number of subscriptions, the cost of the IP provider should scale well with revenue.
The “crawler” section on similar keywords tool in this article covers how to write a SERP scraper which has the same applicability to the rank tracker scraper.
How to create a site audit tool
What is a site audit tool?
Site audit tools are used to detect on-page SEO issues like:
- Find Broken links.
- Check for missing or incorrect robots.txt.
- Discover missing or incorrect sitemap.
- Check for internal linking issues.
- Detect missing meta description tag.
- Find incorrect usage of header tags (for example, H1).
- Discover duplicate content.
- Generate a sitemap.
- Check for missing hreflang.
- And more.
How does a site audit tool work?
A site audit tool works by crawling the site it audits and analyzing the link structure and meta information on each page. Once it has all the data, it processes it to show the auditor the relevant information to fix.
How to build a site audit tool?
The core component of such tool is the crawler, the first decision would be whether that crawler is based on SAAS (the crawler is hosted at your servers) or the crawler is a downloadable software.
Crawler as a SAAS
Unlike the crawler for backlinks tool, this crawler usually needs to crawl less data, which means you can get away with Python-based crawlers, which will speed up development time.
Once you have your crawler in place, you’ll need to analyze all the factors mentioned, and Python has many libraries to do so, so a fair Python programmer should do the job rather easy.
All the data would go to a database since there’s not a lot of data, a single MySQL instance would do the job.
On Code Canyon there are several scripts to do SEO audit, they can be used as a base for the final SAAS software and save development time.
Crawler as a download
The main decision, in this case, is which OS to support, if the plan is to support multiple OSs, you should use a development language that is not OS depended like: Python, Java, NodeJS. The drawback is the need to install the development language runtime locally, which on some computer can fail for various reasons.
If the software is meant for a single OS, it can be written in C/C++, with minimal dependencies, this will allow for a higher success rate when installing the software. Also, a good programmer can write C/C++ code that will port easily to Linux/Mac if it’s known upfront.
For a local database, you can use SQLite, which is the gold standard for local databases (I wouldn’t try to install MySQL locally, that would lower installation success rates)
Other SEO tools are not mainstream. Obviously, we can’t cover everything, for example, our SAAS has built-in URL Classification, it allows for better domain vetting, we integrated our URL Classification SAAS with the SEO SAAS.
I could theorize for most products the core principles shown here for the other tools:
- If the tool needs fast data additions, it needs a distributed database solution which is expensive, or a custom solution which is costly to develop.
- If the tool uses AI, you can use a premade dataset from GPT-2, or use a library by Huggingface.
Which hosting provider to use
When developing this kind of tool, you must have an experienced sysadmin, so the answer is written with that assumption in mind, besides for a project at that scale, you must have someone that experienced, or you’ll end up paying a premium for mistakes.
Database hosting provider
We host our custom database at Hetzner, from our experience Hetzner is the best-dedicated server hosting company we worked with. We had a hard drive replaced within 5 minutes; the entire motherboard replaced within 30 minutes.
In general, a dedicated server will be cheaper and faster than any cloud VPS, for example, we use a server with 12 logical threads and 256GB that costs 80$ per month, getting an equivalent in the cloud costs around 1000$ per month.
If you plan to use a distributed database, you can go with a provider that has a built-in option to deploy such a database, like DigitalOcean. You can also use Amazon Redshift, keep in mind it may be more expensive than going with the distributed database.
If you feel that your team is highly technical and able to deploy a distributed database by yourselves, you can save by using a cheap dedicated server host like Kimsufi.
VPS for proxies
For VPS, you can go with the big players: DigitalOcean, Amazon Lightsail, Linode, their smallest VPS costs 5$.
There are smaller players which offer 2-4$ VPS, the more technical your team is, the more leeway you have with picking up the providers.
Front end hosting
Unlike our recommendation to use dedicated servers, with the front end a VPS might be a better pick, the reason is, it’s easy to backup, deploy more instances on the go, and usually, at the start, there’s no demand for high performing hardware.
In later stages, when the traffic is high, you can go with a dedicated server and various CDN solutions like Cloudflare, that will help with the high traffic.
It’s essential to make sure your website loads fast from the geography of your end-users; you can use a tool like Pingdom to benchmark your site.
Which development language to use for the frontend?
Here at SEO Explorer, we love PHP. It’s very similar to C/C++, which is the language we use for our backend.
At the beginning we thought if we should use a framework, we read the reviews, and decided against it, the tipping point was that the inventor of PHP – Rasmus Lerdorf was against it, as written here.
Some people will swear by PHP frameworks. It’s a matter of opinion, and there’s no right or wrong answer.
For the design itself, we use a theme called Metronic, which saved us a lot of time since the theme is responsive and automatically supports desktop and mobile browsers, we paid about 80$ for it. Still, the saving in development costs was crazy.
The good thing about HTML themes is that you can add any development language to it, PHP or others, which gives excellent flexibility.
There are many other good themes you can purchase at Themeforest for a low price.
Which framework to use for your blog?
For our blog (the one you read now), it was a no brainer, and we use WordPress, it’s the most supported blogging ecosystem, you can’t go wrong with it.
One dilemma we had, for SEO, should we use our domain as a standalone subdomain or as part of the domain, we decided it should be part of our domain for higher domain ranking.
It’s possible to develop your own SEO tool, each type of tool has a different entry barrier, some are very hard, some are easier.
There’s an option to speed it up – white label SEO API. We provide an API that is available for private or Whitelabel use that supports:
- Backlinks checker API.
- Outlinks checker API.
- Keywords search volume API.
- Similar keywords API.
- And more.
It’s possible to integrate our API into an existing solution or build a new solution.