Microsoft adds Reddit data to Bing search results, Power BI analytics tool

Microsoft is bringing the self-proclaimed “front page of the internet” to the pages of its search results.

Microsoft has struck a deal with Reddit to pipe data from the social network into Bing’s search results, as well as Power BI’s analytics dashboard, the companies announced on Wednesday.

Now, when people search on Bing, posts published to Reddit may be included in the search results. For example, if a person’s query asks something like “what were the best video games released in 2017,” answers may be sourced from comments left in Reddit’s “gaming” subreddit, or topic-specific forum.

People will also be able to use Bing to specifically search for content from Reddit. Typing “reddit [subreddit name]” will return a link to that subreddit and a selection of top comments that have been posted to it. And typing “reddit AMAs” will return a collection of popular AMA (“Ask Me Anything”) sessions, which are live question-and-answer forums that people can host on Reddit. Additionally, if people search for the name of a person who has a hosted an AMA on Reddit, a selection of responses from the Q&A session will appear among the non-Reddit results.

In addition to bringing Reddit’s data to Bing users, Microsoft is also opening that data up to brands. Brands will be able to access Reddit data through Microsoft’s Power BI analytics tool, with options to specify the keywords to track and toggle the time frames to examine. As a result, marketers will be able to monitor what people are saying about their brand or competing brands on Reddit and have that information processed using Power BI’s sentiment analysis feature and plotted into data visualizations.

The deal with Microsoft’s Power BI is similar to one that Reddit announced with social marketing platform Sprinklr last week in terms of accessing Reddit data. Brands will be able to see which subreddits they are mentioned on and then go buy ads targeted to those audiences.

How on-site search can drive holiday revenue & help e-commerce sites compete against major retailers

This holiday season is set to break a new record, with online sales reaching beyond $100 billion, according to Adobe’s recent predictions. Following Black Friday and Cyber Monday outcomes, most of that revenue will be divided among Amazon and a handful of large-scale e-commerce sites, including Walmart, Target and Best Buy.

With so many dollars at stake, there is still a sizeable amount of market share available for smaller online retailers. But what can e-commerce sites do to compete with the likes of Amazon or Walmart?

An optimized on-site search platform could very well be the answer to capturing more conversions and driving more sales during the holidays. Unfortunately, many e-commerce sites may be missing the boat by not paying enough attention to their on-site search efforts.

How on-site search impacts revenue

According to SLI Systems, which offers an AI-powered e-commerce solution, visitors who use on-site search make purchases at a 2.7x greater rate than website visitors who only browse products. If searchers have indicated exactly what they want — specifying a color, size or material within their query — SLI Systems says it’s the e-commerce site’s job to quickly deliver the product that best matches their search.

“Don’t make these folks navigate their way to what they want. No extra clicks. You’ll likely lose them even if you have a great price and an amazing free shipping offer,” says Bob Angus, an e-commerce consultant, in a post on SLI System’s company blog.

Eli Finkelshteyn, founder and CEO of on-site search platform, says most of the of the on-site search market is still predominantly made up of companies that have built platforms in-house.

“I think there’s an erroneous belief among a lot of companies that search is really core to what they do,” says Finkelshteyn.

“At the end of the day, I think, for e-commerce websites, they’ve got things they need to build themselves, that no one can help them with — things like merchandising, making sure you have the lowest prices, quick delivery, that you have the product that customers want — but search is adjacent to that.”

Finkelshteyn says companies need to make sure their on-site search is optimized so that consumers find the products they want.

“I think that’s notoriously difficult to do,” says Finkelshteyn.

With an on-site search function, you may only be serving up a limited number of results. If a consumer is searching your site for a specific product, Finkelshteyn says it is imperative your on-site search knows how to deliver the most relevant products.

The technology driving an optimized on-site search experience’s platform incorporates a number of technologies, including the integration of machine learning to improve personalized auto-suggestion results.

“Typo-tolerance is automatic with us. We do that using phonetic and typo-graphic dissonances,” says Finkelshteyn, “What that means, essentially, is that we’re mapping how a word is pronounced to the canonical word in your data set.”

For example, if someone is searching for a Kohler faucet but enters a search for Koler — they will receive the correct product match.

Finkelshteyn says another fairly common on-site search challenge is typographical misspellings — when someone simply enters a typo. An effective on-site search platform should be able to recognize common misspellings and still surface relevant products.

On-site search from a brand’s point of view

Dennis Goedegebuure serves as the VP of growth and SEO for sporting apparel company Fanatics. The company operates more than 300 online and offline partner stores. A portion of those stores handle the e-commerce business for all major professional and sports leagues.

“I work very closely with the on-site search teams to make sure the sites differentiate themselves with the offers we give our users,” says Goedegebuure.

The VP of growth says on-site search plays a crucial role in Fanatics’ e-commerce business.

“When you capture a visit, you would like to offer your customer the best selection. So making sure they get the best selection at the best price for the best value to make the sale is obviously top priority,” says Goedegebuure.

According to Goedegebuure, it’s not only about product competition, but the competition among online retailers for share of wallet.

“The customer only has a certain amount of money to spend, you would like to make sure they spend it with you.”

Goedegebuure’s teams are constantly running tests to fine-tune their sites’ on-site search functions.

“We’re running a bunch of experiments all the time, from sizing of the pictures to the little icons that we add to the search, to sort-order, to the number of items in the search result page,” says Goedegebuure. “We’re running constant experiments to find an optimal configuration of our search and to improve the conversions we get out of the traffic.”

According to Goedegebuure, the on-site search tests his teams are running have helped identify a definite sweet-spot for the number of items displayed in search results, as well as determining how the sizing of a picture can impact conversion rates.

On-site search for the holidays

In terms of holiday preparation, Goedegebuure says Fanatics on-site search algorithms may be tweaked to align with holiday promotions.

“If we have a brand on sale — like our own Fanatics brand — these might be pushed up to the top because there are better pricing points,” says Goedegebuure, “If an item goes off sale, you need to adjust for that.”

Finkelshteyn says one of the major on-site search mistakes he sees companies making this time of year is failing to refresh their index rankings.

“If you have a search index with rankings you’ve built over the last year, you still might be optimizing for searches that are not really seasonal right now,” says Finkelshteyn, “For example, if somebody searches for the word ‘blanket’ during the summer, you probably want to give them a beach blanket. If somebody searches for the word ‘blanket’ during the winter, you probably want to give them a warm blanket.”

Whether your company has built its on-site search platform in-house or is using a vendor platform, making sure it is optimized for the holiday e-commerce surge should be a top priority. As we enter the final days of the shopping season, there is still much revenue up for grabs.

Adobe’s latest reports found that holiday e-commerce had reached $50 billion by the end of November, leaving more than $50 billion of its predicted $100 billion in revenue to be claimed by the year’s end.

For many e-commerce companies, fine-tuning their on-site search algorithms may be the most profitable move they could make this holiday season — and beyond.

[This article first appeared on Marketing Land.]

What SEOs need to know about Baidu in 2017

The first half 2017 was stressful for Baidu, which witnessed a recession in active advertisers and stagnant revenue. Nonetheless, we see the search giant putting a huge amount of resources into AI and into building China’s web ecosystem.

If you are in the business of inbound marketing to the Chinese market, this article is for you. I have wrapped up the most significant updates and tips officially given by Baidu Webmaster Tools (BWT) in the list below. Ready? Let’s get started.

Baidu MIP ramping up

Mobile Instant Pages (MIP) have reached several milestones in the first six months of 2017:

5,400 websites have built and submitted their MIP pages.Over 1 billion mobile pages are now on MIP.Every day, there are hundreds of millions of clicks to MIP pages from Baidu Search.

Moreover, MIP now has 215 components built for public use. The response time of the MIP cache has been optimized with speed increases of 50 percent or faster. And MIP now has enabled mip-install-serviceworker for offline caching.

In June, I spoke with Junjie Wang, the owner of Baidu MIP, at the Baidu VIP Conference in Shanghai. He explained that MIP, despite being a derived version of Google’s AMP, is optimized for the internet users in China who use different browsers and different browsing behaviors from those in the West. Baidu and Google have collaborated for a faster web; in fact, Baidu helped Google set up its AMP CDN in China.

Baidu has indexed a considerable number of AMP pages, although these don’t display the lightning icon in Baidu’s search results the way MIP pages do (see screen shot below). For sites only serving the audience from Mainland China, I would recommend you deploy MIP instead of AMP.

The Flash icon for MIP results on Baidu SERP


The other improvement Baidu is driving in China is the secure web. Baidu Webmaster Tools launched a new feature of HTTPS Site Authentication in May that allows HTTPS sites to have a better presence on Baidu SERPs.

Previously, when HTTPS pages weren’t well supported, Baidu didn’t know whether to index a non-secure page or a secure page. Sites had to build two versions with different protocols to have a better result in indexation. Now, once you have been through this authentication, only secure pages of your website will be indexed and presented on the SERPs.

Authenticate an HTTPs site in Baidu Webmaster Tools

PWA and Lavas

PWA (Progressive Web Apps) for Baidu have finally arrived! Just like Google’s PWA, the Baidu version of PWA can have features like Desktop Icon, Full-screen Browsing, Offline Caches and Push Messages.

A “Hello World” of Lavas PWA

In order to help developers build their PWA instance effectively, Baidu has launched a framework based on Vue as a solution and named it Lavas. With Lavas, you will have a set of templates that accelerate your development and deployment.

Algorithm: Hurricane

Content scraping is undoubtedly the greatest threat to content marketers in China’s internet. While Baidu is still testing its Original Content Protection feature with a few selected websites, they released an algorithm update, code-named Hurricane, which is taking on those websites with a majority of scraped content.

You will probably also find the copyright tag in Baidu Image Search results. This tag is meant to encourage content marketers to generate more original images and graphics.

The Copyright tag on Baidu Image Search


In order to better understand what the page will look like to users, Baidu started testing its new spider with page-rendering capabilities in March. Now, the search engine has two new spiders in function.

For desktop version:

Mozilla/5.0 (compatible; Baiduspider-render/2.0; +

For mobile version:

Mozilla/5.0 (iPhone; CPU iPhone OS 9_1 like Mac OS X) AppleWebKit/601.1.46 (KHTML, like Gecko) Version/9.0 Mobile/13B143 Safari/601.1 (compatible; Baiduspider-render/2.0; +

It is easy to check if the IP is from a real Baidu bot. You can do host in Linux or ns lookup in Windows. See below:

nslookup for verifying the Baidu Spider

Baidu Mobile Search UX Whitepaper for Advertising 2.0

In mid-June, Baidu released its new UX Whitepaper for Mobile Search (v1.0 was released in March earlier this year). In it, Baidu published detailed mobile advertising guidelines. According to the whitepaper, the following types of ads will lead to a Baidu penalty:

    Pornography, seductive, gambling and other ads prohibited by lawsAds with scam and fraud messagesContent with app-wall or auto-redirect to app storesAds with massive size or large proportion of a pageAds covering content with layersAds near the buttons on a pageAuto-play video adsAds between article heading and body textAds between the body text and pagination

An example of an ad that would trigger the penalty from Baidu

SEO tips, straight from Baidu

In addition to the updates above, Baidu has also recently provided some SEO-specific guidance through its Webmaster Tools platform. I’ve summarized some of the most important advice below.

Page size/URL length

Baidu says your page size (the HTML) should not be larger than 128 KB. Pages using binary image data to convert to HTML can easily make the page size above 128 KB, and this is causing issues for the Baidu spider attempting to parse the page. In fact, if you have a page that is too big, it is best practice (for Baidu SEO) to implement pagination. Another tip is to avoid adding unnecessary code into your output in case it overflows.

In addition to page size, URL length is playing a critical role in pages being indexed. At Merkle, we’ve observed that clean and short URLs are getting indexed more quickly and are ranking higher. The recommended URL length is 76 characters, excluding the protocol. Hence, when adopting a URL convention, you need to avoid using Chinese characters in your URL, as the transcode will make those URLs much longer than it looks in Chinese characters.

404 pages/deleting pages

In May, Baidu posted an article on how to manage 404 pages (Chinese language). Handling 404 pages is different (and more complicated) in Baidu than in Google or Bing. Here is the suggested course of action:

    If you have website pages that no longer exist or that you need to delete, the first thing you need to do is to confirm that those pages are indexed by Baidu. You can search for the URL on Baidu or check your web analytics tools.The next step is to set the status code to 404 for those URLs. Of course, those URLs should not be disallowed in your robots.txt.Now, compile these pages into an XML or TXT file and make sure every single URL in this file is set to 404.Submit it to Baidu Webmaster Tools. The de-indexation will take effect in two to three days. Once the pages are no longer in the index, delete the XML or TXT you submitted.

Submit 404 files in Baidu Webmaster Tools.

Alternatively, if you want to delete a folder or a set of URLs beginning with a string, you can submit the pattern to Baidu Webmaster Tools. This pattern must end with a slash (/) or a question mark (?) — e.g., or

Avoid cheap domains

If you are running your business on a top-level domain (TLD) such as .top or .win, you need to be aware that your site may look spammy to Baudi.

Other spammy TLDs include, but are not limited to, .bid, .pw, .party and .science. Those domains are cheap. Therefore, they look fishy to Baidu.

TLDAnnual Fee
(1st-Time Buy)
TLDAnnual Fee
(1st-Time Buy)

Domains under $3 per year

According to Baidu (Chinese language), these cheap TLDs are low priority for indexation. If you insist on using such a domain, you must verify it with Baidu Webmaster Tools so that it can be regarded as a legitimate site.

Baidu cache

For the first time, Baidu explained how cached pages (known as “Baidu snapshots”) work (Chinese language). Cached pages are generated when Baidu crawls the page and adds it to the index (or updates the indexed version). How fresh your cached page is will depend on your site’s crawl frequency, which can vary from several minutes up to a month (depending on the site).

If you’ve blocked Baidu’s spider from your .js and .css resources, or if you use relative URLs in your HTML, the snapshot will look odd and unformatted. If you want to have the snapshot deleted, you can report an inappropriately cached page.

Report inappropriate cache for deletion.

Launching a new site

The last tip I’m sharing is how to give Baidu a stunning first impression when launching a new website.

You may only have a handful of pages at launch, or perhaps you have lots of pages that are low in quality (short/empty or with duplicate content). Unfortunately, this is a disaster to Baidu. Having a robust, high-quality website at launch shows Baidu that you know how to organize your content and provide reliable information. If you fail to make a good “first impression,” Baidu then allocates fewer resources crawling your site in the future — and consequently, it is difficult to win back their trust.

To solve this problem, Baidu suggests (Chinese language) disallowing the website during the UAT (User Acceptance Test) or Invite-only period.

Opinions expressed in this article are those of the guest author and not necessarily Search Engine Land. Staff authors are listed here.

AdWords mobile extensions get major upgrades with tappable sitelinks & more visible content

So far in 2017, we’ve seen location extensions for display, automated call extensions, a global rollout of price extensions, and most recently, call extension bid adjustments. Google AdWords extensions are more powerful than ever, and they just got another major mobile upgrade. Mobile sitelinks are becoming interactive, while callouts and structured snippets are gaining more real estate.

Interactive, tappable sitelinks

Today, sitelinks are becoming tappable on mobile, allowing users to scroll further and click through to deeper parts of a site.  This will allow searchers to choose which sitelinks pertain to them, while also diving directly into specific page. Of course, these sitelinks are not automated links, but rather are chosen and crafted by the advertiser.

This will be a big upgrade for sitelinks, as previous mobile versions would regularly see cut-off text, as they simply were not interactive. Giving the power to the searcher to view the sitelinks should be a major win for all sides, Google, advertisers and searchers.

So far, Google has reported early results that show people have been twice as likely to interact with the new formatting of these extensions.

Callouts & snippets fall ‘in-line’

Typically, a searcher would find callout and structured snippet extensions underneath an ad. Now these two extensions will be included in-line with the ad copy. According to Google, this will allow more of the callout and snippet text to be displayed.

We’ve seen callouts, and structured snippets in swipable formats in previous tests, but it makes sense to combine into the heart of the ad text. From a uniformity standpoint, these aren’t interactive extensions like a location, sitelink or call extension. They can’t be interacted with, and they shouldn’t appear outside of the ad “paragraph.”

For more information, see the official release on the Inside AdWords blog.

DuckDuckGo surpasses 10 billion searches, with 4 billion happening in 2016 alone

DuckDuckGo, the privacy search engine that allows users to search without fear of queries being tracked, reached a major milestone last year — surpassing an accumulative 10 billion searches since the site’s inception.

The search engine says four billion of those searches happened in 2016. Already this year, DuckDuckGo says it had its biggest day ever when it served up 14 million searches on January 10.

“People are actively seeking out ways to reduce their digital footprint online,” says DuckDuckGo’s release, noting that a Pew Research study reported 40 percent of people believe search engines shouldn’t retain information about a user’s online activity.

As part of the announcement, DuckDuckGo named nine organizations it supported in 2016, with donations totaling $225,000, including Freedom of the Press Foundation, OpenBSD Foundation, Tor Project, Fight for the Future and Riseup Labs.

The trouble with truth

Since the election, there’s been a lot of discussion about fake news and its ability to sway masses into potentially false perceptions. Clearly, creating false perceptions in mass media is a dangerous thing, and it can sway public opinion and policy greatly.

But what about search engines and other content distributors? Even before the US election, German Chancellor Angela Merkel warned that search engine algorithms, “when they are not transparent, can lead to a distortion of our perception, they can shrink our expanse of information.” What responsibility, then, does a search engine have to produce truthful information?

Is Pluto a planet?

Getting at truth can be tough because not everything is black and white, especially in certain subjects. Take, for example, good old Pluto. Many of us grew up learning that Pluto is a planet. Then, in 2006, astronomers ruled that it was no longer a planet.

But in the last few years, Pluto’s planetary designation seems to have been in dispute. As I was helping my daughter with her solar system project for school, I questioned if we should add Pluto as a planet or if it should be left off. What is Pluto’s planetary status now?

Unfortunately, the answer still wasn’t clear. The International Astronomical Union (IAU) determined that Pluto is not a planet because it only meets two of their three criteria for planetary status:

    Orbit around the sun (true).Be spherical (true).Be the biggest thing in its orbit (not true).

In fall 2014, the Harvard-Smithsonian Center for Astrophysics held a panel discussion on Pluto’s planetary status with several leading experts: Dr. Owen Gingerich, chair the IAU planet definition committee; Dr. Gareth Williams, associate director of the Minor Planet Center; and Dr. Dimitar Sasselov, director of the Harvard Origins of Life Initiative. Interestingly, even Gingerich, who is the chair of the IAU planet definition committee, argued that “a planet is a culturally defined word that changes over time,” and that Pluto is a planet. Two of the three members of the panel, including Gingerich, concluded that Pluto is indeed a planet.

Confused yet? Sometimes there are multiple reputable organizations who debate two potential truths. And that’s OK. Science is about always learning and discovering, and new discoveries may mean that we have to rethink what we once considered fact.

A problem bigger than the smallest planet

While my Pluto example is a fairly harmless and hopefully less controversial example, there are clearly topics in science and beyond that can create dangerous thinking and action based on little or no proven fact.

The issue, especially in science and research, delves much deeper, though. Even if a research study is performed and demonstrates a result, how dependable is that result? Was the methodology and sample size proper? All too often, we see sensational clickbait headlines for studies, as John Oliver shared earlier this year:

Hey, who doesn’t want to drink wine instead of going to the gym?

But the methodology around some of these research reports can be truly suspect. Sweeping generalizations, especially around the health of humans and the environment, can be incredibly dangerous.

In the video segment, Oliver shares a story published by Time magazine, which I would normally consider a reputable source. The article is about a study which, Time claims, suggests that “smelling farts can prevent cancer.” Now, while this particular study actually did not actually make that claim, if you search for “smelling farts can prevent cancer” on Google, here are the results:

Google has even elevated the false information to a Google Answer at the top of the page. In fact, the first result disputing this false claim doesn’t even appear above the scroll line.

The media and clickbait

As Oliver points out in the video, the problem is larger than just users sharing and buying into this information. Rather, there’s a deeper issue at play here, and it centers around what’s popular. Most of us are familiar with clickbait — outrageous headlines created to entice us to click on an article. In an effort to compete to get the most clicks (and thus ad revenue), media outlets have resorted to trying to share the most outrageous news first.

The problem for Google is that much of their algorithm relies on the authority of a site and inbound links to that website. So if a typically authoritative site, such as CNN, posts stories that are not fact-checked, and then we share those links, those two actions are helping to boost the SEO for the incorrect information.

But isn’t it much more fun to think that drinking wine will spare me from having to go to the gym? That’s essentially why we share it.

Why fact checking is hard, manual work

If media outlets and websites aren’t fact checking, how can Google do this? There are certainly a number of sites dedicated to fact checking and rumor validation, such as Snopes and PolitiFact, but they also rely on human editors to pore over articles and fact-check claims.

Last year, Google engineers outlined in a research paper how they might incorporate a truthfulness measurement into the ranking algorithm. But can that really be done? Can a simple algorithm separate truth from fiction?

There are many fact-checking organizations, and there’s even an international network of fact-checkers. But ironically, while there are some mutually agreed-upon best practices, there are no set standards for fact-checking — it can vary by organization. Further, to Oliver’s points in the video, fact-checking different topics requires different standards. A scientific study, for instance, may need to be judged on several standards: methodology, duplication of study and so on, whereas political articles likely require on-the-record quotes to verify.

Treating the cause

Google and Facebook both have started taking steps to eradicate fake news. Google announced it would no longer allow sites with fake news to publish ads on those pages, essentially seeking to cut off potential revenue streams for fake news producers that rely on false clickbait to generate income. That’s certainly one cause of false news generation, but is it the only one?

The issue is much deeper than just ad revenue. One of the scientists in Oliver’s video shares how scientists are incentivized to publish research. The competition in journalism and in science to get “eye-catching” results is real. So the root cause can often be more than just ad revenue — it may be just to get noticed. Or further, it may be to promote an agenda, which falls under the umbrella of propaganda.

The other side of the situation: bias and backlash

So what should Google do? The challenge the search engines (and Facebook) are confronted with is looking biased or being accused of promoting and favoring one side over another. Per Merkel’s comments, this stifles debate as well. And as Google has seen numerous times, and Facebook recently saw this summer with the accusation that its news feed was liberal-leaning, showing more or less of one side of a story may earn the platform a reputation of being biased.

Further, as we’ve established, truth is not always black and white. Merriam-Webster defines truth as “a statement or idea that is true or accepted as true.” So what if I accept something as true that you do not? Truth is not universal in all cases. For example, an atheist believes that God does not exist. This is truth for the atheist.

As one writer commented in an article on, “No one — not even Google — wants Google to step in and settle hash that scientists themselves can’t.” Is it really Google’s responsibility to promote only content it deems through an algorithm to be true?

It certainly puts Google in a tough bind. If they quell sites that they believe are not truthful, they may be accused of censorship. If they don’t, they have the power to potentially sway the beliefs of many people who will believe that what they see in Google is true.

Another answer: education

We’ll never stop clickbait. And we’ll never stop fake news. There’s always a way to work the system. Isn’t that what SEOs do? We figure out how to respond to the algorithm and what it wants to rank our sites higher in results. While Google can take steps to try to combat fake news, it will never stop it fully. But should Google stop it completely? That’s a slippery slope.

In conjunction with these efforts, we really have to hold journalism to a higher standard. It starts there. If it sounds too good to be true, you can bet it is. It starts with questioning what we read instead of simply sharing it because it sounds good.

Time for my daily workout: a glass of wine.

Opinions expressed in this article are those of the guest author and not necessarily Search Engine Land. Staff authors are listed here.

Common Search: The open source project bringing back PageRank

Over the last several years, Google has slowly reduced the amount of data available to SEO practitioners. First it was keyword data, then PageRank score. Now it is specific search volume from AdWords (unless you are spending some moola). You can read more about this in Russ Jones’s excellent article that details the impact of his company’s research and insights into clickstream data for volume disambiguation.

One item that we have gotten really involved in recently is Common Crawl data. There are several teams in our industry that have been using this data for some time, so I felt a little late to the game. Common Crawl data is an open source project that scrapes the entire internet at regular intervals. Thankfully, Amazon, being the great company it is, pitched in to store the data to make it available to many without the the high storage costs.

In addition to Common Crawl data, there is a non-profit called Common Search whose mission is to create an alternative open source and transparent search engine — the opposite, in many respects, of Google. This piqued my interest because it means that we all can play, tweak and mangle the signals to learn how search engines operate without the huge time investment of starting from ground zero.

Common Search data

Currently, Common Search uses the following data sources for calculating their search rankings (This is taken directly from their website):

Common Crawl: The largest open repository of web crawl data. This is currently our unique source of raw page data.Wikidata: A free, linked database that acts as central storage for the structured data of many Wikimedia projects like Wikipedia, Wikivoyage and Wikisource.UT1 Blacklist: Maintained by Fabrice Prigent from the Université Toulouse 1 Capitole, this blacklist categorizes domains and URLs into several categories, including “adult” and “phishing.”DMOZ: Also known as the Open Directory Project, it is the oldest and largest web directory still alive. Though its data is not as reliable as it was in the past, we still use it as a signal and metadata source.Web Data Commons Hyperlink Graphs: Graphs of all hyperlinks from a 2012 Common Crawl archive. We are currently using its Harmonic Centrality file as a temporary ranking signal on domains. We plan to perform our own analysis of the web graph in the near future.Alexa top 1M sites: Alexa ranks websites based on a combined measure of page views and unique site users. It is known to be demographically biased. We are using it as a temporary ranking signal on domains.

Common Search ranking

In addition to these data sources, in investigating the code, it also uses URL length, path length and domain PageRank as ranking signals in its algorithm. Lo and behold, since July, Common Search has had its own data on host-level PageRank, and we all missed it.

I will get to the PageRank (PR) in a moment, but it is interesting to review the code of Common Crawl, especially the portion located here, because you really can get into the driver’s seat with tweaking the weights of the signals that it uses to rank the pages:

signal_weights = {"url_total_length": 0.01,"url_path_length": 0.01,"url_subdomain": 0.1,"alexa_top1m": 5,"wikidata_url": 3,"dmoz_domain": 1,"dmoz_url": 1,"webdatacommons_hc": 1,"commonsearch_host_pagerank": 1}

Of particular note, as well, is that Common Search uses BM25 as the similarity measure of keyword to document body and meta data. BM25 is a better measure than TF-IDF because it takes document length into account, meaning a 200-word document that has your keyword five times is probably more relevant than a 1,500-word document that has it the same number of times.

It is also worthwhile to say that the number of signals here is very rudimentary and obviously missing many of the refinements (and data) that Google has integrated in their search ranking algorithm. One of the key things that we are working on is to use the data available in Common Crawl and the infrastructure of Common Search to do topic vector search for content that is relevant based on semantics, not just keyword matching.

On to PageRank

On the page here, you can find links to the host-level PageRank for the June 2016 Common Crawl. I am using the one entitled pagerank-top1m.txt.gz (top 1 million) because the other file is 3GB and over 112 million domains. Even in R, I do not have enough machine to load it without capping out.

After downloading, you will need to bring the file into your working directory in R. The PageRank data from Common Search is not normalized and also is not in the clean 0-10 format that we are all used to seeing it in. Common Search uses “max(0, min(1, float(rank) / 244660.58))” — basically, a domain’s rank divided by Facebook’s rank — as the method of translating the data into a distribution between 0 and 1. But this leaves some definite gaps, in that this would leave Linkedin’s PageRank as a 1.4 when scaled by 10.

The following code will load the dataset and append a PR column with a better approximated PR:

#Grab the datadf <- read.csv("pagerank-top1m.txt", header = F, sep = " ")#Log NormalizelogNorm <- function(x){    #Normalize    x <- (x-min(x))/(max(x)-min(x))    10 / (1 - (log10(x)*.25))}#Append a Column named PR to the datasetdf$pr <- (round(logNorm(df$V2),digits = 0))

We had to play around a bit with the numbers to get it somewhere close (for several samples of domains that I remembered the PR for) to the old Google PR. Below are a few example PageRank results: (8) (6) (5) (9) (6)

Here is a plot of 100,000 random samples. The calculated PageRank score is along the Y-axis, and the original Common Search score is along the X-axis.

To grab your own results, you can run the following command in R (Just substitute your own domain):


Keep in mind that this dataset only has the top one million domains by PageRank, so out of 112 million domains that Common Search indexed, there is a good chance your site may not be there if it doesn’t have a pretty good link profile. Also, this metric includes no indication of the harmfulness of links, only an approximation of your site’s popularity with respect to links.

Common Search is a great tool and a great foundation. I am looking forward to getting more involved with the community there and hopefully learning to understand the nuts and bolts behind search engines better by actually working on one. With R and a little code, you can have a quick way to check PR for a million domains in a matter of seconds. Hope you enjoyed!

Opinions expressed in this article are those of the guest author and not necessarily Search Engine Land. Staff authors are listed here.

DuckDuckGo Now Serves Up LGBT Legal Rights By Country Via Data From Equaldex

Privacy search engine DuckDuckGo is making it easy to find the legal rights for international LGBT communities using data from Equaldex, a collaborative database of LGBT rights.

Searching for “LGBT rights” or “gay rights” along with a country name will deliver an Equaldex tab at the top of the search engine’s results page, displaying the legal LGBT rights for the specific country:

(In a similar search for “gay rights United States,” the Equaldex tab at the top of the search results page was the last tab listed and had to be clicked to display the info.)

DuckDuckGo announced the search update this past weekend on Twitter:

Gay rights around the world? Find out instantly thanks to @Equaldex: #LGBT By @IsChrisW

— DuckDuckGo (@duckduckgo) November 15, 2015

Tumblr's New GIF Search Engine Takes The Pain Out Of Having To Use Actual Words

Because coming up with words can be so very difficult, Tumblr has added a GIF search engine to help users quickly locate an image that accurately expresses whatever it is they’re trying to write.

Tumblr says once a GIF is selected, it will be properly credited and the original GIF creator will be notified via their dashboard, phone and any other platform set up to receive Tumblr notifications.

“Since GIFs have replaced written language, we’re making it easier to turn your obsolete verbiage into modern moving pictures.”

To search for a GIF when writing a Tumblr post, click the “+” icon to the left of the screen, and then click the “GIF” button to find relevant images, as shown in the GIF here:

According to a report on TechCrunch, Tumblr is not relying on a third-party, but instead indexing GIFs that have been posted to the Tumblr platform.

“That means Tumblr users should be able to surface GIFs using less common keywords than on some other search services, including via unique Tumblr slang, sayings and other abbreviations that members of the various fandoms on Tumblr use,” writes TechCrunch reporter Sarah Perez.

TechCrunch claims, with Tumblr’s more than 239 million blogs and more than 80 million daily posts, the site contains over 112 billion posts – many of which include GIFs that are now searchable.

How Search Engines Process Links

Have you ever wondered why 404s, rel=canonicals, noindex, nofollow, and robots.txt work the way they do? Or have you never been clear on quite how they do all work? To help you understand, here is a very basic interpretation of how search engines crawl pages and add links to the link graph.

The Simple Crawl

The search engine crawler (let’s make it a spider for fun) visits a site. The first thing it collects is the robots.txt file.

Let’s assume that file either doesn’t exist or says it’s okay to crawl the whole site. The crawler collects information about all of those pages and feeds it back into a database. Strictly, it’s a crawl scheduling system that de-duplicates and shuffles pages by priority to index later.

While it’s there, it collects a list of all the pages each page links to. If they’re internal links, the crawler will probably follow them to other pages. If they’re external, they get put into a database for later.

Processing Links

Later on, when the link graph gets processed, the search engine pulls all those links out of the database and connects them, assigning relative values to them. The values may be positive, or they may be negative. Let’s imagine, for example, that one of the pages is spamming. If that page is linking to other pages, it may be passing some bad link value on to those pages. Let’s say S=Spammer, and G=Good:

The page on the top right has more G’s than S’s. Therefore, it would earn a fairly good score. A page with only G’s would earn a better score. If the S’s outweighed the G’s, the page would earn a fairly poor score. Add to that the complications that some S’s and some G’s are worth more than others, and you have a very simplified view of how the link graph works.

Blocking Pages With Robots.txt

Let’s go back to that original example. Suppose the robots.txt file had told the search engine not to access one of those pages.

That means that while the search engine was crawling through the pages and making lists of links, it wouldn’t have any data about that page that was included in the robots.txt file.

Now, go back to that super simple link graph example. Let’s suppose that the page on the top right was that page that was blocked by robots.txt:

The search engine is still going to take all of the links to that page and count them. It won’t be able to see what pages that page links to, but it will be able to add link value metrics for the page — which affects the domain as a whole.

Using 404 Or 410 To Remove Pages

Next, let’s assume that instead of blocking that page with robots.txt, we simply removed it. So the search engine would try to access it, but get a clear message that it’s not there anymore.

This means that when the link graph is processed, links to that page just go away. They get stored for later use if that page comes back.

At some other point (and likely by a different set of servers!), priority pages that are crawled get assigned to an index.

How The Index Works

The index identifies words and elements on a page that match with words and elements in the database. Do a search for “blue widgets.” The search engine uses the database to find pages that are related to blue, widgets, and blue widgets. If the search engine also considers widget (singular) and cornflower (a type of blue) to be synonyms, it may evaluate pages with those words on the page as well.

The search engine uses its algorithm to determine which pages in the index have those words assigned to them, evaluates links pointing to the page and the domain, and processes dozens of other known and unknown metrics to arrive at a value. If the site is being filtered for poor behavior like Panda or Penguin, that is also taken into account. The overall value then determines where in the results the page will appear.

This is further complicated by things webmasters might do to manipulate values. For example, if two pages are very similar, a webmaster may decide to use rel=canonical to signal the search engine that only one of those pages has value. This is not definitive, though. If the “cornflower widget” page is rel=canonical-ed to the “blue widgets” page, but the cornflower widget page has more valuable links pointing to it, the search engine may choose to use the cornflower widget page instead. If the canonical is accepted, the values of both elements on the pages and links pointing to the pages are combined.

Removing Pages With NoIndex

Noindex is more definitive. It works similarly to robots.txt except that instead of being prevented from crawling that page, the search engine is able to access it, but then is told to go away. The search engine will still collect links on the page to add to the database (unless a directive on the page also indicates not to follow them, i.e. nofollow), and it will still assign value to links pointing to that page.

However, it will not consolidate value with any other pages, and it will not stop value from flowing through the page. All noindex does is request the search engine not assign the page to its index.

Therefore, there is only one definitive way to stop the flow of link value at the destination. Taking the page away completely (404 or 410 status) is the only way to stop it. 410 is more definitive than 404, as you can read here, but both will cause the page to be dropped out of the index eventually. There are multiple other ways to stop link flow from the origination of the link, but webmasters seldom have control over other sites, only their own.

Hopefully, this primer has helped you understand how pages are accessed by search engines and the difference between robots.txt, noindex, and not found, especially as they relate to links. Please leave any questions in the comments and be sure to check out my session at SMX Advanced: The Latest in Advanced Technical SEO.

Opinions expressed in this article are those of the guest author and not necessarily Search Engine Land. Staff authors are listed here.