To me the main way of determining search engine quality is whether or not it can find pages that I know exist because I've read them recently (they're still in my history) based on the terms that to me make sense for that page.
By that metric there are no good search engines at the moment and the older the pages the worse this effect gets. It's really nice to see Google do lots of 'moonshots' and interesting tech demos but I'd be far happier if they fixed search and kept their focus on that.
If a page doesn't show up in either Google or Bing for sensible queries then that page effectively ceases to exist. The perverse incentive that these companies have to avoid you going to a page with relevant results as long as you spend more time on pages with their advertising on it ensures that more and more content will end up missing in action.
Yup. Google has been pretty awful for the past, oh, 5-10 years or so. They used to be a high quality search engine, a way to index the depth of material on the internet. Now they are just a semantic front-end for the most popular content on the internet. Do you want to find something from wikipedia, youtube, medium, the new york times, amazon, etc? Google does great. Do you want to search for something that thousands of other people also search for routinely? Google does great. This role is also the easiest to monetize for google (through promoted links). But if you want to search for something highly technical or very specific, google is now terrible, in fact it's worse than it used to be 10 years ago.
I’m attributing the reason for this to the mass of mobile/social users who changed the search market. Most people from these groups search for naturally popular things like Saylor Twift [legs] or whatever it is in the trends right now. I wish we had unpopular search engines that suck at pop. I also miss directories – while not fully complete, they gave a good overview on technology sections and many more that you cannot just google since you don’t know it exists. Before the internet my family had a big encyclopedia collection that I as a kid occasionally opened/skimmed and read about something new. It is not possible with wikipedia and the internet since it is now overwhelming and has no good place to start anymore. Our average attention volume is so narrow (relative to amount of information) that it became a product. Also, I miss the days when you had to investigate a topic, make yourself fluent in it and enter ‘the club’ of highly interested people. Now anyone can google a shallow pop-info on anything and pretend to be educated in it in minutes. That degraded many good groups as a result.
Experience, and being able to see directly the different behavior of the search engine. I watched as it happened. Google used to be optimized for producing the minimum number of results for your query, biasing towards specificity. This worked great for tech savvy folks who knew how to craft searches with a high degree of specificity. Then they switched to biasing towards a higher number of search results, biasing towards correcting your search to match some other popular search. It's so bad now that google will just completely drop words from your search as terms in order to present the results it "thinks" you want, and then you have to go out of your way to force it to actually care about those search terms (and this isn't because the more specific / restrictive search has no results, it just has results that google doesn't "like" as much).
You can craft a google search with greater specificity but it's very difficult to obtain the sort of search behavior that google used to have. Now google treats your search terms as sort of a grab bag, it mutates them into a cloud of synonyms and related words, then it picks a subset of the grab bag that it decides is valuable and gives you results that are tuned by about a zillion arcane heuristics. This works great for giving you "magically" accurate answers for the most common search queries. It works terribly for giving you highly specific answers to highly specific queries. The way google used to work was by providing results that matched all of your search terms, and being smart enough to include different variations of each word but not vaguely related words. That sometimes made it hard to find the right thing if you didn't get the right words but now we're in a state where you can't find the right thing even if you do have all the right words.
Funny, Google has some perverse incentives here: it might be nice to have a good "history search" built into a browser, but as a search engine provider they won't build it into Chrome.
Now that we live in the future I guess you need never "clear your cache" except maybe for privacy reasons. You could keep full page text for just about any site you visit (so long as the authors don't consider their site a "web app" I guess.)
One component I would include is weighing a site down proportionally to amount of ads it loads, as in my experience it anticorrelates with quality and trustworthiness of the site.
Yahoo hasn't had its own search engine for years. In 2010, they became essentially a frontend for Bing. In a later 2015 deal they switched the backend to using Google.
Duckduckgo is a metasearch engine, technically, but mostly it delegates to Bing.
As far as I can tell, there are only two and a half real search engines that still exist: Bing, Google, and Wolfram Alpha. (I count Alpha as a half because it's not really what most people are looking for.) I'm curious if anyone else knows of other real search engines still in existence.
Bing would be unable to associate series of queries with users.
As long as DDG are doing it properly (and I believe they are), Bing would only learn that the contents of each individual query are associated together, they would learn nothing about which other queries were performed by the same user.
I think the concern isn't necessarily that Bing would associate query X with person Y. The concern is that Bing would even know that query X exists. For example, if Bing saw a spike in searches for "Aramco IPO July 4, 2018" and were to reveal it to a human or store it, that might be a serious leak of non-public information. Many searches reveal private information, even when they aren't associated with a user.
> if Bing saw a spike in searches for "Aramco IPO July 4, 2018" and were to reveal it to a human or store it, that might be a serious leak of non-public information
Maybe I'm missing something obvious here, but how is that any different from Google or DuckDuckGo seeing the same spike?
Well you might trust DDG as a good actor but not a third party. To discover that this information is discoverable to a third party (even if un-attributable) would breach their trust in DDG. Whether that's reasonable or DDG are misleading people in that regard is another matter. Personally I still use them a lot, and will continue.
I just think there is a point to be made here. Even generally it's often opaque what third parties have what data and I don't really think GDPR has fixed that. It's surprising for people the Bing might have the contents of their DDG search history, somewhere in the huge dataset of DDG searches that pass through.
Also they might not want to help improve Bing search but I'm guessing they do inadvertently?
Intel SGX is only answer at the moment. The Signal messenger uses it, do address book matching is private. It requires the user to trust the server hardware vendor (Intel) instead of also the cloud provider.
That would not stop the Bing query matcher (or indeed the Signal address book matcher) from being able to look at the contents of its own secure enclave.
The trick is that every user uploads his own matcher. The server only sees encrypted matchers, feeds them data and returns the encrypted results. You as a user decrypt your results and nobody (except Intel) was able to see them.
Did Yahoo ever have its own search engine, technically? In the early web it was a directory maintained by humans, which made sense at a time when the total number of pages in existence on any given subject was no more than a few hundred; I thought that when that era passed they went straight into licensing other search engines' results.
I worked at Google on search indexing at the time Yahoo switched from their own search engine to using Bing. At the time, by most of Google's own search metrics, Yahoo had a product superior to Bing. If Bing had been spun off as a separate company, or otherwise hadn't had access to Microsoft's deep pockets and default IE search status, it's likely Yahoo would have fared better.
I was at Yahoo during that time, although not in web search. From what I could tell, company leadership was frustrated with lack of growth in search market share, and didn't want to invest in it anymore.
Yahoo was running user studies where they would put Google results and Yahoo results side by side but switch the branding; while Yahoo results were ranked better than Google for most of the tested queries, results with Google branding ranked better than with Yahoo branding, regardless of whose results they were.
The plan was to just use Google, but the DOJ (or FTC?) put out guidance that that would be anti-competitive, so Bing was it. This might have worked out anyway, but the expected cost savings from outsourcing search didn't actually happen that I saw, but I left in late 2011, and stopped following closely after that. Web search was also linked with search ads, which Bing did poorly at too.
Google also ran similar user studies, sometimes between Google and other search engines, and sometimes between production Google and a proposed change.
One tough thing is there isn't one search quality metric. It's important to have the search results page look good with its snippets, and another thing to have people actually look at the linked pages and compare the usefulness of the linked pages.
Common vs. uncommon searches are also important. It's not difficult to write a search engine that badly over-fits on the most common searches. However, for market share, it's important to do well enough on the common searches that users don't leave, and do well enough on tough long-tail searches that you pick up users that leave other search engines on tough queries. The idea is to be pretty good at the common searches, but the best at the kinds of searches that cause people to try other search engines. Naive frequency-weighted metrics will get this totally wrong.
It's also more important to get useful information in the first 2 or 3 links. If Google links to the second-best link at result #1 and puts the best link off the first results page, but Yahoo puts the best link down at #7 and second-best at #8, the user may lose interest before following a really good link.
I don't think Google took the union of front-page search results between two competitors and asked humans to hand-order the (up to 40) pages for how well they fit the query. But, that seems like a good way to test the actual usefulness of search results. You'd probably especially want to keep track of the percentage of the top 3 search results that were filled by top-5 (guessing at 5) useful links.
Anyway, inside Google it was well-known that Yahoo was the competitor to worry about in terms of search quality.
Yes, before they used google. It's a pretty interesting story, actually, how Yahoo felt that they should use the best underlying search engine with a "white label" approach, and how Google succeeded in eventually building a very strong brand despite being invisible.
Mostly this is just me missing the websites of the early 2000's, and trying to figure out a way to rediscover them.
And I'd probably want content on top of this. (Edit: e.g. search by topics)
Lastly, it'd be nice to restrict things to sub-genres, but I'm not sure. E.g. when I'm doing a search I'd love to reference things related to micro-controllers, and so maybe I'd put in Arduino to get into the realm. Sort of what like google does for you without telling you. (tailoring your searches by some magic context).
A man can dream...
Edit 2: Search engines these days seem to be answer engines, I want a research engine.
> Mostly this is just me missing the websites of the early 2000's, and trying to figure out a way to rediscover them.
Apparently somethingawful and ebaumsworld are still around in some form, and Slashdot of course. The thing is, these sites have largely been replaced by better versions of themselves. That’s resulted in a lot of centralization into a few sites like reddit, which is a combination aggregator and blogging platform for people who are too embarrassed to attach their real name to what they write, which is apparently a good share of the population. Then there are sites like YouTube, LiveLeak, Facebook, that just offer something that no one could or did in the 2000s. And with mobile and apps, there’s a level of engagement that doesn’t leave much room for a thousand little sites with quirky, regular, custom content.
I'm not talking about the large sites though, I'm talking about the small sites. Perhas I should have said feel, and not sites, of the early 2000's.
Right now Google thinks it knows what you want and when you search for things, it returns the same few sites (mostly). You used to come across people's personal sites into which they poured their soul. And while those exist less frequently now, I bet they still exist.
I'd like to be able to search for sites that return a 406 containing a specific string so that I could find APIs that implement particular media type standards.
I don't know if this is still the case but when I worked for Lycos 10 or 11 years back they owned Hotbot so including both in the list is a bit redundant.
They also -- despite being one of the first ever search engines -- didn't do their own search in 2008. They outsourced to Yahoo. Though there was an effort at the time to become a search engine again. I don't know if anything came of it.
Edit:
It's hilarious they labeled Lycos as...
> Lycos—is still around!
Because even at the time I worked their the number one response I got from people when I told them that was "They still exist?"
What was it like to work there? I know it was a decade ago but I am still curious what it would be like to work at these mostly forgotten companies that still manage to exist.
But the short summary: I loved it. It was a really fun company with a lot of great people. And we got to launch some really great products. Most of whom were let go during the great recession (myself included) but it was fun while it lasted.
On the flip side, every time we launched a product the news media treated it as a novelty instead of a serious thing. Which was insanely frustrating. Some of our tech was way ahead of its time.
So why aren't there more search engines these days? Google is great but we constantly talk here about how it's losing its edge for certain kinds of more specific searches like technical ones. So seems like there's room for engines that are more tuned for special use cases and the ability to index web pages has only gotten cheaper since Google started. Has the size of the web made this impractical? Or do I just not know about these options?
Utter layman's guess: While indexing has gotten easier, the web has expanded exponentially in the meantime - likely far outstripping any technical gains. I would be surprised if indexing the modern web is feasible without significant time and capital.
Also, I can think of a couple specialized engines that do exist: Google Scholar and Shodan. There are probably more I'm unfamiliar with.
Not only is it infeasible, but there's so much garbage out there that we have no simple way to filter it out from scratch to the point that anyone could actually use it. Google and Bing have feedback loops with their users that prevent crap from rising to the top of search results. A plain index (or even using something like pagerank) wouldn't have this huge benefit, and you'd never be able to get your search engine off the ground.
The problem is that Google and Bing really are duopolistic.
You'd think that a monopoly could just break instantly if all it required was typing in a different URL.
But modern search engines are reliant on machine learning on mind-bogglingly enormous troves of real human interaction data.
If you truly outsmart Google by inventing a better mousetrap, it's worth fuck-all. Your solution will probably require more usage data than you will ever be able to collect, because nobody will use your search engine while it still produces poor results.
Well, ML on 100 billion searches is probably going to more closely approximate user intent than a living brain in a tank, because that brain doesn't know what the hell "lkw attachment" means.
Looks like german-english bilingual logistics professionals are looking for truck parts vendors, while teenage Americans are looking for hidden locations of laser focusing and enhancement devices within the video game Wolfenstein The New Order.
Well, the interface offered by blekko's Izik tablet search engine was that it would show 2 categories in the answer, one related to automotive and one related to games.
Google has done this for a while, but at the level of recognized entities, not at the level of individual queries. Is that how the engine you're referring to worked too? Eg, would the "lkw attachment" example above have partitioned results?
In Izik's tablet interface, each category was a separate row of results. So in this example, there would be 2 rows, one for automotive parts and one for games, and if you scroll horizontally in a row you get more results in that category.
I think that's what you meant by partitioned results.
Google computes this internally but I've never seen them use it for anything other than having diversity in their top 10 results.
Yea it's explicitly separated, just not shown at the same time. For example, if you search for "kings", there will be a couple of bubbles at the top of the page with different entities: "Kings" (2017 film), "Sacramento Kings" (baseball team), etc. Clicking on one of those will show you a list of results that only pertains to that entity. This feature has been around for years, and is part of the series of "things, not strings" features they've been working on.
As I said, Google is pretty conservative about this and other entity-based features, so they definitely wouldn't do it got something like "lkw attachment". My question was whether Izik triggered this feature in such cases or not.
That Google feature is like "related queries", that kind of feature has been around for more than a decade. If you click on the "Kings (2017 Film)" link it runs a search for [Kings 2017]... which just adds 2017 as a keyword to a conventional search. No semantic search is involved.
Izik would show you film-related website results for the film category.
I think these exist, they've just become more specialized. Think of these sites you might have seen before:
- alternativeto.net and similar
- Google Scholar / Semantic Scholar
- Every sandboxed social network (Facebook, Twitter, Tumblr)
- A variety of Instagram searching sites
- Alternative App stores for Android
And there's room for plenty more sites like this.
Attempting to return a good response for anything in the search box is a bottomless problem of unclear utility. Being more focused makes the work easier and the value for the user clearer.
Google jumped into the market when it was still possible to compete. And they came to the table with a product that was faster, better, cheaper, and more easy to monetize. Google-style data centers reduced costs, sharding and map-reduce plus streamlined design improved speed, pagerank improved quality, low cost fast searches meant that low cost advertisements could still bring in a lot of RoI. From that kernel they grew to dominance, becoming synonymous with the very term "search". Now we live in a different era, one where search is integrated in everything on every platform and where replacing the default search engine is a huge uphill battle.
Let's say that someone creates a better search engine, how would people actually use it? I'll tell you how most people who bothered to use it would integrate into their routines. Firstly they would still use google for everything day to day. They would still use google maps for directions. They would still use gmail for mail. They would still use google search as the semantic front-end for their browsing. Only after they performed a search using google that produced unsatisfactory results would they then pull out the better search engine and make use of it for that one isolated search. And that's the problem, because that scenario is very hard for the better search engine maker to monetize while google would continue reaping the major monitization haul for the vast majority of search uses for that user. And going from zero to completely integrating into a user's experience in the same way that google does now is not a realistic prospect for most startups.
I'd love a search engine with more querying power limited to a niche. But I doubt limiting topical scope would do much to keep computational complexity from blowing up. A while back some startup was charging per search to work around that.
I wasn't really thinking a niche topic but a niche use-case. Google is bad at doing literal, verbatim string searches these days and all I want sometimes is like Google 2005ish era search, just PageRank based and no modern machine learning, etc. You can have it simpler than Google did at the time since SEO against your niche engine is unlikely, so tech wise you might be more like 1998 Google. The main bottleneck seems to be what others here are suggesting: there's just too much web to index these days unless you're Bing-scale or bigger.
> Google is bad at doing literal, verbatim string searches these days and all I want sometimes is like Google 2005ish era search, just PageRank based and no modern machine learning, etc
You can get most of the way to this with the verbatim option, but I think they make it difficult to make the default.
> By using the Verbatim tool Google will not make the following changes:
• Personalizing your search using websites you have visited before;
• Including synonyms of your search terms;
• Automatic spelling corrections;
• Searching for words with the same stem e.g. “Shopping” when searched for “shop”;
• Finding results that match similar terms to those in your query.
Well that's sort of what I'm thinking of too. I'm talking about the idea that this might be doable by limiting topical scope. So for example you'd only index some portion of tech related sites.
I think the transformation has been more subtle, App Stores SEO is a thing, and voice searches are becoming more common. Document searches is still Google though.
"What is it with these nearly twenty year old sites still up?" Not sure but I believe some of the answer lies with adtech distribution needs. The search ads demand traffic, however astro-turf it be.
Search is alive and well. I'd recommend reading some of the latest textbooks and research papers on information retrieval. The industry was given new life about 5 years back with knowledge graphs and has been reborn again with recent innovations in machine learning, cloud computing, and data mining technologies.
I'm working on a project now which has indexed billions of pages and answers queries similar to a web search engine like Google:
https://www.AtSign.co/
The only difference is that it's a keyword + location based business contact information engine but operates on the same principles as a real web search engine client.
We're a small team and it would have been unthinkable even a years back to launch something of this scale effectively ... But here we are! Amazing space to be in right now
ugh. compare the results for tech support bridgeport, ct to google. or just look at them without comparing to google. awful! No offense but you aren't even doing the most obvious rule based filtering/ordering on cities/states in your result sets.
hi Greg, we don't offer filtering by cities at the moment, only states/countries, so, you wouldn't have been able to look up, "Bridgeport" specifically, right? A lot of people punch in a city and hit "search" but what they get as a response are matches from "any country" which is the default. Which is why you didn't see the basic filtering you were expecting.
Regardless, I just looked at our results for tech support in CT and I agree we need to work harder on our results, but comparing to Google, they only had tech support jobs (not even business listings)... which makes sense in their product use case.
I can see what you're saying and Google is by far, the industry benchmark, but it's also difficult to compare results sometimes ... it's like Apples and Oranges.
I don't want to get too much into the weeds, but there is a whole subset in Information Retrieval which relates to IR system evaluation, or search engine result evaluation. One simple way of doing it is simply labeling the accuracy of each result via human curator as either a 0 or a 1.
But it can get really complicated, for example, sometimes there just aren't relevant documents in the index in the first place ... so, you can't really blame your ranking factors too much. The opposite can happen too, where a word occurs too frequently in which case you might resort to other kinds of ranking factors (most notably pagerank).
In our case, we're focused on broadening our state/country level coverage right now for keywords (more listings), then we're going to focus on making sure our location accuracy is a lot better (it needs work). Overtime, you should get the results you're expecting more often :)
All of those are 'meta' search engines. There are three English language indexes of any size available, Google, Bing, and Yandex. All of the other search engines go to one of those three for most if not all of their queries. Some of the bespoke engines have local indexes of things like stack overflow or wikipedia (both fairly easy to index) to save on the cost but all the others use the big three (and mostly big two because Yandex pulled their servers out of Nevada which made their ping time add 300 - 800mS of latency to their searches).
Most of these used BOSS (aka Yahoo!'s old build your own search service API) which was served off Bing as its index, although Google has started paying more and more people to send their search traffic to Google.
Bing charges $7/thousand [1] for their "quality" searches and $3/thousand for their so-so searches (not as current, the index doesn't go as deep, this used to be what BOSS called until they turned it off in 2016).
That $7/thousand lets you give them up to 250 queries per second. For reference that is about 1-5M uniques per day. It looks like 21M searches a day but for English most of the searches come during the day from Europe and The US so you're really only going to do 10 - 15M searches per day at that rate. If you are clever you can cache results so for the same search you can just re-use the cache rather than paying for another result. This is nominally frowned upon but hard to defend against. If you manage to make a deal with a phone supplier to be the 'standard' search engine a lot of queries will just be 'facebook' or 'reddit' so you don't really need to actually query those. You will want to find some ad networks to provide you ads. Bing will do that too, but you will quickly figure out that if you could make money reselling Bing results with Bing ads, that they could do that too so you've find the margins pretty thin and negative at times. You'll have to pay for a machine that is taking those queries, calling out to what ever ad networks you want, and then filling out a results page (SERP) and sending it back to the consumer. If you are just fronting Bing or Yandex that is pretty straight forward to do with an nginx server on an AWS "large" instance.
If you negotiate well and market well you can be a dogpile or a startpage with some schtick that makes you different than just going to Google or Bing. The more privacy you afford the clients the more margin you give up (because you can't sell that information as well).
Bottom line is that its a hard way to make a living.
I wish Google would let us use both Verbatim (use all keywords I entered as-is instead of assuming you know better than I what I meant to be searching instead) and filter by date (to get most recent results first) because right now, you have to choose between relevant but outdated results, or irrelevant but recent results, both being frustrating.
Seems like that's the wrong title though: the article is showing that search engines (other than Google, Bing, Yahoo) are still a thing. Maybe "alternative search engines are still a thing"?
Author here. Back in the mid 90s, search engines were a thing, and you had many companies trying to provide search results in the emerging web. In 1996-97, it was the in thing to run a search engine.
I still don't get the title. Search engines were a thing - yes everyone knows that so that bit doesn't say anything - and they're still a thing - well ok we all already knew that as well - and if they're a still a thing why did you say they were a thing just before? It's two useless statements, and one is redundant because of the other! And what does 'a thing' really mean? That they exist? Why say 'a thing'? It must be the title with the least information possible.
By that metric there are no good search engines at the moment and the older the pages the worse this effect gets. It's really nice to see Google do lots of 'moonshots' and interesting tech demos but I'd be far happier if they fixed search and kept their focus on that.
If a page doesn't show up in either Google or Bing for sensible queries then that page effectively ceases to exist. The perverse incentive that these companies have to avoid you going to a page with relevant results as long as you spend more time on pages with their advertising on it ensures that more and more content will end up missing in action.
I'd be happy to pay for a search engine that:
- actually really works
- also allows you to search past page 10
- has a working API with reasonable limits