Search Engine Bots Generating Strange Queries

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
12 messages Options
Reply | Threaded
Open this post in threaded view
|

Search Engine Bots Generating Strange Queries

MikeK-2

In a general CMS app written in CakePHP I am noticing in my logs
invalid queries being generated by various search engine bots
including Google, Inktomi, and Yahoo.

What I'm wondering is WHY?

For example they are requesting

http://mysite.com/controller/view instead of the correct

http://mysite.com/controller/view/34 (ex: id 34)

Nowhere on my site do I publish any links to /controllers/view without
an id parm

This is driving me slightly nuts. Why would a bot request a URI it has
never seen?

My validation code that checks for valid requests logs these
occurences and every day I puzzle over my logs and examine the emitted
web page source wondering where or why they are requesting these
invalid URIs. I've been dumping $_SERVER and no clues there either.
The referer is always '/'.


--~--~---------~--~----~------------~-------~--~----~
You received this message because you are subscribed to the Google Groups "CakePHP" group.
To post to this group, send email to [hidden email]
To unsubscribe from this group, send email to [hidden email]
For more options, visit this group at http://groups.google.com/group/cake-php?hl=en
-~----------~----~----~----~------~----~------~--~---

Reply | Threaded
Open this post in threaded view
|

Re: Search Engine Bots Generating Strange Queries

David C. Zentgraf

I'm totally no expert on this, but I'd guess that the bots are simply  
trying to walk the tree.
If "http://mysite.com/directory/subdirectory/subsubdirectory" is  
valid, then "http://mysite.com/directory/subdirectory", "http://mysite.com/directory 
" and "http://mysite.com" are probably also valid. The GOOG doesn't  
know that those directories don't actually exist. In "classic" web  
development patterns there should be an index.htm file in each of  
these directories, so it can't hurt to look for them.

BTW: Safari (and possibly other browsers as well) allow you to right-
click on the title bar and offer the same kind of "URL shortening  
shortcuts" in a popup menu.

On 30 Oct 2008, at 15:02, MikeK wrote:

>
> In a general CMS app written in CakePHP I am noticing in my logs
> invalid queries being generated by various search engine bots
> including Google, Inktomi, and Yahoo.
>
> What I'm wondering is WHY?
>
> For example they are requesting
>
> http://mysite.com/controller/view instead of the correct
>
> http://mysite.com/controller/view/34 (ex: id 34)
>
> Nowhere on my site do I publish any links to /controllers/view without
> an id parm
>
> This is driving me slightly nuts. Why would a bot request a URI it has
> never seen?
>
> My validation code that checks for valid requests logs these
> occurences and every day I puzzle over my logs and examine the emitted
> web page source wondering where or why they are requesting these
> invalid URIs. I've been dumping $_SERVER and no clues there either.
> The referer is always '/'.
>
>
> >


--~--~---------~--~----~------------~-------~--~----~
You received this message because you are subscribed to the Google Groups "CakePHP" group.
To post to this group, send email to [hidden email]
To unsubscribe from this group, send email to [hidden email]
For more options, visit this group at http://groups.google.com/group/cake-php?hl=en
-~----------~----~----~----~------~----~------~--~---

Reply | Threaded
Open this post in threaded view
|

Re: Search Engine Bots Generating Strange Queries

MikeK-2

So you're saying the search bots are just walking all my actions as if
they are subdirs on a site? Not sure about this.

Maybe I should disallow those specific requests with robots.txt? Any
other cakers have an opinion on this? If I disallow

www.mydomain.com/controller/action/ wont the bots stop walking all the
actions?
--~--~---------~--~----~------------~-------~--~----~
You received this message because you are subscribed to the Google Groups "CakePHP" group.
To post to this group, send email to [hidden email]
To unsubscribe from this group, send email to [hidden email]
For more options, visit this group at http://groups.google.com/group/cake-php?hl=en
-~----------~----~----~----~------~----~------~--~---

Reply | Threaded
Open this post in threaded view
|

Re: Search Engine Bots Generating Strange Queries

mathew-2

Hi Mike,

Disallowing that in your robots.txt is a waste of time.

The robots.txt file was started by Google, and is not an officially
supported feature of all crawlers. So they don't have to follow it,
and I can tell you this doesn't sound like the google bot anyway,
because that bot doesn't generate phantom URIs.

Web crawlers can extract URIs from many different sources, and they
can generate URIs as they see fit. URIs can come from HTML, CSS, SWF,
JavaScript, and form post/get actions. I've even seen crawlers submit
post requests to generate more URIs to crawl.

Crawlers will also clean URIs removing ids, changing queries, fake
cookies, and sometimes rotate their IP address.

There are no rules about crawlers, no guidelines they have to follow,
or limits on how long they will crawl or how aggressively they will
request URIs from your server.

You should modify your Routes to point to a 404 if they request paths
that you don't want them to see.
--~--~---------~--~----~------------~-------~--~----~
You received this message because you are subscribed to the Google Groups "CakePHP" group.
To post to this group, send email to [hidden email]
To unsubscribe from this group, send email to [hidden email]
For more options, visit this group at http://groups.google.com/group/cake-php?hl=en
-~----------~----~----~----~------~----~------~--~---

Reply | Threaded
Open this post in threaded view
|

Re: Search Engine Bots Generating Strange Queries

Novice Programmer
Great advice mathew... Yes... i think that this is the way to go... point all /controller/action which dont mean any thing without an extra id to 404... once the crawler sees this 404 it would never try to fetch the same thing again.

Thanks.

On Sat, Nov 1, 2008 at 6:21 AM, Mathew <[hidden email]> wrote:

Hi Mike,

Disallowing that in your robots.txt is a waste of time.

The robots.txt file was started by Google, and is not an officially
supported feature of all crawlers. So they don't have to follow it,
and I can tell you this doesn't sound like the google bot anyway,
because that bot doesn't generate phantom URIs.

Web crawlers can extract URIs from many different sources, and they
can generate URIs as they see fit. URIs can come from HTML, CSS, SWF,
JavaScript, and form post/get actions. I've even seen crawlers submit
post requests to generate more URIs to crawl.

Crawlers will also clean URIs removing ids, changing queries, fake
cookies, and sometimes rotate their IP address.

There are no rules about crawlers, no guidelines they have to follow,
or limits on how long they will crawl or how aggressively they will
request URIs from your server.

You should modify your Routes to point to a 404 if they request paths
that you don't want them to see.




--
Thanks & Regards,
Novice.

--~--~---------~--~----~------------~-------~--~----~
You received this message because you are subscribed to the Google Groups "CakePHP" group.
To post to this group, send email to [hidden email]
To unsubscribe from this group, send email to [hidden email]
For more options, visit this group at http://groups.google.com/group/cake-php?hl=en
-~----------~----~----~----~------~----~------~--~---

Reply | Threaded
Open this post in threaded view
|

Re: Search Engine Bots Generating Strange Queries

MikeK-2

Thank you Matthew - I log it everytime before throwing the 404 and I
figured whatever was creating these things would stop - but it
continues. I'm so dadgum anal obsessive it just kills me - hard to
ignore...

It is not coming from any 'known' bot either...
--~--~---------~--~----~------------~-------~--~----~
You received this message because you are subscribed to the Google Groups "CakePHP" group.
To post to this group, send email to [hidden email]
To unsubscribe from this group, send email to [hidden email]
For more options, visit this group at http://groups.google.com/group/cake-php?hl=en
-~----------~----~----~----~------~----~------~--~---

Reply | Threaded
Open this post in threaded view
|

Re: Search Engine Bots Generating Strange Queries

mathew-2

Hi Mike,

If your using Apache it has some features in the htaccess file that
will allow you to disable access to your server for bots causing you
trouble.

In your Cake 404 display page keep track of the number of times a 404
is generated per IP address, and if it exceeds a threshold log that IP
address to a text file.

Humans browsing a website will not generate many 404 messages, even if
they have bad bookmarks, or follow old links from search engines. So
an IP address requesting more then one hundred 404 errors is likely a
problem bot. Each time a 404 page is display log the IP to a database
with a counter. When the counter reaches your limit add that IP
address to a text file.

In your .htaccess you can load this text file of IP addresses and
apply rules to those addresses. It's up to you if you wish to display
a static access denied Html page, or simply throw a connection
refused.

Sorry I don't remember the commands for the htaccess file.


--~--~---------~--~----~------------~-------~--~----~
You received this message because you are subscribed to the Google Groups "CakePHP" group.
To post to this group, send email to [hidden email]
To unsubscribe from this group, send email to [hidden email]
For more options, visit this group at http://groups.google.com/group/cake-php?hl=en
-~----------~----~----~----~------~----~------~--~---

Reply | Threaded
Open this post in threaded view
|

Re: Search Engine Bots Generating Strange Queries

BrendonKoz

I'd actually say using a permanent redirect (301, I believe) to your
root (or that controller's index), rather than to the 404 page might
be a better solution.  If your users/visitors won't see it since
you're not linking to it, it isn't really a bad solution, and I doubt
you'd want any search engines indexing 404 errors in association with
your site/domain.  If it was a hacker, I don't think I'd send them a
404 message either, I'd just redirect them...if it was a Safari user,
I'd rather give them a graceful degredation than a 404 just as well.
That's just me though.

Standard incorrect addresses should still receive a 404.  A 404 does
serve a very important purpose.

On Nov 6, 9:00 am, Mathew <[hidden email]> wrote:

> Hi Mike,
>
> If your using Apache it has some features in the htaccess file that
> will allow you to disable access to your server for bots causing you
> trouble.
>
> In your Cake 404 display page keep track of the number of times a 404
> is generated per IP address, and if it exceeds a threshold log that IP
> address to a text file.
>
> Humans browsing a website will not generate many 404 messages, even if
> they have bad bookmarks, or follow old links from search engines. So
> an IP address requesting more then one hundred 404 errors is likely a
> problem bot. Each time a 404 page is display log the IP to a database
> with a counter. When the counter reaches your limit add that IP
> address to a text file.
>
> In your .htaccess you can load this text file of IP addresses and
> apply rules to those addresses. It's up to you if you wish to display
> a static access denied Html page, or simply throw a connection
> refused.
>
> Sorry I don't remember the commands for the htaccess file.
--~--~---------~--~----~------------~-------~--~----~
You received this message because you are subscribed to the Google Groups "CakePHP" group.
To post to this group, send email to [hidden email]
To unsubscribe from this group, send email to [hidden email]
For more options, visit this group at http://groups.google.com/group/cake-php?hl=en
-~----------~----~----~----~------~----~------~--~---

Reply | Threaded
Open this post in threaded view
|

Re: Search Engine Bots Generating Strange Queries

mathew-2

> I'd actually say using a permanent redirect (301, I believe) to your
> root (or that controller's index), rather than to the 404 page might
> be a better solution.  If your users/visitors won't see it since
> you're not linking to it, it isn't really a bad solution, and I doubt
> you'd want any search engines indexing 404 errors in association with
> your site/domain.  If it was a hacker, I don't think I'd send them a
> 404 message either, I'd just redirect them...if it was a Safari user,

You should not redirect unless the content has been moved. Sending the
wrong response codes to incorrect URIs makes it difficult for web
crawl operators to correctly crawl your site. Should a web crawl
operator come to the conclusion that your site provides incorrect
response codes, then they might choose to crawl it aggressively since
the server's responses can not be trusted.

Indexing bots will not index a 404 response code from the Http header.
That response code tells the bots the URI points to no content. Bots
will only index pages when the 404 error message is sent with a Http
200 response code and a text/html content-type in the header, which is
incorrect and more of an error on the server side then a problem with
the bot.

If you send a 301/302 response code you are telling the bot, this URI
is valid, it has been moved, now the source URI and the redirected URI
will continue to be processed by the bot. Where as if you tell the bot
404, then the bot knows this URI is invalid, the source page that URI
comes from is generating invalid URIs, and it can drop other URIs from
that source.

Sending a hacker a 301, 302 does nothing to change their behavior, and
provides them no extra information then a 404.

Blocking a remote computer from making to many invalid requests from
your server does change the behavior of that remote computer. It stops
it. Which is about all you can do at this point. A hacker will return
with a different IP address, and attack. So, hackers are a completely
different topic :)

--~--~---------~--~----~------------~-------~--~----~
You received this message because you are subscribed to the Google Groups "CakePHP" group.
To post to this group, send email to [hidden email]
To unsubscribe from this group, send email to [hidden email]
For more options, visit this group at http://groups.google.com/group/cake-php?hl=en
-~----------~----~----~----~------~----~------~--~---

Reply | Threaded
Open this post in threaded view
|

Re: Search Engine Bots Generating Strange Queries

BrendonKoz

It may not index a 404, but it still checks the 404.  For usability's
sake I'd still prefer to redirect than to send a 404.  Although we
were discussing bots, we have to keep the user in mind as well.  I
have personally traversed the URL path to see what may be found on
some sites, and if Safari has the feature included out of the box,
well...I'd rather present the user with something than nothing at all,
and a 404 isn't my idea of proper degredation within the path.  Either
way, it's simply a matter of personal preference.

Google was not the first search engine to incorporate robots.txt by
the way...they were the first to incorporate the rel="nofollow" and
also I think the SiteMap.xml idea.

On Nov 6, 12:05 pm, Mathew <[hidden email]> wrote:

> > I'd actually say using a permanent redirect (301, I believe) to your
> > root (or that controller's index), rather than to the 404 page might
> > be a better solution.  If your users/visitors won't see it since
> > you're not linking to it, it isn't really a bad solution, and I doubt
> > you'd want any search engines indexing 404 errors in association with
> > your site/domain.  If it was a hacker, I don't think I'd send them a
> > 404 message either, I'd just redirect them...if it was a Safari user,
>
> You should not redirect unless the content has been moved. Sending the
> wrong response codes to incorrect URIs makes it difficult for web
> crawl operators to correctly crawl your site. Should a web crawl
> operator come to the conclusion that your site provides incorrect
> response codes, then they might choose to crawl it aggressively since
> the server's responses can not be trusted.
>
> Indexing bots will not index a 404 response code from the Http header.
> That response code tells the bots the URI points to no content. Bots
> will only index pages when the 404 error message is sent with a Http
> 200 response code and a text/html content-type in the header, which is
> incorrect and more of an error on the server side then a problem with
> the bot.
>
> If you send a 301/302 response code you are telling the bot, this URI
> is valid, it has been moved, now the source URI and the redirected URI
> will continue to be processed by the bot. Where as if you tell the bot
> 404, then the bot knows this URI is invalid, the source page that URI
> comes from is generating invalid URIs, and it can drop other URIs from
> that source.
>
> Sending a hacker a 301, 302 does nothing to change their behavior, and
> provides them no extra information then a 404.
>
> Blocking a remote computer from making to many invalid requests from
> your server does change the behavior of that remote computer. It stops
> it. Which is about all you can do at this point. A hacker will return
> with a different IP address, and attack. So, hackers are a completely
> different topic :)
--~--~---------~--~----~------------~-------~--~----~
You received this message because you are subscribed to the Google Groups "CakePHP" group.
To post to this group, send email to [hidden email]
To unsubscribe from this group, send email to [hidden email]
For more options, visit this group at http://groups.google.com/group/cake-php?hl=en
-~----------~----~----~----~------~----~------~--~---

Reply | Threaded
Open this post in threaded view
|

Re: Search Engine Bots Generating Strange Queries

mathew-2

Most web crawlers won't check a 404, because of the way servers send
Http responses.

When a crawler requests a page that is missing, it first receives the
header response from the request, and it can read the response code,
content-type, and other information. The web crawler can then stop the
download of the content after it has checked the response code,
reducing the bandwidth placed on the server, and reducing time the web
crawler is spending on missing content. If a redirect response is
sent, then the crawler must make another request to the server and
will download the entire content of a page that does not reflect the
source url. The web crawler will see a 200 response code on the new
URI, download all the content, and increase the time and bandwidth
spent crawling that domain.

But I understand what your saying Brendon about it being a design
choice. I'm just not sure traversing the URL path improves the
visitors usability of the website their visiting. Once they step up to
an invalid URI they will be redirected somewhere else, which would
stop the traversal of the URL.

Here's CNN as an example.

http://edition.cnn.com/2008/POLITICS/11/06/middle.east.peace.deal/index.html
http://edition.cnn.com/2008/POLITICS/11/06/middle.east.peace.deal
http://edition.cnn.com/2008/POLITICS/11/06
http://edition.cnn.com/2008/POLITICS/11
http://edition.cnn.com/2008/POLITICS
http://edition.cnn.com/2008

While these links will produce a 404 response and display Html. A web
crawler will not download the content after it has rejected the
response code in the header of the Http response. So the most
bandwidth load placed on the server is a few bytes per bad URI.

This makes your domain crawler friendly, but a friendly crawler would
not request phantom URIs.
--~--~---------~--~----~------------~-------~--~----~
You received this message because you are subscribed to the Google Groups "CakePHP" group.
To post to this group, send email to [hidden email]
To unsubscribe from this group, send email to [hidden email]
For more options, visit this group at http://groups.google.com/group/cake-php?hl=en
-~----------~----~----~----~------~----~------~--~---

Reply | Threaded
Open this post in threaded view
|

Re: Search Engine Bots Generating Strange Queries

mathew-2

Another reason not to use redirects for missing URIs is that you could
mistakenly create what is called a "crawler trap".

A crawler trap are URLs that keep changing but keep producing the same
content. The crawler gets stuck wasting its time download the same
page, because it can't tell by the URL that the content is the same.

While good crawlers have logic to prevent this problem from happening.
Your site could be flagged as poorly structured, and commercial
crawlers will avoid indexing your content.

--~--~---------~--~----~------------~-------~--~----~
You received this message because you are subscribed to the Google Groups "CakePHP" group.
To post to this group, send email to [hidden email]
To unsubscribe from this group, send email to [hidden email]
For more options, visit this group at http://groups.google.com/group/cake-php?hl=en
-~----------~----~----~----~------~----~------~--~---