google ignoring robots.txt files
For a while now on one of my sites it looked as though google was ignoring my robots.txt file.
I say this because a directory I specifically put in my “disallow” comments was indexed in google. I didn’t really think anything of it except “stupid google” because it wasn’t affecting anything.
Now, on one of the pages I had excluded I had a link to another page but this was on a secure page. So google of course followed it and now I have pages indexed in google with http and https so I was looking on the internet to find out how to get rid of pages with https
Backtracking a bit, my robots.txt file looked like this :
User-agent: *
Disallow: /downloads/
Disallow: /bplanx/
Disallow: /cgi-bin/
Disallow: /uk/
Disallow: /uk-reviewed/
Disallow: /small-business/
Disallow: /business-plan-books/
Disallow: /orders/
Disallow: /products/
User-agent: googlebot
Disallow: /business-directory/shopping/
Disallow: /business-directory/lifestyle/
Disallow: /business-directory/motoring/
So it had a separate entry for googlebot. Now reading up on this it seems that this googlebot entry overrides the original entry so the items that I have requested to disallow by all bots is not being read by googlebot so I needed to put those lines below my specific googlebot entry.
AHHHHHHHHHHHHHHH
A bug in google that seems to go back some years. I have actually deleted the googlebot entry so all disallows ARE for all bots.
I just need to find out how to delete all my https:// content


