python - BeautifulSoup KeyError Issue -


i know keyerrors common beautifulsoup and, before yell rtfm @ me, have done extensive reading in both python documentation , beautifulsoup documentation. that's aside, still haven't clue what's going on keyerrors.

here's program i'm trying run , consistently results in keyerror on last element of urls list.

i come c++ background, let know, need use beautifulsoup work, doing in c++ imaginable nightmare!

the idea return list of urls in website contain on pages links url.

here's got far:

import urllib beautifulsoup import beautifulsoup  urls = [] locations = [] urls.append("http://www.tuftsalumni.org")  def print_links (link):     if (link.startswith('/') or link.startswith('http://www.tuftsalumni')):         if (link.startswith('/')):             link = "starting_website" + link         print (link)         htmlsource = urllib.urlopen(link).read(200000)         soup = beautifulsoup(htmlsource)         item in soup.fetch('a'):             if (item['href'].startswith('/') or                  "tuftsalumni" in item['href']):                 urls.append(item['href'])             length = len(urls)             if (item['href'] == "site_on_page"):                 if (check_list(link, locations) == "no"):                     locations.append(link)    def check_list (link, array):     x in range (0, len(array)):         if (link == array[x]):             return "yes"     return "no"  print_links(urls[0])  x in range (0, (len(urls))):     print_links(urls[x])  

the error on next last element of urls:

file "scraper.py", line 35, in <module>     print_links(urls[x])   file "scraper.py", line 16, in print_links     if (item['href'].startswith('/') or    file "/library/frameworks/python.framework/versions/2.7/lib/python2.7/site-   packages/beautifulsoup.py", line 613, in __getitem__     return self._getattrmap()[key] keyerror: 'href'    

now know need use get() handle keyerror default case. have absolutely no idea how that, despite literally hour of searching.

thank you, if can clarify @ please let me know.

if want handle error, can catch exception:

    item in soup.fetch('a'):         try:             if (item['href'].startswith('/') or "tuftsalumni" in item['href']):             (...)         except keyerror:             pass # or other fallback action 

you can specify default using item.get('key','default'), don't think that's need in case.

edit: if else fails, barebones version should reasonable starting point:

#!/usr/bin/env python # -*- coding: utf-8 -*-  import urllib beautifulsoup import beautifulsoup  links = ["http://www.tuftsalumni.org"]  def print_hrefs(link):     htmlsource = urllib.urlopen(link).read()     soup = beautifulsoup(htmlsource)     item in soup.fetch('a'):         print item['href']  link in links:     print_hrefs(link) 

also, check_list(item, l) can replaced item in l.


Comments

Popular posts from this blog

jasper reports - Fixed header in Excel using JasperReports -

media player - Android: mediaplayer went away with unhandled events -

python - ('The SQL contains 0 parameter markers, but 50 parameters were supplied', 'HY000') or TypeError: 'tuple' object is not callable -