python - BeautifulSoup KeyError Issue -
i know keyerrors common beautifulsoup and, before yell rtfm @ me, have done extensive reading in both python documentation , beautifulsoup documentation. that's aside, still haven't clue what's going on keyerrors.
here's program i'm trying run , consistently results in keyerror on last element of urls list.
i come c++ background, let know, need use beautifulsoup work, doing in c++ imaginable nightmare!
the idea return list of urls in website contain on pages links url.
here's got far:
import urllib beautifulsoup import beautifulsoup urls = [] locations = [] urls.append("http://www.tuftsalumni.org") def print_links (link): if (link.startswith('/') or link.startswith('http://www.tuftsalumni')): if (link.startswith('/')): link = "starting_website" + link print (link) htmlsource = urllib.urlopen(link).read(200000) soup = beautifulsoup(htmlsource) item in soup.fetch('a'): if (item['href'].startswith('/') or "tuftsalumni" in item['href']): urls.append(item['href']) length = len(urls) if (item['href'] == "site_on_page"): if (check_list(link, locations) == "no"): locations.append(link) def check_list (link, array): x in range (0, len(array)): if (link == array[x]): return "yes" return "no" print_links(urls[0]) x in range (0, (len(urls))): print_links(urls[x])
the error on next last element of urls:
file "scraper.py", line 35, in <module> print_links(urls[x]) file "scraper.py", line 16, in print_links if (item['href'].startswith('/') or file "/library/frameworks/python.framework/versions/2.7/lib/python2.7/site- packages/beautifulsoup.py", line 613, in __getitem__ return self._getattrmap()[key] keyerror: 'href'
now know need use get() handle keyerror default case. have absolutely no idea how that, despite literally hour of searching.
thank you, if can clarify @ please let me know.
if want handle error, can catch exception:
item in soup.fetch('a'): try: if (item['href'].startswith('/') or "tuftsalumni" in item['href']): (...) except keyerror: pass # or other fallback action
you can specify default using item.get('key','default')
, don't think that's need in case.
edit: if else fails, barebones version should reasonable starting point:
#!/usr/bin/env python # -*- coding: utf-8 -*- import urllib beautifulsoup import beautifulsoup links = ["http://www.tuftsalumni.org"] def print_hrefs(link): htmlsource = urllib.urlopen(link).read() soup = beautifulsoup(htmlsource) item in soup.fetch('a'): print item['href'] link in links: print_hrefs(link)
also, check_list(item, l)
can replaced item in l
.
Comments
Post a Comment