一開始爬google patent時,發現被禁止無法捉取資料,原來是少了header。以下程式碼可以爬google patent獲取資料。
import urllib.request
from bs4 import BeautifulSoup
req = urllib.request.Request('http://www.google.st/patents/US7992995')
req.add_header('User-agent', 'Mozilla/5.0')
patent_html = urllib.request.urlopen(req)
soup = BeautifulSoup(patent_html, 'html.parser')
patentNumber = soup.find("span", { "class" : "patent-number" }).text
assigneeMetaTag = soup.find("meta", { "scheme" : "assignee"})
patentAssignee = assigneeMetaTag.attrs["content"]
print(patentNumber, patentAssignee)
更多訊息,可以參考以下Stackoverflow網址:
http://stackoverflow.com/questions/32637023/using-google-patent-api