Python爬蟲:Scrapy鏈接解析器LinkExtractor返回Link對象
發(fā)布時間:2021-11-23 點擊數(shù):551
LinkExtractor
from scrapy.linkextractors import LinkExtractor
Link
from scrapy.link import Link
Link四個屬性
url text fragment nofollow
如果需要解析出文本,需要在 LinkExtractor 的參數(shù)中添加參數(shù):attrs
link_extractor = LinkExtractor(attrs=('href','text')) links = link_extractor.extract_links(response)
使用示例
import scrapy from scrapy.linkextractors import LinkExtractor class DemoSpider(scrapy.Spider): name = 'spider' start_urls = [ "https://book.douban.com/" ] def parse(self, response): # 參數(shù)是正則表達式 link_extractor = LinkExtractor(allow="https://www.tianyancha.com/brand/b.*") links = link_extractor.extract_links(response) for link in links: print(link.text, link.url) if __name__ == '__main__': cmdline.execute("scrapy crawl spider".split())