博客
关于我
强烈建议你试试无所不能的chatGPT,快点击我
LinkExtractor
阅读量:6838 次
发布时间:2019-06-26

本文共 4804 字,大约阅读时间需要 16 分钟。

wljdeMacBook-Pro:~ wlj$ scrapy shell "http://www.bjrbj.gov.cn/mzhd/detail_29974.htm"

 

scrapy shell发送请求
scrapy shell "http://www.bjrbj.gov.cn/mzhd/detail_29974.htm"
wljdeMacBook-Pro:~ wlj$ scrapy shell "http://www.bjrbj.gov.cn/mzhd/detail_29974.htm"

响应文件

response.body
response.text
response.url
>>> response.url'https://item.btime.com/m_9b62d3a9239a9473c'\

导入LinkExtractor,匹配整个html文档中的链接

from scrapy.linkextractors import LinkExtractor

 
>>> from scrapy.linkextractors import LinkExtractor
>>> response.xpath('//div[@class="xx_neirong"]/h1/text()').extract()[0] '北京社保开户流程是怎么个流程'

 

demo
1 wljdeMacBook-Pro:Desktop wlj$ scrapy shell "http://hr.tencent.com/position.php?" 2 2018-06-21 21:12:40 [scrapy.utils.log] INFO: Scrapy 1.5.0 started (bot: scrapybot) 3 2018-06-21 21:12:40 [scrapy.utils.log] INFO: Versions: lxml 4.2.1.0, libxml2 2.9.8, cssselect 1.0.3, parsel 1.4.0, w3lib 1.19.0, Twisted 18.4.0, Python 3.6.5 (default, Apr 25 2018, 14:23:58) - [GCC 4.2.1 Compatible Apple LLVM 9.1.0 (clang-902.0.39.1)], pyOpenSSL 18.0.0 (OpenSSL 1.1.0h  27 Mar 2018), cryptography 2.2.2, Platform Darwin-17.5.0-x86_64-i386-64bit 4 2018-06-21 21:12:40 [scrapy.crawler] INFO: Overridden settings: {
'DUPEFILTER_CLASS': 'scrapy.dupefilters.BaseDupeFilter', 'LOGSTATS_INTERVAL': 0} 5 2018-06-21 21:12:40 [scrapy.middleware] INFO: Enabled extensions: 6 ['scrapy.extensions.corestats.CoreStats', 7 'scrapy.extensions.telnet.TelnetConsole', 8 'scrapy.extensions.memusage.MemoryUsage'] 9 2018-06-21 21:12:40 [scrapy.middleware] INFO: Enabled downloader middlewares:10 ['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',11 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',12 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',13 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',14 'scrapy.downloadermiddlewares.retry.RetryMiddleware',15 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',16 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',17 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',18 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',19 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',20 'scrapy.downloadermiddlewares.stats.DownloaderStats']21 2018-06-21 21:12:40 [scrapy.middleware] INFO: Enabled spider middlewares:22 ['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',23 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',24 'scrapy.spidermiddlewares.referer.RefererMiddleware',25 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',26 'scrapy.spidermiddlewares.depth.DepthMiddleware']27 2018-06-21 21:12:40 [scrapy.middleware] INFO: Enabled item pipelines:28 []29 2018-06-21 21:12:40 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:602330 2018-06-21 21:12:40 [scrapy.core.engine] INFO: Spider opened31 2018-06-21 21:12:41 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (301) to
from
32 2018-06-21 21:12:41 [scrapy.core.engine] DEBUG: Crawled (200)
(referer: None)33 [s] Available Scrapy objects:34 [s] scrapy scrapy module (contains scrapy.Request, scrapy.Selector, etc)35 [s] crawler
36 [s] item {}37 [s] request
38 [s] response <200 https://hr.tencent.com/position.php>39 [s] settings
40 [s] spider
41 [s] Useful shortcuts:42 [s] fetch(url[, redirect=True]) Fetch URL and update local objects (by default, redirects are followed)43 [s] fetch(req) Fetch a scrapy.Request and update local objects 44 [s] shelp() Shell help (print this help)45 [s] view(response) View response in a browser46 >>> response.url47 'https://hr.tencent.com/position.php'48 >>> from scrapy.linkextractors import LinkExtractor49 >>> link_list=LinkExtractor(allow=("start=\d+"))50 >>> link_list.extract_links(response)51 [Link(url='https://hr.tencent.com/position.php?&start=10#a', text='2', fragment='', nofollow=False), Link(url='https://hr.tencent.com/position.php?&start=20#a', text='3', fragment='', nofollow=False), Link(url='https://hr.tencent.com/position.php?&start=30#a', text='4', fragment='', nofollow=False), Link(url='https://hr.tencent.com/position.php?&start=40#a', text='5', fragment='', nofollow=False), Link(url='https://hr.tencent.com/position.php?&start=50#a', text='6', fragment='', nofollow=False), Link(url='https://hr.tencent.com/position.php?&start=60#a', text='7', fragment='', nofollow=False), Link(url='https://hr.tencent.com/position.php?&start=70#a', text='...', fragment='', nofollow=False), Link(url='https://hr.tencent.com/position.php?&start=3800#a', text='381', fragment='', nofollow=False)]52 >>>

 

 

 

转载于:https://www.cnblogs.com/wanglinjie/p/9211013.html

你可能感兴趣的文章
网站性能优化
查看>>
四、JVM — 类文件结构
查看>>
帮你深入理解OAuth2.0协议
查看>>
程序员全国不同地区,微信(面试 招聘)群。
查看>>
React Native入门遇到的一些问题
查看>>
jquery.timepicker.js - 最常用的日期JS控件
查看>>
C语言中你可能不熟悉的头文件(stdlib.h)
查看>>
LeetCode(258):Add Digits
查看>>
mysql-multi source replication 配置
查看>>
由传输过程中需要面对的问题探讨WCF中的“可靠性”
查看>>
iOS的GCD精要
查看>>
取得ld默认的ldscript配置
查看>>
精简版—愤慨的小鸟
查看>>
让你提前认识软件开发(36):怎样扩展数据表字段?
查看>>
静态代理模式
查看>>
jsp导出身份证到excel时候格式不正确
查看>>
学习笔记
查看>>
病毒小记
查看>>
切蛋糕
查看>>
基于mysql的数据管理
查看>>