s7ckTeam 发表于 5 天前

[18682] 2018-05-19_python爬虫总结(一)

<html>
<head>
<title>python爬虫总结(一)</title>
<meta charset="utf-8">
<meta name="viewport" content="width=device-width,initial-scale=1.0,maximum-scale=1.0,user-scalable=0,viewport-fit=cover">
<style>
*{margin:0;padding:0;max-width:100%;box-sizing:border-box;}html{-ms-text-size-adjust:100%;-webkit-text-size-adjust:100%;line-height:1.6}img{z-index:999;position:relative;max-width:100%;margin:10px 0;}body{-webkit-touch-callout:none;font family:-apple-system-font,BlinkMacSystemFont,"Helvetica Neue","PingFang SC","Hiragino Sans GB","Microsoft YaHei UI","Microsoft YaHei",Arial,sans-serif;color:#333;letter-spacing:.034em}h1,h2,h3,h4,h5,h6{font-weight:400;font-size:16px;line-height:36px;}a{color:#576b95;text-decoration:none;-webkit-tap-highlight-color:rgba(0,0,0,0)}td,th{word-wrap:break-word;padding:5px 10px;border:1px solid #DDD;}table{margin-bottom:10px;border-collapse:collapse;display:table;width:100%!important;}.appmsg_skin_default .rich_media_area_primary{background-color:#fff}.appmsg_skin_default .rich_media_area_primary .weui-loadmore_line .weui-loadmore__tips{background-color:#fff}.rich_media_area_primary{padding:20px 16px 12px;background-color:#fafafa}@media (max-width:375px){.rich_media_area_primary{padding:20px 60px 15px 60px}.rich_media_area_extra{padding:0 60px 21px 60px}}@media (min-width:1024px){.rich_media_area_primary_inner,.rich_media_area_extra_inner,body{max-width:677px;margin-left:auto;margin-right:auto}.rich_media_area_primary{padding-top:32px}}.rich_media{padding:20px;}.appmsg_skin_default .rich_media_area_primary{background-color:#fff}.appmsg_skin_default .rich_media_area_primary .weui-loadmore_line .weui-loadmore__tips{background-color:#fff}@media screen and (min-width:1024px){.rich_media_area_primary_inner,.rich_media_area_extra_inner{max-width:677px;margin-left:auto;margin-right:auto}.rich_media_area_primary{padding-top:32px}}.rich_media_content{overflow:hidden;color:#333;font-size:17px;line-height:37px;;word-wrap:break-word;-webkit-hyphens:auto;-ms-hyphens:auto;hyphens:auto;text-align:justify;position:relative;z-index:0}.rich_media_content *{max-width:100%!important;box-sizing:border-box!important;-webkit-box-sizing:border-box!important;word-wrap:break-word!important}.rich_media_content p{clear:both;min-height:1em}.rich_media_content em{font-style:italic}.rich_media_content fieldset{min-width:0}.rich_media_content .list-paddingleft-1,.rich_media_content .list-paddingleft-2,.rich_media_content .list-paddingleft-3{padding-left:2.2em}.rich_media_content .list-paddingleft-1 .list-paddingleft-2,.rich_media_content .list-paddingleft-2 .list-paddingleft-2,.rich_media_content .list-paddingleft-3 .list-paddingleft-2{padding-left:30px}.rich_media_content .list-paddingleft-1{padding-left:1.2em}.rich_media_content .list-paddingleft-3{padding-left:3.2em}.rich_media_content .code-snippet,.rich_media_content .code-snippet__fix{max-width:1000%!important}.rich_media_content .code-snippet *,.rich_media_content .code-snippet__fix *{max-width:1000%!important}.rich_media_title{font-size:22px;line-height:42px;;line-height:1.4;margin:10px 0;padding-bottom:10px;border-bottom:1px solid #e7e7eb;}@supports(-webkit-overflow-scrolling:touch){.rich_media_title{font-weight:700}}.rich_media_meta{display:inline-block;vertical-align:middle;margin:0 10px 10px 0;font-size:15px;line-height:35px;;line-height:35px;;line-height:35px;;line-height:35px;;-webkit-tap-highlight-color:rgba(0,0,0,0)}.rich_media_meta.icon_appmsg_tag{margin-right:4px}.rich_media_meta.meta_tag_text{margin-right:0}.rich_media_meta_list em{font-style:normal}.rich_media_meta_text{color:rgba(0,0,0,0.3)}p{margin:0;}.msgBox{margin-top:20px;padding-top:20px;padding-left:50px;overflow:hidden;border-top:2px dashed #09a2ff;}.msg{padding-top:7px;clear:both;}.msgBody{float:right;width:100%;margin-left:55px;padding-bottom:15px;border-bottom:1px dashed #e0e0e0;}.userHeadImg{float:left;margin-left:-50px;}.userHeadImg img{width:40px;height:40px;margin-right:10px;border-radius:3px;}.userName{color:#888888;line-height:24px;font-size:14px;line-height:34px;;line-height:34px;;line-height:34px;;line-height:34px;;line-height:34px;;line-height:34px;;line-height:34px;;line-height:34px;;line-height:34px;;line-height:34px;;line-height:34px;;margin:5px 0 5px 0;height:24px;}.replyBody,.autherBody{color:#565656;font-size:15px;}.replyIcon{border-left:4px solid #33ab01;margin-right:5px;}.ad{text-decoration:none;color:#d6d4d4;font-size:12px;line-height:32px;;}.msgBodyReply{padding-top:5px;}.userName span{float:right;color:#afafaf;font-size:14px;}code{text-align:left;font-size:14px;display:block;white-space:pre;display:-webkit-box;display:-webkit-flex;display:flex;position:relative;}.code-snippet__fix{font-size:14px;margin:10px 0;display:block;color:#333;position:relative;background-color:rgba(0,0,0,0.03);border:1px solid #f0f0f0;border-radius:2px;display:-webkit-box;display:-webkit-flex;display:flex;padding-left:25px;line-height:26px}.code-snippet__fix code{text-align:left;font-size:14px;display:block;white-space:pre;display:-webkit-box;display:-webkit-flex;display:flex;position:relative;font family:Consolas,"Liberation Mono",Menlo,Courier,monospace}.code-snippet__comment,.code-snippet__quote{color:#afafaf;font-style:italic}.code-snippet__keyword,.code-snippet__selector-tag,.code-snippet__subst{color:#ca7d37}.code-snippet__number,.code-snippet__literal,.code-snippet__variable,.code-snippet__template-variable,.code-snippet__tag .code-snippet__attr{color:#0e9ce5}.code-snippet__string,.code-snippet__doctag{color:#d14}.code-snippet__title,.code-snippet__section,.code-snippet__selector-id{color:#d14}.code-snippet__subst{font-weight:normal}.code-snippet__type,.code-snippet__class .code-snippet__title{color:#0e9ce5}.code-snippet__tag,.code-snippet__name,.code-snippet__attribute{color:#0e9ce5;font-weight:normal}.code-snippet__regexp,.code-snippet__link{color:#ca7d37}.code-snippet__symbol,.code-snippet__bullet{color:#d14}.code-snippet__built_in,.code-snippet__builtin-name{color:#ca7d37}.code-snippet__meta{color:#afafaf}.code-snippet__deletion{background:#fdd}.code-snippet__addition{background:#dfd}.code-snippet__emphasis{font-style:italic}.code-snippet__strong{font-weight:bold}.account_avatar{width:40px;height:40px;padding:0;}.account_info{display:-webkit-box;display:-webkit-flex;display:flex;-webkit-box-align:center;-webkit-align-items:center;padding:20px 0;align-items:center}.flex_bd{padding-left:14px;}.account_nickname{display:inline-block;vertical-align:middle;line-height:1.2;color:#576b95;font-size:14px}.account_desc{overflow:hidden;text-overflow:ellipsis;display:-webkit-box;-webkit-box-orient:vertical;-webkit-line-clamp:1;color:rgba(0,0,0,0.3);font-size:14px;line-height:1.2;padding-top:.4em}.msg_source_url{text-align:left;word-break:break-all;margin-top:20px;}.msg_source_url a{padding-right:10px;}.msg_source_url .url_text{color:#a8a8a8;}.video-desc{font-size:14px;margin-top:15px;color:#6c6c6c;}.msg_source_url{text-align:left;}.original_primary_card_tips{color:rgba(0,0,0,0.3);line-height:1.4;font-size:15px;}.weui-flex__item{margin-bottom:20px;padding:20px 16px;margin-top:16px;line-height:1.4;align-items:center;background-color:#f7f7f7;border-radius:8px;position:relative;}.original_primary_desc{color:rgba(0,0,0,0.5);font-size:14px;padding-top:4px;width:auto;overflow:hidden;text-overflow:ellipsis;}.msgBodyReplyList{border-top:1px solid #e1e1e1;margin-top:10px;}.msgBodyReplyListTop{border-top:0;}.reply_like_num{float:right;font-size:14px;color:#c7c7c7;}.msgData{margin-top:20px;color:#626262;}.msgData span{font-size:14px;padding-right:15px;}.msgData .likes{float:right;padding-right:0;}.js_text_content p{font-size:18px;line-height:38px;;}.rich_media_meta_link{font-size:15px;}blockquote {padding-left: 10px;border-left: 3px solid #dbdbdb;color: rgba(0,0,0,0.5);font-size:15px;line-height:35px;;padding-top: 4px;margin: 1em 0;}.video_iframe{width:500px;height:400px;}.blockquote_info{color:#b5b5b5;margin-top:10px;}.playVideoWx{position:relative;display:block;}.icon_mid_play{position:absolute;z-index:9999;top:50%;left:50%;display:-webkit-box;display:-webkit-flex;display:flex;-webkit-box-align:center;-webkit-align-items:center;align-items:center;-webkit-box-pack:center;-webkit-justify-content:center;justify-content:center;width:48px;height:48px;background:rgba(237,237,237,0.9);border-radius:50%}.icon_mid_play:before{content:"";text-indent:-999em;display:inline-block;width:28px;height:28px;vertical-align:middle;background-size:cover;background-image:url("data:image/svg+xml;charset=utf8,%3Csvg xmlns='http://www.w3.org/2000/svg' width='24' height='24' viewBox='0 0 24 24'%3E%3Cpath fill='%23151515' fill-rule='evenodd' d='M9.524 4.938l10.092 6.21a1 1 0 0 1 0 1.704l-10.092 6.21A1 1 0 0 1 8 18.21V5.79a1 1 0 0 1 1.524-.852z'/%3E%3C/svg%3E")}
</style>
<link href="https://www.juyifx.cn/config/css/wxArticle.css" rel="stylesheet"/>
</head>
<body>
<div class="rich_media">
               
                <h1 class="rich_media_title" id="activity-name">
                  
                  
                  
python爬虫总结(一)
                </h1>
                <div id="meta_content" class="rich_media_meta_list">
                                                                <span id="copyright_logo" class="rich_media_meta icon_appmsg_tag appmsg_title_tag weui-wa-hotarea">原创</span>
                                                                                          <span class="rich_media_meta rich_media_meta_text">
                                                                  crhua
                                                            </span>
                                                               
                                        <span class="rich_media_meta rich_media_meta_nickname" id="profileBt">
                      <a href="javascript:void(0);" class=" weui-wa-hotarea" id="js_name">
                        huasec                      </a>
                      <div id="js_profile_qrcode" class="profile_container" style="display:none;">
                        <div class="profile_inner">
                              <strong class="profile_nickname">huasec</strong>
                              <img class="profile_avatar" id="js_profile_qrcode_img" >

                              <p class="profile_meta">
                              <label class="profile_meta_label">微信号</label>
                              <span class="profile_meta_value">ihuahua04</span>
                              </p>

                              <p class="profile_meta">
                              <label class="profile_meta_label">功能介绍</label>
                              <span class="profile_meta_value">分享一些平时所学,励志成为一名安全研发。</span>
                              </p>
                              
                        </div>
                        <span class="profile_arrow_wrp" id="js_profile_arrow_wrp">
                              <i class="profile_arrow arrow_out"></i>
                              <i class="profile_arrow arrow_in"></i>
                        </span>
                      </div>
                  </span>
                  <em id="publish_time" class="rich_media_meta rich_media_meta_text">2018-05-19</em>
                </div>

               
                                                <div id="js_tags"class="article-tag__list" style="display: none;" data-len="0">
                                          
                        <div class="article-tag-card__title">收录于话题</div>
                        <div class="article-tags">
                                                    </div>
                                    </div><div id="weixin_content"><p>花了三天时间系统的学习了爬虫,这里做个小总结。python爬虫主要用到requsts,urllib库,解析数据常用的有re,BeautifulSoup,PyQuery库。另外还有自动化爬取数据的selenium 库,以及爬虫框架pyspider,scrapy。</p><p><span style="font-size:18px;line-height:38px;;"><strong>爬虫的开发流程</strong></span></p><p>&nbsp;&nbsp;&nbsp;&nbsp;1.分析目标网站的网页结构</p><p>&nbsp;&nbsp;&nbsp;&nbsp;2.清晰目标数据有哪些</p><p>&nbsp;&nbsp;&nbsp;&nbsp;3.找到当前数据的请求,是否是ajax异步请求,一般先保存网页,然后本地打开,如果和原网页不同,则是异步请求。</p><p><span style="font-size:18px;line-height:38px;;"><strong>反爬虫策略</strong></span></p><p><span style="font-size:18px;line-height:38px;;"><strong>&nbsp;&nbsp;&nbsp; </strong></span><span style="font-size:16px;line-height:36px;;">1.根据UA屏蔽请求</span></p><p><span style="font-size:16px;line-height:36px;;">&nbsp;&nbsp;&nbsp; &nbsp;2.只允许登录用户请求数据</span></p><p><span style="font-size:16px;line-height:36px;;">&nbsp;&nbsp;&nbsp;&nbsp; 3.用验证码限速爬虫</span></p><p><span style="font-size:16px;line-height:36px;;">&nbsp;&nbsp;&nbsp;&nbsp; 4.使用js动态生成token屏蔽爬虫</span></p><p><span style="font-size:16px;line-height:36px;;">&nbsp;&nbsp;&nbsp;&nbsp; 5.根据规则禁用IP</span></p><p><span style="font-size:16px;line-height:36px;;">针对这些反爬虫策略常用的绕过策略有:</span></p><p><span style="font-size:16px;line-height:36px;;">&nbsp;&nbsp;&nbsp;&nbsp; 1.加上请求头</span></p><p><span style="font-size:16px;line-height:36px;;">&nbsp;&nbsp;&nbsp;&nbsp; 2.设置代理</span></p><p><span style="font-size:16px;line-height:36px;;">&nbsp;&nbsp;&nbsp;&nbsp; 3.使用PlantomJS模拟用户请求抓取<br/></span></p><p><strong><span style="font-size:16px;line-height:36px;;">Request库</span></strong></p><blockquote><p><span style="font-size:12px;line-height:32px;;">import requests</span></p><p><span style="font-size:12px;line-height:32px;;">headers = {<br/>&nbsp;&nbsp; &nbsp;'User-Agent' : 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:59.0) Gecko/20100101 Firefox/59.0'</span></p><p><span style="font-size:12px;line-height:32px;;">}<br/></span></p><p><span style="font-size:12px;line-height:32px;;">proxies = {</span></p><p><span style="font-size:12px;line-height:32px;;">&nbsp;&nbsp;&nbsp;&nbsp;'http':'113.109.162.85:808'</span></p><p><span style="font-size:12px;line-height:32px;;">}</span></p><p><span style="font-size:12px;line-height:32px;;">res = requests.get(url,headers=headers,proxies=proxies)<br/></span></p><p><span style="font-size:12px;line-height:32px;;">print(res.text)<br/></span></p></blockquote><p><strong><span style="font-size:16px;line-height:36px;;">Urllib库</span></strong></p><blockquote><p><span style="font-size:12px;line-height:32px;;">import urllib.rerquest</span></p><p><span style="font-size:12px;line-height:32px;;">proxy_handler = urllib.request.ProxyHandler( {</span></p><p><span style="font-size:12px;line-height:32px;;">&nbsp;&nbsp;&nbsp;&nbsp;'http':'113.109.162.85:808'</span></p><p><span style="font-size:12px;line-height:32px;;">})<br/></span></p><p><span style="font-size:12px;line-height:32px;;">opener = urllib.request.build_opener(proxy_handler)</span></p><p><span style="font-size:12px;line-height:32px;;">res = opener.open(url)</span></p><p><span style="font-size:12px;line-height:32px;;">print(res.read().decode('utf-8')</span></p></blockquote><p><span style="font-size:16px;line-height:36px;;">re库</span></p><p><span style="font-size:16px;line-height:36px;;">很多数据用正则匹配起来比较麻烦,这里我主要介绍两个正则符号</span></p><p><span style="font-size:16px;line-height:36px;;">s&nbsp;&nbsp;&nbsp;&nbsp;匹配任意空白字符串,等价于 [       
f].</span></p><p><span style="font-size:16px;line-height:36px;;">S&nbsp;&nbsp;&nbsp;&nbsp;匹配任意非空字符串</span></p><p><span style="font-size:16px;line-height:36px;;">例如:</span></p><p><span style="font-size:14px;line-height:34px;;">&lt;td data="IP"&gt;1.1.1.1&lt;/td&gt;</span></p><p><span style="font-size:14px;line-height:34px;;">&lt;td data="PORT"&gt;80&lt;/td&gt;</span><span style="font-size:16px;line-height:36px;;"><br/></span></p><p><span style="font-size:16px;line-height:36px;;">两个td 之间有换行,所以正则为</span></p><p><span style="font-size:12px;line-height:32px;;">&lt;tdsdata="IP"&gt;(.*?)&lt;/td&gt;s+&lt;tdsdata="PORT"&gt;(.*?)&lt;/td&gt;</span></p><p><span style="font-size:16px;line-height:36px;;">Beautiful库</span></p><p><span style="font-size:16px;line-height:36px;;">可以把源代码解析成 lxml xml两种格式。</span></p><blockquote><p><span style="font-size:14px;line-height:34px;;">soup = BeautifulSoup(html,'lxml')<br/></span></p><p><span style="font-size:14px;line-height:34px;;">soup = BeautifulSoup(html,'xml')</span></p></blockquote><p><span style="font-size:16px;line-height:36px;;">有三种选择器,</span></p><p><span style="font-size:16px;line-height:36px;;">1.标签选择器</span></p><p><span style="font-size:16px;line-height:36px;;">获取属性<br/></span></p><blockquote><p><span style="font-size:14px;line-height:34px;;">soup.p['name']</span></p></blockquote><p><span style="font-size:16px;line-height:36px;;">获取内容</span></p><blockquote><p><span style="font-size:14px;line-height:34px;;">soup.p.string</span></p></blockquote><p><span style="font-size:16px;line-height:36px;;">2.标准选择器</span></p><p><span style="font-size:16px;line-height:36px;;">find_all()&nbsp; 返回所有元素,返回结果是列表</span></p><p><span style="font-size:16px;line-height:36px;;">find()&nbsp;&nbsp;&nbsp;&nbsp;返回单个元素,</span></p><blockquote><p><span style="font-size:14px;line-height:34px;;">soup.find('table',{'id':'list-1'})</span></p></blockquote><p><span style="font-size:16px;line-height:36px;;">3.css选择器</span></p><p><span style="font-size:16px;line-height:36px;;">通过select() 直接传入css选择器即可完成选择,返回类型是列表</span></p><p><span style="font-size:16px;line-height:36px;;">通过标签名查找</span></p><blockquote><p><span style="font-size:14px;line-height:34px;;">soup.select('title')</span></p></blockquote><p><span style="font-size:16px;line-height:36px;;">通过class/id 查找</span></p><blockquote><p><span style="font-size:14px;line-height:34px;;">soup.select('#list')</span></p></blockquote><p><span style="font-size:16px;line-height:36px;;">组合查找<br/></span></p><blockquote><p><span style="font-size:14px;line-height:34px;;">soup.select('p .list_1')</span></p></blockquote><p><span style="font-size:16px;line-height:36px;;">通过属性查找<br/></span></p><blockquote><p><span style="font-size:14px;line-height:34px;;">soup.select('a')</span></p></blockquote><p><span style="font-size:16px;line-height:36px;;">获取内容</span></p><blockquote><p><span style="font-size:16px;line-height:36px;;"><span style="font-size:14px;line-height:34px;;">soup.select('p .list_1')</span>.text()</span></p></blockquote><p><span style="font-size:16px;line-height:36px;;">在抓取数据的时候,我常用css 选择器。<br/></span></p><p><span style="font-size:16px;line-height:36px;;">PyQuery库</span></p><p>初始化html</p><blockquote><p><span style="font-size:14px;line-height:34px;;">from pyquery import PyQyery as pq</span></p><p><span style="font-size:14px;line-height:34px;;">doc =pq(html)</span></p><p><span style="font-size:14px;line-height:34px;;">print(doc('titile'))</span></p></blockquote><p><span style="font-size:16px;line-height:36px;;">初始化url</span></p><blockquote><p><span style="font-size:14px;line-height:34px;;">doc=pq(url)</span></p><p><span style="font-size:14px;line-height:34px;;">print(doc('title'))</span></p></blockquote><p><span style="font-size:16px;line-height:36px;;">初始化文件</span></p><blockquote><p><span style="font-size:14px;line-height:34px;;">doc=pq(filename='demo.html')</span></p><p><span style="font-size:14px;line-height:34px;;">print(doc('title')</span></p></blockquote><p><span style="font-size:16px;line-height:36px;;">css选择器<br/></span></p><p><span style="font-size:16px;line-height:36px;;">返回一个列表<br/></span></p><blockquote><p><span style="font-size:14px;line-height:34px;;">print(doc('#list li'))</span></p></blockquote><p><span style="font-size:16px;line-height:36px;;">用法和bs 大同小异。</span></p><p><span style="font-size:18px;line-height:38px;;"><strong>结语</strong></span></p><p><span style="font-size:16px;line-height:36px;;">&nbsp;&nbsp;&nbsp;&nbsp; 学会熟练运用这些库,爬取一些网站没什么大问题。<br/></span></p><p><span style="font-size:16px;line-height:36px;;"><br/></span></p><p><br/></p><p><br/></p><p><br/></p><p><br/></p><p><br/></p>
                </div>
</div>
</body>
</html>
页: [1]
查看完整版本: [18682] 2018-05-19_python爬虫总结(一)