Java自学者论坛

 找回密码
 立即注册

手机号码,快捷登录

恭喜Java自学者论坛(https://www.javazxz.com)已经为数万Java学习者服务超过8年了!积累会员资料超过10000G+
成为本站VIP会员,下载本站10000G+会员资源,会员资料板块,购买链接:点击进入购买VIP会员

JAVA高级面试进阶训练营视频教程

Java架构师系统进阶VIP课程

分布式高可用全栈开发微服务教程Go语言视频零基础入门到精通Java架构师3期(课件+源码)
Java开发全终端实战租房项目视频教程SpringBoot2.X入门到高级使用教程大数据培训第六期全套视频教程深度学习(CNN RNN GAN)算法原理Java亿级流量电商系统视频教程
互联网架构师视频教程年薪50万Spark2.0从入门到精通年薪50万!人工智能学习路线教程年薪50万大数据入门到精通学习路线年薪50万机器学习入门到精通教程
仿小米商城类app和小程序视频教程深度学习数据分析基础到实战最新黑马javaEE2.1就业课程从 0到JVM实战高手教程MySQL入门到精通教程
查看: 668|回复: 0

python网络爬虫(1)——安装scrapy框架的常见问题及其解决方法

[复制链接]
  • TA的每日心情
    奋斗
    2024-9-22 15:19
  • 签到天数: 795 天

    [LV.10]以坛为家III

    2050

    主题

    2108

    帖子

    72万

    积分

    管理员

    Rank: 9Rank: 9Rank: 9

    积分
    724084
    发表于 2021-7-8 00:20:31 | 显示全部楼层 |阅读模式

      Scrapy是为了爬取网站数据而编写的一款应用框架,出名,强大。所谓的框架其实就是一个集成了相应的功能且具有很强通用性的项目模板。

      其实在Linux和 Mac安装,就简单的pip命令即可:

    pip install wheel
    

      但是在Windows上安装却有很多坑,所以下面小编讲一下自己在windows10安装及配置Scrapy中遇到的一些坑及其解决的方法,现在总结如下,希望对大家有所帮助。

      包的下载地址:https://www.lfd.uci.edu/~gohlke/pythonlibs/

    常见问题一:pip版本需要升级

           如果你的pip版本比较老,可能在安装的过程中需要更新对应的pip版本,所以最好通过指令升级一下pip

           升级指令如下(这是在cmd中操作):

    python -m pip install  --upgrade  pip

      升级完成后,这一类问题就解决了。

    常见问题二:安装wheel

    pip install  wheel

            如果未安装wheel,使用该命令可以直接安装wheel,如果已经安装了,使用该命令则会显示如下图所出信息,不会重复进行安装

    Requirement already satisfied: wheel in d:\python3\lib\site-packages

    常见问题三:缺少lxml

              顺利安装完成wheel,到这里对应的.whl文件,注意别改文件名,然后下载如下的xlml文件,我们可以在LFD中下载对应版本的lxml,如下(我的是windows 64位操作系统,python版本是3.6)

       下载之后,进入cmd命令行安装好对应的whl文件:

    pip install lxml-4.1.1-cp36-cp36m-win_amd64.whl

           未安装的,可以直接安装,已经安装的会出现如下代码表示成功

    Requirement already satisfied: lxml==4.1.1 from file:///D:/lxml-4.1.1-cp36-cp36m-win_amd64.whl in d:\python3\lib\site-packages

     

    常见问题四:路径冲突

    Error in sitecustomize; set PYTHONVERBOSE for traceback:
    AttributeError: module 'sys' has no attribute 'setdefaultencoding'

      因为sys.path 中多了python27的site-package冲突  

      到“…/local/lib/python3.6/site-packages/“目录下(目录因人而已),删除里面的路径即可

    python -v homebrew.pth

     

    常见问题五:缺少Twisted

           安装Twisted,然后根据自己的电脑安装(我的是python 3.6,操作系统是64位,名称中间的cp36是python3.6的意思,amd64是python的位数)

     

     下载好后,安装命令如下:

     pip install  Twisted-17.9.0-cp36-cp36m-win_amd64.whl

    未安装的,可以直接安装,安装的则显示成功,如下:

    Successfully installed Twisted-17.9.0

    常见问题六:出现UnicodeDecodeError

    (由于小编已经踩过坑了,所以这些代码都是网上找到的相似代码,大体内容相似,问题一致)

    Exception:
      Traceback (most recent call last):
        File "c:\program files\python36\lib\site-packages\pip\compat\__init__.py", line 73, in console_to_str
          return s.decode(sys.__stdout__.encoding)
     UnicodeDecodeError: 'utf-8' codec can't decode byte 0xb6 in position 34: invalid start byte
     
     During handling of the above exception, another exception occurred:
     
     Traceback (most recent call last):
       File "c:\program files\python36\lib\site-packages\pip\basecommand.py", line 215, in main
         status = self.run(options, args)
       File "c:\program files\python36\lib\site-packages\pip\commands\install.py", line 342, in run
         prefix=options.prefix_path,
       File "c:\program files\python36\lib\site-packages\pip\req\req_set.py", line 784, in install
         **kwargs
       File "c:\program files\python36\lib\site-packages\pip\req\req_install.py", line 878, in install
         spinner=spinner,
       File "c:\program files\python36\lib\site-packages\pip\utils\__init__.py", line 676, in call_subprocess
         line = console_to_str(proc.stdout.readline())
       File "c:\program files\python36\lib\site-packages\pip\compat\__init__.py", line 75, in console_to_str
         return s.decode('utf_8')
     UnicodeDecodeError: 'utf-8' codec can't decode byte 0xb6 in position 34: invalid start byte
    

       或者下面error:

    Exception:
    Traceback (most recent call last):
      File "c:\users\59740\appdata\local\programs\python\python36\lib\site-packages\pip\compat\__init__.py", line 73, in console_to_str
        return s.decode(sys.__stdout__.encoding)
    UnicodeDecodeError: 'utf-8' codec can't decode byte 0xb6 in position 34: invalid start byte
    
    During handling of the above exception, another exception occurred:
    
    Traceback (most recent call last):
      File "c:\users\59740\appdata\local\programs\python\python36\lib\site-packages\pip\commands\install.py", line 342, in run
        prefix=options.prefix_path,
      File "c:\users\59740\appdata\local\programs\python\python36\lib\site-packages\pip\req\req_set.py", line 784, in install
        **kwargs
      File "c:\users\59740\appdata\local\programs\python\python36\lib\site-packages\pip\req\req_install.py", line 878, in install
        spinner=spinner,
      File "c:\users\59740\appdata\local\programs\python\python36\lib\site-packages\pip\utils\__init__.py", line 676, in call_subprocess
        line = console_to_str(proc.stdout.readline())
      File "c:\users\59740\appdata\local\programs\python\python36\lib\site-packages\pip\compat\__init__.py", line 75, in console_to_str
        return s.decode('utf_8')
    UnicodeDecodeError: 'utf-8' codec can't decode byte 0xb6 in position 34: invalid start byte
    
    During handling of the above exception, another exception occurred:
    
    Traceback (most recent call last):
      File "c:\users\59740\appdata\local\programs\python\python36\lib\site-packages\pip\commands\install.py", line 385, in run
        requirement_set.cleanup_files()
      File "c:\users\59740\appdata\local\programs\python\python36\lib\site-packages\pip\req\req_set.py", line 729, in cleanup_files
        req.remove_temporary_source()
      File "c:\users\59740\appdata\local\programs\python\python36\lib\site-packages\pip\req\req_install.py", line 977, in remove_temporary_sou
        rmtree(self.source_dir)
      File "c:\users\59740\appdata\local\programs\python\python36\lib\site-packages\pip\_vendor\retrying.py", line 49, in wrapped_f
        return Retrying(*dargs, **dkw).call(f, *args, **kw)
      File "c:\users\59740\appdata\local\programs\python\python36\lib\site-packages\pip\_vendor\retrying.py", line 212, in call
        raise attempt.get()
      File "c:\users\59740\appdata\local\programs\python\python36\lib\site-packages\pip\_vendor\retrying.py", line 247, in get
        six.reraise(self.value[0], self.value[1], self.value[2])
      File "c:\users\59740\appdata\local\programs\python\python36\lib\site-packages\pip\_vendor\six.py", line 686, in reraise
        raise value
      File "c:\users\59740\appdata\local\programs\python\python36\lib\site-packages\pip\_vendor\retrying.py", line 200, in call
        attempt = Attempt(fn(*args, **kwargs), attempt_number, False)
      File "c:\users\59740\appdata\local\programs\python\python36\lib\site-packages\pip\utils\__init__.py", line 102, in rmtree
        onerror=rmtree_errorhandler)
      File "c:\users\59740\appdata\local\programs\python\python36\lib\shutil.py", line 488, in rmtree
        return _rmtree_unsafe(path, onerror)
      File "c:\users\59740\appdata\local\programs\python\python36\lib\shutil.py", line 387, in _rmtree_unsafe
        onerror(os.rmdir, path, sys.exc_info())
      File "c:\users\59740\appdata\local\programs\python\python36\lib\site-packages\pip\utils\__init__.py", line 114, in rmtree_errorhandler
        func(path)
    PermissionError: [WinError 32] 另一个程序正在使用此文件,进程无法访问。: 'C:\\Users\\59740\\AppData\\Local\\Temp\\pip-build-1djzmudb\\scrapy'
    
    During handling of the above exception, another exception occurred:
    
    Traceback (most recent call last):
      File "c:\users\59740\appdata\local\programs\python\python36\lib\site-packages\pip\basecommand.py", line 215, in main
        status = self.run(options, args)
      File "c:\users\59740\appdata\local\programs\python\python36\lib\site-packages\pip\commands\install.py", line 385, in run
        requirement_set.cleanup_files()
      File "c:\users\59740\appdata\local\programs\python\python36\lib\site-packages\pip\utils\build.py", line 38, in __exit__
        self.cleanup()
      File "c:\users\59740\appdata\local\programs\python\python36\lib\site-packages\pip\utils\build.py", line 42, in cleanup
        rmtree(self.name)
      File "c:\users\59740\appdata\local\programs\python\python36\lib\site-packages\pip\_vendor\retrying.py", line 49, in wrapped_f
        return Retrying(*dargs, **dkw).call(f, *args, **kw)
      File "c:\users\59740\appdata\local\programs\python\python36\lib\site-packages\pip\_vendor\retrying.py", line 212, in call
        raise attempt.get()
      File "c:\users\59740\appdata\local\programs\python\python36\lib\site-packages\pip\_vendor\retrying.py", line 247, in get
        six.reraise(self.value[0], self.value[1], self.value[2])
      File "c:\users\59740\appdata\local\programs\python\python36\lib\site-packages\pip\_vendor\six.py", line 686, in reraise
        raise value
      File "c:\users\59740\appdata\local\programs\python\python36\lib\site-packages\pip\_vendor\retrying.py", line 200, in call
        attempt = Attempt(fn(*args, **kwargs), attempt_number, False)
      File "c:\users\59740\appdata\local\programs\python\python36\lib\site-packages\pip\utils\__init__.py", line 102, in rmtree
        onerror=rmtree_errorhandler)
      File "c:\users\59740\appdata\local\programs\python\python36\lib\shutil.py", line 488, in rmtree
        return _rmtree_unsafe(path, onerror)
      File "c:\users\59740\appdata\local\programs\python\python36\lib\shutil.py", line 378, in _rmtree_unsafe
        _rmtree_unsafe(fullname, onerror)
      File "c:\users\59740\appdata\local\programs\python\python36\lib\shutil.py", line 387, in _rmtree_unsafe
        onerror(os.rmdir, path, sys.exc_info())
      File "c:\users\59740\appdata\local\programs\python\python36\lib\site-packages\pip\utils\__init__.py", line 114, in rmtree_errorhandler
        func(path)
    PermissionError: [WinError 32] 另一个程序正在使用此文件,进程无法访问。: 'C:\\Users\\59740\\AppData\\Local\\Temp\\pip-build-1djzmudb\\scrapy
    

      

    解决方法:

      打开 

    c:\program files\python36\lib\site-packages\pip\compat\__init__.py 

       找到

    return s.decode('utf_8')


    并将其改为

    return s.decode('cp936')

      
    这个是编码问题,虽然py3统一用utf-8了。但windows下的终端显示用的还是gbk编码。

     常见问题七:缺少win32

      缺少模块,会显示如下错误:

    ModuleNotFoundError: No module named 'win32api'

      安装win32,然后根据自己的电脑安装(我的是python 3.6,操作系统是64位,名称中间的cp36是python3.6的意思,amd64是python的位数)

      安装指令如下:

    pip install pywin32-221-cp36-cp36m-win_amd64.whl

     

    最后安装scrapy

      在cmd中输入如下代码

    pip install scrapy

       ok,终于经过折腾完成这个scrapy框架的安装,真的是经历九九八十一难。

      现在总结一下安装scrapy的大致顺序:

    基本一个好的anaconda环境,我们安装以下面顺序即可:
    
    1,pip install wheel
    
    2,下载对应版本的twisted,然后  pip install   下载好的框架.whl
    
    3,pip install pywin32
    
    4,pip install scrapy
    

      

    复杂问题:找不到指定模组

      报错如下:

       网上找了很多方法,都没有解决,很烦。

      于是我将安装的东西全部卸载,依次卸载lxml,twisted,pywin32。如果运气好的话,再次安装就OK了。

      如果运气不好的话,我们需要更新一个东西,那就是openssl的版本。

    conda install openssl=1.0.2p
    

      这样就OK了。

     

      参考: https://www.cnblogs.com/little-orangeaaa/p/10259973.html

     

    scrapy框架常见命令

      查看所有命令

    scrapy -h

      查看帮助信息

    scapy --help
    

      查看版本信息

    (venv)ql@ql:~$ scrapy version
    Scrapy 1.1.2
    (venv)ql@ql:~$ 
    (venv)ql@ql:~$ scrapy version -v
    Scrapy    : 1.1.2
    lxml      : 3.6.4.0
    libxml2   : 2.9.4
    Twisted   : 16.4.0
    Python    : 2.7.12 (default, Jul  1 2016, 15:12:24) - [GCC 5.4.0 20160609]
    pyOpenSSL : 16.1.0 (OpenSSL 1.0.2g-fips  1 Mar 2016)
    Platform  : Linux-4.4.0-36-generic-x86_64-with-Ubuntu-16.04-xenial
    (venv)ql@ql:~$ 
    

      新建一个工程

    scrapy startproject spider_name
    

      构建爬虫genspider(generator spider)(一个工程中可以存在多个spider, 但是名字必须唯一)

    scrapy genspider name domain
    #如:
    #scrapy genspider sohu sohu.org
    

      查看当前项目内有多少爬虫

    scrapy list
    

      view使用浏览器打开网页

    scrapy view http://www.baidu.com
    

      shell命令, 进入scrpay交互环境

    #进入该url的交互环境
    scrapy shell http://www.dmoz.org/Computers/Programming/Languages/Python/Books/
    

      之后便进入交互环境,我们主要使用这里面的response命令, 例如可以使用

    response.xpath()    #括号里直接加xpath路径
    

      runspider命令用于直接运行创建的爬虫, 并不会运行整个项目

    scrapy runspider 爬虫名称
    

      

    哎...今天够累的,签到来了1...
    回复

    使用道具 举报

    您需要登录后才可以回帖 登录 | 立即注册

    本版积分规则

    QQ|手机版|小黑屋|Java自学者论坛 ( 声明:本站文章及资料整理自互联网,用于Java自学者交流学习使用,对资料版权不负任何法律责任,若有侵权请及时联系客服屏蔽删除 )

    GMT+8, 2024-10-2 08:32 , Processed in 0.065685 second(s), 29 queries .

    Powered by Discuz! X3.4

    Copyright © 2001-2021, Tencent Cloud.

    快速回复 返回顶部 返回列表