Scrapy源码阅读分析__启动流程

更新时间:2019-03-01 17:39:39 点击次数:1393次

scrapy命令

 

当用 scrapy 写好一个爬虫后,使用 scrapy crawl <spider_name> 命令就可以运行这个爬虫,那么这个过程中到底发生了什么? scrapy 命令 从何而来?

实际上,当你成功安装 scrapy 后,使用如下命令,就能找到这个命令:


  1. $ which scrapy
  2. /usr/local/bin/scrapy

使用 vim 或其他编辑器打开它:$ vim /usr/local/bin/scrapy

其实它就是一个 python 脚本,而且代码非常少。


  1. #!/usr/bin/python3
  2. # -*- coding: utf-8 -*-
  3. import re
  4. import sys
  5. from scrapy.cmdline import execute
  6. if __name__ == '__main__':
  7. sys.argv[0] = re.sub(r'(-script\.pyw?|\.exe)?$', '', sys.argv[0])
  8. sys.exit(execute())

安装 scrapy 后,为什么入口点是这里呢? 原因是在 scrapy 的安装文件 setup.py 中,声明了程序的入口处:


  1. from os.path import dirname, join
  2. from pkg_resources import parse_version
  3. from setuptools import setup, find_packages, __version__ as setuptools_version
  4. with open(join(dirname(__file__), 'scrapy/VERSION'), 'rb') as f:
  5. version = f.read().decode('ascii').strip()
  6. def has_environment_marker_platform_impl_support():
  7. """Code extracted from 'pytest/setup.py'
  8. https://github.com/pytest-dev/pytest/blob/7538680c/setup.py#L31
  9. The first known release to support environment marker with range operators
  10. it is 18.5, see:
  11. https://setuptools.readthedocs.io/en/latest/history.html#id235
  12. """
  13. return parse_version(setuptools_version) >= parse_version('18.5')
  14. extras_require = {}
  15. if has_environment_marker_platform_impl_support():
  16. extras_require[':platform_python_implementation == "PyPy"'] = [
  17. 'PyPyDispatcher>=2.1.0',
  18. ]
  19. setup(
  20. name='Scrapy',
  21. version=version,
  22. url='https://scrapy.org',
  23. description='A high-level Web Crawling and Web Scraping framework',
  24. long_description=open('README.rst').read(),
  25. author='Scrapy developers',
  26. maintainer='Pablo Hoffman',
  27. maintainer_email='pablo@pablohoffman.com',
  28. license='BSD',
  29. packages=find_packages(exclude=('tests', 'tests.*')),
  30. include_package_data=True,
  31. zip_safe=False,
  32. entry_points={
  33. 'console_scripts': ['scrapy = scrapy.cmdline:execute']
  34. },
  35. classifiers=[
  36. 'Framework :: Scrapy',
  37. 'Development Status :: 5 - Production/Stable',
  38. 'Environment :: Console',
  39. 'Intended Audience :: Developers',
  40. 'License :: OSI Approved :: BSD License',
  41. 'Operating System :: OS Independent',
  42. 'Programming Language :: Python',
  43. 'Programming Language :: Python :: 2',
  44. 'Programming Language :: Python :: 2.7',
  45. 'Programming Language :: Python :: 3',
  46. 'Programming Language :: Python :: 3.4',
  47. 'Programming Language :: Python :: 3.5',
  48. 'Programming Language :: Python :: 3.6',
  49. 'Programming Language :: Python :: 3.7',
  50. 'Programming Language :: Python :: Implementation :: CPython',
  51. 'Programming Language :: Python :: Implementation :: PyPy',
  52. 'Topic :: Internet :: WWW/HTTP',
  53. 'Topic :: Software Development :: Libraries :: Application Frameworks',
  54. 'Topic :: Software Development :: Libraries :: Python Modules',
  55. ],
  56. python_requires='>=2.7, !=3.0.*, !=3.1.*, !=3.2.*, !=3.3.*',
  57. install_requires=[
  58. 'Twisted>=13.1.0',
  59. 'w3lib>=1.17.0',
  60. 'queuelib',
  61. 'lxml',
  62. 'pyOpenSSL',
  63. 'cssselect>=0.9',
  64. 'six>=1.5.2',
  65. 'parsel>=1.5',
  66. 'PyDispatcher>=2.0.5',
  67. 'service_identity',
  68. ],
  69. extras_require=extras_require,
  70. )

entry_points 指明了入口是 cmdline.py 的 execute 方法,在安装过程中,setuptools 这个包管理工具,就会把上述那一段代码生成放在可执行路径下。

这里也有必要说一下,如何用 python 编写一个可执行文件,其实非常简单,只需要以下几步即可完成:

这样,你就可以直接使用文件名执行此脚本了,而不用通过 python <file.py> 的方式去执行,是不是很简单?

 

 

入口(execute.py)

 

既然现在已经知道了 scrapy 的入口是 scrapy/cmdline.py 的 execute 方法,我们来看一下这个方法。

主要的运行流程已经加好注释,这里我总结出了每个流程执行过程:

 

 

流程解析

 

初始化项目配置

这个流程比较简单,主要是根据环境变量和 scrapy.cfg 初始化环境,最终生成一个 Settings 实例,来看代码get_project_settings 方法(from scrapy.utils.project import inside_project, get_project_settings):


  1. def get_project_settings():
  2. # 环境变量中是否有SCRAPY_SETTINGS_MODULE配置
  3. if ENVVAR not in os.environ:
  4. project = os.environ.get('SCRAPY_PROJECT', 'default')
  5. # 初始化环境,找到用户配置文件settings.py,设置到环境变量SCRAPY_SETTINGS_MODULE中
  6. init_env(project)
  7. # 加载默认配置文件default_settings.py,生成settings实例
  8. settings = Settings()
  9. # 取得用户配置文件
  10. settings_module_path = os.environ.get(ENVVAR)
  11. # 更新配置,用户配置覆盖默认配置
  12. if settings_module_path:
  13. settings.setmodule(settings_module_path, priority='project')
  14. # XXX: remove this hack
  15. # 如果环境变量中有其他scrapy相关配置则覆盖
  16. pickled_settings = os.environ.get("SCRAPY_PICKLED_SETTINGS_TO_OVERRIDE")
  17. if pickled_settings:
  18. settings.setdict(pickle.loads(pickled_settings), priority='project')
  19. # XXX: deprecate and remove this functionality
  20. env_overrides = {k[7:]: v for k, v in os.environ.items() if
  21. k.startswith('SCRAPY_')}
  22. if env_overrides:
  23. settings.setdict(env_overrides, priority='project')
  24. return settings

这个过程中进行了 Settings 配置初始化 (from scrapy.settings import Settings)


  1. class Settings(BaseSettings):
  2. """
  3. This object stores Scrapy settings for the configuration of internal
  4. components, and can be used for any further customization.
  5. It is a direct subclass and supports all methods of
  6. :class:`~scrapy.settings.BaseSettings`. Additionally, after instantiation
  7. of this class, the new object will have the global default settings
  8. described on :ref:`topics-settings-ref` already populated.
  9. """
  10. def __init__(self, values=None, priority='project'):
  11. # Do not pass kwarg values here. We don't want to promote user-defined
  12. # dicts, and we want to update, not replace, default dicts with the
  13. # values given by the user
  14. # 调用父类构造初始化
  15. super(Settings, self).__init__()
  16. # 把default_settings.py的所有配置set到settings实例中
  17. self.setmodule(default_settings, 'default')
  18. # Promote default dictionaries to BaseSettings instances for per-key
  19. # priorities
  20. # 把attributes属性也set到settings实例中
  21. for name, val in six.iteritems(self):
  22. if isinstance(val, dict):
  23. self.set(name, BaseSettings(val, 'default'), 'default')
  24. self.update(values, priority)

程序 加载 默认配置文件 default_settings.py 中的所有配置项设置到 Settings 中,且这个配置是有优先级的。

这个默认配置文件 default_settings.py 是非常重要的,个人认为还是有必要看一下里面的内容,这里包含了所有默认的配置例如:调度器类、爬虫中间件类、下载器中间件类、下载处理器类等等。

在这里就能隐约发现,scrapy 的架构是非常低耦合的,所有组件都是可替换的。什么是可替换呢?

例如:你觉得默认的调度器功能不够用,那么你就可以按照它定义的接口标准,自己实现一个调度器,然后在自己的配置文件中,注册自己写的调度器模块,那么 scrapy 的运行时就会用上你新写的调度器模块了!(scrapy-redis 就是替换 scrapy 中的模块 来实现分布式

只要在默认配置文件中配置的模块,都是可替换的。

 

检查环境是否在项目中

 


  1. def inside_project():
  2. # 检查此环境变量是否存在(上面已设置)
  3. scrapy_module = os.environ.get('SCRAPY_SETTINGS_MODULE')
  4. if scrapy_module is not None:
  5. try:
  6. import_module(scrapy_module)
  7. except ImportError as exc:
  8. warnings.warn("Cannot import scrapy settings module %s: %s" % (scrapy_module, exc))
  9. else:
  10. return True
  11. # 如果环境变量没有,就近查找scrapy.cfg,找得到就认为是在项目环境中
  12. return bool(closest_scrapy_cfg())

scrapy 命令有的是依赖项目运行的,有的命令则是全局的,不依赖项目的。这里主要通过就近查找 scrapy.cfg 文件来确定是否在项目环境中。

 

获取可用命令并组装成名称与实例的字典

 


  1. def _get_commands_dict(settings, inproject):
  2. # 导入commands文件夹下的所有模块,生成{cmd_name: cmd}的字典集合
  3. cmds = _get_commands_from_module('scrapy.commands', inproject)
  4. cmds.update(_get_commands_from_entry_points(inproject))
  5. # 如果用户自定义配置文件中有COMMANDS_MODULE配置,则加载自定义的命令类
  6. cmds_module = settings['COMMANDS_MODULE']
  7. if cmds_module:
  8. cmds.update(_get_commands_from_module(cmds_module, inproject))
  9. return cmds
  10. def _get_commands_from_module(module, inproject):
  11. d = {}
  12. # 找到这个模块下所有的命令类(ScrapyCommand子类)
  13. for cmd in _iter_command_classes(module):
  14. if inproject or not cmd.requires_project:
  15. # 生成{cmd_name: cmd}字典
  16. cmdname = cmd.__module__.split('.')[-1]
  17. d[cmdname] = cmd()
  18. return d
  19. def _iter_command_classes(module_name):
  20. # TODO: add `name` attribute to commands and and merge this function with
  21. # 迭代这个包下的所有模块,找到ScrapyCommand的子类
  22. # scrapy.utils.spider.iter_spider_classes
  23. for module in walk_modules(module_name):
  24. for obj in vars(module).values():
  25. if inspect.isclass(obj) and \
  26. issubclass(obj, ScrapyCommand) and \
  27. obj.__module__ == module.__name__ and \
  28. not obj == ScrapyCommand:
  29. yield obj

这个过程主要是,导入 commands 文件夹下的所有模块,生成 {cmd_name: cmd} 字典集合,如果用户在配置文件中配置了自定义的命令类,也追加进去。也就是说,自己也可以编写自己的命令类,然后追加到配置文件中,之后就可以使用自己自定义的命令了。

 

解析执行的命令并找到对应的命令实例

 


  1. def _pop_command_name(argv):
  2. i = 0
  3. for arg in argv[1:]:
  4. if not arg.startswith('-'):
  5. del argv[i]
  6. return arg
  7. i += 1

这个过程就是解析命令行,例如 scrapy crawl <spider_name>,解析出 crawl,通过上面生成好的命令字典集合,就能找到commands 模块下的 crawl.py 下的 Command 的实例。

 

scrapy命令实例解析命令行参数

 

找到对应的命令实例后,调用 cmd.process_options 方法(例如 scrapy/commands/crawl.py):


  1. class Command(ScrapyCommand):
  2. requires_project = True
  3. def syntax(self):
  4. return "[options] <spider>"
  5. def short_desc(self):
  6. return "Run a spider"
  7. def add_options(self, parser):
  8. ScrapyCommand.add_options(self, parser)
  9. parser.add_option("-a", dest="spargs", action="append", default=[], metavar="NAME=VALUE",
  10. help="set spider argument (may be repeated)")
  11. parser.add_option("-o", "--output", metavar="FILE",
  12. help="dump scraped items into FILE (use - for stdout)")
  13. parser.add_option("-t", "--output-format", metavar="FORMAT",
  14. help="format to use for dumping items with -o")
  15. def process_options(self, args, opts):
  16. # 首先调用了父类的process_options,解析统一固定的参数
  17. ScrapyCommand.process_options(self, args, opts)
  18. try:
  19. opts.spargs = arglist_to_dict(opts.spargs)
  20. except ValueError:
  21. raise UsageError("Invalid -a value, use -a NAME=VALUE", print_help=False)
  22. if opts.output:
  23. if opts.output == '-':
  24. self.settings.set('FEED_URI', 'stdout:', priority='cmdline')
  25. else:
  26. self.settings.set('FEED_URI', opts.output, priority='cmdline')
  27. feed_exporters = without_none_values(
  28. self.settings.getwithbase('FEED_EXPORTERS'))
  29. valid_output_formats = feed_exporters.keys()
  30. if not opts.output_format:
  31. opts.output_format = os.path.splitext(opts.output)[1].replace(".", "")
  32. if opts.output_format not in valid_output_formats:
  33. raise UsageError("Unrecognized output format '%s', set one"
  34. " using the '-t' switch or as a file extension"
  35. " from the supported list %s" % (opts.output_format,
  36. tuple(valid_output_formats)))
  37. self.settings.set('FEED_FORMAT', opts.output_format, priority='cmdline')
  38. def run(self, args, opts):
  39. if len(args) < 1:
  40. raise UsageError()
  41. elif len(args) > 1:
  42. raise UsageError("running 'scrapy crawl' with more than one spider is no longer supported")
  43. spname = args[0]
  44. self.crawler_process.crawl(spname, **opts.spargs)
  45. self.crawler_process.start()
  46. if self.crawler_process.bootstrap_failed:
  47. self.exitcode = 1

这个过程就是解析命令行其余的参数,固定参数 解析交给 父类 处理,例如输出位置等。其余不同的参数由不同的命令类解析。

 

初始化CrawlerProcess

 

最后初始化 CrawlerProcess 实例,然后运行对应命令实例的 run 方法。


  1. cmd.crawler_process = CrawlerProcess(settings)
  2. _run_print_help(parser, _run_command, cmd, args, opts)

如果运行命令是 scrapy crawl <spider_name>,则运行的就是 commands/crawl.py 的 run看上面代码中 run 方法

run 方法中调用了 CrawlerProcess 实例的 crawl 和 start,就这样整个爬虫程序就会运行起来了。

先来看 CrawlerProcess 初始化:(scrapy/crawl.py)


  1. class CrawlerProcess(CrawlerRunner):
  2. def __init__(self, settings=None, install_root_handler=True):
  3. # 调用父类初始化
  4. super(CrawlerProcess, self).__init__(settings)
  5. # 信号和log初始化
  6. install_shutdown_handlers(self._signal_shutdown)
  7. configure_logging(self.settings, install_root_handler)
  8. log_scrapy_info(self.settings)

构造方法中调用了父类 CrawlerRunner 的构造:


  1. class CrawlerRunner(object):
  2. def __init__(self, settings=None):
  3. if isinstance(settings, dict) or settings is None:
  4. settings = Settings(settings)
  5. self.settings = settings
  6. # 获取爬虫加载器
  7. self.spider_loader = _get_spider_loader(settings)
  8. self._crawlers = set()
  9. self._active = set()
  10. self.bootstrap_failed = False

初始化时,调用了  _get_spider_loader 方法:


  1. def _get_spider_loader(settings):
  2. """ Get SpiderLoader instance from settings """
  3. # 读取配置文件中的SPIDER_MANAGER_CLASS配置项
  4. if settings.get('SPIDER_MANAGER_CLASS'):
  5. warnings.warn(
  6. 'SPIDER_MANAGER_CLASS option is deprecated. '
  7. 'Please use SPIDER_LOADER_CLASS.',
  8. category=ScrapyDeprecationWarning, stacklevel=2
  9. )
  10. cls_path = settings.get('SPIDER_MANAGER_CLASS',
  11. settings.get('SPIDER_LOADER_CLASS'))
  12. loader_cls = load_object(cls_path)
  13. try:
  14. verifyClass(ISpiderLoader, loader_cls)
  15. except DoesNotImplement:
  16. warnings.warn(
  17. 'SPIDER_LOADER_CLASS (previously named SPIDER_MANAGER_CLASS) does '
  18. 'not fully implement scrapy.interfaces.ISpiderLoader interface. '
  19. 'Please add all missing methods to avoid unexpected runtime errors.',
  20. category=ScrapyDeprecationWarning, stacklevel=2
  21. )
  22. return loader_cls.from_settings(settings.frozencopy())

默认配置文件中的 spider_loader 配置是 spiderloader.SpiderLoader(scrapy/spiderloader.py)


  1. @implementer(ISpiderLoader)
  2. class SpiderLoader(object):
  3. """
  4. SpiderLoader is a class which locates and loads spiders
  5. in a Scrapy project.
  6. """
  7. def __init__(self, settings):
  8. # 配置文件获取存放爬虫脚本的路径
  9. self.spider_modules = settings.getlist('SPIDER_MODULES')
  10. self.warn_only = settings.getbool('SPIDER_LOADER_WARN_ONLY')
  11. self._spiders = {}
  12. self._found = defaultdict(list)
  13. # 加载所有爬虫
  14. self._load_all_spiders()
  15. def _check_name_duplicates(self):
  16. dupes = ["\n".join(" {cls} named {name!r} (in {module})".format(
  17. module=mod, cls=cls, name=name)
  18. for (mod, cls) in locations)
  19. for name, locations in self._found.items()
  20. if len(locations)>1]
  21. if dupes:
  22. msg = ("There are several spiders with the same name:\n\n"
  23. "{}\n\n This can cause unexpected behavior.".format(
  24. "\n\n".join(dupes)))
  25. warnings.warn(msg, UserWarning)
  26. def _load_spiders(self, module):
  27. for spcls in iter_spider_classes(module):
  28. self._found[spcls.name].append((module.__name__, spcls.__name__))
  29. self._spiders[spcls.name] = spcls
  30. def _load_all_spiders(self):
  31. # 组装成{spider_name: spider_cls}的字典
  32. for name in self.spider_modules:
  33. try:
  34. for module in walk_modules(name):
  35. self._load_spiders(module)
  36. except ImportError as e:
  37. if self.warn_only:
  38. msg = ("\n{tb}Could not load spiders from module '{modname}'. "
  39. "See above traceback for details.".format(
  40. modname=name, tb=traceback.format_exc()))
  41. warnings.warn(msg, RuntimeWarning)
  42. else:
  43. raise
  44. self._check_name_duplicates()
  45. @classmethod
  46. def from_settings(cls, settings):
  47. return cls(settings)
  48. def load(self, spider_name):
  49. """
  50. Return the Spider class for the given spider name. If the spider
  51. name is not found, raise a KeyError.
  52. """
  53. try:
  54. return self._spiders[spider_name]
  55. except KeyError:
  56. raise KeyError("Spider not found: {}".format(spider_name))
  57. def find_by_request(self, request):
  58. """
  59. Return the list of spider names that can handle the given request.
  60. """
  61. return [name for name, cls in self._spiders.items()
  62. if cls.handles_request(request)]
  63. def list(self):
  64. """
  65. Return a list with the names of all spiders available in the project.
  66. """
  67. return list(self._spiders.keys())

爬虫加载器会加载所有的爬虫脚本,最后生成一个 {spider_name: spider_cls} 的字典。

 

执行 crawl 和 start 方法

 

CrawlerProcess 初始化完之后,调用 crawl 方法:


  1. class CrawlerRunner(object):
  2. def __init__(self, settings=None):
  3. if isinstance(settings, dict) or settings is None:
  4. settings = Settings(settings)
  5. self.settings = settings
  6. self.spider_loader = _get_spider_loader(settings)
  7. self._crawlers = set()
  8. self._active = set()
  9. self.bootstrap_failed = False
  10. @property
  11. def spiders(self):
  12. warnings.warn("CrawlerRunner.spiders attribute is renamed to "
  13. "CrawlerRunner.spider_loader.",
  14. category=ScrapyDeprecationWarning, stacklevel=2)
  15. return self.spider_loader
  16. def crawl(self, crawler_or_spidercls, *args, **kwargs):
  17. # 创建crawler
  18. crawler = self.create_crawler(crawler_or_spidercls)
  19. return self._crawl(crawler, *args, **kwargs)
  20. def _crawl(self, crawler, *args, **kwargs):
  21. self.crawlers.add(crawler)
  22. # 调用Crawler的crawl方法
  23. d = crawler.crawl(*args, **kwargs)
  24. self._active.add(d)
  25. def _done(result):
  26. self.crawlers.discard(crawler)
  27. self._active.discard(d)
  28. self.bootstrap_failed |= not getattr(crawler, 'spider', None)
  29. return result
  30. return d.addBoth(_done)
  31. def create_crawler(self, crawler_or_spidercls):
  32. # 如果是字符串,则从spider_loader中加载这个爬虫类
  33. if isinstance(crawler_or_spidercls, Crawler):
  34. return crawler_or_spidercls
  35. # 否则创建Crawler
  36. return self._create_crawler(crawler_or_spidercls)
  37. def _create_crawler(self, spidercls):
  38. if isinstance(spidercls, six.string_types):
  39. spidercls = self.spider_loader.load(spidercls)
  40. return Crawler(spidercls, self.settings)
  41. def stop(self):
  42. """
  43. Stops simultaneously all the crawling jobs taking place.
  44. Returns a deferred that is fired when they all have ended.
  45. """
  46. return defer.DeferredList([c.stop() for c in list(self.crawlers)])
  47. @defer.inlineCallbacks
  48. def join(self):
  49. """
  50. join()
  51. Returns a deferred that is fired when all managed :attr:`crawlers` have
  52. completed their executions.
  53. """
  54. while self._active:
  55. yield defer.DeferredList(self._active)

这个过程会创建 Cralwer 实例,然后调用它的 crawl 方法:(scrapy/crawl.py 中 class Crawler )


  1. @defer.inlineCallbacks
  2. def crawl(self, *args, **kwargs):
  3. assert not self.crawling, "Crawling already taking place"
  4. self.crawling = True
  5. try:
  6. # 到现在,才是实例化一个爬虫实例
  7. self.spider = self._create_spider(*args, **kwargs)
  8. # 创建引擎
  9. self.engine = self._create_engine()
  10. # 调用爬虫类的start_requests方法
  11. start_requests = iter(self.spider.start_requests())
  12. # 执行引擎的open_spider,并传入爬虫实例和初始请求
  13. yield self.engine.open_spider(self.spider, start_requests)
  14. yield defer.maybeDeferred(self.engine.start)
  15. except Exception:
  16. # In Python 2 reraising an exception after yield discards
  17. # the original traceback (see https://bugs.python.org/issue7563),
  18. # so sys.exc_info() workaround is used.
  19. # This workaround also works in Python 3, but it is not needed,
  20. # and it is slower, so in Python 3 we use native `raise`.
  21. if six.PY2:
  22. exc_info = sys.exc_info()
  23. self.crawling = False
  24. if self.engine is not None:
  25. yield self.engine.close()
  26. if six.PY2:
  27. six.reraise(*exc_info)
  28. raise

最后调用 start 方法:


  1. def start(self, stop_after_crawl=True):
  2. """
  3. This method starts a Twisted `reactor`_, adjusts its pool size to
  4. :setting:`REACTOR_THREADPOOL_MAXSIZE`, and installs a DNS cache based
  5. on :setting:`DNSCACHE_ENABLED` and :setting:`DNSCACHE_SIZE`.
  6. If `stop_after_crawl` is True, the reactor will be stopped after all
  7. crawlers have finished, using :meth:`join`.
  8. :param boolean stop_after_crawl: stop or not the reactor when all
  9. crawlers have finished
  10. """
  11. if stop_after_crawl:
  12. d = self.join()
  13. # Don't start the reactor if the deferreds are already fired
  14. if d.called:
  15. return
  16. d.addBoth(self._stop_reactor)
  17. reactor.installResolver(self._get_dns_resolver())
  18. # 配置reactor的池子大小(可修改REACTOR_THREADPOOL_MAXSIZE调整)
  19. tp = reactor.getThreadPool()
  20. tp.adjustPoolsize(maxthreads=self.settings.getint('REACTOR_THREADPOOL_MAXSIZE'))
  21. reactor.addSystemEventTrigger('before', 'shutdown', self.stop)
  22. # 开始执行
  23. reactor.run(installSignalHandlers=False) # blocking call

reactor 是个什么东西呢?它是 Twisted 模块的 事件管理器,只要把需要执行的事件方法注册到 reactor 中,然后调用它的 run 方法,它就会帮你执行注册好的事件方法,如果遇到 网络IO 等待,它会自动帮你切换可执行的事件方法,非常高效。

大家不用在意 reactor 是如何工作的,你可以把它想象成一个线程池,只是采用注册回调的方式来执行事件。

到这里,爬虫的之后调度逻辑就交由引擎 ExecuteEngine 处理了。

在每次执行 scrapy 命令 时,主要经过环境、配置初始化,加载命令类 和 爬虫模块,最终实例化执行引擎,交给引擎调度处理的流程,下篇文章会讲解执行引擎是如何调度和管理各个组件工作的。

本站文章版权归原作者及原出处所有 。内容为作者个人观点, 并不代表本站赞同其观点和对其真实性负责,本站只提供参考并不构成任何投资及应用建议。本站是一个个人学习交流的平台,网站上部分文章为转载,并不用于任何商业目的,我们已经尽可能的对作者和来源进行了通告,但是能力有限或疏忽,造成漏登,请及时联系我们,我们将根据著作权人的要求,立即更正或者删除有关内容。本站拥有对此声明的最终解释权。

回到顶部
嘿,我来帮您!