> 文章列表 > Scrapy-连接数据库

Scrapy-连接数据库

Scrapy-连接数据库

通过前面几篇文章的学习,我们已经能够使用Scrapy框架写出一些常见的网络爬虫。在本章中,我们将使用Scrapy框架,将爬取到的数据存储到数据库中

与将数据写入文件一样,写入到数据库中也是通过pipelines.py文件完成的

存储到MySQL

修改pipelines.py后,代码如下:

import pymysqlclass BlogPipeline(object):def __init__(self):self.conn = pymysql.connect(host='127.0.0.1', user='root', passwd='123456', db='colin-test', charset='utf8mb4')def process_item(self, item, *args, kwargs):for i in range(len(item['title'])):title = item['title'][i]page_views = item['page_views'][i]published_date = item['published_date'][i]sql = "insert into blog(title, page_views, published_date) values('"+title+"', '"+page_views+"', '"+published_date+"' )"self.conn.query(sql)self.conn.commit()return itemdef close_spider(self, *args, kwargs):self.conn.close()

存储到SQLite

修改pipelines.py后,代码如下:

import sqlite3class BlogPipeline(object):def __init__(self):self.conn = sqlite3.connect('blog.db')self.cursor = self.conn.cursor()def process_item(self, item, *args, kwargs):for i in range(len(item['title'])):title = item['title'][i]page_views = item['page_views'][i]published_date = item['published_date'][i]sql = "insert into blog(title, page_views, published_date) values('"+title+"', '"+page_views+"', '"+published_date+"' )"self.cursor.execute(sql)self.conn.commit()return itemdef close_spider(self, *args, kwargs):self.conn.close()

存储到Mongodb

修改pipelines.py后,代码如下:

import pymongoclass BlogPipeline(object):def __init__(self):client = pymongo.MongoClient(f"mongodb://localhost:27017/")database = client["sport"]self.collection = database["blog"]def process_item(self, item, *args, kwargs):for i in range(len(item['title'])):title = item['title'][i]page_views = item['page_views'][i]published_date = item['published_date'][i]self.collection.insert_one({"title": title, "page_views": page_views, "published_date": published_date})return itemdef close_spider(self, *args, kwargs):self.collection.close()