feat: 知乎支持创作者主页数据爬取(回答、文章、视频)
This commit is contained in:
parent
af9d2d8e84
commit
da8f1c62b8
102
README.md
102
README.md
@ -13,17 +13,8 @@
|
||||
原理:利用[playwright](https://playwright.dev/)搭桥,保留登录成功后的上下文浏览器环境,通过执行JS表达式获取一些加密参数
|
||||
通过使用此方式,免去了复现核心加密JS代码,逆向难度大大降低
|
||||
|
||||
[MediaCrawlerPro](https://github.com/MediaCrawlerPro) 版本已经迭代出来了,相较于开源版本的优势:
|
||||
- 多账号+IP代理支持(重点!)
|
||||
- 去除Playwright依赖,使用更加简单
|
||||
- 支持linux部署(Docker docker-compose)
|
||||
- 代码重构优化,更加易读易维护(解耦JS签名逻辑)
|
||||
- 完美的架构设计,更加易扩展,源码学习的价值更大
|
||||
|
||||
|
||||
MediaCrawler仓库白金赞助商:
|
||||
<a href="https://dashboard.ipcola.com/register?referral_code=atxtupzfjhpbdbl">⚡️【IPCola全球独家海外IP代理】⚡️新鲜的原生住宅代理,超高性价比,超多稀缺国家</a>
|
||||
> 【IPCola全球独家海外IP代理】使用此处阿江专属推荐码注册:atxtupzfjhpbdbl ,获得10%金额补贴。
|
||||
|
||||
## 功能列表
|
||||
| 平台 | 关键词搜索 | 指定帖子ID爬取 | 二级评论 | 指定创作者主页 | 登录态缓存 | IP代理池 | 生成评论词云图 |
|
||||
@ -36,8 +27,80 @@ MediaCrawler仓库白金赞助商:
|
||||
| 贴吧 | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ |
|
||||
| 知乎 | ✅ | ❌ | ✅ | ❌ | ✅ | ✅ | ✅ |
|
||||
|
||||
|
||||
## 创建并激活 python 虚拟环境
|
||||
> 如果是爬取抖音和知乎,需要提前安装nodejs环境,版本大于等于:`16`即可 <br>
|
||||
```shell
|
||||
# 进入项目根目录
|
||||
cd MediaCrawler
|
||||
|
||||
# 创建虚拟环境
|
||||
# 我的python版本是:3.9.6,requirements.txt中的库是基于这个版本的,如果是其他python版本,可能requirements.txt中的库不兼容,自行解决一下。
|
||||
python -m venv venv
|
||||
|
||||
# macos & linux 激活虚拟环境
|
||||
source venv/bin/activate
|
||||
|
||||
# windows 激活虚拟环境
|
||||
venv\Scripts\activate
|
||||
|
||||
```
|
||||
|
||||
## 安装依赖库
|
||||
|
||||
```shell
|
||||
pip install -r requirements.txt
|
||||
```
|
||||
|
||||
## 安装 playwright浏览器驱动
|
||||
|
||||
```shell
|
||||
playwright install
|
||||
```
|
||||
|
||||
## 运行爬虫程序
|
||||
|
||||
```shell
|
||||
### 项目默认是没有开启评论爬取模式,如需评论请在config/base_config.py中的 ENABLE_GET_COMMENTS 变量修改
|
||||
### 一些其他支持项,也可以在config/base_config.py查看功能,写的有中文注释
|
||||
|
||||
# 从配置文件中读取关键词搜索相关的帖子并爬取帖子信息与评论
|
||||
python main.py --platform xhs --lt qrcode --type search
|
||||
|
||||
# 从配置文件中读取指定的帖子ID列表获取指定帖子的信息与评论信息
|
||||
python main.py --platform xhs --lt qrcode --type detail
|
||||
|
||||
# 打开对应APP扫二维码登录
|
||||
|
||||
# 其他平台爬虫使用示例,执行下面的命令查看
|
||||
python main.py --help
|
||||
```
|
||||
|
||||
## 数据保存
|
||||
- 支持关系型数据库Mysql中保存(需要提前创建数据库)
|
||||
- 执行 `python db.py` 初始化数据库数据库表结构(只在首次执行)
|
||||
- 支持保存到csv中(data/目录下)
|
||||
- 支持保存到json中(data/目录下)
|
||||
|
||||
## MediaCrawlerPro
|
||||
[MediaCrawlerPro](https://github.com/MediaCrawlerPro) 版本已经重构出来了,相较于开源版本的优势:
|
||||
- 多账号+IP代理支持(重点!)
|
||||
- 去除Playwright依赖,使用更加简单
|
||||
- 支持linux部署(Docker docker-compose)
|
||||
- 代码重构优化,更加易读易维护(解耦JS签名逻辑)
|
||||
- 代码质量更高,对于构建更大型的爬虫项目更加友好
|
||||
- 完美的架构设计,更加易扩展,源码学习的价值更大
|
||||
|
||||
|
||||
## 其他常见问题可以查看在线文档
|
||||
>
|
||||
> 在线文档包含使用方法、常见问题、加入项目交流群等。
|
||||
> [MediaCrawler在线文档](https://nanmicoder.github.io/MediaCrawler/)
|
||||
>
|
||||
|
||||
## 开发者服务
|
||||
> 开源不易,希望大家可以Star一下MediaCrawler仓库、支持下我的课程、星球,十分感谢!!! <br>
|
||||
> 开源不易,希望大家可以Star一下MediaCrawler仓库!!!!十分感谢!!! <br>
|
||||
> 如果你对知识付费认可,可以看下下面我提供的付费服务,如果你是学生,请一定提前告知,会有优惠💰<br>
|
||||
|
||||
- MediaCrawler源码剖析课程:
|
||||
如果你想很快入门这个项目,或者想了具体实现原理,我推荐你看看这个我录制的视频课程,从设计出发一步步带你如何使用,门槛大大降低
|
||||
@ -65,12 +128,6 @@ MediaCrawler仓库白金赞助商:
|
||||
- [Python协程在并发场景下的幂等性问题](https://articles.zsxq.com/id_wocdwsfmfcmp.html)
|
||||
- [错误使用 Python 可变类型带来的隐藏 Bug](https://articles.zsxq.com/id_f7vn89l1d303.html)
|
||||
|
||||
## 使用教程文档
|
||||
|
||||
> MediaCrawler文档使用vitepress构建,包含使用方法、常见问题、加入项目交流群等。
|
||||
>
|
||||
[MediaCrawler在线文档](https://nanmicoder.github.io/MediaCrawler/)
|
||||
|
||||
|
||||
## 感谢下列Sponsors对本仓库赞助
|
||||
> 【IPCola全球独家海外IP代理】使用此处阿江专属推荐码注册:atxtupzfjhpbdbl ,获得10%金额补贴。
|
||||
@ -80,6 +137,19 @@ MediaCrawler仓库白金赞助商:
|
||||
成为赞助者,可以将您产品展示在这里,每天获得大量曝光,联系作者微信:yzglan 或 email:relakkes@gmail.com
|
||||
|
||||
|
||||
## MediaCrawler项目微信交流群
|
||||
|
||||
👏👏👏 汇聚爬虫技术爱好者,共同学习,共同进步。
|
||||
|
||||
❗️❗️❗️群内禁止广告,禁止发各类违规和MediaCrawler不相关的问题
|
||||
|
||||
### 加群方式
|
||||
> 备注:github,会有拉群小助手自动拉你进群。
|
||||
>
|
||||
> 如果图片展示不出来或过期,可以直接添加我的微信号:yzglan,并备注github,会有拉群小助手自动拉你进群
|
||||
|
||||
![relakkes_wechat](docs/static/images/relakkes_weichat.jpg)
|
||||
|
||||
## 打赏
|
||||
|
||||
如果觉得项目不错的话可以打赏哦。您的支持就是我最大的动力!
|
||||
|
@ -1,6 +1,6 @@
|
||||
# 基础配置
|
||||
PLATFORM = "xhs"
|
||||
KEYWORDS = "编程副业,编程兼职"
|
||||
KEYWORDS = "编程副业,编程兼职" # 关键词搜索配置,以英文逗号分隔
|
||||
LOGIN_TYPE = "qrcode" # qrcode or phone or cookie
|
||||
COOKIES = ""
|
||||
# 具体值参见media_platform.xxx.field下的枚举值,暂时只支持小红书
|
||||
@ -45,8 +45,8 @@ MAX_CONCURRENCY_NUM = 1
|
||||
# 是否开启爬图片模式, 默认不开启爬图片
|
||||
ENABLE_GET_IMAGES = False
|
||||
|
||||
# 是否开启爬评论模式, 默认不开启爬评论
|
||||
ENABLE_GET_COMMENTS = False
|
||||
# 是否开启爬评论模式, 默认开启爬评论
|
||||
ENABLE_GET_COMMENTS = True
|
||||
|
||||
# 是否开启爬二级评论模式, 默认不开启爬二级评论
|
||||
# 老版本项目使用了 db, 则需参考 schema/tables.sql line 287 增加表字段
|
||||
@ -130,6 +130,13 @@ KS_CREATOR_ID_LIST = [
|
||||
# ........................
|
||||
]
|
||||
|
||||
|
||||
# 指定知乎创作者主页url列表
|
||||
ZHIHU_CREATOR_URL_LIST = [
|
||||
"https://www.zhihu.com/people/yd1234567",
|
||||
# ........................
|
||||
]
|
||||
|
||||
# 词云相关
|
||||
# 是否开启生成评论词云图
|
||||
ENABLE_GET_WORDCLOUD = False
|
||||
|
@ -5,18 +5,19 @@ from typing import Any, Callable, Dict, List, Optional, Union
|
||||
from urllib.parse import urlencode
|
||||
|
||||
import httpx
|
||||
from httpx import Response
|
||||
from playwright.async_api import BrowserContext, Page
|
||||
from tenacity import retry, stop_after_attempt, wait_fixed
|
||||
|
||||
import config
|
||||
from base.base_crawler import AbstractApiClient
|
||||
from constant import zhihu as zhihu_constant
|
||||
from model.m_zhihu import ZhihuComment, ZhihuContent
|
||||
from model.m_zhihu import ZhihuComment, ZhihuContent, ZhihuCreator
|
||||
from tools import utils
|
||||
|
||||
from .exception import DataFetchError, ForbiddenError
|
||||
from .field import SearchSort, SearchTime, SearchType
|
||||
from .help import ZhiHuJsonExtractor, sign
|
||||
from .help import ZhihuExtractor, sign
|
||||
|
||||
|
||||
class ZhiHuClient(AbstractApiClient):
|
||||
@ -33,7 +34,7 @@ class ZhiHuClient(AbstractApiClient):
|
||||
self.timeout = timeout
|
||||
self.default_headers = headers
|
||||
self.cookie_dict = cookie_dict
|
||||
self._extractor = ZhiHuJsonExtractor()
|
||||
self._extractor = ZhihuExtractor()
|
||||
|
||||
async def _pre_headers(self, url: str) -> Dict:
|
||||
"""
|
||||
@ -95,7 +96,7 @@ class ZhiHuClient(AbstractApiClient):
|
||||
raise DataFetchError(response.text)
|
||||
|
||||
|
||||
async def get(self, uri: str, params=None) -> Dict:
|
||||
async def get(self, uri: str, params=None, **kwargs) -> Union[Response, Dict, str]:
|
||||
"""
|
||||
GET请求,对请求头签名
|
||||
Args:
|
||||
@ -109,7 +110,7 @@ class ZhiHuClient(AbstractApiClient):
|
||||
if isinstance(params, dict):
|
||||
final_uri += '?' + urlencode(params)
|
||||
headers = await self._pre_headers(final_uri)
|
||||
return await self.request(method="GET", url=zhihu_constant.ZHIHU_URL + final_uri, headers=headers)
|
||||
return await self.request(method="GET", url=zhihu_constant.ZHIHU_URL + final_uri, headers=headers, **kwargs)
|
||||
|
||||
async def pong(self) -> bool:
|
||||
"""
|
||||
@ -194,7 +195,7 @@ class ZhiHuClient(AbstractApiClient):
|
||||
}
|
||||
search_res = await self.get(uri, params)
|
||||
utils.logger.info(f"[ZhiHuClient.get_note_by_keyword] Search result: {search_res}")
|
||||
return self._extractor.extract_contents(search_res)
|
||||
return self._extractor.extract_contents_from_search(search_res)
|
||||
|
||||
async def get_root_comments(self, content_id: str, content_type: str, offset: str = "", limit: int = 10,
|
||||
order_by: str = "sort") -> Dict:
|
||||
@ -317,3 +318,170 @@ class ZhiHuClient(AbstractApiClient):
|
||||
all_sub_comments.extend(sub_comments)
|
||||
await asyncio.sleep(crawl_interval)
|
||||
return all_sub_comments
|
||||
|
||||
async def get_creator_info(self, url_token: str) -> Optional[ZhihuCreator]:
|
||||
"""
|
||||
获取创作者信息
|
||||
Args:
|
||||
url_token:
|
||||
|
||||
Returns:
|
||||
|
||||
"""
|
||||
uri = f"/people/{url_token}"
|
||||
html_content: str = await self.get(uri, return_response=True)
|
||||
return self._extractor.extract_creator(url_token, html_content)
|
||||
|
||||
async def get_creator_answers(self, url_token: str, offset: int = 0, limit: int = 20) -> Dict:
|
||||
"""
|
||||
获取创作者的回答
|
||||
Args:
|
||||
url_token:
|
||||
offset:
|
||||
limit:
|
||||
|
||||
Returns:
|
||||
|
||||
|
||||
"""
|
||||
uri = f"/api/v4/members/{url_token}/answers"
|
||||
params = {
|
||||
"include":"data[*].is_normal,admin_closed_comment,reward_info,is_collapsed,annotation_action,annotation_detail,collapse_reason,collapsed_by,suggest_edit,comment_count,can_comment,content,editable_content,attachment,voteup_count,reshipment_settings,comment_permission,created_time,updated_time,review_info,excerpt,paid_info,reaction_instruction,is_labeled,label_info,relationship.is_authorized,voting,is_author,is_thanked,is_nothelp;data[*].vessay_info;data[*].author.badge[?(type=best_answerer)].topics;data[*].author.vip_info;data[*].question.has_publishing_draft,relationship",
|
||||
"offset": offset,
|
||||
"limit": limit,
|
||||
"order_by": "created"
|
||||
}
|
||||
return await self.get(uri, params)
|
||||
|
||||
async def get_creator_articles(self, url_token: str, offset: int = 0, limit: int = 20) -> Dict:
|
||||
"""
|
||||
获取创作者的文章
|
||||
Args:
|
||||
url_token:
|
||||
offset:
|
||||
limit:
|
||||
|
||||
Returns:
|
||||
|
||||
"""
|
||||
uri = f"/api/v4/members/{url_token}/articles"
|
||||
params = {
|
||||
"include":"data[*].comment_count,suggest_edit,is_normal,thumbnail_extra_info,thumbnail,can_comment,comment_permission,admin_closed_comment,content,voteup_count,created,updated,upvoted_followees,voting,review_info,reaction_instruction,is_labeled,label_info;data[*].vessay_info;data[*].author.badge[?(type=best_answerer)].topics;data[*].author.vip_info;",
|
||||
"offset": offset,
|
||||
"limit": limit,
|
||||
"order_by": "created"
|
||||
}
|
||||
return await self.get(uri, params)
|
||||
|
||||
async def get_creator_videos(self, url_token: str, offset: int = 0, limit: int = 20) -> Dict:
|
||||
"""
|
||||
获取创作者的视频
|
||||
Args:
|
||||
url_token:
|
||||
offset:
|
||||
limit:
|
||||
|
||||
Returns:
|
||||
|
||||
"""
|
||||
uri = f"/api/v4/members/{url_token}/zvideos"
|
||||
params = {
|
||||
"include":"similar_zvideo,creation_relationship,reaction_instruction",
|
||||
"offset": offset,
|
||||
"limit": limit,
|
||||
"similar_aggregation": "true"
|
||||
}
|
||||
return await self.get(uri, params)
|
||||
|
||||
async def get_all_anwser_by_creator(self, creator: ZhihuCreator, crawl_interval: float = 1.0,
|
||||
callback: Optional[Callable] = None) -> List[ZhihuContent]:
|
||||
"""
|
||||
获取创作者的所有回答
|
||||
Args:
|
||||
creator: 创作者信息
|
||||
crawl_interval: 爬取一次笔记的延迟单位(秒)
|
||||
callback: 一次笔记爬取结束后
|
||||
|
||||
Returns:
|
||||
|
||||
"""
|
||||
all_contents: List[ZhihuContent] = []
|
||||
is_end: bool = False
|
||||
offset: int = 0
|
||||
limit: int = 20
|
||||
while not is_end:
|
||||
res = await self.get_creator_answers(creator.url_token, offset, limit)
|
||||
if not res:
|
||||
break
|
||||
utils.logger.info(f"[ZhiHuClient.get_all_anwser_by_creator] Get creator {creator.url_token} answers: {res}")
|
||||
paging_info = res.get("paging", {})
|
||||
is_end = paging_info.get("is_end")
|
||||
contents = self._extractor.extract_content_list_from_creator(res.get("data"))
|
||||
if callback:
|
||||
await callback(contents)
|
||||
all_contents.extend(contents)
|
||||
offset += limit
|
||||
await asyncio.sleep(crawl_interval)
|
||||
return all_contents
|
||||
|
||||
|
||||
async def get_all_articles_by_creator(self, creator: ZhihuCreator, crawl_interval: float = 1.0,
|
||||
callback: Optional[Callable] = None) -> List[ZhihuContent]:
|
||||
"""
|
||||
获取创作者的所有文章
|
||||
Args:
|
||||
creator:
|
||||
crawl_interval:
|
||||
callback:
|
||||
|
||||
Returns:
|
||||
|
||||
"""
|
||||
all_contents: List[ZhihuContent] = []
|
||||
is_end: bool = False
|
||||
offset: int = 0
|
||||
limit: int = 20
|
||||
while not is_end:
|
||||
res = await self.get_creator_articles(creator.url_token, offset, limit)
|
||||
if not res:
|
||||
break
|
||||
paging_info = res.get("paging", {})
|
||||
is_end = paging_info.get("is_end")
|
||||
contents = self._extractor.extract_content_list_from_creator(res.get("data"))
|
||||
if callback:
|
||||
await callback(contents)
|
||||
all_contents.extend(contents)
|
||||
offset += limit
|
||||
await asyncio.sleep(crawl_interval)
|
||||
return all_contents
|
||||
|
||||
|
||||
async def get_all_videos_by_creator(self, creator: ZhihuCreator, crawl_interval: float = 1.0,
|
||||
callback: Optional[Callable] = None) -> List[ZhihuContent]:
|
||||
"""
|
||||
获取创作者的所有视频
|
||||
Args:
|
||||
creator:
|
||||
crawl_interval:
|
||||
callback:
|
||||
|
||||
Returns:
|
||||
|
||||
"""
|
||||
all_contents: List[ZhihuContent] = []
|
||||
is_end: bool = False
|
||||
offset: int = 0
|
||||
limit: int = 20
|
||||
while not is_end:
|
||||
res = await self.get_creator_videos(creator.url_token, offset, limit)
|
||||
if not res:
|
||||
break
|
||||
paging_info = res.get("paging", {})
|
||||
is_end = paging_info.get("is_end")
|
||||
contents = self._extractor.extract_content_list_from_creator(res.get("data"))
|
||||
if callback:
|
||||
await callback(contents)
|
||||
all_contents.extend(contents)
|
||||
offset += limit
|
||||
await asyncio.sleep(crawl_interval)
|
||||
return all_contents
|
||||
|
@ -10,7 +10,7 @@ from playwright.async_api import (BrowserContext, BrowserType, Page,
|
||||
|
||||
import config
|
||||
from base.base_crawler import AbstractCrawler
|
||||
from model.m_zhihu import ZhihuContent
|
||||
from model.m_zhihu import ZhihuContent, ZhihuCreator
|
||||
from proxy.proxy_ip_pool import IpInfoModel, create_ip_pool
|
||||
from store import zhihu as zhihu_store
|
||||
from tools import utils
|
||||
@ -18,7 +18,7 @@ from var import crawler_type_var, source_keyword_var
|
||||
|
||||
from .client import ZhiHuClient
|
||||
from .exception import DataFetchError
|
||||
from .help import ZhiHuJsonExtractor
|
||||
from .help import ZhihuExtractor
|
||||
from .login import ZhiHuLogin
|
||||
|
||||
|
||||
@ -31,7 +31,7 @@ class ZhihuCrawler(AbstractCrawler):
|
||||
self.index_url = "https://www.zhihu.com"
|
||||
# self.user_agent = utils.get_user_agent()
|
||||
self.user_agent = "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/128.0.0.0 Safari/537.36"
|
||||
self._extractor = ZhiHuJsonExtractor()
|
||||
self._extractor = ZhihuExtractor()
|
||||
|
||||
async def start(self) -> None:
|
||||
"""
|
||||
@ -74,7 +74,7 @@ class ZhihuCrawler(AbstractCrawler):
|
||||
await self.zhihu_client.update_cookies(browser_context=self.browser_context)
|
||||
|
||||
# 知乎的搜索接口需要打开搜索页面之后cookies才能访问API,单独的首页不行
|
||||
utils.logger.info("[ZhihuCrawler.start] Zhihu跳转到搜索页面获取搜索页面的Cookies,改过程需要5秒左右")
|
||||
utils.logger.info("[ZhihuCrawler.start] Zhihu跳转到搜索页面获取搜索页面的Cookies,该过程需要5秒左右")
|
||||
await self.context_page.goto(f"{self.index_url}/search?q=python&search_source=Guess&utm_content=search_hot&type=content")
|
||||
await asyncio.sleep(5)
|
||||
await self.zhihu_client.update_cookies(browser_context=self.browser_context)
|
||||
@ -88,7 +88,7 @@ class ZhihuCrawler(AbstractCrawler):
|
||||
raise NotImplementedError
|
||||
elif config.CRAWLER_TYPE == "creator":
|
||||
# Get creator's information and their notes and comments
|
||||
raise NotImplementedError
|
||||
await self.get_creators_and_notes()
|
||||
else:
|
||||
pass
|
||||
|
||||
@ -169,6 +169,53 @@ class ZhihuCrawler(AbstractCrawler):
|
||||
callback=zhihu_store.batch_update_zhihu_note_comments
|
||||
)
|
||||
|
||||
async def get_creators_and_notes(self) -> None:
|
||||
"""
|
||||
Get creator's information and their notes and comments
|
||||
Returns:
|
||||
|
||||
"""
|
||||
utils.logger.info("[ZhihuCrawler.get_creators_and_notes] Begin get xiaohongshu creators")
|
||||
for user_link in config.ZHIHU_CREATOR_URL_LIST:
|
||||
utils.logger.info(f"[ZhihuCrawler.get_creators_and_notes] Begin get creator {user_link}")
|
||||
user_url_token = user_link.split("/")[-1]
|
||||
# get creator detail info from web html content
|
||||
createor_info: ZhihuCreator = await self.zhihu_client.get_creator_info(url_token=user_url_token)
|
||||
if not createor_info:
|
||||
utils.logger.info(f"[ZhihuCrawler.get_creators_and_notes] Creator {user_url_token} not found")
|
||||
continue
|
||||
|
||||
utils.logger.info(f"[ZhihuCrawler.get_creators_and_notes] Creator info: {createor_info}")
|
||||
await zhihu_store.save_creator(creator=createor_info)
|
||||
|
||||
# 默认只提取回答信息,如果需要文章和视频,把下面的注释打开即可
|
||||
|
||||
# Get all anwser information of the creator
|
||||
all_content_list = await self.zhihu_client.get_all_anwser_by_creator(
|
||||
creator=createor_info,
|
||||
crawl_interval=random.random(),
|
||||
callback=zhihu_store.batch_update_zhihu_contents
|
||||
)
|
||||
|
||||
|
||||
# Get all articles of the creator's contents
|
||||
# all_content_list = await self.zhihu_client.get_all_articles_by_creator(
|
||||
# creator=createor_info,
|
||||
# crawl_interval=random.random(),
|
||||
# callback=zhihu_store.batch_update_zhihu_contents
|
||||
# )
|
||||
|
||||
# Get all videos of the creator's contents
|
||||
# all_content_list = await self.zhihu_client.get_all_videos_by_creator(
|
||||
# creator=createor_info,
|
||||
# crawl_interval=random.random(),
|
||||
# callback=zhihu_store.batch_update_zhihu_contents
|
||||
# )
|
||||
|
||||
# Get all comments of the creator's contents
|
||||
await self.batch_get_content_comments(all_content_list)
|
||||
|
||||
|
||||
@staticmethod
|
||||
def format_proxy_info(ip_proxy_info: IpInfoModel) -> Tuple[Optional[Dict], Optional[Dict]]:
|
||||
"""format proxy info for playwright and httpx"""
|
||||
|
@ -1,8 +1,10 @@
|
||||
# -*- coding: utf-8 -*-
|
||||
from typing import Dict, List
|
||||
import json
|
||||
from typing import Dict, List, Optional
|
||||
from urllib.parse import parse_qs, urlparse
|
||||
|
||||
import execjs
|
||||
from parsel import Selector
|
||||
|
||||
from constant import zhihu as zhihu_constant
|
||||
from model.m_zhihu import ZhihuComment, ZhihuContent, ZhihuCreator
|
||||
@ -29,11 +31,11 @@ def sign(url: str, cookies: str) -> Dict:
|
||||
return ZHIHU_SGIN_JS.call("get_sign", url, cookies)
|
||||
|
||||
|
||||
class ZhiHuJsonExtractor:
|
||||
class ZhihuExtractor:
|
||||
def __init__(self):
|
||||
pass
|
||||
|
||||
def extract_contents(self, json_data: Dict) -> List[ZhihuContent]:
|
||||
def extract_contents_from_search(self, json_data: Dict) -> List[ZhihuContent]:
|
||||
"""
|
||||
extract zhihu contents
|
||||
Args:
|
||||
@ -45,21 +47,34 @@ class ZhiHuJsonExtractor:
|
||||
if not json_data:
|
||||
return []
|
||||
|
||||
result: List[ZhihuContent] = []
|
||||
search_result: List[Dict] = json_data.get("data", [])
|
||||
search_result = [s_item for s_item in search_result if s_item.get("type") in ['search_result', 'zvideo']]
|
||||
for sr_item in search_result:
|
||||
sr_object: Dict = sr_item.get("object", {})
|
||||
if sr_object.get("type") == zhihu_constant.ANSWER_NAME:
|
||||
result.append(self._extract_answer_content(sr_object))
|
||||
elif sr_object.get("type") == zhihu_constant.ARTICLE_NAME:
|
||||
result.append(self._extract_article_content(sr_object))
|
||||
elif sr_object.get("type") == zhihu_constant.VIDEO_NAME:
|
||||
result.append(self._extract_zvideo_content(sr_object))
|
||||
return self._extract_content_list([sr_item.get("object") for sr_item in search_result if sr_item.get("object")])
|
||||
|
||||
|
||||
def _extract_content_list(self, content_list: List[Dict]) -> List[ZhihuContent]:
|
||||
"""
|
||||
extract zhihu content list
|
||||
Args:
|
||||
content_list:
|
||||
|
||||
Returns:
|
||||
|
||||
"""
|
||||
if not content_list:
|
||||
return []
|
||||
|
||||
res: List[ZhihuContent] = []
|
||||
for content in content_list:
|
||||
if content.get("type") == zhihu_constant.ANSWER_NAME:
|
||||
res.append(self._extract_answer_content(content))
|
||||
elif content.get("type") == zhihu_constant.ARTICLE_NAME:
|
||||
res.append(self._extract_article_content(content))
|
||||
elif content.get("type") == zhihu_constant.VIDEO_NAME:
|
||||
res.append(self._extract_zvideo_content(content))
|
||||
else:
|
||||
continue
|
||||
|
||||
return result
|
||||
return res
|
||||
|
||||
def _extract_answer_content(self, answer: Dict) -> ZhihuContent:
|
||||
"""
|
||||
@ -72,22 +87,23 @@ class ZhiHuJsonExtractor:
|
||||
res = ZhihuContent()
|
||||
res.content_id = answer.get("id")
|
||||
res.content_type = answer.get("type")
|
||||
res.content_text = extract_text_from_html(answer.get("content"))
|
||||
res.content_text = extract_text_from_html(answer.get("content", ""))
|
||||
res.question_id = answer.get("question").get("id")
|
||||
res.content_url = f"{zhihu_constant.ZHIHU_URL}/question/{res.question_id}/answer/{res.content_id}"
|
||||
res.title = extract_text_from_html(answer.get("title"))
|
||||
res.desc = extract_text_from_html(answer.get("description"))
|
||||
res.title = extract_text_from_html(answer.get("title", ""))
|
||||
res.desc = extract_text_from_html(answer.get("description", "") or answer.get("excerpt", ""))
|
||||
res.created_time = answer.get("created_time")
|
||||
res.updated_time = answer.get("updated_time")
|
||||
res.voteup_count = answer.get("voteup_count")
|
||||
res.comment_count = answer.get("comment_count")
|
||||
res.voteup_count = answer.get("voteup_count", 0)
|
||||
res.comment_count = answer.get("comment_count", 0)
|
||||
|
||||
# extract author info
|
||||
author_info = self._extract_author(answer.get("author"))
|
||||
author_info = self._extract_content_or_comment_author(answer.get("author"))
|
||||
res.user_id = author_info.user_id
|
||||
res.user_link = author_info.user_link
|
||||
res.user_nickname = author_info.user_nickname
|
||||
res.user_avatar = author_info.user_avatar
|
||||
res.user_url_token = author_info.url_token
|
||||
return res
|
||||
|
||||
def _extract_article_content(self, article: Dict) -> ZhihuContent:
|
||||
@ -106,17 +122,18 @@ class ZhiHuJsonExtractor:
|
||||
res.content_url = f"{zhihu_constant.ZHIHU_URL}/p/{res.content_id}"
|
||||
res.title = extract_text_from_html(article.get("title"))
|
||||
res.desc = extract_text_from_html(article.get("excerpt"))
|
||||
res.created_time = article.get("created_time")
|
||||
res.updated_time = article.get("updated_time")
|
||||
res.voteup_count = article.get("voteup_count")
|
||||
res.comment_count = article.get("comment_count")
|
||||
res.created_time = article.get("created_time", 0) or article.get("created", 0)
|
||||
res.updated_time = article.get("updated_time", 0) or article.get("updated", 0)
|
||||
res.voteup_count = article.get("voteup_count", 0)
|
||||
res.comment_count = article.get("comment_count", 0)
|
||||
|
||||
# extract author info
|
||||
author_info = self._extract_author(article.get("author"))
|
||||
author_info = self._extract_content_or_comment_author(article.get("author"))
|
||||
res.user_id = author_info.user_id
|
||||
res.user_link = author_info.user_link
|
||||
res.user_nickname = author_info.user_nickname
|
||||
res.user_avatar = author_info.user_avatar
|
||||
res.user_url_token = author_info.url_token
|
||||
return res
|
||||
|
||||
def _extract_zvideo_content(self, zvideo: Dict) -> ZhihuContent:
|
||||
@ -129,25 +146,34 @@ class ZhiHuJsonExtractor:
|
||||
|
||||
"""
|
||||
res = ZhihuContent()
|
||||
res.content_id = zvideo.get("zvideo_id")
|
||||
|
||||
if "video" in zvideo and isinstance(zvideo.get("video"), dict): # 说明是从创作者主页的视频列表接口来的
|
||||
res.content_id = zvideo.get("video").get("video_id")
|
||||
res.content_url = f"{zhihu_constant.ZHIHU_URL}/zvideo/{res.content_id}"
|
||||
res.created_time = zvideo.get("published_at")
|
||||
res.updated_time = zvideo.get("updated_at")
|
||||
else:
|
||||
res.content_id = zvideo.get("zvideo_id")
|
||||
res.content_url = zvideo.get("video_url")
|
||||
res.created_time = zvideo.get("created_at")
|
||||
|
||||
res.content_type = zvideo.get("type")
|
||||
res.content_url = zvideo.get("video_url")
|
||||
res.title = extract_text_from_html(zvideo.get("title"))
|
||||
res.desc = extract_text_from_html(zvideo.get("description"))
|
||||
res.created_time = zvideo.get("created_at")
|
||||
res.voteup_count = zvideo.get("voteup_count")
|
||||
res.comment_count = zvideo.get("comment_count")
|
||||
|
||||
# extract author info
|
||||
author_info = self._extract_author(zvideo.get("author"))
|
||||
author_info = self._extract_content_or_comment_author(zvideo.get("author"))
|
||||
res.user_id = author_info.user_id
|
||||
res.user_link = author_info.user_link
|
||||
res.user_nickname = author_info.user_nickname
|
||||
res.user_avatar = author_info.user_avatar
|
||||
res.user_url_token = author_info.url_token
|
||||
return res
|
||||
|
||||
@staticmethod
|
||||
def _extract_author(author: Dict) -> ZhihuCreator:
|
||||
def _extract_content_or_comment_author(author: Dict) -> ZhihuCreator:
|
||||
"""
|
||||
extract zhihu author
|
||||
Args:
|
||||
@ -165,6 +191,7 @@ class ZhiHuJsonExtractor:
|
||||
res.user_link = f"{zhihu_constant.ZHIHU_URL}/people/{author.get('url_token')}"
|
||||
res.user_nickname = author.get("name")
|
||||
res.user_avatar = author.get("avatar_url")
|
||||
res.url_token = author.get("url_token")
|
||||
return res
|
||||
|
||||
def extract_comments(self, page_content: ZhihuContent, comments: List[Dict]) -> List[ZhihuComment]:
|
||||
@ -209,7 +236,7 @@ class ZhiHuJsonExtractor:
|
||||
res.content_type = page_content.content_type
|
||||
|
||||
# extract author info
|
||||
author_info = self._extract_author(comment.get("author"))
|
||||
author_info = self._extract_content_or_comment_author(comment.get("author"))
|
||||
res.user_id = author_info.user_id
|
||||
res.user_link = author_info.user_link
|
||||
res.user_nickname = author_info.user_nickname
|
||||
@ -254,3 +281,80 @@ class ZhiHuJsonExtractor:
|
||||
query_params = parse_qs(parsed_url.query)
|
||||
offset = query_params.get('offset', [""])[0]
|
||||
return offset
|
||||
|
||||
@staticmethod
|
||||
def _foramt_gender_text(gender: int) -> str:
|
||||
"""
|
||||
format gender text
|
||||
Args:
|
||||
gender:
|
||||
|
||||
Returns:
|
||||
|
||||
"""
|
||||
if gender == 1:
|
||||
return "男"
|
||||
elif gender == 0:
|
||||
return "女"
|
||||
else:
|
||||
return "未知"
|
||||
|
||||
|
||||
def extract_creator(self, user_url_token: str, html_content: str) -> Optional[ZhihuCreator]:
|
||||
"""
|
||||
extract zhihu creator
|
||||
Args:
|
||||
user_url_token : zhihu creator url token
|
||||
html_content: zhihu creator html content
|
||||
|
||||
Returns:
|
||||
|
||||
"""
|
||||
if not html_content:
|
||||
return None
|
||||
|
||||
js_init_data = Selector(text=html_content).xpath("//script[@id='js-initialData']/text()").get(default="").strip()
|
||||
if not js_init_data:
|
||||
return None
|
||||
|
||||
js_init_data_dict: Dict = json.loads(js_init_data)
|
||||
users_info: Dict = js_init_data_dict.get("initialState", {}).get("entities", {}).get("users", {})
|
||||
if not users_info:
|
||||
return None
|
||||
|
||||
creator_info: Dict = users_info.get(user_url_token)
|
||||
if not creator_info:
|
||||
return None
|
||||
|
||||
res = ZhihuCreator()
|
||||
res.user_id = creator_info.get("id")
|
||||
res.user_link = f"{zhihu_constant.ZHIHU_URL}/people/{user_url_token}"
|
||||
res.user_nickname = creator_info.get("name")
|
||||
res.user_avatar = creator_info.get("avatarUrl")
|
||||
res.url_token = creator_info.get("urlToken") or user_url_token
|
||||
res.gender = self._foramt_gender_text(creator_info.get("gender"))
|
||||
res.ip_location = creator_info.get("ipInfo")
|
||||
res.follows = creator_info.get("followingCount")
|
||||
res.fans = creator_info.get("followerCount")
|
||||
res.anwser_count = creator_info.get("answerCount")
|
||||
res.video_count = creator_info.get("zvideoCount")
|
||||
res.question_count = creator_info.get("questionCount")
|
||||
res.article_count = creator_info.get("articlesCount")
|
||||
res.column_count = creator_info.get("columnsCount")
|
||||
res.get_voteup_count = creator_info.get("voteupCount")
|
||||
return res
|
||||
|
||||
|
||||
def extract_content_list_from_creator(self, anwser_list: List[Dict]) -> List[ZhihuContent]:
|
||||
"""
|
||||
extract content list from creator
|
||||
Args:
|
||||
anwser_list:
|
||||
|
||||
Returns:
|
||||
|
||||
"""
|
||||
if not anwser_list:
|
||||
return []
|
||||
|
||||
return self._extract_content_list(anwser_list)
|
||||
|
@ -15,8 +15,8 @@ class ZhihuContent(BaseModel):
|
||||
question_id: str = Field(default="", description="问题ID, type为answer时有值")
|
||||
title: str = Field(default="", description="内容标题")
|
||||
desc: str = Field(default="", description="内容描述")
|
||||
created_time: int = Field(default="", description="创建时间")
|
||||
updated_time: int = Field(default="", description="更新时间")
|
||||
created_time: int = Field(default=0, description="创建时间")
|
||||
updated_time: int = Field(default=0, description="更新时间")
|
||||
voteup_count: int = Field(default=0, description="赞同人数")
|
||||
comment_count: int = Field(default=0, description="评论数量")
|
||||
source_keyword: str = Field(default="", description="来源关键词")
|
||||
@ -25,6 +25,7 @@ class ZhihuContent(BaseModel):
|
||||
user_link: str = Field(default="", description="用户主页链接")
|
||||
user_nickname: str = Field(default="", description="用户昵称")
|
||||
user_avatar: str = Field(default="", description="用户头像地址")
|
||||
user_url_token: str = Field(default="", description="用户url_token")
|
||||
|
||||
|
||||
class ZhihuComment(BaseModel):
|
||||
@ -57,7 +58,15 @@ class ZhihuCreator(BaseModel):
|
||||
user_link: str = Field(default="", description="用户主页链接")
|
||||
user_nickname: str = Field(default="", description="用户昵称")
|
||||
user_avatar: str = Field(default="", description="用户头像地址")
|
||||
url_token: str = Field(default="", description="用户url_token")
|
||||
gender: str = Field(default="", description="用户性别")
|
||||
ip_location: Optional[str] = Field(default="", description="IP地理位置")
|
||||
follows: int = Field(default=0, description="关注数")
|
||||
fans: int = Field(default=0, description="粉丝数")
|
||||
anwser_count: int = Field(default=0, description="回答数")
|
||||
video_count: int = Field(default=0, description="视频数")
|
||||
question_count: int = Field(default=0, description="提问数")
|
||||
article_count: int = Field(default=0, description="文章数")
|
||||
column_count: int = Field(default=0, description="专栏数")
|
||||
get_voteup_count: int = Field(default=0, description="获得的赞同数")
|
||||
|
||||
|
@ -474,6 +474,7 @@ CREATE TABLE `zhihu_content` (
|
||||
`user_link` varchar(255) NOT NULL COMMENT '用户主页链接',
|
||||
`user_nickname` varchar(64) NOT NULL COMMENT '用户昵称',
|
||||
`user_avatar` varchar(255) NOT NULL COMMENT '用户头像地址',
|
||||
`user_url_token` varchar(255) NOT NULL COMMENT '用户url_token',
|
||||
`add_ts` bigint NOT NULL COMMENT '记录添加时间戳',
|
||||
`last_modify_ts` bigint NOT NULL COMMENT '记录最后修改时间戳',
|
||||
PRIMARY KEY (`id`),
|
||||
@ -482,6 +483,7 @@ CREATE TABLE `zhihu_content` (
|
||||
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_0900_ai_ci COMMENT='知乎内容(回答、文章、视频)';
|
||||
|
||||
|
||||
|
||||
CREATE TABLE `zhihu_comment` (
|
||||
`id` int NOT NULL AUTO_INCREMENT COMMENT '自增ID',
|
||||
`comment_id` varchar(64) NOT NULL COMMENT '评论ID',
|
||||
@ -513,10 +515,17 @@ CREATE TABLE `zhihu_creator` (
|
||||
`user_link` varchar(255) NOT NULL COMMENT '用户主页链接',
|
||||
`user_nickname` varchar(64) NOT NULL COMMENT '用户昵称',
|
||||
`user_avatar` varchar(255) NOT NULL COMMENT '用户头像地址',
|
||||
`url_token` varchar(64) NOT NULL COMMENT '用户URL Token',
|
||||
`gender` varchar(16) DEFAULT NULL COMMENT '用户性别',
|
||||
`ip_location` varchar(64) DEFAULT NULL COMMENT 'IP地理位置',
|
||||
`follows` int NOT NULL DEFAULT '0' COMMENT '关注数',
|
||||
`fans` int NOT NULL DEFAULT '0' COMMENT '粉丝数',
|
||||
`follows` int NOT NULL DEFAULT 0 COMMENT '关注数',
|
||||
`fans` int NOT NULL DEFAULT 0 COMMENT '粉丝数',
|
||||
`anwser_count` int NOT NULL DEFAULT 0 COMMENT '回答数',
|
||||
`video_count` int NOT NULL DEFAULT 0 COMMENT '视频数',
|
||||
`question_count` int NOT NULL DEFAULT 0 COMMENT '问题数',
|
||||
`article_count` int NOT NULL DEFAULT 0 COMMENT '文章数',
|
||||
`column_count` int NOT NULL DEFAULT 0 COMMENT '专栏数',
|
||||
`get_voteup_count` int NOT NULL DEFAULT 0 COMMENT '获得的赞同数',
|
||||
`add_ts` bigint NOT NULL COMMENT '记录添加时间戳',
|
||||
`last_modify_ts` bigint NOT NULL COMMENT '记录最后修改时间戳',
|
||||
PRIMARY KEY (`id`),
|
||||
|
@ -3,7 +3,7 @@ from typing import List
|
||||
|
||||
import config
|
||||
from base.base_crawler import AbstractStore
|
||||
from model.m_zhihu import ZhihuComment, ZhihuContent
|
||||
from model.m_zhihu import ZhihuComment, ZhihuContent, ZhihuCreator
|
||||
from store.zhihu.zhihu_store_impl import (ZhihuCsvStoreImplement,
|
||||
ZhihuDbStoreImplement,
|
||||
ZhihuJsonStoreImplement)
|
||||
@ -25,6 +25,21 @@ class ZhihuStoreFactory:
|
||||
raise ValueError("[ZhihuStoreFactory.create_store] Invalid save option only supported csv or db or json ...")
|
||||
return store_class()
|
||||
|
||||
async def batch_update_zhihu_contents(contents: List[ZhihuContent]):
|
||||
"""
|
||||
批量更新知乎内容
|
||||
Args:
|
||||
contents:
|
||||
|
||||
Returns:
|
||||
|
||||
"""
|
||||
if not contents:
|
||||
return
|
||||
|
||||
for content_item in contents:
|
||||
await update_zhihu_content(content_item)
|
||||
|
||||
async def update_zhihu_content(content_item: ZhihuContent):
|
||||
"""
|
||||
更新知乎内容
|
||||
@ -71,3 +86,19 @@ async def update_zhihu_content_comment(comment_item: ZhihuComment):
|
||||
local_db_item.update({"last_modify_ts": utils.get_current_timestamp()})
|
||||
utils.logger.info(f"[store.zhihu.update_zhihu_note_comment] zhihu content comment:{local_db_item}")
|
||||
await ZhihuStoreFactory.create_store().store_comment(local_db_item)
|
||||
|
||||
|
||||
async def save_creator(creator: ZhihuCreator):
|
||||
"""
|
||||
保存知乎创作者信息
|
||||
Args:
|
||||
creator:
|
||||
|
||||
Returns:
|
||||
|
||||
"""
|
||||
if not creator:
|
||||
return
|
||||
local_db_item = creator.model_dump()
|
||||
local_db_item.update({"last_modify_ts": utils.get_current_timestamp()})
|
||||
await ZhihuStoreFactory.create_store().store_creator(local_db_item)
|
Loading…
Reference in New Issue
Block a user