Merge branch 'main' of https://github.com/NanmiCoder/MediaCrawler
# Conflicts: # README.md
This commit is contained in:
commit
71c07c7d36
17
.github/workflows/main.yaml
vendored
Normal file
17
.github/workflows/main.yaml
vendored
Normal file
@ -0,0 +1,17 @@
|
|||||||
|
on:
|
||||||
|
push:
|
||||||
|
branches:
|
||||||
|
- main
|
||||||
|
|
||||||
|
jobs:
|
||||||
|
contrib-readme-job:
|
||||||
|
runs-on: ubuntu-latest
|
||||||
|
name: A job to automate contrib in readme
|
||||||
|
permissions:
|
||||||
|
contents: write
|
||||||
|
pull-requests: write
|
||||||
|
steps:
|
||||||
|
- name: Contribute List
|
||||||
|
uses: akhilmhdh/contributors-readme-action@v2.3.10
|
||||||
|
env:
|
||||||
|
GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
|
231
README.md
231
README.md
@ -22,12 +22,11 @@
|
|||||||
|-----|-------|----------|-----|--------|-------|-------|-------|
|
|-----|-------|----------|-----|--------|-------|-------|-------|
|
||||||
| 小红书 | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ |
|
| 小红书 | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ |
|
||||||
| 抖音 | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ |
|
| 抖音 | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ |
|
||||||
| 快手 | ✅ | ✅ | ❌ | ❌ | ✅ | ✅ | ✅ |
|
| 快手 | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ |
|
||||||
| B 站 | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ |
|
| B 站 | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ |
|
||||||
| 微博 | ✅ | ✅ | ❌ | ❌ | ✅ | ✅ | ✅ |
|
| 微博 | ✅ | ✅ | ❌ | ❌ | ✅ | ✅ | ✅ |
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
## 使用方法
|
## 使用方法
|
||||||
|
|
||||||
### 创建并激活 python 虚拟环境
|
### 创建并激活 python 虚拟环境
|
||||||
@ -87,12 +86,19 @@
|
|||||||
## 开发者服务
|
## 开发者服务
|
||||||
- 知识星球:沉淀高质量常见问题、最佳实践文档、多年编程+爬虫经验分享,提供付费知识星球服务,主动提问,作者会定期回答问题
|
- 知识星球:沉淀高质量常见问题、最佳实践文档、多年编程+爬虫经验分享,提供付费知识星球服务,主动提问,作者会定期回答问题
|
||||||
<p>
|
<p>
|
||||||
<img alt="星球图片" src="static/images/xingqiu.jpg" style="width: auto;height: 400px" >
|
<img alt="xingqiu" src="https://nm.zizhi1.com/static/img/8e1312d1f52f2e0ff436ea7196b4e27b.15555424244122T1.webp" style="width: auto;height: 400px" >
|
||||||
<img alt="星球图片" src="static/images/xingqiu_yh.png" style="width: auto;height: 400px" >
|
|
||||||
</p>
|
</p>
|
||||||
前20个入驻星球的小伙伴,将获得新人券50元,还剩14张。
|
|
||||||
<br>
|
星球精选文章:
|
||||||
- 视频课程:
|
- [【独创】使用Playwright获取某音a_bogus参数流程(包含加密参数分析)](https://articles.zsxq.com/id_u89al50jk9x0.html)
|
||||||
|
- [【独创】使用Playwright低成本获取某书X-s参数流程分析(当年的回忆录)](https://articles.zsxq.com/id_u4lcrvqakuc7.html)
|
||||||
|
- [ MediaCrawler-基于抽象类设计重构项目缓存](https://articles.zsxq.com/id_4ju73oxewt9j.html)
|
||||||
|
- [ 手把手带你撸一个自己的IP代理池](https://articles.zsxq.com/id_38fza371ladm.html)
|
||||||
|
- 每天 1 快钱订阅我的知识服务
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
- MediaCrawler视频课程:
|
||||||
> 如果你想很快入门这个项目,或者想了具体实现原理,我推荐你看看这个视频课程,从设计出发一步步带你如何使用,门槛大大降低,同时也是对我开源的支持,如果你能支持我的课程,我将会非常开心~<br>
|
> 如果你想很快入门这个项目,或者想了具体实现原理,我推荐你看看这个视频课程,从设计出发一步步带你如何使用,门槛大大降低,同时也是对我开源的支持,如果你能支持我的课程,我将会非常开心~<br>
|
||||||
> 课程售价非常非常的便宜,几杯咖啡的事儿.<br>
|
> 课程售价非常非常的便宜,几杯咖啡的事儿.<br>
|
||||||
> 课程介绍飞书文档链接:https://relakkes.feishu.cn/wiki/JUgBwdhIeiSbAwkFCLkciHdAnhh
|
> 课程介绍飞书文档链接:https://relakkes.feishu.cn/wiki/JUgBwdhIeiSbAwkFCLkciHdAnhh
|
||||||
@ -115,7 +121,7 @@
|
|||||||
> 7天有效期,自动更新, 如果人满了可以加作者wx拉进群: yzglan,备注来自github.
|
> 7天有效期,自动更新, 如果人满了可以加作者wx拉进群: yzglan,备注来自github.
|
||||||
|
|
||||||
<div style="max-width: 200px">
|
<div style="max-width: 200px">
|
||||||
<p><img alt="10群二维码" src="static/images/10群二维码.JPG" style="width: 200px;height: 100%" ></p>
|
<p><img alt="11群二维码" src="static/images/11群二维码.JPG" style="width: 200px;height: 100%" ></p>
|
||||||
</div>
|
</div>
|
||||||
|
|
||||||
|
|
||||||
@ -144,11 +150,220 @@
|
|||||||
## 手机号登录说明
|
## 手机号登录说明
|
||||||
➡️➡️➡️ [手机号登录说明](docs/手机号登录说明.md)
|
➡️➡️➡️ [手机号登录说明](docs/手机号登录说明.md)
|
||||||
|
|
||||||
|
<<<<<<< HEAD
|
||||||
## 词云图相关操作说明
|
## 词云图相关操作说明
|
||||||
|
|
||||||
➡️➡️➡️ [词云图相关说明](docs/关于词云图相关操作.md)
|
➡️➡️➡️ [词云图相关说明](docs/关于词云图相关操作.md)
|
||||||
|
|
||||||
|
|
||||||
|
=======
|
||||||
|
## 项目贡献者
|
||||||
|
<!-- readme: contributors -start -->
|
||||||
|
<table>
|
||||||
|
<tbody>
|
||||||
|
<tr>
|
||||||
|
<td align="center">
|
||||||
|
<a href="https://github.com/NanmiCoder">
|
||||||
|
<img src="https://avatars.githubusercontent.com/u/47178017?v=4" width="100;" alt="NanmiCoder"/>
|
||||||
|
<br />
|
||||||
|
<sub><b>程序员阿江-Relakkes</b></sub>
|
||||||
|
</a>
|
||||||
|
</td>
|
||||||
|
<td align="center">
|
||||||
|
<a href="https://github.com/leantli">
|
||||||
|
<img src="https://avatars.githubusercontent.com/u/117699758?v=4" width="100;" alt="leantli"/>
|
||||||
|
<br />
|
||||||
|
<sub><b>leantli</b></sub>
|
||||||
|
</a>
|
||||||
|
</td>
|
||||||
|
<td align="center">
|
||||||
|
<a href="https://github.com/BaoZhuhan">
|
||||||
|
<img src="https://avatars.githubusercontent.com/u/140676370?v=4" width="100;" alt="BaoZhuhan"/>
|
||||||
|
<br />
|
||||||
|
<sub><b>Bao Zhuhan</b></sub>
|
||||||
|
</a>
|
||||||
|
</td>
|
||||||
|
<td align="center">
|
||||||
|
<a href="https://github.com/nelzomal">
|
||||||
|
<img src="https://avatars.githubusercontent.com/u/8512926?v=4" width="100;" alt="nelzomal"/>
|
||||||
|
<br />
|
||||||
|
<sub><b>zhounan</b></sub>
|
||||||
|
</a>
|
||||||
|
</td>
|
||||||
|
<td align="center">
|
||||||
|
<a href="https://github.com/Hiro-Lin">
|
||||||
|
<img src="https://avatars.githubusercontent.com/u/40111864?v=4" width="100;" alt="Hiro-Lin"/>
|
||||||
|
<br />
|
||||||
|
<sub><b>HIRO</b></sub>
|
||||||
|
</a>
|
||||||
|
</td>
|
||||||
|
<td align="center">
|
||||||
|
<a href="https://github.com/PeanutSplash">
|
||||||
|
<img src="https://avatars.githubusercontent.com/u/98582625?v=4" width="100;" alt="PeanutSplash"/>
|
||||||
|
<br />
|
||||||
|
<sub><b>PeanutSplash</b></sub>
|
||||||
|
</a>
|
||||||
|
</td>
|
||||||
|
</tr>
|
||||||
|
<tr>
|
||||||
|
<td align="center">
|
||||||
|
<a href="https://github.com/Ermeng98">
|
||||||
|
<img src="https://avatars.githubusercontent.com/u/55784769?v=4" width="100;" alt="Ermeng98"/>
|
||||||
|
<br />
|
||||||
|
<sub><b>Ermeng</b></sub>
|
||||||
|
</a>
|
||||||
|
</td>
|
||||||
|
<td align="center">
|
||||||
|
<a href="https://github.com/Rosyrain">
|
||||||
|
<img src="https://avatars.githubusercontent.com/u/116946548?v=4" width="100;" alt="Rosyrain"/>
|
||||||
|
<br />
|
||||||
|
<sub><b>Rosyrain</b></sub>
|
||||||
|
</a>
|
||||||
|
</td>
|
||||||
|
<td align="center">
|
||||||
|
<a href="https://github.com/henryhyn">
|
||||||
|
<img src="https://avatars.githubusercontent.com/u/5162443?v=4" width="100;" alt="henryhyn"/>
|
||||||
|
<br />
|
||||||
|
<sub><b>Henry He</b></sub>
|
||||||
|
</a>
|
||||||
|
</td>
|
||||||
|
<td align="center">
|
||||||
|
<a href="https://github.com/Akiqqqqqqq">
|
||||||
|
<img src="https://avatars.githubusercontent.com/u/51102894?v=4" width="100;" alt="Akiqqqqqqq"/>
|
||||||
|
<br />
|
||||||
|
<sub><b>leonardoqiuyu</b></sub>
|
||||||
|
</a>
|
||||||
|
</td>
|
||||||
|
<td align="center">
|
||||||
|
<a href="https://github.com/jayeeliu">
|
||||||
|
<img src="https://avatars.githubusercontent.com/u/77389?v=4" width="100;" alt="jayeeliu"/>
|
||||||
|
<br />
|
||||||
|
<sub><b>jayeeliu</b></sub>
|
||||||
|
</a>
|
||||||
|
</td>
|
||||||
|
<td align="center">
|
||||||
|
<a href="https://github.com/ZuWard">
|
||||||
|
<img src="https://avatars.githubusercontent.com/u/38209256?v=4" width="100;" alt="ZuWard"/>
|
||||||
|
<br />
|
||||||
|
<sub><b>ZuWard</b></sub>
|
||||||
|
</a>
|
||||||
|
</td>
|
||||||
|
</tr>
|
||||||
|
<tr>
|
||||||
|
<td align="center">
|
||||||
|
<a href="https://github.com/Zzendrix">
|
||||||
|
<img src="https://avatars.githubusercontent.com/u/154900254?v=4" width="100;" alt="Zzendrix"/>
|
||||||
|
<br />
|
||||||
|
<sub><b>Zendrix</b></sub>
|
||||||
|
</a>
|
||||||
|
</td>
|
||||||
|
<td align="center">
|
||||||
|
<a href="https://github.com/chunpat">
|
||||||
|
<img src="https://avatars.githubusercontent.com/u/19848304?v=4" width="100;" alt="chunpat"/>
|
||||||
|
<br />
|
||||||
|
<sub><b>zhangzhenpeng</b></sub>
|
||||||
|
</a>
|
||||||
|
</td>
|
||||||
|
<td align="center">
|
||||||
|
<a href="https://github.com/tanpenggood">
|
||||||
|
<img src="https://avatars.githubusercontent.com/u/37927946?v=4" width="100;" alt="tanpenggood"/>
|
||||||
|
<br />
|
||||||
|
<sub><b>Sam Tan</b></sub>
|
||||||
|
</a>
|
||||||
|
</td>
|
||||||
|
<td align="center">
|
||||||
|
<a href="https://github.com/xbsheng">
|
||||||
|
<img src="https://avatars.githubusercontent.com/u/56357338?v=4" width="100;" alt="xbsheng"/>
|
||||||
|
<br />
|
||||||
|
<sub><b>xbsheng</b></sub>
|
||||||
|
</a>
|
||||||
|
</td>
|
||||||
|
<td align="center">
|
||||||
|
<a href="https://github.com/yangrq1018">
|
||||||
|
<img src="https://avatars.githubusercontent.com/u/25074163?v=4" width="100;" alt="yangrq1018"/>
|
||||||
|
<br />
|
||||||
|
<sub><b>Martin</b></sub>
|
||||||
|
</a>
|
||||||
|
</td>
|
||||||
|
<td align="center">
|
||||||
|
<a href="https://github.com/zhihuiio">
|
||||||
|
<img src="https://avatars.githubusercontent.com/u/165655688?v=4" width="100;" alt="zhihuiio"/>
|
||||||
|
<br />
|
||||||
|
<sub><b>zhihuiio</b></sub>
|
||||||
|
</a>
|
||||||
|
</td>
|
||||||
|
</tr>
|
||||||
|
<tr>
|
||||||
|
<td align="center">
|
||||||
|
<a href="https://github.com/renaissancezyc">
|
||||||
|
<img src="https://avatars.githubusercontent.com/u/118403818?v=4" width="100;" alt="renaissancezyc"/>
|
||||||
|
<br />
|
||||||
|
<sub><b>Ren</b></sub>
|
||||||
|
</a>
|
||||||
|
</td>
|
||||||
|
<td align="center">
|
||||||
|
<a href="https://github.com/Tianci-King">
|
||||||
|
<img src="https://avatars.githubusercontent.com/u/109196852?v=4" width="100;" alt="Tianci-King"/>
|
||||||
|
<br />
|
||||||
|
<sub><b>Wang Tianci</b></sub>
|
||||||
|
</a>
|
||||||
|
</td>
|
||||||
|
<td align="center">
|
||||||
|
<a href="https://github.com/Styunlen">
|
||||||
|
<img src="https://avatars.githubusercontent.com/u/30810222?v=4" width="100;" alt="Styunlen"/>
|
||||||
|
<br />
|
||||||
|
<sub><b>Styunlen</b></sub>
|
||||||
|
</a>
|
||||||
|
</td>
|
||||||
|
<td align="center">
|
||||||
|
<a href="https://github.com/Schofi">
|
||||||
|
<img src="https://avatars.githubusercontent.com/u/33537727?v=4" width="100;" alt="Schofi"/>
|
||||||
|
<br />
|
||||||
|
<sub><b>Schofi</b></sub>
|
||||||
|
</a>
|
||||||
|
</td>
|
||||||
|
<td align="center">
|
||||||
|
<a href="https://github.com/Klu5ure">
|
||||||
|
<img src="https://avatars.githubusercontent.com/u/166240879?v=4" width="100;" alt="Klu5ure"/>
|
||||||
|
<br />
|
||||||
|
<sub><b>Klu5ure</b></sub>
|
||||||
|
</a>
|
||||||
|
</td>
|
||||||
|
<td align="center">
|
||||||
|
<a href="https://github.com/keeper-jie">
|
||||||
|
<img src="https://avatars.githubusercontent.com/u/33612777?v=4" width="100;" alt="keeper-jie"/>
|
||||||
|
<br />
|
||||||
|
<sub><b>Kermit</b></sub>
|
||||||
|
</a>
|
||||||
|
</td>
|
||||||
|
</tr>
|
||||||
|
<tr>
|
||||||
|
<td align="center">
|
||||||
|
<a href="https://github.com/kexinoh">
|
||||||
|
<img src="https://avatars.githubusercontent.com/u/91727108?v=4" width="100;" alt="kexinoh"/>
|
||||||
|
<br />
|
||||||
|
<sub><b>KEXNA</b></sub>
|
||||||
|
</a>
|
||||||
|
</td>
|
||||||
|
<td align="center">
|
||||||
|
<a href="https://github.com/aa65535">
|
||||||
|
<img src="https://avatars.githubusercontent.com/u/5417786?v=4" width="100;" alt="aa65535"/>
|
||||||
|
<br />
|
||||||
|
<sub><b>Jian Chang</b></sub>
|
||||||
|
</a>
|
||||||
|
</td>
|
||||||
|
<td align="center">
|
||||||
|
<a href="https://github.com/522109452">
|
||||||
|
<img src="https://avatars.githubusercontent.com/u/16929874?v=4" width="100;" alt="522109452"/>
|
||||||
|
<br />
|
||||||
|
<sub><b>tianqing</b></sub>
|
||||||
|
</a>
|
||||||
|
</td>
|
||||||
|
</tr>
|
||||||
|
<tbody>
|
||||||
|
</table>
|
||||||
|
<!-- readme: contributors -end -->
|
||||||
|
>>>>>>> 86a88f72602fe3f692acc628427888487554b716
|
||||||
|
|
||||||
## star 趋势图
|
## star 趋势图
|
||||||
- 如果该项目对你有帮助,star一下 ❤️❤️❤️
|
- 如果该项目对你有帮助,star一下 ❤️❤️❤️
|
||||||
|
@ -1,4 +1,5 @@
|
|||||||
import argparse
|
import argparse
|
||||||
|
|
||||||
import config
|
import config
|
||||||
from tools.utils import str2bool
|
from tools.utils import str2bool
|
||||||
|
|
||||||
|
@ -3,8 +3,10 @@ PLATFORM = "xhs"
|
|||||||
KEYWORDS = "python,golang"
|
KEYWORDS = "python,golang"
|
||||||
LOGIN_TYPE = "qrcode" # qrcode or phone or cookie
|
LOGIN_TYPE = "qrcode" # qrcode or phone or cookie
|
||||||
COOKIES = ""
|
COOKIES = ""
|
||||||
# 具体值参见media_platform.xxx.field下的枚举值,展示只支持小红书
|
# 具体值参见media_platform.xxx.field下的枚举值,暂时只支持小红书
|
||||||
SORT_TYPE = "popularity_descending"
|
SORT_TYPE = "popularity_descending"
|
||||||
|
# 具体值参见media_platform.xxx.field下的枚举值,暂时只支持抖音
|
||||||
|
PUBLISH_TIME_TYPE = 0
|
||||||
CRAWLER_TYPE = "search" # 爬取类型,search(关键词搜索) | detail(帖子详情)| creator(创作者主页数据)
|
CRAWLER_TYPE = "search" # 爬取类型,search(关键词搜索) | detail(帖子详情)| creator(创作者主页数据)
|
||||||
|
|
||||||
# 是否开启 IP 代理
|
# 是否开启 IP 代理
|
||||||
@ -103,6 +105,13 @@ BILI_CREATOR_ID_LIST = [
|
|||||||
# ........................
|
# ........................
|
||||||
]
|
]
|
||||||
|
|
||||||
|
# 指定快手创作者ID列表
|
||||||
|
KS_CREATOR_ID_LIST = [
|
||||||
|
"3x4sm73aye7jq7i",
|
||||||
|
# ........................
|
||||||
|
]
|
||||||
|
|
||||||
|
|
||||||
#词云相关
|
#词云相关
|
||||||
#是否开启生成评论词云图
|
#是否开启生成评论词云图
|
||||||
ENABLE_GET_WORDCLOUD = False
|
ENABLE_GET_WORDCLOUD = False
|
||||||
@ -118,5 +127,3 @@ STOP_WORDS_FILE = "./docs/hit_stopwords.txt"
|
|||||||
|
|
||||||
#中文字体文件路径
|
#中文字体文件路径
|
||||||
FONT_PATH= "./docs/STZHONGS.TTF"
|
FONT_PATH= "./docs/STZHONGS.TTF"
|
||||||
|
|
||||||
|
|
||||||
|
@ -106,6 +106,7 @@ class BilibiliCrawler(AbstractCrawler):
|
|||||||
page += 1
|
page += 1
|
||||||
continue
|
continue
|
||||||
|
|
||||||
|
utils.logger.info(f"[BilibiliCrawler.search] search bilibili keyword: {keyword}, page: {page}")
|
||||||
video_id_list: List[str] = []
|
video_id_list: List[str] = []
|
||||||
videos_res = await self.bili_client.search_video_by_keyword(
|
videos_res = await self.bili_client.search_video_by_keyword(
|
||||||
keyword=keyword,
|
keyword=keyword,
|
||||||
@ -126,7 +127,6 @@ class BilibiliCrawler(AbstractCrawler):
|
|||||||
if video_item:
|
if video_item:
|
||||||
video_id_list.append(video_item.get("View").get("aid"))
|
video_id_list.append(video_item.get("View").get("aid"))
|
||||||
await bilibili_store.update_bilibili_video(video_item)
|
await bilibili_store.update_bilibili_video(video_item)
|
||||||
|
|
||||||
page += 1
|
page += 1
|
||||||
await self.batch_get_video_comments(video_id_list)
|
await self.batch_get_video_comments(video_id_list)
|
||||||
|
|
||||||
|
@ -12,8 +12,8 @@ from playwright.async_api import BrowserContext, Page
|
|||||||
from tenacity import (RetryError, retry, retry_if_result, stop_after_attempt,
|
from tenacity import (RetryError, retry, retry_if_result, stop_after_attempt,
|
||||||
wait_fixed)
|
wait_fixed)
|
||||||
|
|
||||||
from base.base_crawler import AbstractLogin
|
|
||||||
import config
|
import config
|
||||||
|
from base.base_crawler import AbstractLogin
|
||||||
from tools import utils
|
from tools import utils
|
||||||
|
|
||||||
|
|
||||||
|
@ -1,5 +1,6 @@
|
|||||||
import asyncio
|
import asyncio
|
||||||
import copy
|
import copy
|
||||||
|
import json
|
||||||
import urllib.parse
|
import urllib.parse
|
||||||
from typing import Any, Callable, Dict, List, Optional
|
from typing import Any, Callable, Dict, List, Optional
|
||||||
|
|
||||||
@ -119,14 +120,19 @@ class DOUYINClient(AbstractApiClient):
|
|||||||
params = {
|
params = {
|
||||||
"keyword": urllib.parse.quote(keyword),
|
"keyword": urllib.parse.quote(keyword),
|
||||||
"search_channel": search_channel.value,
|
"search_channel": search_channel.value,
|
||||||
"sort_type": sort_type.value,
|
|
||||||
"publish_time": publish_time.value,
|
|
||||||
"search_source": "normal_search",
|
"search_source": "normal_search",
|
||||||
"query_correct_type": "1",
|
"query_correct_type": 1,
|
||||||
"is_filter_search": "0",
|
"is_filter_search": 0,
|
||||||
"offset": offset,
|
"offset": offset,
|
||||||
"count": 10 # must be set to 10
|
"count": 10 # must be set to 10
|
||||||
}
|
}
|
||||||
|
if sort_type != SearchSortType.GENERAL or publish_time != PublishTimeType.UNLIMITED:
|
||||||
|
params["filter_selected"] = urllib.parse.quote(json.dumps({
|
||||||
|
"sort_type": str(sort_type.value),
|
||||||
|
"publish_time": str(publish_time.value)
|
||||||
|
}))
|
||||||
|
params["is_filter_search"] = 1
|
||||||
|
params["search_source"] = "tab_search"
|
||||||
referer_url = "https://www.douyin.com/search/" + keyword
|
referer_url = "https://www.douyin.com/search/" + keyword
|
||||||
referer_url += f"?publish_time={publish_time.value}&sort_type={sort_type.value}&type=general"
|
referer_url += f"?publish_time={publish_time.value}&sort_type={sort_type.value}&type=general"
|
||||||
headers = copy.copy(self.headers)
|
headers = copy.copy(self.headers)
|
||||||
|
@ -90,9 +90,10 @@ class DouYinCrawler(AbstractCrawler):
|
|||||||
page += 1
|
page += 1
|
||||||
continue
|
continue
|
||||||
try:
|
try:
|
||||||
|
utils.logger.info(f"[DouYinCrawler.search] search douyin keyword: {keyword}, page: {page}")
|
||||||
posts_res = await self.dy_client.search_info_by_keyword(keyword=keyword,
|
posts_res = await self.dy_client.search_info_by_keyword(keyword=keyword,
|
||||||
offset=page * dy_limit_count,
|
offset=page * dy_limit_count - dy_limit_count,
|
||||||
publish_time=PublishTimeType.UNLIMITED
|
publish_time=PublishTimeType(config.PUBLISH_TIME_TYPE)
|
||||||
)
|
)
|
||||||
except DataFetchError:
|
except DataFetchError:
|
||||||
utils.logger.error(f"[DouYinCrawler.search] search douyin keyword: {keyword} failed")
|
utils.logger.error(f"[DouYinCrawler.search] search douyin keyword: {keyword} failed")
|
||||||
|
@ -12,13 +12,12 @@ class SearchChannelType(Enum):
|
|||||||
class SearchSortType(Enum):
|
class SearchSortType(Enum):
|
||||||
"""search sort type"""
|
"""search sort type"""
|
||||||
GENERAL = 0 # 综合排序
|
GENERAL = 0 # 综合排序
|
||||||
LATEST = 1 # 最新发布
|
MOST_LIKE = 1 # 最多点赞
|
||||||
MOST_LIKE = 2 # 最多点赞
|
LATEST = 2 # 最新发布
|
||||||
|
|
||||||
|
|
||||||
class PublishTimeType(Enum):
|
class PublishTimeType(Enum):
|
||||||
"""publish time type"""
|
"""publish time type"""
|
||||||
UNLIMITED = 0 # 不限
|
UNLIMITED = 0 # 不限
|
||||||
ONE_DAY = 1 # 一天内
|
ONE_DAY = 1 # 一天内
|
||||||
ONE_WEEK = 2 # 一周内
|
ONE_WEEK = 7 # 一周内
|
||||||
SIX_MONTH = 3 # 半年内
|
SIX_MONTH = 180 # 半年内
|
||||||
|
@ -1,7 +1,7 @@
|
|||||||
# -*- coding: utf-8 -*-
|
# -*- coding: utf-8 -*-
|
||||||
import asyncio
|
import asyncio
|
||||||
import json
|
import json
|
||||||
from typing import Any, Callable, Dict, Optional
|
from typing import Any, Callable, Dict, List, Optional
|
||||||
from urllib.parse import urlencode
|
from urllib.parse import urlencode
|
||||||
|
|
||||||
import httpx
|
import httpx
|
||||||
@ -67,7 +67,7 @@ class KuaiShouClient(AbstractApiClient):
|
|||||||
"variables": {
|
"variables": {
|
||||||
"ftype": 1,
|
"ftype": 1,
|
||||||
},
|
},
|
||||||
"query": self.graphql.get("vision_profile")
|
"query": self.graphql.get("vision_profile_user_list")
|
||||||
}
|
}
|
||||||
res = await self.post("", post_data)
|
res = await self.post("", post_data)
|
||||||
if res.get("visionProfileUserList", {}).get("result") == 1:
|
if res.get("visionProfileUserList", {}).get("result") == 1:
|
||||||
@ -129,17 +129,60 @@ class KuaiShouClient(AbstractApiClient):
|
|||||||
"pcursor": pcursor
|
"pcursor": pcursor
|
||||||
},
|
},
|
||||||
"query": self.graphql.get("comment_list")
|
"query": self.graphql.get("comment_list")
|
||||||
|
|
||||||
}
|
}
|
||||||
return await self.post("", post_data)
|
return await self.post("", post_data)
|
||||||
|
|
||||||
async def get_video_all_comments(self, photo_id: str, crawl_interval: float = 1.0, is_fetch_sub_comments=False,
|
async def get_video_sub_comments(
|
||||||
callback: Optional[Callable] = None):
|
self, photo_id: str, rootCommentId: str, pcursor: str = ""
|
||||||
|
) -> Dict:
|
||||||
|
"""get video sub comments
|
||||||
|
:param photo_id: photo id you want to fetch
|
||||||
|
:param pcursor: last you get pcursor, defaults to ""
|
||||||
|
:return:
|
||||||
|
"""
|
||||||
|
post_data = {
|
||||||
|
"operationName": "visionSubCommentList",
|
||||||
|
"variables": {
|
||||||
|
"photoId": photo_id,
|
||||||
|
"pcursor": pcursor,
|
||||||
|
"rootCommentId": rootCommentId,
|
||||||
|
},
|
||||||
|
"query": self.graphql.get("vision_sub_comment_list"),
|
||||||
|
}
|
||||||
|
return await self.post("", post_data)
|
||||||
|
|
||||||
|
async def get_creator_profile(self, userId: str) -> Dict:
|
||||||
|
post_data = {
|
||||||
|
"operationName": "visionProfile",
|
||||||
|
"variables": {
|
||||||
|
"userId": userId
|
||||||
|
},
|
||||||
|
"query": self.graphql.get("vision_profile"),
|
||||||
|
}
|
||||||
|
return await self.post("", post_data)
|
||||||
|
|
||||||
|
async def get_video_by_creater(self, userId: str, pcursor: str = "") -> Dict:
|
||||||
|
post_data = {
|
||||||
|
"operationName": "visionProfilePhotoList",
|
||||||
|
"variables": {
|
||||||
|
"page": "profile",
|
||||||
|
"pcursor": pcursor,
|
||||||
|
"userId": userId
|
||||||
|
},
|
||||||
|
"query": self.graphql.get("vision_profile_photo_list"),
|
||||||
|
}
|
||||||
|
return await self.post("", post_data)
|
||||||
|
|
||||||
|
async def get_video_all_comments(
|
||||||
|
self,
|
||||||
|
photo_id: str,
|
||||||
|
crawl_interval: float = 1.0,
|
||||||
|
callback: Optional[Callable] = None,
|
||||||
|
):
|
||||||
"""
|
"""
|
||||||
get video all comments include sub comments
|
get video all comments include sub comments
|
||||||
:param photo_id:
|
:param photo_id:
|
||||||
:param crawl_interval:
|
:param crawl_interval:
|
||||||
:param is_fetch_sub_comments:
|
|
||||||
:param callback:
|
:param callback:
|
||||||
:return:
|
:return:
|
||||||
"""
|
"""
|
||||||
@ -158,7 +201,107 @@ class KuaiShouClient(AbstractApiClient):
|
|||||||
|
|
||||||
result.extend(comments)
|
result.extend(comments)
|
||||||
await asyncio.sleep(crawl_interval)
|
await asyncio.sleep(crawl_interval)
|
||||||
if not is_fetch_sub_comments:
|
sub_comments = await self.get_comments_all_sub_comments(
|
||||||
continue
|
comments, photo_id, crawl_interval, callback
|
||||||
# todo handle get sub comments
|
)
|
||||||
|
result.extend(sub_comments)
|
||||||
|
return result
|
||||||
|
|
||||||
|
async def get_comments_all_sub_comments(
|
||||||
|
self,
|
||||||
|
comments: List[Dict],
|
||||||
|
photo_id,
|
||||||
|
crawl_interval: float = 1.0,
|
||||||
|
callback: Optional[Callable] = None,
|
||||||
|
) -> List[Dict]:
|
||||||
|
"""
|
||||||
|
获取指定一级评论下的所有二级评论, 该方法会一直查找一级评论下的所有二级评论信息
|
||||||
|
Args:
|
||||||
|
comments: 评论列表
|
||||||
|
photo_id: 视频id
|
||||||
|
crawl_interval: 爬取一次评论的延迟单位(秒)
|
||||||
|
callback: 一次评论爬取结束后
|
||||||
|
Returns:
|
||||||
|
|
||||||
|
"""
|
||||||
|
if not config.ENABLE_GET_SUB_COMMENTS:
|
||||||
|
utils.logger.info(
|
||||||
|
f"[KuaiShouClient.get_comments_all_sub_comments] Crawling sub_comment mode is not enabled"
|
||||||
|
)
|
||||||
|
return []
|
||||||
|
|
||||||
|
result = []
|
||||||
|
for comment in comments:
|
||||||
|
sub_comments = comment.get("subComments")
|
||||||
|
if sub_comments and callback:
|
||||||
|
await callback(photo_id, sub_comments)
|
||||||
|
|
||||||
|
sub_comment_pcursor = comment.get("subCommentsPcursor")
|
||||||
|
if sub_comment_pcursor == "no_more":
|
||||||
|
continue
|
||||||
|
|
||||||
|
root_comment_id = comment.get("commentId")
|
||||||
|
sub_comment_pcursor = ""
|
||||||
|
|
||||||
|
while sub_comment_pcursor != "no_more":
|
||||||
|
comments_res = await self.get_video_sub_comments(
|
||||||
|
photo_id, root_comment_id, sub_comment_pcursor
|
||||||
|
)
|
||||||
|
vision_sub_comment_list = comments_res.get("visionSubCommentList",{})
|
||||||
|
sub_comment_pcursor = vision_sub_comment_list.get("pcursor", "no_more")
|
||||||
|
|
||||||
|
comments = vision_sub_comment_list.get("subComments", {})
|
||||||
|
if callback:
|
||||||
|
await callback(photo_id, comments)
|
||||||
|
await asyncio.sleep(crawl_interval)
|
||||||
|
result.extend(comments)
|
||||||
|
return result
|
||||||
|
|
||||||
|
async def get_creator_info(self, user_id: str) -> Dict:
|
||||||
|
"""
|
||||||
|
eg: https://www.kuaishou.com/profile/3x4jtnbfter525a
|
||||||
|
快手用户主页
|
||||||
|
"""
|
||||||
|
|
||||||
|
visionProfile = await self.get_creator_profile(user_id)
|
||||||
|
return visionProfile.get("userProfile")
|
||||||
|
|
||||||
|
async def get_all_videos_by_creator(
|
||||||
|
self,
|
||||||
|
user_id: str,
|
||||||
|
crawl_interval: float = 1.0,
|
||||||
|
callback: Optional[Callable] = None,
|
||||||
|
) -> List[Dict]:
|
||||||
|
"""
|
||||||
|
获取指定用户下的所有发过的帖子,该方法会一直查找一个用户下的所有帖子信息
|
||||||
|
Args:
|
||||||
|
user_id: 用户ID
|
||||||
|
crawl_interval: 爬取一次的延迟单位(秒)
|
||||||
|
callback: 一次分页爬取结束后的更新回调函数
|
||||||
|
Returns:
|
||||||
|
|
||||||
|
"""
|
||||||
|
result = []
|
||||||
|
pcursor = ""
|
||||||
|
|
||||||
|
while pcursor != "no_more":
|
||||||
|
videos_res = await self.get_video_by_creater(user_id, pcursor)
|
||||||
|
if not videos_res:
|
||||||
|
utils.logger.error(
|
||||||
|
f"[KuaiShouClient.get_all_videos_by_creator] The current creator may have been banned by ks, so they cannot access the data."
|
||||||
|
)
|
||||||
|
break
|
||||||
|
|
||||||
|
vision_profile_photo_list = videos_res.get("visionProfilePhotoList", {})
|
||||||
|
pcursor = vision_profile_photo_list.get("pcursor", "")
|
||||||
|
|
||||||
|
videos = vision_profile_photo_list.get("feeds", [])
|
||||||
|
utils.logger.info(
|
||||||
|
f"[KuaiShouClient.get_all_videos_by_creator] got user_id:{user_id} videos len : {len(videos)}"
|
||||||
|
)
|
||||||
|
|
||||||
|
if callback:
|
||||||
|
await callback(videos)
|
||||||
|
await asyncio.sleep(crawl_interval)
|
||||||
|
result.extend(videos)
|
||||||
return result
|
return result
|
||||||
|
@ -65,11 +65,14 @@ class KuaishouCrawler(AbstractCrawler):
|
|||||||
|
|
||||||
crawler_type_var.set(config.CRAWLER_TYPE)
|
crawler_type_var.set(config.CRAWLER_TYPE)
|
||||||
if config.CRAWLER_TYPE == "search":
|
if config.CRAWLER_TYPE == "search":
|
||||||
# Search for notes and retrieve their comment information.
|
# Search for videos and retrieve their comment information.
|
||||||
await self.search()
|
await self.search()
|
||||||
elif config.CRAWLER_TYPE == "detail":
|
elif config.CRAWLER_TYPE == "detail":
|
||||||
# Get the information and comments of the specified post
|
# Get the information and comments of the specified post
|
||||||
await self.get_specified_videos()
|
await self.get_specified_videos()
|
||||||
|
elif config.CRAWLER_TYPE == "creator":
|
||||||
|
# Get creator's information and their videos and comments
|
||||||
|
await self.get_creators_and_videos()
|
||||||
else:
|
else:
|
||||||
pass
|
pass
|
||||||
|
|
||||||
@ -89,7 +92,7 @@ class KuaishouCrawler(AbstractCrawler):
|
|||||||
utils.logger.info(f"[KuaishouCrawler.search] Skip page: {page}")
|
utils.logger.info(f"[KuaishouCrawler.search] Skip page: {page}")
|
||||||
page += 1
|
page += 1
|
||||||
continue
|
continue
|
||||||
|
utils.logger.info(f"[KuaishouCrawler.search] search kuaishou keyword: {keyword}, page: {page}")
|
||||||
video_id_list: List[str] = []
|
video_id_list: List[str] = []
|
||||||
videos_res = await self.ks_client.search_info_by_keyword(
|
videos_res = await self.ks_client.search_info_by_keyword(
|
||||||
keyword=keyword,
|
keyword=keyword,
|
||||||
@ -135,7 +138,7 @@ class KuaishouCrawler(AbstractCrawler):
|
|||||||
utils.logger.error(f"[KuaishouCrawler.get_video_info_task] Get video detail error: {ex}")
|
utils.logger.error(f"[KuaishouCrawler.get_video_info_task] Get video detail error: {ex}")
|
||||||
return None
|
return None
|
||||||
except KeyError as ex:
|
except KeyError as ex:
|
||||||
utils.logger.error(f"[KuaishouCrawler.get_video_info_task] have not fund note detail video_id:{video_id}, err: {ex}")
|
utils.logger.error(f"[KuaishouCrawler.get_video_info_task] have not fund video detail video_id:{video_id}, err: {ex}")
|
||||||
return None
|
return None
|
||||||
|
|
||||||
async def batch_get_video_comments(self, video_id_list: List[str]):
|
async def batch_get_video_comments(self, video_id_list: List[str]):
|
||||||
@ -145,7 +148,7 @@ class KuaishouCrawler(AbstractCrawler):
|
|||||||
:return:
|
:return:
|
||||||
"""
|
"""
|
||||||
if not config.ENABLE_GET_COMMENTS:
|
if not config.ENABLE_GET_COMMENTS:
|
||||||
utils.logger.info(f"[KuaishouCrawler.batch_get_note_comments] Crawling comment mode is not enabled")
|
utils.logger.info(f"[KuaishouCrawler.batch_get_video_comments] Crawling comment mode is not enabled")
|
||||||
return
|
return
|
||||||
|
|
||||||
utils.logger.info(f"[KuaishouCrawler.batch_get_video_comments] video ids:{video_id_list}")
|
utils.logger.info(f"[KuaishouCrawler.batch_get_video_comments] video ids:{video_id_list}")
|
||||||
@ -200,10 +203,10 @@ class KuaishouCrawler(AbstractCrawler):
|
|||||||
return playwright_proxy, httpx_proxy
|
return playwright_proxy, httpx_proxy
|
||||||
|
|
||||||
async def create_ks_client(self, httpx_proxy: Optional[str]) -> KuaiShouClient:
|
async def create_ks_client(self, httpx_proxy: Optional[str]) -> KuaiShouClient:
|
||||||
"""Create xhs client"""
|
"""Create ks client"""
|
||||||
utils.logger.info("[KuaishouCrawler.create_ks_client] Begin create kuaishou API client ...")
|
utils.logger.info("[KuaishouCrawler.create_ks_client] Begin create kuaishou API client ...")
|
||||||
cookie_str, cookie_dict = utils.convert_cookies(await self.browser_context.cookies())
|
cookie_str, cookie_dict = utils.convert_cookies(await self.browser_context.cookies())
|
||||||
xhs_client_obj = KuaiShouClient(
|
ks_client_obj = KuaiShouClient(
|
||||||
proxies=httpx_proxy,
|
proxies=httpx_proxy,
|
||||||
headers={
|
headers={
|
||||||
"User-Agent": self.user_agent,
|
"User-Agent": self.user_agent,
|
||||||
@ -215,7 +218,7 @@ class KuaishouCrawler(AbstractCrawler):
|
|||||||
playwright_page=self.context_page,
|
playwright_page=self.context_page,
|
||||||
cookie_dict=cookie_dict,
|
cookie_dict=cookie_dict,
|
||||||
)
|
)
|
||||||
return xhs_client_obj
|
return ks_client_obj
|
||||||
|
|
||||||
async def launch_browser(
|
async def launch_browser(
|
||||||
self,
|
self,
|
||||||
@ -246,6 +249,39 @@ class KuaishouCrawler(AbstractCrawler):
|
|||||||
)
|
)
|
||||||
return browser_context
|
return browser_context
|
||||||
|
|
||||||
|
async def get_creators_and_videos(self) -> None:
|
||||||
|
"""Get creator's videos and retrieve their comment information."""
|
||||||
|
utils.logger.info("[KuaiShouCrawler.get_creators_and_videos] Begin get kuaishou creators")
|
||||||
|
for user_id in config.KS_CREATOR_ID_LIST:
|
||||||
|
# get creator detail info from web html content
|
||||||
|
createor_info: Dict = await self.ks_client.get_creator_info(user_id=user_id)
|
||||||
|
if createor_info:
|
||||||
|
await kuaishou_store.save_creator(user_id, creator=createor_info)
|
||||||
|
|
||||||
|
# Get all video information of the creator
|
||||||
|
all_video_list = await self.ks_client.get_all_videos_by_creator(
|
||||||
|
user_id = user_id,
|
||||||
|
crawl_interval = random.random(),
|
||||||
|
callback = self.fetch_creator_video_detail
|
||||||
|
)
|
||||||
|
|
||||||
|
video_ids = [video_item.get("photo", {}).get("id") for video_item in all_video_list]
|
||||||
|
await self.batch_get_video_comments(video_ids)
|
||||||
|
|
||||||
|
async def fetch_creator_video_detail(self, video_list: List[Dict]):
|
||||||
|
"""
|
||||||
|
Concurrently obtain the specified post list and save the data
|
||||||
|
"""
|
||||||
|
semaphore = asyncio.Semaphore(config.MAX_CONCURRENCY_NUM)
|
||||||
|
task_list = [
|
||||||
|
self.get_video_info_task(post_item.get("photo", {}).get("id"), semaphore) for post_item in video_list
|
||||||
|
]
|
||||||
|
|
||||||
|
video_details = await asyncio.gather(*task_list)
|
||||||
|
for video_detail in video_details:
|
||||||
|
if video_detail is not None:
|
||||||
|
await kuaishou_store.update_kuaishou_video(video_detail)
|
||||||
|
|
||||||
async def close(self):
|
async def close(self):
|
||||||
"""Close browser context"""
|
"""Close browser context"""
|
||||||
await self.browser_context.close()
|
await self.browser_context.close()
|
||||||
|
@ -11,7 +11,7 @@ class KuaiShouGraphQL:
|
|||||||
self.load_graphql_queries()
|
self.load_graphql_queries()
|
||||||
|
|
||||||
def load_graphql_queries(self):
|
def load_graphql_queries(self):
|
||||||
graphql_files = ["search_query.graphql", "video_detail.graphql", "comment_list.graphql", "vision_profile.graphql"]
|
graphql_files = ["search_query.graphql", "video_detail.graphql", "comment_list.graphql", "vision_profile.graphql","vision_profile_photo_list.graphql","vision_profile_user_list.graphql","vision_sub_comment_list.graphql"]
|
||||||
|
|
||||||
for file in graphql_files:
|
for file in graphql_files:
|
||||||
with open(self.graphql_dir + file, mode="r") as f:
|
with open(self.graphql_dir + file, mode="r") as f:
|
||||||
|
@ -1,16 +1,27 @@
|
|||||||
query visionProfileUserList($pcursor: String, $ftype: Int) {
|
query visionProfile($userId: String) {
|
||||||
visionProfileUserList(pcursor: $pcursor, ftype: $ftype) {
|
visionProfile(userId: $userId) {
|
||||||
result
|
result
|
||||||
fols {
|
hostName
|
||||||
user_name
|
userProfile {
|
||||||
headurl
|
ownerCount {
|
||||||
user_text
|
fan
|
||||||
|
photo
|
||||||
|
follow
|
||||||
|
photo_public
|
||||||
|
__typename
|
||||||
|
}
|
||||||
|
profile {
|
||||||
|
gender
|
||||||
|
user_name
|
||||||
|
user_id
|
||||||
|
headurl
|
||||||
|
user_text
|
||||||
|
user_profile_bg_url
|
||||||
|
__typename
|
||||||
|
}
|
||||||
isFollowing
|
isFollowing
|
||||||
user_id
|
|
||||||
__typename
|
__typename
|
||||||
}
|
}
|
||||||
hostName
|
|
||||||
pcursor
|
|
||||||
__typename
|
__typename
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
|
@ -0,0 +1,110 @@
|
|||||||
|
fragment photoContent on PhotoEntity {
|
||||||
|
__typename
|
||||||
|
id
|
||||||
|
duration
|
||||||
|
caption
|
||||||
|
originCaption
|
||||||
|
likeCount
|
||||||
|
viewCount
|
||||||
|
commentCount
|
||||||
|
realLikeCount
|
||||||
|
coverUrl
|
||||||
|
photoUrl
|
||||||
|
photoH265Url
|
||||||
|
manifest
|
||||||
|
manifestH265
|
||||||
|
videoResource
|
||||||
|
coverUrls {
|
||||||
|
url
|
||||||
|
__typename
|
||||||
|
}
|
||||||
|
timestamp
|
||||||
|
expTag
|
||||||
|
animatedCoverUrl
|
||||||
|
distance
|
||||||
|
videoRatio
|
||||||
|
liked
|
||||||
|
stereoType
|
||||||
|
profileUserTopPhoto
|
||||||
|
musicBlocked
|
||||||
|
riskTagContent
|
||||||
|
riskTagUrl
|
||||||
|
}
|
||||||
|
|
||||||
|
fragment recoPhotoFragment on recoPhotoEntity {
|
||||||
|
__typename
|
||||||
|
id
|
||||||
|
duration
|
||||||
|
caption
|
||||||
|
originCaption
|
||||||
|
likeCount
|
||||||
|
viewCount
|
||||||
|
commentCount
|
||||||
|
realLikeCount
|
||||||
|
coverUrl
|
||||||
|
photoUrl
|
||||||
|
photoH265Url
|
||||||
|
manifest
|
||||||
|
manifestH265
|
||||||
|
videoResource
|
||||||
|
coverUrls {
|
||||||
|
url
|
||||||
|
__typename
|
||||||
|
}
|
||||||
|
timestamp
|
||||||
|
expTag
|
||||||
|
animatedCoverUrl
|
||||||
|
distance
|
||||||
|
videoRatio
|
||||||
|
liked
|
||||||
|
stereoType
|
||||||
|
profileUserTopPhoto
|
||||||
|
musicBlocked
|
||||||
|
riskTagContent
|
||||||
|
riskTagUrl
|
||||||
|
}
|
||||||
|
|
||||||
|
fragment feedContent on Feed {
|
||||||
|
type
|
||||||
|
author {
|
||||||
|
id
|
||||||
|
name
|
||||||
|
headerUrl
|
||||||
|
following
|
||||||
|
headerUrls {
|
||||||
|
url
|
||||||
|
__typename
|
||||||
|
}
|
||||||
|
__typename
|
||||||
|
}
|
||||||
|
photo {
|
||||||
|
...photoContent
|
||||||
|
...recoPhotoFragment
|
||||||
|
__typename
|
||||||
|
}
|
||||||
|
canAddComment
|
||||||
|
llsid
|
||||||
|
status
|
||||||
|
currentPcursor
|
||||||
|
tags {
|
||||||
|
type
|
||||||
|
name
|
||||||
|
__typename
|
||||||
|
}
|
||||||
|
__typename
|
||||||
|
}
|
||||||
|
|
||||||
|
query visionProfilePhotoList($pcursor: String, $userId: String, $page: String, $webPageArea: String) {
|
||||||
|
visionProfilePhotoList(pcursor: $pcursor, userId: $userId, page: $page, webPageArea: $webPageArea) {
|
||||||
|
result
|
||||||
|
llsid
|
||||||
|
webPageArea
|
||||||
|
feeds {
|
||||||
|
...feedContent
|
||||||
|
__typename
|
||||||
|
}
|
||||||
|
hostName
|
||||||
|
pcursor
|
||||||
|
__typename
|
||||||
|
}
|
||||||
|
}
|
@ -0,0 +1,16 @@
|
|||||||
|
query visionProfileUserList($pcursor: String, $ftype: Int) {
|
||||||
|
visionProfileUserList(pcursor: $pcursor, ftype: $ftype) {
|
||||||
|
result
|
||||||
|
fols {
|
||||||
|
user_name
|
||||||
|
headurl
|
||||||
|
user_text
|
||||||
|
isFollowing
|
||||||
|
user_id
|
||||||
|
__typename
|
||||||
|
}
|
||||||
|
hostName
|
||||||
|
pcursor
|
||||||
|
__typename
|
||||||
|
}
|
||||||
|
}
|
@ -0,0 +1,22 @@
|
|||||||
|
mutation visionSubCommentList($photoId: String, $rootCommentId: String, $pcursor: String) {
|
||||||
|
visionSubCommentList(photoId: $photoId, rootCommentId: $rootCommentId, pcursor: $pcursor) {
|
||||||
|
pcursor
|
||||||
|
subComments {
|
||||||
|
commentId
|
||||||
|
authorId
|
||||||
|
authorName
|
||||||
|
content
|
||||||
|
headurl
|
||||||
|
timestamp
|
||||||
|
likedCount
|
||||||
|
realLikedCount
|
||||||
|
liked
|
||||||
|
status
|
||||||
|
authorLiked
|
||||||
|
replyToUserName
|
||||||
|
replyTo
|
||||||
|
__typename
|
||||||
|
}
|
||||||
|
__typename
|
||||||
|
}
|
||||||
|
}
|
@ -7,6 +7,7 @@ from playwright.async_api import BrowserContext, Page
|
|||||||
from tenacity import (RetryError, retry, retry_if_result, stop_after_attempt,
|
from tenacity import (RetryError, retry, retry_if_result, stop_after_attempt,
|
||||||
wait_fixed)
|
wait_fixed)
|
||||||
|
|
||||||
|
import config
|
||||||
from base.base_crawler import AbstractLogin
|
from base.base_crawler import AbstractLogin
|
||||||
from tools import utils
|
from tools import utils
|
||||||
|
|
||||||
@ -57,7 +58,7 @@ class KuaishouLogin(AbstractLogin):
|
|||||||
|
|
||||||
# click login button
|
# click login button
|
||||||
login_button_ele = self.context_page.locator(
|
login_button_ele = self.context_page.locator(
|
||||||
"xpath=//p[text()=' 登录 ']"
|
"xpath=//p[text()='登录']"
|
||||||
)
|
)
|
||||||
await login_button_ele.click()
|
await login_button_ele.click()
|
||||||
|
|
||||||
|
@ -108,7 +108,7 @@ class WeiboCrawler(AbstractCrawler):
|
|||||||
utils.logger.info(f"[WeiboCrawler.search] Skip page: {page}")
|
utils.logger.info(f"[WeiboCrawler.search] Skip page: {page}")
|
||||||
page += 1
|
page += 1
|
||||||
continue
|
continue
|
||||||
|
utils.logger.info(f"[WeiboCrawler.search] search weibo keyword: {keyword}, page: {page}")
|
||||||
search_res = await self.wb_client.get_note_by_keyword(
|
search_res = await self.wb_client.get_note_by_keyword(
|
||||||
keyword=keyword,
|
keyword=keyword,
|
||||||
page=page,
|
page=page,
|
||||||
|
@ -12,6 +12,7 @@ from playwright.async_api import BrowserContext, Page
|
|||||||
from tenacity import (RetryError, retry, retry_if_result, stop_after_attempt,
|
from tenacity import (RetryError, retry, retry_if_result, stop_after_attempt,
|
||||||
wait_fixed)
|
wait_fixed)
|
||||||
|
|
||||||
|
import config
|
||||||
from base.base_crawler import AbstractLogin
|
from base.base_crawler import AbstractLogin
|
||||||
from tools import utils
|
from tools import utils
|
||||||
|
|
||||||
|
@ -102,6 +102,7 @@ class XiaoHongShuCrawler(AbstractCrawler):
|
|||||||
continue
|
continue
|
||||||
|
|
||||||
try:
|
try:
|
||||||
|
utils.logger.info(f"[XiaoHongShuCrawler.search] search xhs keyword: {keyword}, page: {page}")
|
||||||
note_id_list: List[str] = []
|
note_id_list: List[str] = []
|
||||||
notes_res = await self.xhs_client.get_note_by_keyword(
|
notes_res = await self.xhs_client.get_note_by_keyword(
|
||||||
keyword=keyword,
|
keyword=keyword,
|
||||||
|
Binary file not shown.
Before Width: | Height: | Size: 169 KiB |
BIN
static/images/11群二维码.JPG
Normal file
BIN
static/images/11群二维码.JPG
Normal file
Binary file not shown.
After Width: | Height: | Size: 171 KiB |
Binary file not shown.
Before Width: | Height: | Size: 175 KiB |
Binary file not shown.
Before Width: | Height: | Size: 242 KiB After Width: | Height: | Size: 241 KiB |
Binary file not shown.
Before Width: | Height: | Size: 115 KiB |
@ -13,9 +13,9 @@ import aiofiles
|
|||||||
|
|
||||||
import config
|
import config
|
||||||
from base.base_crawler import AbstractStore
|
from base.base_crawler import AbstractStore
|
||||||
from tools import utils
|
from tools import utils, words
|
||||||
from var import crawler_type_var
|
from var import crawler_type_var
|
||||||
from tools import words
|
|
||||||
|
|
||||||
def calculate_number_of_files(file_store_path: str) -> int:
|
def calculate_number_of_files(file_store_path: str) -> int:
|
||||||
"""计算数据保存文件的前部分排序数字,支持每次运行代码不写到同一个文件中
|
"""计算数据保存文件的前部分排序数字,支持每次运行代码不写到同一个文件中
|
||||||
|
@ -11,10 +11,10 @@ from typing import Dict
|
|||||||
|
|
||||||
import aiofiles
|
import aiofiles
|
||||||
|
|
||||||
from base.base_crawler import AbstractStore
|
|
||||||
from tools import utils,words
|
|
||||||
from var import crawler_type_var
|
|
||||||
import config
|
import config
|
||||||
|
from base.base_crawler import AbstractStore
|
||||||
|
from tools import utils, words
|
||||||
|
from var import crawler_type_var
|
||||||
|
|
||||||
|
|
||||||
def calculate_number_of_files(file_store_path: str) -> int:
|
def calculate_number_of_files(file_store_path: str) -> int:
|
||||||
|
@ -76,3 +76,22 @@ async def update_ks_video_comment(video_id: str, comment_item: Dict):
|
|||||||
utils.logger.info(
|
utils.logger.info(
|
||||||
f"[store.kuaishou.update_ks_video_comment] Kuaishou video comment: {comment_id}, content: {save_comment_item.get('content')}")
|
f"[store.kuaishou.update_ks_video_comment] Kuaishou video comment: {comment_id}, content: {save_comment_item.get('content')}")
|
||||||
await KuaishouStoreFactory.create_store().store_comment(comment_item=save_comment_item)
|
await KuaishouStoreFactory.create_store().store_comment(comment_item=save_comment_item)
|
||||||
|
|
||||||
|
async def save_creator(user_id: str, creator: Dict):
|
||||||
|
ownerCount = creator.get('ownerCount', {})
|
||||||
|
profile = creator.get('profile', {})
|
||||||
|
|
||||||
|
local_db_item = {
|
||||||
|
'user_id': user_id,
|
||||||
|
'nickname': profile.get('user_name'),
|
||||||
|
'gender': '女' if profile.get('gender') == "F" else '男',
|
||||||
|
'avatar': profile.get('headurl'),
|
||||||
|
'desc': profile.get('user_text'),
|
||||||
|
'ip_location': "",
|
||||||
|
'follows': ownerCount.get("follow"),
|
||||||
|
'fans': ownerCount.get("fan"),
|
||||||
|
'interaction': ownerCount.get("photo_public"),
|
||||||
|
"last_modify_ts": utils.get_current_timestamp(),
|
||||||
|
}
|
||||||
|
utils.logger.info(f"[store.kuaishou.save_creator] creator:{local_db_item}")
|
||||||
|
await KuaishouStoreFactory.create_store().store_creator(local_db_item)
|
@ -11,10 +11,11 @@ from typing import Dict
|
|||||||
|
|
||||||
import aiofiles
|
import aiofiles
|
||||||
|
|
||||||
from base.base_crawler import AbstractStore
|
|
||||||
from tools import utils,words
|
|
||||||
from var import crawler_type_var
|
|
||||||
import config
|
import config
|
||||||
|
from base.base_crawler import AbstractStore
|
||||||
|
from tools import utils, words
|
||||||
|
from var import crawler_type_var
|
||||||
|
|
||||||
|
|
||||||
def calculate_number_of_files(file_store_path: str) -> int:
|
def calculate_number_of_files(file_store_path: str) -> int:
|
||||||
"""计算数据保存文件的前部分排序数字,支持每次运行代码不写到同一个文件中
|
"""计算数据保存文件的前部分排序数字,支持每次运行代码不写到同一个文件中
|
||||||
@ -205,3 +206,14 @@ class KuaishouJsonStoreImplement(AbstractStore):
|
|||||||
|
|
||||||
"""
|
"""
|
||||||
await self.save_data_to_json(comment_item, "comments")
|
await self.save_data_to_json(comment_item, "comments")
|
||||||
|
|
||||||
|
async def store_creator(self, creator: Dict):
|
||||||
|
"""
|
||||||
|
Kuaishou content JSON storage implementation
|
||||||
|
Args:
|
||||||
|
creator: creator dict
|
||||||
|
|
||||||
|
Returns:
|
||||||
|
|
||||||
|
"""
|
||||||
|
await self.save_data_to_json(creator, "creator")
|
@ -11,10 +11,11 @@ from typing import Dict
|
|||||||
|
|
||||||
import aiofiles
|
import aiofiles
|
||||||
|
|
||||||
from base.base_crawler import AbstractStore
|
|
||||||
from tools import utils,words
|
|
||||||
from var import crawler_type_var
|
|
||||||
import config
|
import config
|
||||||
|
from base.base_crawler import AbstractStore
|
||||||
|
from tools import utils, words
|
||||||
|
from var import crawler_type_var
|
||||||
|
|
||||||
|
|
||||||
def calculate_number_of_files(file_store_path: str) -> int:
|
def calculate_number_of_files(file_store_path: str) -> int:
|
||||||
"""计算数据保存文件的前部分排序数字,支持每次运行代码不写到同一个文件中
|
"""计算数据保存文件的前部分排序数字,支持每次运行代码不写到同一个文件中
|
||||||
|
@ -113,7 +113,7 @@ async def save_creator(user_id: str, creator: Dict):
|
|||||||
'gender': '女' if user_info.get('gender') == 1 else '男',
|
'gender': '女' if user_info.get('gender') == 1 else '男',
|
||||||
'avatar': user_info.get('images'),
|
'avatar': user_info.get('images'),
|
||||||
'desc': user_info.get('desc'),
|
'desc': user_info.get('desc'),
|
||||||
'ip_location': user_info.get('ip_location'),
|
'ip_location': user_info.get('ipLocation'),
|
||||||
'follows': follows,
|
'follows': follows,
|
||||||
'fans': fans,
|
'fans': fans,
|
||||||
'interaction': interaction,
|
'interaction': interaction,
|
||||||
|
@ -11,10 +11,11 @@ from typing import Dict
|
|||||||
|
|
||||||
import aiofiles
|
import aiofiles
|
||||||
|
|
||||||
from base.base_crawler import AbstractStore
|
|
||||||
from tools import utils,words
|
|
||||||
from var import crawler_type_var
|
|
||||||
import config
|
import config
|
||||||
|
from base.base_crawler import AbstractStore
|
||||||
|
from tools import utils, words
|
||||||
|
from var import crawler_type_var
|
||||||
|
|
||||||
|
|
||||||
def calculate_number_of_files(file_store_path: str) -> int:
|
def calculate_number_of_files(file_store_path: str) -> int:
|
||||||
"""计算数据保存文件的前部分排序数字,支持每次运行代码不写到同一个文件中
|
"""计算数据保存文件的前部分排序数字,支持每次运行代码不写到同一个文件中
|
||||||
|
@ -1,10 +1,12 @@
|
|||||||
import aiofiles
|
|
||||||
import asyncio
|
import asyncio
|
||||||
import jieba
|
|
||||||
from collections import Counter
|
|
||||||
from wordcloud import WordCloud
|
|
||||||
import json
|
import json
|
||||||
|
from collections import Counter
|
||||||
|
|
||||||
|
import aiofiles
|
||||||
|
import jieba
|
||||||
import matplotlib.pyplot as plt
|
import matplotlib.pyplot as plt
|
||||||
|
from wordcloud import WordCloud
|
||||||
|
|
||||||
import config
|
import config
|
||||||
from tools import utils
|
from tools import utils
|
||||||
|
|
||||||
|
Loading…
Reference in New Issue
Block a user