mirror of
https://github.com/fish2018/pansou.git
synced 2026-05-06 21:51:31 +08:00
新增插件kkmao
This commit is contained in:
159
plugin/kkmao/html结构分析.md
Normal file
159
plugin/kkmao/html结构分析.md
Normal file
@@ -0,0 +1,159 @@
|
||||
# kkmao (夸克猫) HTML结构分析
|
||||
|
||||
## 网站信息
|
||||
- **网站名称**: 夸克猫资源
|
||||
- **域名**: `www.kuakemao.com`
|
||||
- **类型**: 夸克网盘影视资源分享站(WordPress 主题站)
|
||||
- **特点**: 每篇文章提供 1~N 个夸克网盘链接,正文结构高度统一,仅包含夸克网盘
|
||||
|
||||
## 搜索页结构
|
||||
|
||||
### 1. 搜索入口
|
||||
```
|
||||
https://www.kuakemao.com/?s={关键词}
|
||||
|
||||
示例:
|
||||
https://www.kuakemao.com/?s=物
|
||||
```
|
||||
- 直接使用 UTF-8 中文或 URL 编码均可
|
||||
- 页面为标准 WordPress 搜索结果页
|
||||
|
||||
### 2. 结果容器
|
||||
- **父容器**: `section.container > div.content-wrap > div.content`
|
||||
- **结果项**: `article.excerpt`(会附带 `excerpt-1/2` 等序号类名)
|
||||
|
||||
### 3. 单个结果结构
|
||||
|
||||
#### 封面/详情链接
|
||||
```html
|
||||
<a class="focus" href="https://www.kuakemao.com/653.html">
|
||||
<img data-src="https://img.kuakemao.com/.../c4ac4195bed96c7-220x150.webp" class="thumb">
|
||||
</a>
|
||||
```
|
||||
- `href` 即详情页地址,形如 `/数字.html`
|
||||
|
||||
#### 标题
|
||||
```html
|
||||
<header>
|
||||
<h2>
|
||||
<a href="https://www.kuakemao.com/653.html"
|
||||
title="某种物质 (2024) 夸克网盘 法国 恐怖 4K 豆瓣7.5 - 夸克猫资源">
|
||||
某种物质 (2024) 夸克网盘 法国 恐怖 4K 豆瓣7.5
|
||||
</a>
|
||||
</h2>
|
||||
</header>
|
||||
```
|
||||
- 提取要素:
|
||||
- **标题**: `h2 > a` 文本
|
||||
- **详情页 URL**: `h2 > a` 的 `href`
|
||||
|
||||
#### 简介
|
||||
```html
|
||||
<p class="note">
|
||||
某种物质 夸克网盘资源 https://pan.quark.cn/s/631243a6189a ...
|
||||
</p>
|
||||
```
|
||||
- 用于填充 `SearchResult.Content`
|
||||
- 文本中偶尔包含裸露的夸克链接,但仍需访问详情页获取规范链接
|
||||
|
||||
#### 元数据
|
||||
```html
|
||||
<div class="meta">
|
||||
<time>2025-11-26</time>
|
||||
<a class="cat" href="https://www.kuakemao.com/dy">电影</a>
|
||||
<span class="pv">阅读(...)</span>
|
||||
</div>
|
||||
```
|
||||
- **发布时间**: `<time>` 文本(`YYYY-MM-DD`)
|
||||
- **分类标签**: `.meta a.cat` 文本
|
||||
|
||||
## 详情页结构
|
||||
|
||||
### 1. URL 规则
|
||||
```
|
||||
https://www.kuakemao.com/{文章ID}.html
|
||||
示例: https://www.kuakemao.com/653.html
|
||||
```
|
||||
- 文章 ID 可由 `/{id}.html` 提取,用于唯一 ID
|
||||
|
||||
### 2. 主要节点
|
||||
- **标题**: `.article-title`
|
||||
- **元信息**: `.article-meta .item`(日期、分类、阅读数等)
|
||||
- **正文容器**: `.article-content`
|
||||
|
||||
### 3. 夸克链接位置
|
||||
```html
|
||||
<div class="article-content">
|
||||
<h2>某种物质 夸克网盘资源</h2>
|
||||
<p>
|
||||
<a rel="nofollow" href="https://pan.quark.cn/s/631243a6189a" target="_blank">
|
||||
https://pan.quark.cn/s/631243a6189a
|
||||
</a>
|
||||
</p>
|
||||
...
|
||||
</div>
|
||||
```
|
||||
- 所有下载链接位于 `.article-content` 中
|
||||
- 仅出现夸克域名 (`pan.quark.cn`)
|
||||
- 提取码通常在链接同一段落后续文字,需解析 `提取码/密码/pwd/code` 关键词
|
||||
|
||||
## CSS 选择器速查表
|
||||
|
||||
| 数据项 | 选择器 / 规则 | 备注 |
|
||||
|--------|---------------|------|
|
||||
| 结果列表 | `article.excerpt` | 遍历搜索结果 |
|
||||
| 标题 | `article.excerpt h2 a` | 文本 & `href` |
|
||||
| 简介 | `article.excerpt p.note` | 文本描述 |
|
||||
| 分类 | `article.excerpt .meta a.cat` | 可能 0/1 个 |
|
||||
| 发布时间 | `article.excerpt .meta time` | `YYYY-MM-DD` |
|
||||
| 详情正文 | `.article-content` | 包含所有下载信息 |
|
||||
| 夸克链接 | `.article-content a[href*="pan.quark.cn"]` | href 即下载地址 |
|
||||
| 提取码 | 链接文本 / 父节点文本 | 关键词:`提取码/密码/pwd/code` |
|
||||
|
||||
## 实现要点
|
||||
|
||||
1. **请求策略**
|
||||
- 搜索页:`GET https://www.kuakemao.com/?s=关键词`
|
||||
- 设置常规浏览器 UA、Referer,必要时加入重试
|
||||
2. **列表解析**
|
||||
- 遍历 `article.excerpt`,提取标题、摘要、分类、时间
|
||||
- 由详情 URL 提取 `articleID` 作为唯一后缀
|
||||
3. **详情页抓取**
|
||||
- 进入 `.article-content`,收集 `a[href*="pan.quark.cn"]`
|
||||
- 一篇可能提供多条夸克链接,需要全部返回
|
||||
- 通过父节点/兄弟文本匹配提取码
|
||||
4. **链接过滤**
|
||||
- 本站只提供夸克网盘,其他域名全部忽略
|
||||
5. **结果构建**
|
||||
- `UniqueID = kkmao-{articleID}`
|
||||
- `Channel` 置空
|
||||
- `Datetime` 使用搜索结果页的 `<time>`(格式 `2006-01-02`)
|
||||
- `Links` 仅包含 `Type="quark"` 的条目
|
||||
|
||||
## 示例流程
|
||||
```
|
||||
关键词: 物
|
||||
↓
|
||||
搜索页: https://www.kuakemao.com/?s=物
|
||||
- 解析 article.excerpt
|
||||
- 取得标题「某种物质 (2024)...」、详情链接 https://www.kuakemao.com/653.html
|
||||
↓
|
||||
详情页: https://www.kuakemao.com/653.html
|
||||
- 在 .article-content 中找到 <a href="https://pan.quark.cn/s/631243a6189a">
|
||||
↓
|
||||
结果:
|
||||
UniqueID: kkmao-653
|
||||
Title: 某种物质 (2024) 夸克网盘 法国 恐怖 4K 豆瓣7.5
|
||||
Content: 搜索结果页的摘要
|
||||
Links: [{Type:"quark", URL:"https://pan.quark.cn/s/631243a6189a", Password:""}]
|
||||
Tags: ["电影"]
|
||||
Datetime: 2025-11-26
|
||||
```
|
||||
|
||||
## 注意事项
|
||||
1. 搜索页的 `<time>` 可能缺失,需兜底为当前时间
|
||||
2. `.note` 中的裸露链接可忽略,以详情页数据为准
|
||||
3. 页面加载较快,但仍建议设置 10~12 秒超时与 2~3 次重试
|
||||
4. 站点仅有夸克网盘,插件实现时可直接过滤其它域名
|
||||
5. 文章正文含大量 `<h2>` 与 `<pre>`,解析提取码时需遍历父节点文本,避免遗漏
|
||||
|
||||
397
plugin/kkmao/kkmao.go
Normal file
397
plugin/kkmao/kkmao.go
Normal file
@@ -0,0 +1,397 @@
|
||||
package kkmao
|
||||
|
||||
import (
|
||||
"context"
|
||||
"fmt"
|
||||
"net/http"
|
||||
"net/url"
|
||||
"regexp"
|
||||
"strings"
|
||||
"sync"
|
||||
"time"
|
||||
|
||||
"github.com/PuerkitoBio/goquery"
|
||||
|
||||
"pansou/model"
|
||||
"pansou/plugin"
|
||||
)
|
||||
|
||||
var (
|
||||
articleIDRegex = regexp.MustCompile(`/(\d+)\.html`)
|
||||
quarkRegex = regexp.MustCompile(`https?://pan\.quark\.cn/s/[0-9A-Za-z]+`)
|
||||
pwdPatterns = []*regexp.Regexp{
|
||||
regexp.MustCompile(`提取码[::]?\s*([0-9A-Za-z]+)`),
|
||||
regexp.MustCompile(`密码[::]?\s*([0-9A-Za-z]+)`),
|
||||
regexp.MustCompile(`pwd\s*[=::]\s*([0-9A-Za-z]+)`),
|
||||
regexp.MustCompile(`code\s*[=::]\s*([0-9A-Za-z]+)`),
|
||||
}
|
||||
detailCache = sync.Map{}
|
||||
|
||||
cacheTTL = 1 * time.Hour
|
||||
cacheCleanupInterval = 30 * time.Minute
|
||||
)
|
||||
|
||||
type detailCacheEntry struct {
|
||||
links []model.Link
|
||||
expiresAt time.Time
|
||||
}
|
||||
|
||||
const (
|
||||
pluginName = "kkmao"
|
||||
defaultPriority = 2
|
||||
searchTimeout = 12 * time.Second
|
||||
detailTimeout = 10 * time.Second
|
||||
maxConcurrency = 8
|
||||
maxIdleConns = 64
|
||||
maxIdlePerHost = 8
|
||||
maxConnsPerHost = 32
|
||||
idleConnLifetime = 90 * time.Second
|
||||
tlsHandshakeTimeout = 10 * time.Second
|
||||
expectContinueTimeout = 1 * time.Second
|
||||
|
||||
searchMaxRetries = 3
|
||||
detailMaxRetries = 2
|
||||
retryBaseDelay = 200 * time.Millisecond
|
||||
)
|
||||
|
||||
// KkMaoPlugin 夸克猫插件
|
||||
type KkMaoPlugin struct {
|
||||
*plugin.BaseAsyncPlugin
|
||||
client *http.Client
|
||||
}
|
||||
|
||||
func init() {
|
||||
plugin.RegisterGlobalPlugin(NewKkMaoPlugin())
|
||||
go startDetailCacheCleaner()
|
||||
}
|
||||
|
||||
// NewKkMaoPlugin 构造函数
|
||||
func NewKkMaoPlugin() *KkMaoPlugin {
|
||||
return &KkMaoPlugin{
|
||||
BaseAsyncPlugin: plugin.NewBaseAsyncPlugin(pluginName, defaultPriority),
|
||||
client: newHTTPClient(),
|
||||
}
|
||||
}
|
||||
|
||||
// Search 兼容方法
|
||||
func (p *KkMaoPlugin) Search(keyword string, ext map[string]interface{}) ([]model.SearchResult, error) {
|
||||
result, err := p.SearchWithResult(keyword, ext)
|
||||
if err != nil {
|
||||
return nil, err
|
||||
}
|
||||
return result.Results, nil
|
||||
}
|
||||
|
||||
// SearchWithResult 主搜索实现
|
||||
func (p *KkMaoPlugin) SearchWithResult(keyword string, ext map[string]interface{}) (model.PluginSearchResult, error) {
|
||||
return p.AsyncSearchWithResult(keyword, p.searchImpl, p.MainCacheKey, ext)
|
||||
}
|
||||
|
||||
func newHTTPClient() *http.Client {
|
||||
transport := &http.Transport{
|
||||
MaxIdleConns: maxIdleConns,
|
||||
MaxIdleConnsPerHost: maxIdlePerHost,
|
||||
MaxConnsPerHost: maxConnsPerHost,
|
||||
IdleConnTimeout: idleConnLifetime,
|
||||
TLSHandshakeTimeout: tlsHandshakeTimeout,
|
||||
ExpectContinueTimeout: expectContinueTimeout,
|
||||
ForceAttemptHTTP2: true,
|
||||
}
|
||||
return &http.Client{
|
||||
Transport: transport,
|
||||
Timeout: searchTimeout,
|
||||
}
|
||||
}
|
||||
|
||||
func (p *KkMaoPlugin) searchImpl(client *http.Client, keyword string, ext map[string]interface{}) ([]model.SearchResult, error) {
|
||||
if p.client != nil {
|
||||
client = p.client
|
||||
}
|
||||
|
||||
searchURL := fmt.Sprintf("https://www.kuakemao.com/?s=%s", url.QueryEscape(keyword))
|
||||
ctx, cancel := context.WithTimeout(context.Background(), searchTimeout)
|
||||
defer cancel()
|
||||
|
||||
req, err := http.NewRequestWithContext(ctx, http.MethodGet, searchURL, nil)
|
||||
if err != nil {
|
||||
return nil, fmt.Errorf("[%s] 创建请求失败: %w", p.Name(), err)
|
||||
}
|
||||
|
||||
setCommonHeaders(req, "https://www.kuakemao.com/")
|
||||
|
||||
resp, err := p.doRequestWithRetry(req, client, searchMaxRetries, retryBaseDelay)
|
||||
if err != nil {
|
||||
return nil, fmt.Errorf("[%s] 搜索请求失败: %w", p.Name(), err)
|
||||
}
|
||||
defer resp.Body.Close()
|
||||
|
||||
if resp.StatusCode != http.StatusOK {
|
||||
return nil, fmt.Errorf("[%s] 搜索返回状态码: %d", p.Name(), resp.StatusCode)
|
||||
}
|
||||
|
||||
doc, err := goquery.NewDocumentFromReader(resp.Body)
|
||||
if err != nil {
|
||||
return nil, fmt.Errorf("[%s] 解析搜索页面失败: %w", p.Name(), err)
|
||||
}
|
||||
|
||||
var (
|
||||
results []model.SearchResult
|
||||
wg sync.WaitGroup
|
||||
mu sync.Mutex
|
||||
sem = make(chan struct{}, maxConcurrency)
|
||||
)
|
||||
|
||||
doc.Find("article.excerpt").Each(func(_ int, item *goquery.Selection) {
|
||||
titleSel := item.Find("header h2 a")
|
||||
title := strings.TrimSpace(titleSel.Text())
|
||||
detailURL, ok := titleSel.Attr("href")
|
||||
if !ok || title == "" || detailURL == "" {
|
||||
return
|
||||
}
|
||||
|
||||
articleID := extractArticleID(detailURL)
|
||||
if articleID == "" {
|
||||
return
|
||||
}
|
||||
|
||||
summary := strings.TrimSpace(item.Find("p.note").Text())
|
||||
|
||||
var tags []string
|
||||
category := strings.TrimSpace(item.Find(".meta a.cat").First().Text())
|
||||
if category != "" {
|
||||
tags = append(tags, category)
|
||||
}
|
||||
|
||||
rawTime := strings.TrimSpace(item.Find(".meta time").Text())
|
||||
publishTime := parsePublishTime(rawTime)
|
||||
|
||||
wg.Add(1)
|
||||
sem <- struct{}{}
|
||||
go func(title, detailURL, articleID, summary string, tags []string, publishTime time.Time) {
|
||||
defer wg.Done()
|
||||
defer func() { <-sem }()
|
||||
|
||||
links := p.fetchDetailLinks(client, detailURL, articleID)
|
||||
if len(links) == 0 {
|
||||
return
|
||||
}
|
||||
|
||||
result := model.SearchResult{
|
||||
UniqueID: fmt.Sprintf("%s-%s", p.Name(), articleID),
|
||||
Title: title,
|
||||
Content: summary,
|
||||
Links: links,
|
||||
Tags: tags,
|
||||
Channel: "",
|
||||
Datetime: publishTime,
|
||||
}
|
||||
|
||||
mu.Lock()
|
||||
results = append(results, result)
|
||||
mu.Unlock()
|
||||
}(title, detailURL, articleID, summary, tags, publishTime)
|
||||
})
|
||||
|
||||
wg.Wait()
|
||||
|
||||
return plugin.FilterResultsByKeyword(results, keyword), nil
|
||||
}
|
||||
|
||||
func extractArticleID(detailURL string) string {
|
||||
matches := articleIDRegex.FindStringSubmatch(detailURL)
|
||||
if len(matches) >= 2 {
|
||||
return matches[1]
|
||||
}
|
||||
return ""
|
||||
}
|
||||
|
||||
func parsePublishTime(value string) time.Time {
|
||||
value = strings.TrimSpace(value)
|
||||
if value == "" {
|
||||
return time.Now()
|
||||
}
|
||||
|
||||
layouts := []string{
|
||||
"2006-01-02",
|
||||
"2006-01-02 15:04:05",
|
||||
time.RFC3339,
|
||||
}
|
||||
|
||||
for _, layout := range layouts {
|
||||
if t, err := time.Parse(layout, value); err == nil {
|
||||
return t
|
||||
}
|
||||
}
|
||||
|
||||
return time.Now()
|
||||
}
|
||||
|
||||
func (p *KkMaoPlugin) fetchDetailLinks(client *http.Client, detailURL, articleID string) []model.Link {
|
||||
if cached, ok := detailCache.Load(articleID); ok {
|
||||
if entry, valid := cached.(detailCacheEntry); valid {
|
||||
if time.Now().Before(entry.expiresAt) && len(entry.links) > 0 {
|
||||
return entry.links
|
||||
}
|
||||
detailCache.Delete(articleID)
|
||||
}
|
||||
}
|
||||
|
||||
ctx, cancel := context.WithTimeout(context.Background(), detailTimeout)
|
||||
defer cancel()
|
||||
|
||||
req, err := http.NewRequestWithContext(ctx, http.MethodGet, detailURL, nil)
|
||||
if err != nil {
|
||||
return nil
|
||||
}
|
||||
setCommonHeaders(req, detailURL)
|
||||
|
||||
resp, err := p.doRequestWithRetry(req, client, detailMaxRetries, retryBaseDelay)
|
||||
if err != nil {
|
||||
return nil
|
||||
}
|
||||
defer resp.Body.Close()
|
||||
|
||||
if resp.StatusCode != http.StatusOK {
|
||||
return nil
|
||||
}
|
||||
|
||||
doc, err := goquery.NewDocumentFromReader(resp.Body)
|
||||
if err != nil {
|
||||
return nil
|
||||
}
|
||||
|
||||
links := extractQuarkLinks(doc)
|
||||
if len(links) > 0 {
|
||||
detailCache.Store(articleID, detailCacheEntry{
|
||||
links: links,
|
||||
expiresAt: time.Now().Add(cacheTTL),
|
||||
})
|
||||
}
|
||||
return links
|
||||
}
|
||||
|
||||
func extractQuarkLinks(doc *goquery.Document) []model.Link {
|
||||
var (
|
||||
results []model.Link
|
||||
seen = make(map[string]struct{})
|
||||
)
|
||||
|
||||
doc.Find(".article-content a[href]").Each(func(_ int, link *goquery.Selection) {
|
||||
href, _ := link.Attr("href")
|
||||
href = strings.TrimSpace(href)
|
||||
if href == "" {
|
||||
return
|
||||
}
|
||||
|
||||
loc := quarkRegex.FindString(href)
|
||||
if loc == "" {
|
||||
return
|
||||
}
|
||||
|
||||
if _, exists := seen[loc]; exists {
|
||||
return
|
||||
}
|
||||
|
||||
password := extractPassword(link)
|
||||
|
||||
results = append(results, model.Link{
|
||||
Type: "quark",
|
||||
URL: loc,
|
||||
Password: password,
|
||||
})
|
||||
seen[loc] = struct{}{}
|
||||
})
|
||||
|
||||
return results
|
||||
}
|
||||
|
||||
func extractPassword(link *goquery.Selection) string {
|
||||
if pwd := matchPassword(link.Text()); pwd != "" {
|
||||
return pwd
|
||||
}
|
||||
|
||||
if title, ok := link.Attr("title"); ok {
|
||||
if pwd := matchPassword(title); pwd != "" {
|
||||
return pwd
|
||||
}
|
||||
}
|
||||
|
||||
if parent := link.Parent(); parent != nil && parent.Length() > 0 {
|
||||
if pwd := matchPassword(parent.Text()); pwd != "" {
|
||||
return pwd
|
||||
}
|
||||
if next := parent.Next(); next.Length() > 0 {
|
||||
if pwd := matchPassword(next.Text()); pwd != "" {
|
||||
return pwd
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
if sibling := link.Next(); sibling.Length() > 0 {
|
||||
if pwd := matchPassword(sibling.Text()); pwd != "" {
|
||||
return pwd
|
||||
}
|
||||
}
|
||||
|
||||
return ""
|
||||
}
|
||||
|
||||
func matchPassword(text string) string {
|
||||
text = strings.TrimSpace(text)
|
||||
if text == "" {
|
||||
return ""
|
||||
}
|
||||
|
||||
for _, pattern := range pwdPatterns {
|
||||
if matches := pattern.FindStringSubmatch(text); len(matches) >= 2 {
|
||||
return strings.TrimSpace(matches[1])
|
||||
}
|
||||
}
|
||||
|
||||
return ""
|
||||
}
|
||||
|
||||
func setCommonHeaders(req *http.Request, referer string) {
|
||||
req.Header.Set("User-Agent", "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36")
|
||||
req.Header.Set("Accept", "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8")
|
||||
req.Header.Set("Accept-Language", "zh-CN,zh;q=0.9,en;q=0.8")
|
||||
req.Header.Set("Connection", "keep-alive")
|
||||
req.Header.Set("Referer", referer)
|
||||
}
|
||||
|
||||
func (p *KkMaoPlugin) doRequestWithRetry(req *http.Request, client *http.Client, maxRetries int, baseDelay time.Duration) (*http.Response, error) {
|
||||
var lastErr error
|
||||
|
||||
for attempt := 0; attempt < maxRetries; attempt++ {
|
||||
resp, err := client.Do(req.Clone(req.Context()))
|
||||
if err == nil && resp.StatusCode == http.StatusOK {
|
||||
return resp, nil
|
||||
}
|
||||
if resp != nil {
|
||||
resp.Body.Close()
|
||||
}
|
||||
lastErr = err
|
||||
if attempt < maxRetries-1 {
|
||||
backoff := baseDelay * time.Duration(1<<attempt)
|
||||
time.Sleep(backoff)
|
||||
}
|
||||
}
|
||||
|
||||
return nil, fmt.Errorf("重试 %d 次后失败: %w", maxRetries, lastErr)
|
||||
}
|
||||
|
||||
func startDetailCacheCleaner() {
|
||||
ticker := time.NewTicker(cacheCleanupInterval)
|
||||
defer ticker.Stop()
|
||||
|
||||
for range ticker.C {
|
||||
now := time.Now()
|
||||
detailCache.Range(func(key, value interface{}) bool {
|
||||
entry, ok := value.(detailCacheEntry)
|
||||
if !ok || now.After(entry.expiresAt) {
|
||||
detailCache.Delete(key)
|
||||
}
|
||||
return true
|
||||
})
|
||||
}
|
||||
}
|
||||
Reference in New Issue
Block a user