更新,爬虫贴出来了,试爬取了前100页列表的1000个资源,结果见帖中附件:
https://summer-plus.net/u.php?action-topic-uid-1156494.html
https://www.flhk.xyz/
偶然发现这个福利网站的的资源下载链接存在于HTML源码中,只不过页面没有显示出来:这里点击CTRL + U打开页面源码,可以看到在<meta>标签里有下载链接和解压密码:下面这一行:复制代码- <meta name="description" content="下载地址: https://pan.baidu.com/s/1gmKSva8pgMnwrr6vlD6_gw 提取码:nj26 解压密码:4956(下载完后缀名改成zip)">
|
这个站的资源还挺多的,如果哪位想的话,写个简单的爬虫就可以把整个站的资源都抓下来,不知道这个漏洞能用多久,毕竟挺低级的,估计站长不太懂技术,一键搭建WordPress网站。
各位抓紧了
更新爬虫,有兴趣老哥可以尝试爬取资源,测试爬取5页所有资源用时14秒。复制代码- import asyncio
- from lxml import etree
- # import re
- import aiohttp
- import time
- # import uvloop
- import tqdm
- base_url = 'https://www.flhk.xyz/page/{}'
- # work_lst = []
- # asyncio.set_event_loop_policy(uvloop.EventLoopPolicy())
- async def get_dir_page(page, session):
- try:
- async with session.get(url=base_url.format(page)) as resp:
- text = await resp.text(encoding='utf-8')
- return text
- except:
- return None
- async def get_link_passwd(href, title, session):
- async with session.get(href) as resp:
- text = await resp.text(encoding='utf-8')
- html = etree.HTML(text)
- meta_descrp = html.xpath('//meta[@name="description"]/@content')
- if meta_descrp:
- link_and_passwd = meta_descrp[0]
- print('Get link and passwd:\n{} \n {} {}'.format(
- link_and_passwd, title, href))
- return title, href, link_and_passwd
- else:
- print('No download link available for {} {}'.format(title, href))
- async def Main():
- start = time.time()
- # global work_lst
- async with aiohttp.ClientSession() as session:
- tasks = [get_dir_page(page, session) for page in range(1, 5)]
- for rslt in tqdm.tqdm(asyncio.as_completed(tasks), total=len(tasks)):
- text = await rslt
- if text:
- html = etree.HTML(text)
- ajax_load_divs = html.xpath(
- '//div[@class="ajax-load-con content wow fadeInUp"]')
- sub_tasks_lst = []
- for div in ajax_load_divs:
- h2 = div.xpath('.//h2')[0]
- href = h2.xpath('./a/@href')[0]
- title = h2.xpath('./a/@title')[0]
- sub_tasks_lst.append((href, title, session))
- sub_tasks = [get_link_passwd(*tp) for tp in sub_tasks_lst]
- for f in asyncio.as_completed(sub_tasks):
- rslt_tp = await f
- if rslt_tp:
- with open("link_passwds.txt", "a+",
- encoding='utf-8') as file:
- file.write(rslt_tp[1] + ": " + rslt_tp[0] + '\n')
- file.write(rslt_tp[2] + '\n')
- file.write('\n')
- end = time.time()
- total_secs = end - start
- print('total_secs:', total_secs)
- return 'done'
- loop = asyncio.get_event_loop()
- try:
- rslt = loop.run_until_complete(Main())
- print(rslt)
- finally:
- loop.close()
|
结果示意:
最后安利一下 (更新2020/11/28) 自己写的直播录制工具(支持斗鱼,b站, 虎牙), 可抓取显示弹幕
https://summer-plus.net/read.php?tid-1017998.html欢迎各位测试