前言
主要是最近有個專案忘記紀錄以前寫的 Code ,順便把爬蟲那段拿出來紀錄。爬蟲是透過 Response 回來後的html 並從裡面竊取資料,執行動作必須要確認當前站台是否有開放能拿取資料的設定檔,如 robots.txt 。這邊就以簡單範例為例。
前置作業
撰寫爬蟲頁面
這邊使用 “https://udn.com/news/cate/2/6644“ 聯合報新聞來做示範。
response當前頁面
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
   | using HtmlAgilityPack; namespace networkReptile {     internal class Program     {         static async Task Main(string[] args)         {                          string url = "https://udn.com/news/cate/2/6644";                                       HttpClient client = new();             HttpResponseMessage response = await client.GetAsync(url);                          response.EnsureSuccessStatusCode();             string responseBody = await response.Content.ReadAsStringAsync();
                           HtmlDocument doc = new ();             doc.LoadHtml(responseBody);
          }     } }
   | 
 
取得想要的資料
1 2 3 4 5 6 7 8 9 10 11 12 13
   |  for (int i = 1; i<10; i++) {     string xpath = @$"/html/body/main/div/section[2]/section[2]/div[1]/div[{i}]/div[2]/h2/a";     HtmlNodeCollection content = doc.DocumentNode.SelectNodes(xpath);     if(content == null) { continue; }     foreach (HtmlNode node in content)     {         string href = doc.DocumentNode.SelectNodes(xpath+ @"/@href").FirstOrDefault().Attributes.FirstOrDefault().Value.ToString();         Console.WriteLine($"{i} - {node.InnerText} (https://udn.com/{href})");         break;     } }
 
  | 
 
完整程式碼
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40
   | using HtmlAgilityPack;
  namespace networkReptile {     internal class Program     {         static async Task Main(string[] args)         {                          string url = "https://udn.com/news/cate/2/6644";                                       HttpClient client = new();             HttpResponseMessage response = await client.GetAsync(url);             response.EnsureSuccessStatusCode();             string responseBody = await response.Content.ReadAsStringAsync();
                           HtmlDocument doc = new ();             doc.LoadHtml(responseBody);
              Console.WriteLine($"!! ----- 即時新聞 ------ !!");
                           for (int i = 1; i<10; i++)             {                 string xpath = @$"/html/body/main/div/section[2]/section[2]/div[1]/div[{i}]/div[2]/h2/a";                 HtmlNodeCollection content = doc.DocumentNode.SelectNodes(xpath);                 if(content == null) { continue; }                 foreach (HtmlNode node in content)                 {                     string href = doc.DocumentNode.SelectNodes(xpath+ @"/@href").FirstOrDefault().Attributes.FirstOrDefault().Value.ToString();                     Console.WriteLine($"{i} - {node.InnerText} (https://udn.com/{href})");                     break;                 }             }
          }     } }
   | 
 
參考文件