獲取網(wǎng)頁(yè)數(shù)據(jù)有很多種方式。在這里主要講述通過(guò)WebClient、WebBrowser和HttpWebRequest/HttpWebResponse三種方式獲取網(wǎng)頁(yè)內(nèi)容。
這里獲取的是包括網(wǎng)頁(yè)的所有信息。如果單純需要某些數(shù)據(jù)內(nèi)容??梢宰约簶?gòu)造函數(shù)甄別摳除出來(lái)!一般的做法是根據(jù)源碼的格式,用正則來(lái)過(guò)濾出你需要的內(nèi)容部分。
一、通過(guò)WebClient獲取網(wǎng)頁(yè)內(nèi)容
這是一種很簡(jiǎn)單的獲取方式,當(dāng)然,其它的獲取方法也很簡(jiǎn)單。在這里首先要說(shuō)明的是,如果為了實(shí)際項(xiàng)目的效率考慮,需要考慮在函數(shù)中分配一個(gè)內(nèi)存區(qū)域。大概寫法如下
-
- byte[] buffer = new byte[1024];
- using (MemoryStream memory = new MemoryStream())
- {
- int index = 1, sum = 0;
- while (index * sum < 100 * 1024)
- {
- index = reader.Read(buffer, 0, 1024);
- if (index > 0)
- {
- memory.Write(buffer, 0, index);
- sum += index;
- }
- }
-
- Encoding.GetEncoding("gb2312").GetString(memory.ToArray());
- if (string.IsNullOrEmpty(html))
- {
- return html;
- }
- else
- {
- Regex re = new Regex(@"charset=(? charset[/s/S]*?)[ |'']");
- Match m = re.Match(html.ToLower());
- encoding = m.Groups[charset].ToString();
- }
- if (string.IsNullOrEmpty(encoding) || string.Equals(encoding.ToLower(), "gb2312"))
- {
- return html;
- }
- }
好了,現(xiàn)在進(jìn)入正題,WebClient獲取網(wǎng)頁(yè)數(shù)據(jù)的代碼如下
-
- try
- {
- WebClient webClient = new WebClient();
- webClient.Credentials = CredentialCache.DefaultCredentials;
- Byte[] pageData = webClient.DownloadData("http://www.360doc.com/content/11/0427/03/1947337_112596569.shtml");
-
- string pageHtml = Encoding.UTF8.GetString(pageData);
- using (StreamWriter sw = new StreamWriter("e:\\ouput.txt"))
- {
- htm = sw.ToString();
- sw.Write(pageHtml);
- }
- }
- catch (WebException webEx)
- {
- Console.W
- }
二、通過(guò)WebBrowser控件獲取網(wǎng)頁(yè)內(nèi)容
相對(duì)來(lái)說(shuō),這是一種最簡(jiǎn)單的獲取方式。拖WebBrowser控件進(jìn)去,然后匹配下面這段代碼
- WebBrowser web = new WebBrowser();
- web.Navigate("http://www.163.com");
- web.DocumentCompleted += new WebBrowserDocumentCompletedEventHandler(web_DocumentCompleted);
- void web_DocumentCompleted(object sender, WebBrowserDocumentCompletedEventArgs e)
- {
- WebBrowser web = (WebBrowser)sender;
- HtmlElementCollection ElementCollection = web.Document.GetElementsByTagName("Table");
- foreach (HtmlElement item in ElementCollection)
- {
- File.AppendAllText("Kaijiang_xj.txt", item.InnerText);
- }
- }
三、使用HttpWebRequest/HttpWebResponse獲取網(wǎng)頁(yè)內(nèi)容
這是一種比較通用的獲取方式。
- public void GetHtml()
- {
- var url = "http://www.360doc.com/content/11/0427/03/1947337_112596569.shtml";
- string strBuff = "";
- int byteRead = 0;
-
- HttpWebRequest webRequest = (HttpWebRequest)WebRequest.Create(url);
- HttpWebResponse webResponse = (HttpWebResponse)webRequest.GetResponse();
-
- Stream reader = webResponse.GetResponseStream();
-
- StreamReader respStreamReader = new StreamReader(reader,Encoding.UTF8);
-
-
- char[] cbuffer = new char[1024];
- byteRead = respStreamReader.Read(cbuffer,0,256);
-
- while (byteRead != 0)
- {
- string strResp = new string(char,0,byteRead);
- strBuff = strBuff + strResp;
- byteRead = respStreamReader.Read(cbuffer,0,256);
- }
- using (StreamWriter sw = new StreamWriter("e:\\ouput.txt"))
- {
- htm = sw.ToString();
- sw.Write(strBuff);
- }
- }