Saturday 21 January 2012

Tagged under: , , , ,

C# code for connecting to a Web Page & Obtaining its Source Code/ Web Crawler Algorithm in C#

HTTP is the primary mechanism for communicating with resources over the Web. It is a Stateless protocol, used for simple Request-Response communication.  A developer may often want to obtain web pages & their source codes, for different reasons like: building a spider, obtaining info on a particular page, etc. For this purpose, the .NET Framework includes classes that aid in this respect.

Requesting & Obtaining an HTTP page:



To obtain the HTTP page, we first need to establish a connection to it, & then obtain the page.
For this purpose, C# provides two functions viz. HttpWebRequest HttpWebResponse.
This involves specifying a web page to get, with a HTTPWebRequest object, which performs the actual request, and then using a HTTPWebResponse object to receive the page.

After this, the normal String methods can be used to obtain the source code of the web page and then, a number of manipulations can be made, as necessary!

The C# code snippet for connecting to a web page & obtaining its source code is:

      HttpWebRequest req = (HttpWebRequest)WebRequest.Create(url);
                       
       HttpWebResponse resp = (HttpWebResponse)req.GetResponse();
                       
       Stream istrm = resp.GetResponseStream();
                       
       StreamReader rdr = new StreamReader(istrm);
                        
       str = rdr.ReadToEnd();
                        
The entire source code of the Web page under consideration is now stored in 'str'. 
Once we have the source code, we can define a number of functions, to manipulate the source code, in order to obtain the necessary information like: url in case of a crawler or spider!!
This process of obtaining the urls from the source code, once we have 'str' is child's play, & hence it is left to you!! Still, if you have problem, here is the algo:

Step 1: Find Index Of ("href=\"http") from the start location & store it in 'i' (temp variable) 
Step 2: Find index ' " ' from 'i' & add 1 to it & store it in 's' (temp variable)
Step 3: Find index of ' " ' from 's' & store it in 'e' (temp variable)
Step 4: Now obtain the substring from 's' to 'e-s' & store i in url(string variable)

Now you can easily use this code & build your own simple crawler in C#!!

You can for further assistance.

Kindly Bookmark and Share it:

1 comments:

  1. Great article, Thanks for your great information, the content is quiet interesting. I will be waiting for your next post.

    ReplyDelete