C# code for connecting to a Web Page & Obtaining its Source Code/ Web Crawler Algorithm in C#

Ajinkya Mandhare 7:45 pm 1 comment

HTTP is the primary mechanism for communicating with resources over the Web. It is a Stateless protocol, used for simple Request-Response communication. A developer may often want to obtain web pages & their source codes, for different reasons like: building a spider, obtaining info on a particular page, etc. For this purpose, the .NET Framework includes classes that aid in this respect.

Requesting & Obtaining an HTTP page:

To obtain the HTTP page, we first need to establish a connection to it, & then obtain the page.

For this purpose, C# provides two functions viz. HttpWebRequest & HttpWebResponse.

This involves specifying a web page to get, with a HTTPWebRequest object, which performs the actual request, and then using a HTTPWebResponse object to receive the page.

After this, the normal String methods can be used to obtain the source code of the web page and then, a number of manipulations can be made, as necessary!

The C# code snippet for connecting to a web page & obtaining its source code is:

HttpWebRequest req = (HttpWebRequest)WebRequest.Create(url);

HttpWebResponse resp = (HttpWebResponse)req.GetResponse();

Stream istrm = resp.GetResponseStream();

StreamReader rdr = new StreamReader(istrm);

str = rdr.ReadToEnd();

The entire source code of the Web page under consideration is now stored in 'str'.

Once we have the source code, we can define a number of functions, to manipulate the source code, in order to obtain the necessary information like: url in case of a crawler or spider!!

This process of obtaining the urls from the source code, once we have 'str' is child's play, & hence it is left to you!! Still, if you have problem, here is the algo:

Step 1: Find Index Of ("href=\"http") from the start location & store it in 'i' (temp variable)

Step 2: Find index ' " ' from 'i' & add 1 to it & store it in 's' (temp variable)

Step 3: Find index of ' " ' from 's' & store it in 'e' (temp variable)

Step 4: Now obtain the substring from 's' to 'e-s' & store i in url(string variable)

Now you can easily use this code & build your own simple crawler in C#!!

You can Contact Me for further assistance.

Kindly Bookmark and Share it:

1 comments:

Learn Digital Marketing10 October 2017 at 16:01
Great article, Thanks for your great information, the content is quiet interesting. I will be waiting for your next post.
ReplyDelete
Replies