Requesting & Obtaining an HTTP page:
To obtain the HTTP page, we first need to establish a connection to it, & then obtain the page.
For this purpose, C# provides two functions viz. HttpWebRequest & HttpWebResponse.
This involves specifying a web page to get, with a HTTPWebRequest object, which performs the actual request, and then using a HTTPWebResponse object to receive the page.
After this, the normal String methods can be used to obtain the source code of the web page and then, a number of manipulations can be made, as necessary!
The C# code snippet for connecting to a web page & obtaining its source code is:
HttpWebResponse resp = (HttpWebResponse)req.GetResponse();
Stream istrm = resp.GetResponseStream();
StreamReader rdr = new StreamReader(istrm);
str = rdr.ReadToEnd();
The entire source code of the Web page under consideration is now stored in 'str'.
Once we have the source code, we can define a number of functions, to manipulate the source code, in order to obtain the necessary information like: url in case of a crawler or spider!!
This process of obtaining the urls from the source code, once we have 'str' is child's play, & hence it is left to you!! Still, if you have problem, here is the algo:
Step 1: Find Index Of ("href=\"http") from the start location & store it in 'i' (temp variable)
Step 2: Find index ' " ' from 'i' & add 1 to it & store it in 's' (temp variable)
Step 3: Find index of ' " ' from 's' & store it in 'e' (temp variable)
Step 4: Now obtain the substring from 's' to 'e-s' & store i in url(string variable)
Now you can easily use this code & build your own simple crawler in C#!!
You can
Contact Me for further assistance.
Great article, Thanks for your great information, the content is quiet interesting. I will be waiting for your next post.
ReplyDelete