The.NET runtime provides a rich set of classes to allow for programmaticaccess to the web. Using these classes HTTP and HTTPS programs can becreated that automate tasks performed by human users of the web.These programs are called bots. Chapters 1 and 2 introduce you toHTTP programming.
Chapter1 of this book begins by examining the structure of HTTP requests. Ifyou are to create programs that make use of the HTTP protocol it isimportant to understand the structure of the HTTP protocol. Thischapter explains what packets are exchanged between web servers andweb browsers, as well as the makeup of these packets.
Chapter2 shows how to monitor the packets being transferred between a webserver and web browser. Using a program, called a Network Analyzer,you can quickly see what HTTP packets are being exchanged. To createa successful bot, your bot must exchange the same packets with theweb server that a user would. A Network Analyzer can help quicklycreate a bot by showing you
FromChapter 3 and beyond this book is structured as a set of recipes. Youare provided with short concise programming examples for many commonHTTP programming tasks. Most of the chapters are organized into twoparts. The first part introduces the topic of the chapter. The secondpart is a collection of recipes. These recipes are meant to bestarting points for your own programs that will require similarfunctionality.
Chapter3 shows how to execute simple HTTP requests. A simple HTTP request isone that accesses only a single web page. All data that is neededwill be on that page and no additional information must be passed tothe web server.
Chapter4 goes beyond simple requests and shows how to make use of otherfeatures of the HTTP protocol. HTTP server and client headers areintroduced. Additionally, you will be shown how to access data frombasic HTML files.
Chapter5 shows how to use HTTPS. HTTPS is the more secure version of HTTP.Use of HTTPS is generally automatic in C#. However, you will be shownsome of the HTTPS specific features that C# provides, and how to usethem. You will also be introduced to HTTP authentication, which is ameans by which the web server can prompt the user for an id andpassword.
Chapter6 shows how to access data from a variety of HTML sources. An HTMLparser is developed that will be used with most of the remainingrecipes in this book. You are shown how to use this parser to extractdata from forms, lists, tables and other structures. Recipes areprovided that will serve as a good starting point for any of theseHTML constructs.
Chapter7 shows how to interact with HTML forms. HTML forms are veryimportant to web sites that need to interact with the user. Thischapter will show how to construct the appropriate response to anHTML form. You are shown how each of the control types of the forminteracts with the web server.
Chapter8 shows how to handle cookies and sessions. You will see that the webserver can track who is logged on and maintain a session using eithercookies or a URL variable. A useful class will be developed that willhandle cookie processing in C#.
Chapter9 explains the effects that JavaScript can have on a bot. JavaScriptallows programs to be executed by the web browser. This cancomplicate matters for bots. The bot programmer must understand howJavaScript helps to shape the content of HTTP packets being producedby the browser. The bot must provide these same packets if it is towork properly.
Chapter10 explains the effects that AJAX can have on a bot. AJAX is based onXML and JavaScript. It has many of the same effects on a bot programas JavaScript does. However, most AJAX web sites are designed tocommunicate with the web server using XML. This can make creating abot for an AJAX website easier.
Chapter11 introduces web services. Web services have replaced many of thefunctions previously performed by bots. Sites that make use of webservices provide access to their data through XML. This makes itconsiderably easier to access their data than writing a traditionalbot. Additionally, you can use web services in conjunction withregular bot programming. This produces a hybrid bot.
Chapter12 shows how to create bots that make use of RSS feeds. RSS is an XMLformat that allows quick access to the newest content on a web site.Bots can be constructed to automatically access RSS information froma web site.
Chapter13 introduces the Heaton Research Spider. The Heaton Research Spideris an open source implementation of a C# spider. There is also a Javaversion of the Heaton Research Spider. A spider is a program that isdesigned to access a large number of web pages. The spider does thisby continuously visiting the links of web pages, and then pages foundat those links. A web spider visits sites much as a biological spidercrawls its web.
Theremaining chapters of the chapters of this book do not includerecipes. Chapters 14 and 15 explain how the Heaton Research Spiderworks. Chapter 16 explains how to create well behaved bots.
Chapter14 explains the internals of the Heaton Research Spider. The HeatonResearch Spider is open source. Because of this you can modify it tosuit your needs. Chapter 14 discusses the internal structure of theHeaton Research Spider. By default the Heaton Research Spider usescomputer memory to track the list of visited URLs. This chapterexplains how this memory based URL tracking works. The next chapterexplains how to use an SQL database instead of computer memory.
Chapter15 explains how the Heaton Research Spider makes use of databases.The Heaton Research Spider can use databases to track the URLs thatit has visited. This allows the spider to access a much larger volumeof URLs than when using computer memory to track the URL list.
Reader's Comments (0)
Login to CommentNo Comments Yet
Be the first to share your thoughts about this book!