Saturday, October 30, 2010

Screen-scraping Wikipedia

In order to screen-scrape a page on Wikipedia, there is one extra step that you must take in order to successfully download a page for processing. You must include a User-Agent header in your HTTP request. Wikipedia requires that this header be included or else it will return a 403 Forbidden error. I found this out thanks to a user on the #mediawiki IRC channel. They suggest that you set the User-Agent to something which uniquely identifies your program or application. They strongly discourage using the User-Agent string of a browser because this signals that you might be doing something malicious.

It is easy to set the User-Agent header in PHP. You can either edit your PHP installation's php.ini file or add the following line of code to your PHP script. The cURL library also supports setting HTTP headers, but this library is not included in the standard PHP installation.

//tell it what value to use for the User-Agent header
ini_set('user_agent', 'My Cool Screen-Scraper (+http://www.mangst.com)');

//includes the above User-Agent header in this request and all subsequent requests
$page = file_get_contents('http://en.wikipedia.org/wiki/Pumpkin');

Note that this is different from the header() function. The header() function is used to set the headers of the HTTP response that the PHP script itself is generating. This has nothing to do with any HTTP requests that the script makes in the process of generating its response.

Monday, October 25, 2010

Poor Man's FTP

In the November issue of Linux Journal Magazine, Kyle Rankin wrote an article about his experience attending the DEF CON conference.  One lesson he took away was the importance of knowing how to use the basic Linux commands, such as vi and sh.  Being familiar with these commands means that you won't be dead in the water if you have to work on a computer with a minimal Linux install.

One of these commands is netcat (nc).  Netcat allows you to open TCP and UDP connections with other computers as well as listen for connections.  Kyle described many interesting ways that you can use this command.  My favorite was using it to transfer files.  I think that this technique would come in very handily if ssh or ftp is not installed.

It's very simple.  The computer receiving the file runs this command:

nc -l 31337 > output_file

And then the computer sending the file runs this command:

nc hostname 31337 < input_file

This will send the file through port 31337 and automatically close the connection when the transfer is complete.  It doesn't matter what port you use, so long as the port isn't being used by another program.

Friday, October 22, 2010

Starbucks Wi-Fi

I ran into a small problem at a Starbucks store the other day.  I had my netbook with me and wanted to connect to their Wi-Fi network.  The service is free, but in order to access the Internet, you must first visit a Starbucks webpage that asks you to accept their terms and conditions.  Any attempt to visit any other website will redirect you to this page.

I like to have my browser reopen all the tabs from my last browsing session when it starts up.  My problem was that, because I have to accept the terms and conditions first, all my tabs would redirect to the Starbucks page.  Clicking the back button after accepting the terms and conditions doesn't return me to my original page.  I think that this is because it's a HTTP 3xx redirect response.  So I basically lose all my tabs.

Getting around this wasn't too tricky.  The terms and conditions page is just an HTML form with a bunch of hidden parameters and a checkbox for "I agree".  I wrote a Java program to parse all the parameters out of the page and submit the form.  So if I run this before opening my browser, my browser will reload all its tabs no problem.  No annoying redirects to the Starbucks page.

You can download it here.  I put all the classes in one file to make it simpler.  I also wrote some JUnit tests to test the part that parses the HTML page.  To run it, just compile the file and run java Starbucks -v.  The -v (verbose) is optional and will cause it to print status messages as it's working.  Run this program as soon as you connect to the Starbucks Wi-Fi network (and before you open your browser).

Monday, October 18, 2010

Peer-to-Peer (P2P) Systems

In the current issue of the magazine Communications of the ACM, there is an article called Peer-to-Peer Systems by Rodrigo Rodrigues and Peter Druschel. Along with discussing the pros and cons of P2P networks and including examples of how they are used, the article goes into technical detail about how they work.

The article divides P2P systems into two types. One type is partly centralized. In these systems, there exists a single controller node, which keeps a list of all nodes that are connected to the network, along with the resources that each node is sharing. A good example of this kind of P2P network would be Napster (now non-existent). When you searched for a song on Napster, a request would be sent to a centralized server owned by the Napster folks themselves. This server would then search its database for all computers in the P2P network that had the song, and return this list to you. You then downloaded the song by directly connecting to the computer hosting the file. Without this server, there would be no way to get the song that you were looking for because you wouldn't know what computers, out of all the computers in the entire Internet, are both sharing their music collection and have the song you are looking for.

The other type of P2P network is decentralized. In a decentralized P2P network, there is no controller node that knows about all the computers in the network. Computers connect to the network through a bootstrap node, which is just one computer in the network that makes its IP address publicly known. This lack of centralization makes these types of P2P networks more robust, as the network is not dependent on a single server being functional. But because of this lack of centralization, there is no straight-forward way of knowing what computers are connected to the network or what resources they are sharing. This makes searching for particular resources trickier. The article describes two ways in which a decentralized P2P network can be structured in order to solve this problem of search.

One way of structuring a decentralized P2P network is by using an unstructured overlay (an overlay is a graph that describes how the nodes are connected with each other). Each computer in the network only knows about a few other computers that are also connected to the network. To search for a file, the computer will query its neighbors first. If none of its neighbors have the file, then it will ask its neighbor's neighbors. If none of these computers have the file, then it will ask its neighbor's neighbor's neighbors, and so on. This kind of decentralized network is fine if the resource you are looking for is replicated across many other nodes. However, if the resource is rare, then finding that resource could take a very long time. For example, imagine having to search for a file which exists on only one computer in a network of one million computers. The odds of that computer being within a short search distance to your computer is slim.

The other way a decentralized P2P network can be structured is by using a structured overlay. I didn't quite understand the specifics of this technique, but it involves the use of unique keys. Each computer in the network is assigned a unique key in such a way that all keys are evenly spread out in the key space. For example, if the key space is 0-999, the first node will be assigned a random key, say 432. Then, the second node will be assigned a key around 932 (999/2+432, on the opposite side of the "circle"). The third key will be assigned a key around either 682 or 182, and so on. Each node only directly knows about its two neighbors, so the overlay graph looks like a circle. The advantage to this type of overlay is that it makes searching much faster. It's able to use these unique keys to quickly find a computer hosting the resource. This is called key-based routing (KBR). Even if the resource is rare, it will still be able to find it quickly (unlike unstructured overlays, which must spend the time to ask each node directly). However, the downside is that there exists an overhead to maintain these keys. Extra work must be done every time a node enters or leaves the network, so if the number of computers that are connected to the network is constantly changing (called churn) this may not be the best solution.

It was a very good article, but I do have one criticism: it considers applications like SETI@home to be P2P. How is this P2P? You do not communicate with the other peers on the network. You only communicate with the centralized SETI server in order to download new data to process. I think that a better category for SETI@home would be "distributed computing", not P2P.

Saturday, October 2, 2010

How to access your home computer over the Internet with VNC

The VNC protocol gives you remote control of another computer's screen. You can see and interact with the computer as if you were sitting right in front of it.

In this blog post, I'm going to describe how to set up your computer so that you can connect to it anywhere in the world through the Internet.  If your home computer is connected to a router (which it probably is), then the process is a little tricky, which is why I thought it would be helpful to write this blog post.

1. Install a VNC Server

First, you must install a VNC Server on the computer you want to control.

Mac OS X already comes with the necessary software, so you don't have to install anything. To enable Mac's VNC Server, do the following:

a. In System Preferences, click on "Sharing".
b. Check the "Screen Sharing" checkbox to enable it.
c. Click the "Computer Settings..." button.
d. Check the box that says "VNC viewers may control screen with password". Type in a password, then click OK. Remember that your computer will be visible to the world, so make sure the password is secure!





If you are running Windows, can you use TightVNC Server as a VNC Server.

2. Configure your router

Your home computer is probably connected to a router, either through a wired, ethernet connection or a wireless connection. A router connects all of your computers together to form a home network and acts as the gate keeper to the Internet. But if your computer is connected to a router, then it doesn't have its own IP address, which is what you need in order for VNC to connect to your computer.

To get around this, you must tell your router to forward VNC traffic to the computer you want to control. VNC communicates over port 5900, so you must tell your router to forward all data it receives from this port to the 5900 port on your computer. Here is how I did this with my Belkin router:

First, open the router's configuration web page by typing its private IP address in a web browser. My router's private IP is 192.168.2.1. (Private IP addresses are only visible within your home network--they are not visible from the Internet.)

Then, click on the "Virtual Servers" menu option under the "Firewall" category.  It will ask you for a password. If you haven't configured the router with a password, then just click "Submit".  This page lists all the data that the router will forward to other computers on the network. Pick an empty row and enter 5900 for the Inbound and Private port fields. Then, enter the private IP address of the computer you want to control. Finally, click on the "Enable" checkbox, then click "Apply Changes". As you can see in the screenshot, you can do this for other services too like SSH, FTP, or HTTP.



3. Test it out

You'll need a VNC Viewer in order to connect to the computer. Chicken of the VNC is a good VNC Viewer for Mac. For Windows, you can use TightVNC Viewer.

You'll also need the IP address of your router. An easy way to get the IP address of your router is to visit whatismyip.com from one of the computers that are connected to the router.

If the VNC Viewer asks for a display number, enter "0". Display "0" maps to port 5900, display "1" maps to port 5901, display "2" maps to port 5902, etc.

4. Get a free domain name (optional)

DynDNS is a free service that maps your IP address to a domain name like foobar.dyndns.com. Check to see if your router supports this service. My router will automatically update my DynDNS account whenever the router's IP address changes (which can happen often). Using DynDNS means you don't have to memorize your IP address or worry about it changing.