Sunday, April 22, 2012

Retrieving emails with POP3

A couple weeks ago, I showed you how to send emails without an email client. In this blog post, I'm going to show you how to do the opposite--how to retrieve emails from an email server without an email client. As before, everything will be done on the command-line.

To do this, I'll be using the POP3 protocol. POP3 stands for Post Office Protocol version 3. Its purpose is to retrieve emails from an email server (like picking up mail from the post office, hence the name). The other popular email retreival protocol, which you may have heard of, is IMAP. IMAP offers many more features, like the ability to organize emails into folders, but as a consequence, it is more complex. POP3's feature-set is limited to just retrieving and deleting messages, so it's a lot simpler.

POP3 is similar to SMTP in that client/server communication is text-based. However, POP3 is a little simpler because the responses from the server do not have a plethora of numeric status codes. In POP3, there are only two responses: success responses (which begin with +OK) and failure responses (which begin with -ERR).

Connecting to a POP3 server

So let me show you how it works. I will be opening a POP3 connection to my Gmail account. Gmail requires that POP3 transactions be encrypted, so I can't use telnet like I did in my previous SMTP demo. The openssl command will allow me to open a sort of "encrypted telnet" connection.

openssl s_client -connect pop.gmail.com:995 -crlf -ign_eof

For more information on this command, read its man page: man s_client

If the POP3 server is not encrypted, the telnet command will work:

telnet pop3.server.com 110

Note: Most webmail services support POP3. You should be able to find the POP3 URL of your webmail service in your webmail's configuration settings or help pages.

POP3 commands

First, you obviously must authenticate yourself. There are many ways to perform authentication in POP3, but the simplest way is by using the USER and PASS commands. These allow you to enter your username and password directly.

+OK Gpop ready for requests from 68.80.246.118 cn9pf9267980vdc.5
USER mike.angstadt
+OK send PASS
PASS secret
+OK Welcome.

Now that I'm authenticated, I can start retrieving emails. The LIST command returns a list of all of my emails. Each email has an ID (the first number on each line) which is used to retrieve and delete individual emails. Note that these IDs can change with every POP3 session, so do not consider these to be permanent IDs! For instance, if you deleted one or more emails in a previous session, then the IDs of all the emails below the deleted emails will change in your next POP3 session. The number to the right of the ID is the size of the email in bytes. With Gmail, only the first 300 or so messages are shown for some reason. If anyone knows how to get the rest, leave a comment!

LIST
+OK 305 messages (57774596 bytes)
1 2017
2 3751155
3 10873184
...
305 3021

The STAT command is handy to have. It shows the total number of emails (the first number) as well as the total size in bytes of all the emails combined (the second number).

STAT
+OK 305 57791921

The RETR command retrieves an email, including both the headers and the body. The syntax is RETR <num> where <num> is the message ID. For example, to retrieve the fifth email, I would type RETR 5.

RETR 5
+OK message follows
Date: Mon, 17 Jan 2005 17:20:17 -0500
From: Mike Angstadt <mike.angstadt@gmail.com>
To: John Doe <jdoe@yahoo.com>
Subject: Hello John
...more headers...

Dear John,

How are you doing?

-Mike
.

!! Warning !!: Even though I have my Gmail POP3 access configured so that emails are not deleted when they are retrieved, it seems to be doing that anyway and I don't know why! Make sure that your email account is configured so that when you retrieve an email with POP3, it is not deleted from your inbox.

The DELE command, you guessed it, deletes an email. Like RETR, the syntax is DELE <num> where <num> is the message number. Note that the email won't actually be deleted until you terminate the POP3 session with the QUIT command. If your connection terminates unexpectedly, your emails will NOT be deleted.

DELE 1
+OK marked for deletion

If you mark an email for deletion by mistake, you can use the RSET command to undo it. This command will unmark all messages that have been marked for deletion. This means that when QUIT is sent, the emails won't be deleted.

RSET
+OK

And, as was already mentioned, the QUIT command closes the POP3 session, deleting all messages that were marked for deletion with DELE.

QUIT
+OK Farewell.

For more information on POP3, check out its specification document: RFC-1939

Also check out the SMTP server that I wrote, which supports POP3: https://github.com/mangstadt/Sleet

Wednesday, April 11, 2012

Philly Emerging Tech Conference: Day Two

This post describes what I've learned during the second half of the Philly Emerging Technologies for the Enterprise Conference. See my previous blog post for a description of the first day. It was a great conference and I had a blast!

Keynote Address - Emerging Programming Languages

by Alex Payne

Alex started his talk by repeating the most common complaint people have about new languages--"why do we need another programming language?" His answer? Because evolution is a process that's constantly in motion--there's no way of knowing where the "jumping off point" is. As he gave this answer, a picture showing the evolution of the human skull was displayed behind him, implying that we are the result of a similar, albeit slower, kind of change (biological evolution).

When learning about new languages, which Alex does as a hobby, Alex's end goal isn't necessary to use the language, but to learn about the language's unique features and to try to incorporate those features into his work. One language he gave as an example had a certain elegant way of working with WSDLs which compelled him to implement a similar feature into one of his projects.

Alex described around two dozen very obscure programming languages, only 2 of which I've ever heard of (Go and CoffeeScript). He divided the languages up into categories, such as "Web Development", "Dynamic Programming", and "Querying Data".

Behind the scenes of Spring Batch

by Josh Long

Spring Batch is a Spring module that makes creating batch processes more standardized and less error prone. You basically define your job in an XML file. Then, using a combination of custom Java code and classes from the Spring Batch API, you write the logic of your batch job. It streams the batch data by reading the individual entry elements from the input data one by one and then writing out the processed data in chunks (e.g. 10 entries at a time). Because of this, you don't have to worry about getting OutOfMemory errors when processing large amounts of data.

I also thought it was cool that you can schedule your job to run on a regular basis by giving it a cron expression. In addition, you can have it generate a small web app that allows you to view the status of your jobs from the browser.

One thing that took me a little by surprise is that Spring Batch requires a connection to a database. It uses this database basically for logging purposes, like keeping track of the times the job ran and recording the errors that occurred (if any) while the job was running.

Spring Batch looks like a very clean and robust way of working with batch jobs. I definitely want to look more into it.

Dependency Injection Without the Gymnastics - Functional Programming Applied

by Runar Bjarnason and Tony Morris

This presentation was pretty unique in that the speaker, Tony, gave his talk via Skype from Australia! Runar was there in person and acted as the technician and the intermediary between the audience and Tony. It was about how to do dependency injection in Scala without having to resort to confusing XML files like with Spring.

The CoffeeScript Edge

by Trevor Burnham

Trevor explained some of the benefits that CoffeeScript brings to the table in this presentation. For example, following one of Douglas Crockford's words of wisdom, the Javascript code that CoffeeScript generates will never use the "==" operator. When comparing two variables in CoffeeScript, the syntax "x is y" is used, which translates to "x === y" in Javascript.

CoffeeScript also supports string interpolation, which allows you to concatenate strings using a cleaner syntax. For example:

dog = 'Spot'
x = "See #{dog}. See #{dog} run."

Another nice perk in CoffeeScript is that you don't have to separate array elements with commas if they are on separate lines. For example:

arr = [
  'One'
  'Two'
  'Three'
]

You can also use the @ operator as shorthand for this.

Trevor also made an interesting point about the increasing popularity of Javascript. Due to the increased usage of Javascript on the web, all the major browser makers (Microsoft, Google, Mozilla, Apple, and Opera) have been pouring money into making the language faster on their browsers. It's quite possible that no language in the history of computing has ever received this much financial backing.

JavaScript Testing: Completing the BDD Circle in Web Development

by Trevor Lalish-Menagh

This talk focused on how to write unit tests for Javascript code. Trevor did some live coding using some pretty impressive vim-foo, showing how to unit test Javascript code using the Jasmine framework. An important concept that he discussed was "spying" on functions. I'm not sure if this is unique to Jasmine, but it allows you determine whether a particular function was called or not, something that's very helpful in unit testing. Trevor also showed that it's possible to integrate your Javascript unit tests into a Maven build script.

Effective Scala

by Joshua Suereth

Joshua's talk focused on providing fairly advanced tips for writing good Scala code. He stressed the importance of using the Scala REPL (an interactive interpreter) during development. The REPL should be used on a regular basis to experiment with unfamiliar libraries and test out snippets of code. He also stressed the importance of staying immutable. If your objects are immutable, then it means (1) they are thread-safe and (2) they are hash-safe. He says that you should write your interfaces in Java because the bytecode of Scala interfaces doesn't convert well back to Java.

Joshua talked in depth about what's called "implicit scope". This is a special scope that basically lets you insert whatever variables you want into it. If used properly, it can be very powerful. One example Joshua gave was using implicit scope to define a collection of "Encoder" classes which convert various objects to byte arrays. It's designed that so any object can be passed into an "Encoder.encode()" method. Then, using implicit scope, the method delegates the object to the appropriate "Encoder" implementation for further processing.

Tuesday, April 10, 2012

Philly Emerging Tech Conference: Day One

Today, I attended the first half of the Philly Emerging Technologies for the Enterprise Conference down in Center City. This was a very good conference and I look forward to attending the second half tomorrow! Here's what I took in from the talks I attended.

Keynote Address - Self Engineering

by Chad Fowler

The conference started with a surprise visit from the mayor of Philadelphia, Michael Nutter(!!). Following the mayor was Chad Fowler, who talked about applying software development principles to improving your own life. One interesting thing he discussed was what's called a "QFD" (quality function deployment) graph, which is a technique for converting non-quantifiable requirements into quantifiable requirements. The example he gave was making a good cookie. Customers might say that they want a cookie that "tastes good", "has good texture", and "is cheap". These are all valid requirements, but completely non-quantifiable! What exactly makes a cookie "taste good"? More sugar? More chocolate? How much more? A QFD helps to break these requirements down into hard numbers.

Javascript, Programming Style, and Your Brain

by Douglas Crockford

This is the guy that wrote the excellent book, "JavaScript: The Good Parts", which contains insightful techniques for writing good Javascript code. He's also the author of JSLint, an online tool that helps to improve Javascript code. His talk was about Javascript and what to avoid doing when coding in the language. For instance, you should never use the "with" statement because it acts in unpredictable ways under certain circumstances. He also suggests never using the "switch" statement, since it's easy for a programmer to forget to include the "break" keyword inside of a "case" block.

Also, he says you should always put your opening curly braces on the same line to the right instead of on the next line to the left. In most languages, this issue is simply a matter of programmer taste and does not effect the actual behavior of the program. But in Javascript, there's one situation where it does have consequences:

return {
  foo:'bar'
};

return
{
  foo:'bar'
};

These two return statements both seem to do the same thing--return an inline object. But in fact, only the top example does this! The reason is that, since semi-colons are optional, Javascript auto-inserts a semicolon after the return keyword in the bottom example, causing it to exit the function and return nothing. It completely ignores the object that's defined below it (it won't even throw an error message). So, if you always put your curly braces on the right, you'll never have to worry about this quirk.

Java EE in the Cloud

by Gordon Dickens

In his talk, Gordon compared and contrasted a number of cloud-based JavaEE services. These services allow you to quickly deploy JavaEE web applications to the Internet and customize what kind of back-end software you want to use. For example, one cloud service he demoed lets you choose what database and web container you want to use.

In response to hearing some buzz about Java 7 being "cloud ready", Gordon did a close investigation of the current source code of JavaEE 7. He couldn't find anything substantial that was really worthy of that description. He said that Oracle intends to release JavaEE 7 during the third quarter of this year no matter what, and that anything that doesn't make it into version 7 will be pushed back to version 8.

SQL? NoSQL? NewSQL?!? What's a Java developer to do?

by Chris Richardson

This talk was after lunch, so I was a little sleepy, but I did my best to pay attention. Chris compared and contrasted three next-generation databases: MongoDB, Apache Cassandra, and VaultDB.

MongoDB is a document-oriented, NoSQL database. Every record in the database is a JSON object. Queries are pretty straight-forward--just pass the database a JSON object that has what you're looking for in it. Inserting data into MongoDB is fast because you don't have to wait for a response from the server when you send it commands. However, a downside is that it doesn't support ACID (i.e. transactions) like relational databases do. It's used by a number of large companies, such as bit.ly.

Apache Cassandra is another NoSQL database. However, it is column-oriented, instead of document-oriented like MongoDB. This means that a Cassandra database is basically one big hash map. Each record has a key and a value. The key can be anything (it doesn't have to be a number) and the value can also be anything. Chris said that this database is good for logging purposes because it can quickly ingest data. Netflix and Facebook both use this database.

VaultDB is known as a NewSQL database. From what I gathered from Chris' talk, it's basically just a relational database that resides completely in memory. It writes the database to disk like once an hour or something so it can be recovered if it crashes. A downside is that the API it uses is proprietary and still a work in progress. It has limited JDBC support.

Chris gave a good piece of advice for startup companies that are having trouble deciding what kind of database to use. You might be tempted to use one of these next generation databases because, as a startup, you're starting from scratch and don't have to do any sort of migration work that an established company running a relational database would have to do. However, the advantages that NoSQL and NewSQL databases bring to the table--namely speed and scalability--aren't things you really need as a new business. Since you're a small company, you don't have very many customers, so neither speed nor scalability is really an issue. In fact, you could probably hit the ground running much faster with a relational database because its tools, software, and support are more mature.

How GitHub Works

by Scott Chacon

This talk was given by the CIO of GitHub, Scott Chacon. He described the workplace culture at GitHub.

  • Trust your employees - Your employees want to do a good job. Define what your expectations are, and they'll likely exceed them. Don't micro-manage.
  • No work hours - The traditional 9 to 5 work day is a relic from the industrial revolution long ago. Programming is largely a creative process and you can't effectively box it into a rigid schedule. If you're not being productive, then why are you at work? At GitHub, people work when they want to.
  • Headphones - If you're "in the zone" working on a programming problem, it can be hard to return to the zone after being interrupted. The rule that they have at the GitHub office is that, if you're wearing headphones, no one can interrupt you no matter what. They can send you an IM or an email, but they can't physically approach you at your desk.
  • The chat room is the office - Not all the employees are in a single building. Many are scattered all over the world, so they have a chat room that everyone uses for much of their communication.
  • Saying "No" - Scott talked about the importance of creating a culture where it's OK to say "No". This means that when people propose new ideas, their feelings aren't hurt if the idea is turned down by the team. It's important to establish this in order to encourage people to speak their minds without the fear of rejection and also to prevent bad ideas from being put into place and harming the company.

The Evolution of CSS Layout: Through CSS 3 and Beyond

by Elika J. Etemad

Elika is a member of the W3C CSS Working Group, so it was interesting to get a "behind-the-scenes" look as to how these specifications evolve. Elika started by giving a brief history of CSS and then gave a preview as to what can be expected in the future. She says that standardizing the rules for how elements are positioned on the page is the most complicated part because of all the various layout algorithms that are involved.

As to their interaction with Micro$oft, she said that they are productive, contributing members to the standardization process. Their involvement was lackluster during the long rule of IE6, but improved with the release of IE7.

Before Elika joined as a full-time employee, she was a dedicated member of the mailing list and an avid submitter of browser bugs. After several years of involvement, they offered her a job! It just goes to show that when the W3C says they are open to input from the community, they really mean it!

Sunday, April 8, 2012

Extending the DateFormat class

I'm writing an SMTP server and one of the things that you have to do when writing an SMTP server is understand how dates in an email message are formatted. These rules are defined in RFC-5322, a document which provides details about the contents of SMTP email messages. RFC ("request for comment") documents are written by a standards organization known as the Internet Engineering Task Force (IETF). RFCs help to form what is essentially the "Bible" of the Internet--they lay down the rules for how many fundamental Internet technologies work. Some of these technologies include email, TCP, HTTP, and FTP.

The rules pertaining to dates are defined in two sections of RFC-5322. Section 3.3 (page 14) contains the most up-to-date specifications. This is what should be used when creating and sending new emails. The rules in Section 4.3 (page 33), on the other hand, describe the old standards which are now obsolete. These are included because an SMTP server must support them in order to maintain backwards compatibility with older SMTP servers.

To parse these dates in Java, at first I thought I could just use a single SimpleDateFormat object. But because of the complexity of the rules, that just wasn't possible. So, I created my own implementation of the DateFormat class to handle the complexity. The advantage to extending DateFormat is that it allows my code to plug nicely into the Java Date API, so I can call the parse() and format() methods just like I would with SimpleDateFormat.

import java.text.*;
import java.util.*;
import java.util.regex.*;

public class EmailDateFormat extends DateFormat {
  /**
   * The preferred format.
   */
  private final DateFormat longForm = new SimpleDateFormat("EEE, d MMM yyyy HH:mm:ss Z");

  /**
   * Day of the week is optional.
   * @see RFC-5322 p.50
   */
  private final DateFormat withoutDotw = new SimpleDateFormat("d MMM yyyy HH:mm:ss Z");

  /**
   * Seconds and day of the week are optional.
   * @see RFC-5322 p.49,50
   */
  private final DateFormat withoutDotwSeconds = new SimpleDateFormat("d MMM yyyy HH:mm Z");

  /**
   * Seconds are optional.
   * @see RFC-5322 p.49
   */
  private final DateFormat withoutSeconds = new SimpleDateFormat("EEE, d MMM yyyy HH:mm Z");

  /**
   * Determines if a date string has the day of the week.
   */
  private final Pattern dotwRegex = Pattern.compile("^[a-z]+,", Pattern.CASE_INSENSITIVE);

  /**
   * Determines if a date string has seconds.
   */
  private final Pattern secondsRegex = Pattern.compile("\d{1,2}:\d{2}:\d{2}");

  /**
   * Used for fixing obsolete two-digit years.
   * @see RFC-5322, p.50
   */
  private final Pattern twoDigitYearRegex = Pattern.compile("(\d{1,2} [a-z]{3}) (\d{2}) ", Pattern.CASE_INSENSITIVE);

  @Override
  public StringBuffer format(Date date, StringBuffer toAppendTo, FieldPosition fieldPosition) {
    return longForm.format(date, toAppendTo, fieldPosition);
  }

  @Override
  public Date parse(String source, ParsePosition pos) {
    //fix two-digit year
    Matcher m = twoDigitYearRegex.matcher(source);
    source = m.replaceFirst("$1 19$2 ");

    //remove extra whitespace
    //see RFC-5322, p.51
    source = source.replaceAll("\s{2,}", " "); //remove runs of multiple whitespace chars
    source = source.replaceAll(" ,", ","); //remove any spaces before the comma that comes after the day of the week
    source = source.replaceAll("\s*:\s*", ":"); //remove whitespace around the colons in the time

    //is the day of the week included?
    m = dotwRegex.matcher(source);
    boolean dotw = m.find();

    //are seconds included?
    m = secondsRegex.matcher(source);
    boolean seconds = m.find();

    if (dotw && seconds) {
      return longForm.parse(source, pos);
    } else if (dotw) {
      return withoutSeconds.parse(source, pos);
    } else if (seconds) {
      return withoutDotw.parse(source, pos);
    } else {
      return withoutDotwSeconds.parse(source, pos);
    }
  }
}

Looking at the source code of my EmailDateFormat class, the parse() method is designed to handle both the most recent syntax and the obsolete syntax of date strings. It basically does two things. First, it sanitizes the date string, removing unnecessary white space and converting two-digit years (which are now obsolete) to four-digit years. Second, it determines which of the many valid formats the date adheres to and then parses the date using an appropriate SimpleDateFormat object. The reason why so many SimpleDateFormat objects need to be created is that the "day of the week" and "second" parts of the date string are optional. Four separate SimpleDateFormat objects must be created to cover all possibilities because there's no way to define specific date fields as "optional" in the SimpleDateFormat class.

The format() method of the EmailDateFormat class is designed so that it will always create a date string that adheres to the most up-to-date standards.

Because of this class' complexity and its loose coupling from the rest of the application, it really lends itself to unit testing. So I wrote a unit test that feeds it date strings in various formats, and confirms that it parses them correctly. The unit test also makes sure that the format() method creates a date string that contains the most up-to-date syntax.

import static org.junit.Assert.*;
import java.util.*;
import org.junit.*;

public class EmailDateFormatTest {
  @Test
  public void parse() throws Exception {
    EmailDateFormat df = new EmailDateFormat();
    Calendar c;
    Date expected, actual;

    //+ day of the week
    //- seconds
    c = Calendar.getInstance();
    c.set(2012, 3, 8, 14, 25, 0);
    c.set(Calendar.MILLISECOND, 0);
    c.setTimeZone(TimeZone.getTimeZone("-0400"));
    expected = c.getTime();
    actual = df.parse("Sun, 8 Apr 2012 10:25 -0400");
    assertEquals(expected, actual);

    //+ date of the week
    //+ seconds
    c = Calendar.getInstance();
    c.set(2012, 3, 8, 14, 25, 1);
    c.set(Calendar.MILLISECOND, 0);
    c.setTimeZone(TimeZone.getTimeZone("-0400"));
    expected = c.getTime();
    actual = df.parse("Sun, 8 Apr 2012 10:25:01 -0400");
    assertEquals(expected, actual);

    //- day of the week
    //- seconds
    c = Calendar.getInstance();
    c.set(2012, 3, 8, 14, 25, 0);
    c.set(Calendar.MILLISECOND, 0);
    c.setTimeZone(TimeZone.getTimeZone("-0400"));
    expected = c.getTime();
    actual = df.parse("8 Apr 2012 10:25 -0400");
    assertEquals(expected, actual);

    //- date of the week
    //+ seconds
    c = Calendar.getInstance();
    c.set(2012, 3, 8, 14, 25, 1);
    c.set(Calendar.MILLISECOND, 0);
    c.setTimeZone(TimeZone.getTimeZone("-0400"));
    expected = c.getTime();
    actual = df.parse("8 Apr 2012 10:25:01 -0400");
    assertEquals(expected, actual);

    //single-digit date
    c = Calendar.getInstance();
    c.set(2012, 3, 8, 14, 25, 1);
    c.set(Calendar.MILLISECOND, 0);
    c.setTimeZone(TimeZone.getTimeZone("-0400"));
    expected = c.getTime();
    actual = df.parse("Sun, 8 Apr 2012 10:25:01 -0400");
    assertEquals(expected, actual);

    //two-digit date
    c = Calendar.getInstance();
    c.set(2012, 3, 10, 14, 25, 0);
    c.set(Calendar.MILLISECOND, 0);
    c.setTimeZone(TimeZone.getTimeZone("-0400"));
    expected = c.getTime();
    actual = df.parse("Tue, 10 Apr 2012 10:25 -0400");
    assertEquals(expected, actual);

    //obsolete timezone format (see RFC-5322, p.50)
    c = Calendar.getInstance();
    c.set(2012, 3, 8, 14, 25, 1);
    c.set(Calendar.MILLISECOND, 0);
    c.setTimeZone(TimeZone.getTimeZone("-0400"));
    expected = c.getTime();
    actual = df.parse("Sun, 8 Apr 2012 10:25:01 EDT");
    assertEquals(expected, actual);

    //obsolete year format (see RFC-5322, p.50)
    c = Calendar.getInstance();
    c.set(1999, 3, 8, 14, 25, 1);
    c.set(Calendar.MILLISECOND, 0);
    c.setTimeZone(TimeZone.getTimeZone("-0400"));
    expected = c.getTime();
    actual = df.parse("8 Apr 99 10:25:01 EDT");
    assertEquals(expected, actual);

    //with extra whitespacee
    c = Calendar.getInstance();
    c.set(2012, 3, 8, 14, 25, 0);
    c.set(Calendar.MILLISECOND, 0);
    c.setTimeZone(TimeZone.getTimeZone("-0400"));
    expected = c.getTime();
    actual = df.parse("Sun , 8   Apr 2012   10 :   25  -0400");
    assertEquals(expected, actual);
  }

  @Test
  public void format() throws Exception {
    EmailDateFormat df = new EmailDateFormat();

    //the long format should always be used

    //single-digit date
    Calendar c = Calendar.getInstance();
    c.set(2012, 3, 8, 14, 25, 1);
    c.set(Calendar.MILLISECOND, 0);
    c.setTimeZone(TimeZone.getTimeZone("-0400"));
    Date input = c.getTime();
    String expected = "Sun, 8 Apr 2012 10:25:01 -0400";
    String actual = df.format(input);
    assertEquals(expected, actual);

    //two-digit date
    c = Calendar.getInstance();
    c.set(2012, 3, 10, 14, 25, 1);
    c.set(Calendar.MILLISECOND, 0);
    c.setTimeZone(TimeZone.getTimeZone("-0400"));
    input = c.getTime();
    expected = "Tue, 10 Apr 2012 10:25:01 -0400";
    actual = df.format(input);
    assertEquals(expected, actual);
  }
}
Anyway, I was just proud of this, so I thought I'd share.