Sunday, October 7, 2012

Deploying to Maven Central

I've been working on developing a vCard parsing library, called ez-vcard, and have decided to upload it to the Maven Central repository. This is the main code repository that all projects configured with Maven use by default. By uploading to Maven Central, developers can more easily use your library with their own projects.

This blog post documents the steps I had to take to do this. Official instructions can be found here, which is where I got most of this information. It's a fairly complex process, so make sure you take your time and don't rush yourself! Your project is going to be released to the world, so make sure you do it right!

1. Prepare the POM file

First, you have to make sure that your POM file is ready.

a. The groupId of your project must be under a domain that you control. If your project is hosted by code hosting service like Sourceforge, then you can prefix the groupId with the hosting service's domain. For example:

  • Sourceforge: net.sf.projectName
  • Google Code: com.googlecode.projectName
  • Github: com.github.projectName

b. The POM must contain the following information:

  • <modelVersion>
  • <groupId>
  • <artifactId>
  • <version>
  • <packaging>
  • <name>
  • <description>
  • <url>
  • <licenses>
  • <scm><url>
  • <scm><connection>
  • <developers>

c. It also must reference the "oss-parent" parent POM if you want to use the special Maven goals to deploy your project (explained in step 4 below).

<parent>
  <groupId>org.sonatype.oss</groupId>
  <artifactId>oss-parent</artifactId>
  <version>7</version>
</parent>

d. Also note that usage of <repository>s and <pluginRepository>s inside of your POM is strongly discouraged. All of your project's dependencies should exist inside of Maven Central.

See the POM of the ez-vcard project for an example of a well-formed POM file.

2. Submit a Sonatype JIRA ticket

You will need to create an account on the Sonatype JIRA website and then submit a JIRA ticket so that your project can be reviewed. Someone will verify that your groupId is valid and that your POM has all the required information. It takes approximately 2 business days to process your request (my ticket was approved the same day I submitted it).

For instructions on how to create a JIRA account and fill out a JIRA ticket, see the Sonatype OSS Maven Repository Usage Guide.

3. Create a public key

While you're waiting for your ticket to be approved, you can generate a public key, which will be used to sign all files that you upload to Maven Central. File signatures are required in order to deploy to Maven Central. A file's signature is stored as a plain text file with the ".asc" extension. They are used to verify whether or not the file was uploaded by the real author.

The public key can be created using a tool called GPG. Most Linux distributions come with this tool pre-installed. If you're on a Windows or Mac computer, you'll have to download it separately (see this page for instructions).

Generate the key

A key can be generated using the following command:

> gpg --gen-key

The command will ask you for the following information (the supplied default is fine for many of these steps):

  1. Key type
  2. Key size - A high key size means the key will be harder to crack, but it will take longer to generate the key and longer to verify signed files. A size of 2048 is good.
  3. Expiration date - The key can be set to never expire, but for extra security, an expiration date can be set. If an expiration date is set, you'll have to regenerate your key once it has expired.
  4. Your name and email - This information will be used to label your public key in the public key database.
  5. A comment - This can be left blank. It is an optional component of the string that is created from your name and email.
  6. Key password - This is optional, but strongly recommended. You will need to enter this every time you sign a file (i.e. every time you deploy to Central).

Afterward, the tool will start collecting data from various activities that are going on inside your computer, such as keyboard and mouse activity. It uses this random data to build a random number called a seed. This seed will be used to kick start a random number generator, which is used to generate your key. It takes a little bit of time, so be patient. Open a text editor and start typing, or just do normal work on your computer. This will speed up the seed generation process.

Distribute the key

The next step is to upload your public key to the Internet. People can then download your key and use it to verify the signatures that you've uploaded with your project files.

First, get the name of your public key. You will need this to run the command that distributes the key. For example, in the console snippet below, the name of the public key is "ED69FC1F".

> gpg --list-keys
pub   2048R/ED69FC1F 2012-09-28
uid                  John Doe <jdoe@hotmail.com>
sub   2048R/E8A6EAD8 2012-09-28

Then, distribute your public key to the Internet with this command:

> gpg --keyserver hkp://pool.sks-keyservers.net/ --send-keys ED69FC1F

There are many key servers on the Internet, but the one above is what I think Maven Central requires or recommends that you to use.

For more information, see How To Generate PGP Signatures With Maven.

4. Deploy to Central

Once your JIRA ticket from step 2 has been approved, you can upload your project to a staging repository, where it will be released to Maven Central. Remember that, once you deploy a release to Central, that version of your library is set in stone. You cannot re-release your library unless you deploy a new version!!

Prepare the project

Before you start, make sure that you do the following:

a. Add your Sonatype JIRA credentials to your Maven settings file (located at "~/.m2/settings.xml").

<settings>
  ...
  <servers>
    <server>
      <id>sonatype-nexus-snapshots</id>
      <username>your-jira-id</username>
      <password>your-jira-pwd</password>
    </server>
    <server>
      <id>sonatype-nexus-staging</id>
      <username>your-jira-id</username>
      <password>your-jira-pwd</password>
    </server>
  </servers>
  ...
</settings>

b. Add "-SNAPSHOT" to the end of the version in the POM. Even though you are deploying a release version, the project's version in the POM file must end in "-SNAPSHOT". Maven will automatically remove this when it builds and deploys your project.

c. Commit all changes to source control. The working copy of your project must have zero uncommitted changes.

Build the project

Next, run these commands to get your project ready for uploading:

> mvn release:clean
> mvn release:prepare -Dusername=SCM_USERNAME -Dpassword=SCM_PASSWORD

The "release:prepare" goal asks you for the following:

  1. The release version of your project. For example, if the version in your POM is set to "0.4.1-SNAPSHOT", the release version should be "0.4.1".
  2. The name of the SVN tag to create. The goal will automatically create an SVN tag (or equivalent object if using a different SCM) for the release, which is why you need to provide your SCM credentials in the Maven command.
  3. The new version to assign to the project after it has been deployed. For example, if you are deploying "0.4.1", you might want the new development version to be "0.4.2-SNAPSHOT".

It then performs the following operations:

  1. Does a clean build of the project.
  2. Asks for your GPG key password to sign the built files.
  3. Changes the version and SCM URLs in your POM to reflect the release version, then commits these changes to your source control system.
  4. Creates a SVN tag (or equivalent) for the release version.
  5. Changes the version and SCM URLs in your POM to reflect the new development version that was entered above, then commits these changes to your source control system.

Upload the project

The next step is to upload your project to a staging repository. The staging repository gives you one last chance to confirm that your project is in good shape and ready to be released to the world. It also performs automated checks on your project to make sure it meets all the requirements.

> mvn release:perform

This command will:

  1. Checkout the SVN tag (or equivalent) that was created with the "release:prepare" goal.
  2. Build the checked-out files.
  3. Ask for your GPG key password to sign the built files.
  4. Upload everything to a staging repository (not Maven Central yet).

Release

Now that your project is in the staging repository, you can release it to the world!

  1. Open the Nexus UI by visiting https://oss.sonatype.org/. Login with the JIRA credentials you created in step 2.
  2. Click on "Staging Repositories" in the menu on the left.
  3. Find the staging repository for your project. Select it, then click the "Close" button. You will be asked to enter a comment describing your action. You can enter something like "Release of version 0.4.1". Closing the repository will perform some automated checks on your project. It makes sure that all files are signed, that your POM has all the required information, and that your project has source code and Javadoc JARs.
  4. If there is a problem with your project, you can click "Drop" to delete the staging repository so you can correct the mistakes, re-build, and re-stage your project.
  5. Once you've confirmed that your project is in good shape, click "Release". If the "Release" button is disabled, it means it is still performing some automated checks on your project. Wait a few seconds, then click the "Refresh" button. The "Release" should become enabled (if it is not, wait a few more seconds, then click "Refresh" again). After clicking "Release", you'll be asked again to enter a comment.
  6. Since this is the first time you are deploying your project to Central, your project must be manually reviewed to make sure everything is OK. Add a comment to the JIRA ticket that you created in step 2, saying that you have released the project. If everything is OK, then the Maven folks will configure your project to sync with Maven Central. It will appear there within 2 hours. All subsequent versions you release will be automatically synced with Central and this last step will not be necessary. It will take approximately 4 hours to appear on search.maven.org.

Congratulations! Your project is now part of the Maven community!

For more information, see the Sonatype OSS Maven Repository Usage Guide.

Saturday, August 18, 2012

vCards

When you meet someone for the first time, whether it be a friend or a business partner, how do you exchange contact information? Maybe you send an email to each other to share your email address. Maybe you text or call each other on your cell phones to share your cell phone number. Maybe you write your number on a piece of paper. There are many ways to do this, but each of these ways is error-prone. What if you forget to include a vital piece of information, like the spelling of your last name or the URL of your website?

The idea behind the vCard standard is to provide an easy and hassle-free way for people to share their contact information electronically. A vCard is basically an electronic business card--it contains information like your name, mailing address, email address, telephone number, website, and a picture of yourself. So, when you meet someone new, instead sending them an email with your phone number, address, website, and "darn it what else do they need to know", you can just email them your vCard.

Pretty much all email clients including GMail, Outlook, and Mail can import vCards into their address books. And since many email clients also allow you to export your contacts as a vCard, the vCard standard can act as a sort of data transmission format if you want to switch email clients. You can export all your contacts as a vCard, and then import the vCard into the other email client.

Developer's Overview

From a software engineering perspective, a vCard is just a plain text file (there is also a less popular XML format). It consists of a number of "properties", each of which has zero or more "parameters" and exactly one "value". Each property goes on its own line. Long lines are usually "folded", which means that they are split up into multiple lines. The folded lines all start with a whitespace character to show that they're part of the same line. Binary data, like photos, are encoded in base64.

Here's what my vCard looks like.

BEGIN:VCARD
VERSION:4.0
KIND:individual
SOURCE:http://mikeangstadt.name/mike-angstadt.vcf
FN:Michael Angstadt
N:Angstadt;Michael;;Mr;
NICKNAME:Mike
PHOTO;VALUE=uri:data:image/jpeg;base64,/9j/4AAQSkZJRgABAQEAYABgAAD/4Q
 ZgAASUkqAAgAAAAEABoBBQABAAAAPgAAABsBBQABAAAARgAAACgBAwABAAAAAgAAADEB
 AATgAAAAAAAABgAAAAAQAAAGAAAAABAAAAUGFpbnQuTkVUIHYzLjUuMTAA/9sAQwACAQ
 [more base64 data]
REV;VALUE=timestamp:20120818T155230Z
EMAIL;TYPE=home:mike.angstadt@gmail.com
TEL;TYPE=cell;VALUE=uri:tel:+1 555-555-1234
TEL;TYPE=home;VALUE=uri:tel:+1 555-555-9876
URL;TYPE=home:http://mikeangstadt.name
URL;TYPE=work:http://code.google.com/p/ez-vcard
TZ;VALUE=text:America/New_York
GEO;VALUE=uri:geo:39.95,75.1667
CATEGORIES:Java software engineer,vCard expert,Nice guy
UID:urn:uuid:dd418720-c754-4631-a869-db89d02b831b
LANG:en-US
X-GENERATOR:EZ vCard v0.2.1-SNAPSHOT http://code.google.com/p/ez-vcard
END:VCARD

As you can see, it contains my name, email address, website, and other data. The telephone information is fake, but I wanted to include it just to show what telephone data looks like. The PHOTO property contains a profile picture of myself. Because this property value is so large, it has been folded into multiple lines. The PHOTO property also contains a parameter, "VALUE", which states that the property value is a URI. All vCards must start with "BEGIN:VCARD" and end with "END:VCARD". vCards must use the \r\n newline character sequence.

There are three different vCard versions: 2.1, 3.0, 4.0. Versions 3.0 and 4.0 are RFC standards, defined in RFC 2426 and RFC 6350 respectively. Versions 2.1 and 3.0 are very similar, but version 4.0 is significantly different from the previous versions. It adds many new properties and parameters, and redefines how the values of many existing properties should be encoded.

EZ-vCard

EZ-vCard is an open source Java library that I wrote that reads and creates vCards. It supports all versions of the vCard standard. My goal was to design an API that was as easy to use as possible. For example, here's how to create a vCard file with some basic information:

VCard vcard = new VCard();

vcard.setFormattedName(new FormattedNameType("Barak Obama"));

EmailType email = new EmailType("barak.obama@whitehouse.gov");
email.addType(EmailTypeParameter.WORK);
vcard.addEmail(email);

email = new EmailType("superdude22@hotmail.com");
email.addType(EmailTypeParameter.HOME);
vcard.addEmail(email);

TelephoneType tel = new TelephoneType("(555) 123-5672");
tel.addType(TelephoneTypeParameter.CELL);
vcard.addTelephoneNumber(tel);

File file = new File("obama.vcf");
vcard.write(file);

It's also very easy to read a vCard file using EZ-vCard:

File file = new File("obama.vcf");
VCard vcard = VCard.parse(file);

Nothing says "I'm professional" like a vCard! Create your own vCard today!

Sunday, August 12, 2012

JUnit and Temporary Files

Oftentimes, an application interacts with the file system by reading from or creating files and directories. This functionality should, of course, be unit tested to ensure that it works as expected. Also, the unit tests should be self-contained, meaning that any files it reads from or creates should be located within the project itself and not at some location like "C:\unit-test-files". In addition, these temporary files and directories should be cleaned up when the test is done running because, well, they're temporary. And they definitely should not be commited to version control.

You could just throw the files in a location that you know is temporary, like the "target" directory if you use Maven or your operating system's temporary file directory. The problem with this is that if the files are not deleted between test runs, then it could skew your test results. No matter where you put them, they have to be cleaned up when the test is finished running.

You could write the cleanup code yourself OR you could use JUnit's TemporaryFolder class. This class takes care of cleaning up these files and directories after each test finishes running. It will always clean up the files, whether the test passes, fails, or throws an exception. It creates the temp folder when a test starts and deletes the temp folder when the test finishes. It does this for every test method in the class.

Under the covers, TemporaryFolder uses the File.createTempFile() method to create the directory, so it's storing the directory in your operating system's temp directory. It also assigns a unique name to the directory, so if the JVM crashes and TemporaryFolder does NOT clean up your files, the results of your next test run will not be skewed by the files from the previous run.

Here's a code sample demonstrating how the TemporaryFolder class works.

import org.junit.Rule;
import org.junit.Test;
import org.junit.rules.TemporaryFolder;

class FileTest {
  @Rule
  public TemporaryFolder temp = new TemporaryFolder();

  @Test
  public void basicTest() throws IOException {
    //the temporary folder is created before this test method runs

    File fileWithoutName = temp.newFile();
    File fileWithName = temp.newFile("myfile.txt");

    File dirWithoutName = temp.newFolder();
    File dirWithName = temp.newFolder("myfolder");

    File fileInsideCreatedDir = new File(dirWithName, "myfile2.txt");

    //the temporary folder is deleted when this test method finishes
  }
}

A class-level instance of TemporaryFolder is created and tagged with the @Rule annotation. This annotation instructs the class to create the temporary folder before a test runs and delete the temporary folder after the test finishes. This field MUST be "public".

The newFile() method creates a new file within the temporary directory. If a file name is not passed into the method, then it will generate a random file name.

The newFolder() method creates a new directory within the temporary directory. As with newFile(), if a name is not passed into the method, then it will generate a random name for the directory.

Note: The zero-argument versions of newFile() and newFolder() were added fairly recently to the API. If you get compilation errors trying to use these, update your JUnit library to the latest version (4.10 at the time of this writing).

You can, of course, create files and directories within a directory that is created with newFolder(). Just pass the File object that was returned by newFolder() into the first argument of the File constructor (as demonstrated with the fileInsideCreatedDir object).

This is a wonderful little gem that takes the pain out of unit testing file system code. I wish I had known about it sooner.

Saturday, June 23, 2012

Java 7 Changes

Java 7 was released about a year ago and is the latest version of the Java language. It includes some useful tweaks to the language as well as some improvements to the API. Some of the changes are described in detail below.

Warning: Long blog post ahead! >.<

1. Language Enhancements

Diamond Operator

When using generics in previous versions of Java, you always had to define the generic types twice--once in the variable definition and once in the class constructor call.

Map<Integer, String> map = new HashMap<Integer, String>();

In Java 7, the generic types in the constructor call no longer have to be repeated:

Map<Integer, String> map = new HashMap<>();

Strings in Switch Statements

Before, only integers and characters were allowed to be used in switch statements. In Java 7, strings can be used as well.

switch ("two"){
  case "one":
    ...
    break;
  case "two":
    ...
    break;
  case "three";
    ...
    break;
  default:
    ...
}

Note that the String.equals() method is used behind the scenes to do the string comparison, which means that the comparison is case-sensitive.

The try-with-resource statement

All Java programmers know how important it is to close resources when they are done being used (such as input and output streams). This is best done using a try-catch-finally block.

InputStream in = null;
OutputStream out = null;
try{
  in = ...;
  out = ...;
  ...
} catch (IOException e){
  ...
} finally{
  if (in != null){
      try{
        in.close();
      } catch (IOException e){}
  }
  if (out != null){
      try{
        out.close();
      } catch (IOException e){}
  }
}

Java 7 introduces a special try block, called try-with-resources, that automatically closes the resources for you. This reduces boiler-plate code, making your code shorter, easier to read, and easier to maintain. It also helps to eliminate bugs, since you no longer have to remember to close the resource.

try (InputStream in = ...; OutputStream out = ...){
  ...
} catch (IOException e){
  ...
}

In order to support this, the class must implement the AutoCloseable interface. However, classes that implement Closeable can also be auto-closed. This is because the Closeable interface was modified to extend AutoCloseable.

Multi-exception catch blocks

Previously, only one exception could be caught per catch block. This would sometimes lead to duplicated code:

try {
  ...
} catch (SQLException e) {
  System.out.println(e.getMessage());
  throw e;
} catch (IOException e) {
  System.out.println(e.getMessage());
  throw e;
}

Java 7 lets you put multiple exceptions in a single catch block, thus reducing code duplication:

try {
  ...
} catch (SQLException | IOException e) {
  System.out.println(e.getMessage());
  throw e;
}

Underscores in numeric literals

Sometimes, you have to hard-code a number in your Java code. If the number is long, it can be hard to read.

int i = 1000000;

How long does it take you to read this number? All those digits makes my eyes hurt. Is it ten million? One million? One hundred thousand? In Java 7, underscores can be added to the number to make it more readable.

int i = 1_000_000;

Binary literals

In Java 7, you can hard-code binary values. A binary value starts with "0b" (or "0B") and is followed by a sequence of "0"s and "1"s.

int one = 0b001;
int two = 0b010;
int six = 0b110;

2. New File System API - NIO 2.0

Java 7 adds a revamped file system API called NIO 2.0. Basically, you have a Path class that represents a path on the filesystem and a Files class that allows you to perform operations on an instance of Path, like deleting or copying a file.

Java IO and NIO 2.0 comparison

To give you a feel for the changes, here are two code samples that compare the APIs of the original IO framework and the new NIO 2.0 framework.

Java IO

import java.io.*;

public class JavaIO{
  public static void main(String args[]) throws Exception {
    //define a file
    File file = new File("/home/michael/my-file.txt");

    //create readers/writers
    BufferedReader bufReader = new BufferedReader(new FileReader(file));
    BufferedWriter bufWriter = new BufferedWriter(new FileWriter(file));

    //read file data into memory
    InputStream in = null;
    byte data[];
    try{
      in = new FileInputStream(file);
      ByteArrayOutputStream out = new ByteArrayOutputStream();
      byte buffer[] = new byte[4096];
      int read;
      while ((read = in.read(buffer)) != -1){
        out.write(buffer, 0, read);
      }
      data = out.toByteArray();
    } finally{
      if (in != null){
        in.close();
      }
    }

    //write a string to a file
    PrintWriter writer = null;
    try{
      writer = new PrintWriter(file);
      writer.print("the data");
    } finally{
      if (writer != null){
        writer.close();
      }
    }

    //copy a file
    File target = new File("/home/michael/copy.txt");
    InputStream in2 = null;
    OutputStream out2 = null;
    try{
      in2 = new FileInputStream(file);
      out2 = new FileOutputStream(target);
      byte buffer[] = new byte[4096];
      int read;
      while ( (read = in2.read(buffer)) != -1){
        out2.write(buffer, 0, read);
      }
    } finally {
      if (in2 != null){
        in2.close();
      }
      if (out2 != null){
        out2.close();
      }
    }

    //delete a file
    boolean success = file.delete();
    if (!success){
      //the file couldn't be deleted...we don't know why!!
    }

    //create a directory
    File newDir = new File("/home/michael/mydir");
    success = newDir.mkdir();
    if (!success){
      //directory couldn't be created...we don't know why!!
    }
  }
}

NIO 2.0

import java.io.*;
import java.nio.file.*;
import java.nio.charset.*;

public class JavaNIO{
  public static void main(String args[]) throws Exception {
    //define a file
    Path file = Paths.get("/home/michael/my-file.txt");

    //create readers/writers
    BufferedReader reader = Files.newBufferedReader(file, Charset.defaultCharset());
    BufferedWriter writer = Files.newBufferedWriter(file, Charset.defaultCharset());

    //read file data into memory
    byte data[] = Files.readAllBytes(file);

    //write a string to a file
    Files.write(file, "the data".getBytes());

    //copy a file
    Path target = Paths.get("/home/michael/copy.txt");
    Files.copy(file, target);

    //delete a file
    //throws various exceptions depending on what went wrong
    Files.delete(file);

    //create a directory
    //throws various exceptions depending on what went wrong
    Path newDir = Paths.get("/home/michael/mydir");
    Files.createDirectory(newDir);
  }
}

As you can see, NIO 2.0 adds many convenience methods that remove the need to write a lot of boilerplate code. It also has more fine-grained error handling (for example, when deleting a file and creating a directory).

Directory monitoring

Also added to NIO 2.0 is the ability to monitor directories for changes. This allows your application to immediately respond to events such as files being deleted, modified, or renamed.

WatchService watchService = FileSystems.getDefault().newWatchService();
Path dir = Paths.get("/home/michael");
dir.register(watchService,
  StandardWatchEventKinds.ENTRY_CREATE,
  StandardWatchEventKinds.ENTRY_DELETE,
  StandardWatchEventKinds.ENTRY_MODIFY
);

while(true){
  WatchKey key = watchService.take();
  for (WatchEvent<?> event : key.pollEvents()){
    ...
  }
  key.reset();
}

Because the WatchSevice.take() method blocks while waiting for the next event, you should consider running this code in a separate thread.

Creating new File Systems (ZIP files)

Another feature in NIO 2.0 is the ability to create new file systems. One purpose for this interacting with ZIP files. NIO 2.0 basically treats a ZIP file as if it were a flash drive or another hard drive on your computer. You read, write, copy, and delete files on the ZIP file system as if they were ordinary files on a hard drive. The example below shows how create a ZIP file and add a file to it.

Map<String, String> env = new HashMap<>(); 
env.put("create", "true");
URI uri = URI.create("jar:file:/home/michael/my-zip.zip");

try (FileSystem zip = FileSystems.newFileSystem(uri, env)) {
  Path externalFilePath = Paths.get("my-file.txt");
  Path zipFilePath = zip.getPath("/my-file.txt");          
  Files.copy(externalFilePath, zipFilePath);
}

3. Fork/Join Framework

New fork/join classes were added to Java's Concurrency framework. They make it easier to divide a single task into many smaller tasks so the task can be completed using multi-threading. It uses a worker-stealing technique in which a worker thread will "steal" work from another worker thread if it finishes its work before the other. This helps to reduce the total amount of time it takes for a task to complete. Here's an example of how you would use fork/join to calculate the sum of an array of integers.

import java.util.concurrent.*;

public class ForkJoinDemo{
  public static void main(String args[]){
    ForkJoinPool pool = new ForkJoinPool();
    SumTask task = new SumTask(new int[]{0,1,2,3,4,5});
    Integer result = pool.invoke(task);
    System.out.println(result);
  }

  private static class SumTask extends RecursiveTask<Integer> {
    private final int[] numbers;
    private final int start, end;

    public SumTask(int[] numbers){
      this(numbers, 0, numbers.length - 1);
    }

    public SumTask(int[] numbers, int start, int end){
      this.numbers = numbers;
      this.start = start;
      this.end = end;
    }

    @Override
    public Integer compute(){
      if (start == end){
        return numbers[start];
      }
      if (end-start == 1){
        return numbers[start] + numbers[end];
      }
      int mid = (end+start)/2;
      SumTask t1 = new SumTask(numbers, start, mid);
      t1.fork();
      SumTask t2 = new SumTask(numbers, mid+1, end);
      return t2.compute() + t1.join();
    }
  } 
}

First, a ForkJoinPool object is created. Then, a task (SumTask) is instantiated and passed to the ForkJoinPool.invoke() method, which blocks until the task is complete. By default, ForkJoinPool will create one thread for each core in your computer.

4. Misc

Throwable.getSuppressed()

Java 7 adds a method called getSuppressed() to the Throwable class. This method allows the programmer to get any exceptions that were suppressed by the thrown exception.

An exception can be suppressed in a try-finally block or in the new try-with-resources block. In a try-finally block, if an exception is thrown from both the try and finally blocks, the exception thrown from the try block will be suppressed and the exception thrown from the finally block will be returned.

BufferedReader reader = null;
try{
  reader = ...;
  reader.readLine(); //IOException thrown (**suppressed**)
} finally {
  reader.close(); //IOException thrown
}

In Java 7's new try-with-resources block, it is the opposite--the exception thrown when it closes resource is suppressed, while the exception thrown from the try block is returned:

try (BufferedReader reader = ...){
  reader.readLine(); //IOException thrown
}
//IOException thrown when "reader" is closed (**suppressed**)

In previous Java versions, there was no way to get this suppressed exception. But in Java 7, calling the new Throwable.getSuppressed() method will return an array of all suppressed exceptions.

Shaped and Translucent Windows in Swing

You can now create non-square, transparent windows in Swing. Examples can be found in the Java Swing tutorial.


These are just some of the changes that were added to Java 7. Out of all the changes I've mentioned, my favorite is the file monitoring functionality in the NIO 2.0 framework. This is completely new functionality that didn't exist before in previous Java versions. What's your favorite?? Leave a note in the comments below.

The complete Java 7 release notes can be found on Oracle's website.

Saturday, June 9, 2012

HTTP 304

The Bing homepage is a lot prettier looking than Google's. It has a slick, high-resolution background image that changes every week or so. But an image like this takes time to download and it increases the load time of the page. What if you're doing a research project and you have to visit Bing several times a day? Does your browser have to download this image over and over again?

The answer is no. When your browser downloads an image for the first time, it saves it to a cache on the hard drive. A If-Modified-Since header is then added to all subsequent requests for the image and it contains the time that the browser last downloaded the image. The server looks at this time and compares it with the time that the image was last modified on the server. If the image hasn't been modified since then, it returns a HTTP 304 response with an empty body (the image data is left out). The browser sees this status code and knows that it's OK to use the cached version of the image. This means that the image does not have to be downloaded again and the page loads more quickly. If the image has been modified since the browser last downloaded it, then a normal HTTP 200 response is returned containing the image data.


Using the Bing.com background image mentioned above as an example, let's try using curl to test this out. First, we'll send a request without the If-Modified-Since header. This should return a normal HTTP 200 response with the image in the response body.

Note: Curl sends the response body to stdout--because we're not interested in the actual image data, we'll just direct it to /dev/null to throw it away. The --verbose argument displays the request and response headers.

curl --verbose "http://www.bing.com/az/hprichbg?p=rb%2fTimothyGrassPollen_EN-US8441009544_1366x768.jpg" > /dev/null

Request headers:

GET /az/hprichbg?p=rb%2fTimothyGrassPollen_EN-US8441009544_1366x768.jpg HTTP/1.1
User-Agent: curl/7.21.6 (i686-pc-linux-gnu)
Host: www.bing.com

Response headers:

HTTP/1.1 200 OK
Content-Type: image/jpeg
Last-Modified: Fri, 08 Jun 2012 09:37:14 GMT
Content-Length: 155974

As expected, an HTTP 200 response was returned containing a JPEG image that's about 155KB in size. The Last-Modified header shows the date that the image was last modified on the server.


Now let's try sending a request that will cause a HTTP 304 response to be returned. As shown in the Last-Modified header from the response above, the image was last modifed on the morning of June 8. Let's pretend that we last downloaded the image on June 9. Because the image hasn't changed since we've downloaded it, we know that we have the most recent image, so an HTTP 304 response should be returned.

curl --verbose --header "If-Modified-Since: Sat, 09 Jun 2012 09:37:14 GMT" "http://www.bing.com/az/hprichbg?p=rb%2fTimothyGrassPollen_EN-US8441009544_1366x768.jpg" > /dev/null

Request headers:

GET /az/hprichbg?p=rb%2fTimothyGrassPollen_EN-US8441009544_1366x768.jpg HTTP/1.1
User-Agent: curl/7.21.6 (i686-pc-linux-gnu)
Host: www.bing.com
If-Modified-Since: Sat, 09 Jun 2012 09:37:14 GMT

Response headers:

HTTP/1.1 304 Not Modified
Content-Type: image/jpeg
Last-Modified: Fri, 08 Jun 2012 09:37:14 GMT

An HTTP 304 response was returned with an empty body (as shown by the lack of a Content-Length header) as expected.


Let's play pretend one more time and say that we last downloaded the image on June 7. The image was last updated on June 8, so this means that we have an outdated copy of the image and we need to download a fresh copy.

curl --verbose --header "If-Modified-Since: Thu, 07 Jun 2012 09:37:14 GMT" "http://www.bing.com/az/hprichbg?p=rb%2fTimothyGrassPollen_EN-US8441009544_1366x768.jpg" > /dev/null

Request headers:

GET /az/hprichbg?p=rb%2fTimothyGrassPollen_EN-US8441009544_1366x768.jpg HTTP/1.1
User-Agent: curl/7.21.6 (i686-pc-linux-gnu)
Host: www.bing.com
If-Modified-Since: Thu, 07 Jun 2012 09:37:14 GMT

Response headers:

HTTP/1.1 200 OK
Content-Type: image/jpeg
Last-Modified: Fri, 08 Jun 2012 09:37:14 GMT
Content-Length: 155974

As shown in the response, the server detected the fact that our copy was out of date and sent us a HTTP 200 response with the image data in it.


So as you can see, without this caching mechanism, the web would be much slower. Your browser would have to download everything from scratch every time a page is loaded. But with caching, your browser can pull images from the cache without having to download them again.

Saturday, May 26, 2012

Summary of "Your Mouse is a Database" - May 2012 CACM

In a typical web application, when you make an Ajax call to a web service, you have to wait for the entire response to be received before you can act on it. You cannot process the data as it's coming in off the wire. The article "Your Mouse is a Databse" in the May 2012 issue of CACM by Erik Meijer explores a programming API called Reactive Extensions that can asynchronously stream large or infinite amounts of data to an application (think of a stream of Twitter tweets--new tweets are always being created, so there's never an end to the data). Two key words here are "asynchronous" and "stream". Asynchronous means that the data is handled in a separate thread so the application's UI does not lock up. And stream means that the data is acted on as it's being received.

The term "data" is used very broadly--it doesn't have to be data from a web service over the Internet. The data can also be UI input from the user such mouse movements or typing characters into a text box. The data is push-based because it is sent to the consumer instead of the consumer having to explicitly request it.

To help explain the concept, Meijer proposes modifying Java's Future interface (an interface used to check on the status of threads that are running in the background) to handle such data.

interface Future<T> {
  Closable get(Observer<T> callback);
}

He slims the Future<T> interface down to just one method. The get method allows a consumer to subscribe to the data stream. The return value for the method is an instance of Closeable, which allows the programmer to cancel that particular subscription if she wishes.

The Observer<T> parameter processes data from the stream asynchronously. Meijer bases the Observer<T> interface on GWT's AsyncCallback interface by giving it onFailure() and onSuccess() methods. onFailure() is called if there's an error retrieving data from the stream. onSuccess() is called when (or if) the stream ends. The third method, onNext() defines how to handle each data item from the stream (like saving it to a database or displaying it on the screen).

interface Observer<T> {
  void onFailure(Throwable t);
  void onSuccess();
  void onNext(T value);
}

Meijer then describes a number of query opeators in the Reactive Extensions library that can be used to filter this streaming data. For instance, the "where" operator lets the programmer specify the criteria for whether or not a data item should be processed. If the data item does not meet the criteria, then it is discarded. Another operator is "throttle", which prevents too much data from being processed too quickly. For example, if the throttle value is set to 2 seconds and 10 data items are pushed within a 2 second time span, it will only process the most recent message within that 2 second time span. The other 9 messages will be ignored. These operators can be chained together to give the programmer strong control over how to filter a stream.

This idea of streaming push-based based data can help developers design more memory-efficient applications. Tt can also help developers filter the data from larger streams, like Twitter for example, so as to not overwhelm the application with data it doesn't need.

Sunday, May 20, 2012

Summary of "Idempotence Is Not a Medical Condition" - CACM, May 2012

I wanted to read the article "Idempotence Is Not a Medical Condition" by Pat Helland in the May 2012 issue of CACM because it smelled like an article that was full of big words and fuzzy architecture abstractions. I learned some things from the article (and I really like the title), but it's mostly full of FUD. The jist of the article is: "Are you SURE that the messages you send over the network are delivered successfully? Are you REALLY sure? I mean, are you SUPER DUPER sure? After all, how can we be certain of anything in this crazy world?" By the way, SQL Server Broker eliminates this uncertainty for you.


In this article, Helland says that distributed systems today are largely composed of a collection of off-the-shelf software and cloud-based Internet services. This means that they lack a centrally-enforced communication policy. So, extra care must be taken to ensure that each message is delivered and that no message is lost, even if one of the sub-systems goes down or restarts.

This contrasts with the traditional notion of a distributed system: a cluster of tightly-coupled computers in a lab somewhere, which are specifically programmed to talk only to each other. Communication is simpler in this situation because each sub-system is essentially identical and also physically present in the same location.

Every sub-system has some sort of "plumbing" which sends and receives messages (e.g. a TCP network stack). If the plumbing is able to do things like handle duplicate messages and resend lost messages, then the system's application layer does not have to be programmed to ensure reliable communication. But if the plumbing cannot do these things, then the application layer is responsible.

A message should only be marked as "consumed" if the action that the receiving application performs on the message is successful. For example, if an application tries to save a message it receives to a database, but the database transaction fails, then the message should not be considered "consumed" because the operation that the message invoked (a database call) was not successful. The message should sit in a queue somewhere until the application is able to successfully process it or it should respond to the client saying that it failed the operation.

Helland also states that messages should be idempotent, meaning that if the same message is sent multiple times, the effect it has on the server should be the same as if just one message was sent. The term is often used in the context of HTTP--GET requests are defined as being idempotent, while POST requests are not.

Messages can have a preference for one of two types of behavior: "history" or "currency". History means that the messages must be delivered in order. If message N+1 arrives, but message N has not yet arrived, then it must wait until it receives message N in order to deliver message N+1 to the application. Downloading a file requires this type of communication because all of the file's data must be delivered in order or else you'll get a corrupt file. Currency means that getting the most recent message is what matters. If some messages are skipped or lost, that's OK. Getting the most recent price of a stock is one such example of this behavior type.

Helland says that the greatest moment of uncertainty during the request/response cycle is right after the request is sent and the application is waiting for a response. The assumption is that the message has been received and is being processed, but you don't know this for sure. For all you know, the receiving end could have decided to ignore the request and not send a response. It's for this reason that timeouts are needed. If the requester has not received a response within the timeout period, then it will assume the request was not processed and will stop waiting for a response.

However, there is one thing that you can know for certain, Helland says. "Each message is guaranteed to be delivered zero or more times!". Finally, something I can rely on!

Helland says that despite TCP's robustness and wide-spread use, it cannot fully be trusted to deliver messages reliably. He says that TCP "offers no guarantees once the connection is terminated or one of the processes completes or fails." I don't understand what he's trying to say. Of course it offers no guarantees once the connection is terminated. You can't send any messages over a terminated connection. And of course it offers no guarantees once one of the processes completes or fails. At that point, the TCP conversation is over.

He then goes on to say that "challenges arise when longer-lived participants [such as HTTP requests] are involved." He says that when a persistent HTTP connection is needed, the TCP connection is usually kept alive, but there's no guarantee that this will happen. Because of this, the HTTP request may have to be sent multiple times.

Helland then goes into an in-depth discussion about idempotency. He says that, technically, no idempotent request is truly idempotent because every request has some lasting effect on the server. For instance, most servers keep an access log of every request that was received. Making five identical requests, even if they are idempotent, will add five entries to the log file. Also, the performance of the server is impacted every time it receives a request. The more requests it receives, the more its performance will degrade. However, these side-effects are not related to how the actual application logic behaves, which is the true context in which the term "idempotent" should be used.

Helland states that state-ful communication is more difficult than state-less communication because all previously sent messages must be taken into consideration when processing the current message.

He also points out that there's no way of knowing whether the server that receives a request is doing the actual work to fulfill that request. The server could be forwarding the request to another server to do the actual work.

Helland says that if a dialog between two servers breaks apart in the middle of the conversation, both ends must be able to cleanly recover from this failure.

If a service is load-balanced across multiple servers, then state-ful information must be stored in such a way so that a client's state is not lost. One way to do this is to store the state information on a designated server that the other servers have access to. That way, requests from the same client can be handled by any server in the cluster.

Alternatively, the client could be assigned to a designated server in the cluster by a load-balancer when it makes its first request. This server will then be responsible for maintaining the state information for this client and handling all of the client's requests. The first request that the client makes must be idempotent because if the request is received successfully, but the response from the server is lost, then the client will assume that the request was lost and try making the request again. When the request is sent for the second time, the load-balancer may assign the request to a different server. If the request is not idempotent, then the request will be applied twice, thus tainting the server's data. Once the client receives a server response to its first request, it now knows which server in the cluster it should communicate with, which means that all subsequent messages don't have to be idempotent because there's no risk of sending the same request to multiple servers.

Helland says there are three ways to make this first client request idempotent. (1) You can send basically an empty message to the server, such as a TCP SYN message, (2) you can perform a read-only operation, or (3) you have the server queue a non-idempotent operation which will only be executed by the server once the connection has been confirmed. Approach (1) is the simplest, but approaches (2) and (3) can be seen as more efficient because they are performing a useful operation. Or, in Helland's words: "allowing the application to do useful work with the round-trip is cool."

The last point Helland makes is that the last message of a conversation cannot be guaranteed. This is because, if you were to send a response to the last message stating that you have received it, then it wouldn't be the last message! Therefore, applications must be designed so that it is not important whether the last message is received or not.

Friday, May 18, 2012

Database Migration Scripts

I recently added database migration functionality to my Sleet SMTP project. This means that, if I release a new version of the application that includes a change to the database schema, the existing databases of deployed applications will be migrated to the new schema automatically. Before, you would have had to wipe the database completely or apply the schema changes manually, so this is a big improvement.

The way it works is as follows. I created a table in the database whose sole purpose is to store the schema version of the database. This is just an integer that starts at "1" and increments every time the schema changes. The source code also contains a version number, which is the schema version that the source code is programmed to use. When Sleet starts up, it compares the version number in the database with the version number in the source code to determine if the schema is out of date.

If the schema is out of date, it runs a series of migration scripts. Each migration script contains the SQL code necessary to migrate the database from one version to the next. For example, if the latest database schema version is "4", then the application will contain three migration scripts: 1-to-2, 2-to-3, and 3-to-4. By chaining these scripts together, the database schema can be updated no matter what version it currently is. For example, if the schema version of my database is "2", it will first execute the 2-to-3 migration script and then execute the 3-to-4 migration script. If it's "3", then it will just execute the 3-to-4 script. If it's "1", then it will execute all of them. All of this is done within a database transaction, so if something goes wrong during the migration process, the database will be restored to its previous state.

The psuedo-code below shows how this is done in code.

//connect to the database
Connection db = ...
db.setAutoCommit(false);

int schemaVersion = 4;
int curSchemaVersion = //"SELECT db_schema_version FROM sleet"
if (curSchemaVersion < schemaVersion) {
  //schema is outdated, run the migration script(s)
  Statment statement = db.createStatement();
  while (curSchemaVersion < schemaVersion) {
    String script = "migrate-" + curVersion + "-" + (curVersion + 1) + ".sql";
    SQLStatementReader in = new SQLStatementReader(new InputStreamReader(getClass().getResourceAsStream(script)));
    String sql;
    while ((sql = in.readStatement()) != null) {
      statement.execute(sql);
    }
    curSchemaVersion++;
  }

  //update the version number in the database
  //"UPDATE sleet SET db_schema_version = [schemaVersion]"

  //commit the transaction
  db.commit();
}

Sunday, April 22, 2012

Retrieving emails with POP3

A couple weeks ago, I showed you how to send emails without an email client. In this blog post, I'm going to show you how to do the opposite--how to retrieve emails from an email server without an email client. As before, everything will be done on the command-line.

To do this, I'll be using the POP3 protocol. POP3 stands for Post Office Protocol version 3. Its purpose is to retrieve emails from an email server (like picking up mail from the post office, hence the name). The other popular email retreival protocol, which you may have heard of, is IMAP. IMAP offers many more features, like the ability to organize emails into folders, but as a consequence, it is more complex. POP3's feature-set is limited to just retrieving and deleting messages, so it's a lot simpler.

POP3 is similar to SMTP in that client/server communication is text-based. However, POP3 is a little simpler because the responses from the server do not have a plethora of numeric status codes. In POP3, there are only two responses: success responses (which begin with +OK) and failure responses (which begin with -ERR).

Connecting to a POP3 server

So let me show you how it works. I will be opening a POP3 connection to my Gmail account. Gmail requires that POP3 transactions be encrypted, so I can't use telnet like I did in my previous SMTP demo. The openssl command will allow me to open a sort of "encrypted telnet" connection.

openssl s_client -connect pop.gmail.com:995 -crlf -ign_eof

For more information on this command, read its man page: man s_client

If the POP3 server is not encrypted, the telnet command will work:

telnet pop3.server.com 110

Note: Most webmail services support POP3. You should be able to find the POP3 URL of your webmail service in your webmail's configuration settings or help pages.

POP3 commands

First, you obviously must authenticate yourself. There are many ways to perform authentication in POP3, but the simplest way is by using the USER and PASS commands. These allow you to enter your username and password directly.

+OK Gpop ready for requests from 68.80.246.118 cn9pf9267980vdc.5
USER mike.angstadt
+OK send PASS
PASS secret
+OK Welcome.

Now that I'm authenticated, I can start retrieving emails. The LIST command returns a list of all of my emails. Each email has an ID (the first number on each line) which is used to retrieve and delete individual emails. Note that these IDs can change with every POP3 session, so do not consider these to be permanent IDs! For instance, if you deleted one or more emails in a previous session, then the IDs of all the emails below the deleted emails will change in your next POP3 session. The number to the right of the ID is the size of the email in bytes. With Gmail, only the first 300 or so messages are shown for some reason. If anyone knows how to get the rest, leave a comment!

LIST
+OK 305 messages (57774596 bytes)
1 2017
2 3751155
3 10873184
...
305 3021

The STAT command is handy to have. It shows the total number of emails (the first number) as well as the total size in bytes of all the emails combined (the second number).

STAT
+OK 305 57791921

The RETR command retrieves an email, including both the headers and the body. The syntax is RETR <num> where <num> is the message ID. For example, to retrieve the fifth email, I would type RETR 5.

RETR 5
+OK message follows
Date: Mon, 17 Jan 2005 17:20:17 -0500
From: Mike Angstadt <mike.angstadt@gmail.com>
To: John Doe <jdoe@yahoo.com>
Subject: Hello John
...more headers...

Dear John,

How are you doing?

-Mike
.

!! Warning !!: Even though I have my Gmail POP3 access configured so that emails are not deleted when they are retrieved, it seems to be doing that anyway and I don't know why! Make sure that your email account is configured so that when you retrieve an email with POP3, it is not deleted from your inbox.

The DELE command, you guessed it, deletes an email. Like RETR, the syntax is DELE <num> where <num> is the message number. Note that the email won't actually be deleted until you terminate the POP3 session with the QUIT command. If your connection terminates unexpectedly, your emails will NOT be deleted.

DELE 1
+OK marked for deletion

If you mark an email for deletion by mistake, you can use the RSET command to undo it. This command will unmark all messages that have been marked for deletion. This means that when QUIT is sent, the emails won't be deleted.

RSET
+OK

And, as was already mentioned, the QUIT command closes the POP3 session, deleting all messages that were marked for deletion with DELE.

QUIT
+OK Farewell.

For more information on POP3, check out its specification document: RFC-1939

Also check out the SMTP server that I wrote, which supports POP3: https://github.com/mangstadt/Sleet

Wednesday, April 11, 2012

Philly Emerging Tech Conference: Day Two

This post describes what I've learned during the second half of the Philly Emerging Technologies for the Enterprise Conference. See my previous blog post for a description of the first day. It was a great conference and I had a blast!

Keynote Address - Emerging Programming Languages

by Alex Payne

Alex started his talk by repeating the most common complaint people have about new languages--"why do we need another programming language?" His answer? Because evolution is a process that's constantly in motion--there's no way of knowing where the "jumping off point" is. As he gave this answer, a picture showing the evolution of the human skull was displayed behind him, implying that we are the result of a similar, albeit slower, kind of change (biological evolution).

When learning about new languages, which Alex does as a hobby, Alex's end goal isn't necessary to use the language, but to learn about the language's unique features and to try to incorporate those features into his work. One language he gave as an example had a certain elegant way of working with WSDLs which compelled him to implement a similar feature into one of his projects.

Alex described around two dozen very obscure programming languages, only 2 of which I've ever heard of (Go and CoffeeScript). He divided the languages up into categories, such as "Web Development", "Dynamic Programming", and "Querying Data".

Behind the scenes of Spring Batch

by Josh Long

Spring Batch is a Spring module that makes creating batch processes more standardized and less error prone. You basically define your job in an XML file. Then, using a combination of custom Java code and classes from the Spring Batch API, you write the logic of your batch job. It streams the batch data by reading the individual entry elements from the input data one by one and then writing out the processed data in chunks (e.g. 10 entries at a time). Because of this, you don't have to worry about getting OutOfMemory errors when processing large amounts of data.

I also thought it was cool that you can schedule your job to run on a regular basis by giving it a cron expression. In addition, you can have it generate a small web app that allows you to view the status of your jobs from the browser.

One thing that took me a little by surprise is that Spring Batch requires a connection to a database. It uses this database basically for logging purposes, like keeping track of the times the job ran and recording the errors that occurred (if any) while the job was running.

Spring Batch looks like a very clean and robust way of working with batch jobs. I definitely want to look more into it.

Dependency Injection Without the Gymnastics - Functional Programming Applied

by Runar Bjarnason and Tony Morris

This presentation was pretty unique in that the speaker, Tony, gave his talk via Skype from Australia! Runar was there in person and acted as the technician and the intermediary between the audience and Tony. It was about how to do dependency injection in Scala without having to resort to confusing XML files like with Spring.

The CoffeeScript Edge

by Trevor Burnham

Trevor explained some of the benefits that CoffeeScript brings to the table in this presentation. For example, following one of Douglas Crockford's words of wisdom, the Javascript code that CoffeeScript generates will never use the "==" operator. When comparing two variables in CoffeeScript, the syntax "x is y" is used, which translates to "x === y" in Javascript.

CoffeeScript also supports string interpolation, which allows you to concatenate strings using a cleaner syntax. For example:

dog = 'Spot'
x = "See #{dog}. See #{dog} run."

Another nice perk in CoffeeScript is that you don't have to separate array elements with commas if they are on separate lines. For example:

arr = [
  'One'
  'Two'
  'Three'
]

You can also use the @ operator as shorthand for this.

Trevor also made an interesting point about the increasing popularity of Javascript. Due to the increased usage of Javascript on the web, all the major browser makers (Microsoft, Google, Mozilla, Apple, and Opera) have been pouring money into making the language faster on their browsers. It's quite possible that no language in the history of computing has ever received this much financial backing.

JavaScript Testing: Completing the BDD Circle in Web Development

by Trevor Lalish-Menagh

This talk focused on how to write unit tests for Javascript code. Trevor did some live coding using some pretty impressive vim-foo, showing how to unit test Javascript code using the Jasmine framework. An important concept that he discussed was "spying" on functions. I'm not sure if this is unique to Jasmine, but it allows you determine whether a particular function was called or not, something that's very helpful in unit testing. Trevor also showed that it's possible to integrate your Javascript unit tests into a Maven build script.

Effective Scala

by Joshua Suereth

Joshua's talk focused on providing fairly advanced tips for writing good Scala code. He stressed the importance of using the Scala REPL (an interactive interpreter) during development. The REPL should be used on a regular basis to experiment with unfamiliar libraries and test out snippets of code. He also stressed the importance of staying immutable. If your objects are immutable, then it means (1) they are thread-safe and (2) they are hash-safe. He says that you should write your interfaces in Java because the bytecode of Scala interfaces doesn't convert well back to Java.

Joshua talked in depth about what's called "implicit scope". This is a special scope that basically lets you insert whatever variables you want into it. If used properly, it can be very powerful. One example Joshua gave was using implicit scope to define a collection of "Encoder" classes which convert various objects to byte arrays. It's designed that so any object can be passed into an "Encoder.encode()" method. Then, using implicit scope, the method delegates the object to the appropriate "Encoder" implementation for further processing.

Tuesday, April 10, 2012

Philly Emerging Tech Conference: Day One

Today, I attended the first half of the Philly Emerging Technologies for the Enterprise Conference down in Center City. This was a very good conference and I look forward to attending the second half tomorrow! Here's what I took in from the talks I attended.

Keynote Address - Self Engineering

by Chad Fowler

The conference started with a surprise visit from the mayor of Philadelphia, Michael Nutter(!!). Following the mayor was Chad Fowler, who talked about applying software development principles to improving your own life. One interesting thing he discussed was what's called a "QFD" (quality function deployment) graph, which is a technique for converting non-quantifiable requirements into quantifiable requirements. The example he gave was making a good cookie. Customers might say that they want a cookie that "tastes good", "has good texture", and "is cheap". These are all valid requirements, but completely non-quantifiable! What exactly makes a cookie "taste good"? More sugar? More chocolate? How much more? A QFD helps to break these requirements down into hard numbers.

Javascript, Programming Style, and Your Brain

by Douglas Crockford

This is the guy that wrote the excellent book, "JavaScript: The Good Parts", which contains insightful techniques for writing good Javascript code. He's also the author of JSLint, an online tool that helps to improve Javascript code. His talk was about Javascript and what to avoid doing when coding in the language. For instance, you should never use the "with" statement because it acts in unpredictable ways under certain circumstances. He also suggests never using the "switch" statement, since it's easy for a programmer to forget to include the "break" keyword inside of a "case" block.

Also, he says you should always put your opening curly braces on the same line to the right instead of on the next line to the left. In most languages, this issue is simply a matter of programmer taste and does not effect the actual behavior of the program. But in Javascript, there's one situation where it does have consequences:

return {
  foo:'bar'
};

return
{
  foo:'bar'
};

These two return statements both seem to do the same thing--return an inline object. But in fact, only the top example does this! The reason is that, since semi-colons are optional, Javascript auto-inserts a semicolon after the return keyword in the bottom example, causing it to exit the function and return nothing. It completely ignores the object that's defined below it (it won't even throw an error message). So, if you always put your curly braces on the right, you'll never have to worry about this quirk.

Java EE in the Cloud

by Gordon Dickens

In his talk, Gordon compared and contrasted a number of cloud-based JavaEE services. These services allow you to quickly deploy JavaEE web applications to the Internet and customize what kind of back-end software you want to use. For example, one cloud service he demoed lets you choose what database and web container you want to use.

In response to hearing some buzz about Java 7 being "cloud ready", Gordon did a close investigation of the current source code of JavaEE 7. He couldn't find anything substantial that was really worthy of that description. He said that Oracle intends to release JavaEE 7 during the third quarter of this year no matter what, and that anything that doesn't make it into version 7 will be pushed back to version 8.

SQL? NoSQL? NewSQL?!? What's a Java developer to do?

by Chris Richardson

This talk was after lunch, so I was a little sleepy, but I did my best to pay attention. Chris compared and contrasted three next-generation databases: MongoDB, Apache Cassandra, and VaultDB.

MongoDB is a document-oriented, NoSQL database. Every record in the database is a JSON object. Queries are pretty straight-forward--just pass the database a JSON object that has what you're looking for in it. Inserting data into MongoDB is fast because you don't have to wait for a response from the server when you send it commands. However, a downside is that it doesn't support ACID (i.e. transactions) like relational databases do. It's used by a number of large companies, such as bit.ly.

Apache Cassandra is another NoSQL database. However, it is column-oriented, instead of document-oriented like MongoDB. This means that a Cassandra database is basically one big hash map. Each record has a key and a value. The key can be anything (it doesn't have to be a number) and the value can also be anything. Chris said that this database is good for logging purposes because it can quickly ingest data. Netflix and Facebook both use this database.

VaultDB is known as a NewSQL database. From what I gathered from Chris' talk, it's basically just a relational database that resides completely in memory. It writes the database to disk like once an hour or something so it can be recovered if it crashes. A downside is that the API it uses is proprietary and still a work in progress. It has limited JDBC support.

Chris gave a good piece of advice for startup companies that are having trouble deciding what kind of database to use. You might be tempted to use one of these next generation databases because, as a startup, you're starting from scratch and don't have to do any sort of migration work that an established company running a relational database would have to do. However, the advantages that NoSQL and NewSQL databases bring to the table--namely speed and scalability--aren't things you really need as a new business. Since you're a small company, you don't have very many customers, so neither speed nor scalability is really an issue. In fact, you could probably hit the ground running much faster with a relational database because its tools, software, and support are more mature.

How GitHub Works

by Scott Chacon

This talk was given by the CIO of GitHub, Scott Chacon. He described the workplace culture at GitHub.

  • Trust your employees - Your employees want to do a good job. Define what your expectations are, and they'll likely exceed them. Don't micro-manage.
  • No work hours - The traditional 9 to 5 work day is a relic from the industrial revolution long ago. Programming is largely a creative process and you can't effectively box it into a rigid schedule. If you're not being productive, then why are you at work? At GitHub, people work when they want to.
  • Headphones - If you're "in the zone" working on a programming problem, it can be hard to return to the zone after being interrupted. The rule that they have at the GitHub office is that, if you're wearing headphones, no one can interrupt you no matter what. They can send you an IM or an email, but they can't physically approach you at your desk.
  • The chat room is the office - Not all the employees are in a single building. Many are scattered all over the world, so they have a chat room that everyone uses for much of their communication.
  • Saying "No" - Scott talked about the importance of creating a culture where it's OK to say "No". This means that when people propose new ideas, their feelings aren't hurt if the idea is turned down by the team. It's important to establish this in order to encourage people to speak their minds without the fear of rejection and also to prevent bad ideas from being put into place and harming the company.

The Evolution of CSS Layout: Through CSS 3 and Beyond

by Elika J. Etemad

Elika is a member of the W3C CSS Working Group, so it was interesting to get a "behind-the-scenes" look as to how these specifications evolve. Elika started by giving a brief history of CSS and then gave a preview as to what can be expected in the future. She says that standardizing the rules for how elements are positioned on the page is the most complicated part because of all the various layout algorithms that are involved.

As to their interaction with Micro$oft, she said that they are productive, contributing members to the standardization process. Their involvement was lackluster during the long rule of IE6, but improved with the release of IE7.

Before Elika joined as a full-time employee, she was a dedicated member of the mailing list and an avid submitter of browser bugs. After several years of involvement, they offered her a job! It just goes to show that when the W3C says they are open to input from the community, they really mean it!

Sunday, April 8, 2012

Extending the DateFormat class

I'm writing an SMTP server and one of the things that you have to do when writing an SMTP server is understand how dates in an email message are formatted. These rules are defined in RFC-5322, a document which provides details about the contents of SMTP email messages. RFC ("request for comment") documents are written by a standards organization known as the Internet Engineering Task Force (IETF). RFCs help to form what is essentially the "Bible" of the Internet--they lay down the rules for how many fundamental Internet technologies work. Some of these technologies include email, TCP, HTTP, and FTP.

The rules pertaining to dates are defined in two sections of RFC-5322. Section 3.3 (page 14) contains the most up-to-date specifications. This is what should be used when creating and sending new emails. The rules in Section 4.3 (page 33), on the other hand, describe the old standards which are now obsolete. These are included because an SMTP server must support them in order to maintain backwards compatibility with older SMTP servers.

To parse these dates in Java, at first I thought I could just use a single SimpleDateFormat object. But because of the complexity of the rules, that just wasn't possible. So, I created my own implementation of the DateFormat class to handle the complexity. The advantage to extending DateFormat is that it allows my code to plug nicely into the Java Date API, so I can call the parse() and format() methods just like I would with SimpleDateFormat.

import java.text.*;
import java.util.*;
import java.util.regex.*;

public class EmailDateFormat extends DateFormat {
  /**
   * The preferred format.
   */
  private final DateFormat longForm = new SimpleDateFormat("EEE, d MMM yyyy HH:mm:ss Z");

  /**
   * Day of the week is optional.
   * @see RFC-5322 p.50
   */
  private final DateFormat withoutDotw = new SimpleDateFormat("d MMM yyyy HH:mm:ss Z");

  /**
   * Seconds and day of the week are optional.
   * @see RFC-5322 p.49,50
   */
  private final DateFormat withoutDotwSeconds = new SimpleDateFormat("d MMM yyyy HH:mm Z");

  /**
   * Seconds are optional.
   * @see RFC-5322 p.49
   */
  private final DateFormat withoutSeconds = new SimpleDateFormat("EEE, d MMM yyyy HH:mm Z");

  /**
   * Determines if a date string has the day of the week.
   */
  private final Pattern dotwRegex = Pattern.compile("^[a-z]+,", Pattern.CASE_INSENSITIVE);

  /**
   * Determines if a date string has seconds.
   */
  private final Pattern secondsRegex = Pattern.compile("\d{1,2}:\d{2}:\d{2}");

  /**
   * Used for fixing obsolete two-digit years.
   * @see RFC-5322, p.50
   */
  private final Pattern twoDigitYearRegex = Pattern.compile("(\d{1,2} [a-z]{3}) (\d{2}) ", Pattern.CASE_INSENSITIVE);

  @Override
  public StringBuffer format(Date date, StringBuffer toAppendTo, FieldPosition fieldPosition) {
    return longForm.format(date, toAppendTo, fieldPosition);
  }

  @Override
  public Date parse(String source, ParsePosition pos) {
    //fix two-digit year
    Matcher m = twoDigitYearRegex.matcher(source);
    source = m.replaceFirst("$1 19$2 ");

    //remove extra whitespace
    //see RFC-5322, p.51
    source = source.replaceAll("\s{2,}", " "); //remove runs of multiple whitespace chars
    source = source.replaceAll(" ,", ","); //remove any spaces before the comma that comes after the day of the week
    source = source.replaceAll("\s*:\s*", ":"); //remove whitespace around the colons in the time

    //is the day of the week included?
    m = dotwRegex.matcher(source);
    boolean dotw = m.find();

    //are seconds included?
    m = secondsRegex.matcher(source);
    boolean seconds = m.find();

    if (dotw && seconds) {
      return longForm.parse(source, pos);
    } else if (dotw) {
      return withoutSeconds.parse(source, pos);
    } else if (seconds) {
      return withoutDotw.parse(source, pos);
    } else {
      return withoutDotwSeconds.parse(source, pos);
    }
  }
}

Looking at the source code of my EmailDateFormat class, the parse() method is designed to handle both the most recent syntax and the obsolete syntax of date strings. It basically does two things. First, it sanitizes the date string, removing unnecessary white space and converting two-digit years (which are now obsolete) to four-digit years. Second, it determines which of the many valid formats the date adheres to and then parses the date using an appropriate SimpleDateFormat object. The reason why so many SimpleDateFormat objects need to be created is that the "day of the week" and "second" parts of the date string are optional. Four separate SimpleDateFormat objects must be created to cover all possibilities because there's no way to define specific date fields as "optional" in the SimpleDateFormat class.

The format() method of the EmailDateFormat class is designed so that it will always create a date string that adheres to the most up-to-date standards.

Because of this class' complexity and its loose coupling from the rest of the application, it really lends itself to unit testing. So I wrote a unit test that feeds it date strings in various formats, and confirms that it parses them correctly. The unit test also makes sure that the format() method creates a date string that contains the most up-to-date syntax.

import static org.junit.Assert.*;
import java.util.*;
import org.junit.*;

public class EmailDateFormatTest {
  @Test
  public void parse() throws Exception {
    EmailDateFormat df = new EmailDateFormat();
    Calendar c;
    Date expected, actual;

    //+ day of the week
    //- seconds
    c = Calendar.getInstance();
    c.set(2012, 3, 8, 14, 25, 0);
    c.set(Calendar.MILLISECOND, 0);
    c.setTimeZone(TimeZone.getTimeZone("-0400"));
    expected = c.getTime();
    actual = df.parse("Sun, 8 Apr 2012 10:25 -0400");
    assertEquals(expected, actual);

    //+ date of the week
    //+ seconds
    c = Calendar.getInstance();
    c.set(2012, 3, 8, 14, 25, 1);
    c.set(Calendar.MILLISECOND, 0);
    c.setTimeZone(TimeZone.getTimeZone("-0400"));
    expected = c.getTime();
    actual = df.parse("Sun, 8 Apr 2012 10:25:01 -0400");
    assertEquals(expected, actual);

    //- day of the week
    //- seconds
    c = Calendar.getInstance();
    c.set(2012, 3, 8, 14, 25, 0);
    c.set(Calendar.MILLISECOND, 0);
    c.setTimeZone(TimeZone.getTimeZone("-0400"));
    expected = c.getTime();
    actual = df.parse("8 Apr 2012 10:25 -0400");
    assertEquals(expected, actual);

    //- date of the week
    //+ seconds
    c = Calendar.getInstance();
    c.set(2012, 3, 8, 14, 25, 1);
    c.set(Calendar.MILLISECOND, 0);
    c.setTimeZone(TimeZone.getTimeZone("-0400"));
    expected = c.getTime();
    actual = df.parse("8 Apr 2012 10:25:01 -0400");
    assertEquals(expected, actual);

    //single-digit date
    c = Calendar.getInstance();
    c.set(2012, 3, 8, 14, 25, 1);
    c.set(Calendar.MILLISECOND, 0);
    c.setTimeZone(TimeZone.getTimeZone("-0400"));
    expected = c.getTime();
    actual = df.parse("Sun, 8 Apr 2012 10:25:01 -0400");
    assertEquals(expected, actual);

    //two-digit date
    c = Calendar.getInstance();
    c.set(2012, 3, 10, 14, 25, 0);
    c.set(Calendar.MILLISECOND, 0);
    c.setTimeZone(TimeZone.getTimeZone("-0400"));
    expected = c.getTime();
    actual = df.parse("Tue, 10 Apr 2012 10:25 -0400");
    assertEquals(expected, actual);

    //obsolete timezone format (see RFC-5322, p.50)
    c = Calendar.getInstance();
    c.set(2012, 3, 8, 14, 25, 1);
    c.set(Calendar.MILLISECOND, 0);
    c.setTimeZone(TimeZone.getTimeZone("-0400"));
    expected = c.getTime();
    actual = df.parse("Sun, 8 Apr 2012 10:25:01 EDT");
    assertEquals(expected, actual);

    //obsolete year format (see RFC-5322, p.50)
    c = Calendar.getInstance();
    c.set(1999, 3, 8, 14, 25, 1);
    c.set(Calendar.MILLISECOND, 0);
    c.setTimeZone(TimeZone.getTimeZone("-0400"));
    expected = c.getTime();
    actual = df.parse("8 Apr 99 10:25:01 EDT");
    assertEquals(expected, actual);

    //with extra whitespacee
    c = Calendar.getInstance();
    c.set(2012, 3, 8, 14, 25, 0);
    c.set(Calendar.MILLISECOND, 0);
    c.setTimeZone(TimeZone.getTimeZone("-0400"));
    expected = c.getTime();
    actual = df.parse("Sun , 8   Apr 2012   10 :   25  -0400");
    assertEquals(expected, actual);
  }

  @Test
  public void format() throws Exception {
    EmailDateFormat df = new EmailDateFormat();

    //the long format should always be used

    //single-digit date
    Calendar c = Calendar.getInstance();
    c.set(2012, 3, 8, 14, 25, 1);
    c.set(Calendar.MILLISECOND, 0);
    c.setTimeZone(TimeZone.getTimeZone("-0400"));
    Date input = c.getTime();
    String expected = "Sun, 8 Apr 2012 10:25:01 -0400";
    String actual = df.format(input);
    assertEquals(expected, actual);

    //two-digit date
    c = Calendar.getInstance();
    c.set(2012, 3, 10, 14, 25, 1);
    c.set(Calendar.MILLISECOND, 0);
    c.setTimeZone(TimeZone.getTimeZone("-0400"));
    input = c.getTime();
    expected = "Tue, 10 Apr 2012 10:25:01 -0400";
    actual = df.format(input);
    assertEquals(expected, actual);
  }
}
Anyway, I was just proud of this, so I thought I'd share.