Thursday, January 26, 2012

The Java Collections Framework - Sets

The Collections Framework is a group of classes that allow you to group objects together in various ways and perform operations against those groupings. It is contained within the core Java API, so you don't have to install anything extra to use it. Its features are so basic and powerful that pretty much every piece of software written in Java makes use of this framework in one way or another. The framework has been around since Java 1.2 (1998) and has been improved upon with every subsequent Java release. There are four major collection types: Lists, Sets, Queues, and Maps.

The four basic collection types (List, Set, and Queue all inherit from the Collection interface).

Another type of collection in the Java Collections Framework is the set. A set is a collection of elements where duplicates are not allowed. If the programmer tries to add a duplicate element, the set will see that the element already exists and nothing will be added. Sets also do not allow random access like lists do. The only way to retrieve elements from a set is to iterate over all the elements using an Iterator or a foreach loop (so, for example, you cannot directly get the "fourth" element of a set).

Buckets

Sets are really fast when it comes to determining whether an element exists in the set or not. The way sets do this is by making heavy use of the Object.equals() and Object.hashCode() methods. When an object is added to the set, the set uses the object's hash to find a bucket to store the object in. Later, when the set is asked if it contains a given object, the set uses the hash of the given object to find a bucket, and then iterates over each element in that bucket. It calls the equals method on each element in the bucket to determine if the given object is the same as any of the elements in the bucket. This is a lot faster than a list because a list has to iterate over every single element in order to determine if it contains an object. Sets just have to iterate over the elements in a single bucket, which contains only a fraction of the total number of elements in the set.

Importance of hashCode()

However, the efficiency of the set's search operation relies heavily on the quality of the object's hashCode() implementation. If the implementation is good, then the set will spread all of its elements evenly amongst all buckets. However, if the implementation is not good, then the elements will not be spread evenly amongst all buckets. This will cause some buckets to be larger than others, which means they will take longer to search over and decrease the overall performance of the set.

Probably the worst possible hashCode() implementation you could create is one that returns a hard-coded value. It's a completely valid implementation, but it will cause all elements to be stored in a single bucket, making the set's search performance no better than that of a list.

public class BlogPost {
  private String content;
  private Date created;
  private List<String> comments;

  @Override
  public int hashCode(){
    //the worst-possible implementation
    //DO NOT USE!!
    return 1;
  }
}

What you want to do is create a hashCode() implementation that will generate a wide range of values. Luckily, if you use Eclipse (or probably any other IDE), you can have one generated for you. In Eclipse, right-click on the code and go to "Source > Generate hashCode() and equals()". It will prompt you for what class fields you want to include in the hash calculation and then generate the proper code automatically.

public class BlogPost {
  private String content;
  private Date created;
  private List<String> comments;

  @Override
  public int hashCode() {
    //the implementation generated by Eclipse (much better)
    final int prime = 31;
    int result = 1;
    result = prime * result + ((comments == null) ? 0 : comments.hashCode());
    result = prime * result + ((content == null) ? 0 : content.hashCode());
    result = prime * result + ((created == null) ? 0 : created.hashCode());
    return result;
  }
}

Relationship between hashCode() and equals()

Note that Java expects the equals() and hashCode() methods to follow certain rules (read this three times to let it sink in):

  1. If two objects are equal, then they must have the same hash.
  2. But if two objects have the same hash, they are not necessarily equal.

It is your responsibility as a programmer to follow these rules. If you do not, then sets (as well as other collection types from the Collections Framework) will not function correctly. The code that Eclipse generates complies with these rules.

Example

Here's a short code example of how to use a set in Java.

Set<String> users = new HashSet<String>();
users.add("Mark");
users.add("Steve");
users.add("Mark");
users.add("Kelly");
System.out.println(users.contains("mark"));
System.out.println(users.contains("Mark"));
for (String user : users){
  System.out.println(user);
}

On line 1, I create the Set object. I'm assigning the object to a variable of type Set instead of type HashSet so that I can more easily change implementations if I ever need to in the future.

On lines 2-5, I add elements to the set. On line 4, I try adding a duplicate element, but nothing will be added because sets do not allow duplicate elements. No exception will be thrown or anything--the add() method will just return "false" instead of "true".

Line 6 will print "false" because the String "mark" is not the same as the string "Mark". Line 7 will print "true".

On lines 8-10, I iterate over the entire set, printing each user name. This is the only way to retrieve elements from a set.

Set Implementations

There are three implementations of the Set interface. They differ in the way in which they iterate over their elements.

  • HashSet - The most basic set implementation, iterates over elements in no particular order.
  • LinkedHashSet - Iterates over elements in the order in which they were added to the set.
  • TreeSet - Iterates over elements in their sorted order. The way in which the elements are sorted can be defined by having the elements implement the Comparable interface, or by passing in an implementation of the Comparator interface to the TreeSet constructor.

No comments: