Serialization

1. Basic concepts

1.1 Definition

Serialization is the process of encoding an object, including the objects it refers to, as a stream of byte data such that an equal object can be reconstructed by reading from the stream (which is referred to as "deserialization").  Serialization allows saving objects in files and transmitting objects over a network.  In particular, technologies that support invoking the methods of an object on another host such as Java RMI and CORBA use a form of serialization to implement parameter passing across the network.  Serialization is also used in technologies such as Enterprise Java Beans that automatically passivate and activate server objects.

Serialization does not write class variables because they are not part of the state of the object.  It also does not transmit the object's class object (e.g., its method dictionary) because the program deserializing the stream must load that class.  We will see that Java serialization provides the ability to serialize any object without writing methods that do so (we will see what is required as we proceed).

1.2 The interface Serializable

The interface java.io.Serializable defines no messages (such interfaces are called “marker” or “tag” interfaces). Implementing Serializable or extending a class that implements Serializable identifies the class as one that participates in serialization.  Its instances can be used as the argument of ObjectOutputStream.writeObject and as the result of ObjectInputStream.readObject .  If an object is encountered that is not serializable (e.g., a collection element), these methods throw NotSerializableException .

Most library classes are serializable, including String , collection classes, wrapper classes, GUI component classes, Date , Color, Point, and URL.  Library classes that are not serializable include Thread , reflection classes (Method, etc.), stream classes, Socket , Graphics , and Image.  Generally, these are the classes that have implementations or "peers" that are system-dependent.

The Java compiler uses the "default serialization" mechanism described in the next section for implementors of Serializable.  (We will see how to customize serialization below.)  It stores all non-static instance variables referents that are serializable objects or primitive types, and all such variables inherited from serializable ancestors.  The default implementation handles shared and circular object references and class identity.  However, if an object includes variables of class type that refer to objects whose classes are not serializable, the object stream methods will signal NotSerializableException when attempting to write or read an instance.  Similarly, if a collection is serializable but contains objects that are not serializable, an exception will be signaled.  Note that this is a run-time exception, rather than a compiler error.  For example, an object (like all collections) may have a field of type Object, which is not serializable.  If that field refers to an instance of a serializable class, no exception occurs upon serialization.  We will see below that variables marked as transient are not serialized.  If the default mechanism is adequate (i.e., all fields are serializable and no special processing is needed), a class need only declare that it implements Serializable to be serializable.

1.3 Implementing serialization

Serializing an object must deal with three issues: 1) representing built-in types, 2) encoding references to other objects, and 3) maintaining type identity.  Even for an object containing built-in type fields only, this process is complex in C++ because it must deal with non-standardized sizes for built-in types, big/little-endian issues, data alignment, etc.  (For example, Sun's eXternal Data Representation for RPC and CORBA handle these issues.)  In Java, both the serializer and the deserializer are Java Virtual Machines so these complications do not arise.  In an object-oriented language, it is also necessary to serialize inherited fields.

Clearly, storing pointer values to implement references would be meaningless.  To understand serialization with references, view an object as the root of a directed graph of references to other objects.  In particular, when some object is referred to along multiple paths in that graph, deserialization must not result in multiple copies of that object.  On output, the first reference to an object stores the object's fields to the stream, and creates a identifier that will be used for subsequent references to the object. (The Java process is called “serialization” because serial numbers are used to identify references.)  This procedure avoids duplication of a “subobject” with multiple reference to it upon deserialization and maintains the objects’ identities.  The inverse process occurs when deserializing an object from a stream: the first occurrence of the object defines its fields and causes a copy to be created and subsequent references use a serial number which is used to locate the object.  Clearly, serialization must also detect cycles in the object graph to avoid infinite recursion.  Note also that Java supports non-static inner classes, whose instances have an implicit reference to their "enclosing object" which must be serialized.

An object is more than just the field values it contains: it has a type identity.  The serialization process uses an instance of ObjectStreamClass (discussed further below) to identify an object’s class, rather than just its class name.

Java uses the facilities of the "reflection" API to perform serialization.  An object's class object, an instance of the class Class, is available via getClass.  This class object includes method for accessing the class's ancestors and its members and their types.  In particular, the method Class.getFields returns an array of instances of Field, which defines the accessors getName and getType, as well as methods to obtain the value of that field in a particular object.

2. Object input/output

2.1 Object stream classes

The classes ObjectInputStream and ObjectOutputStream support reading and writing serializable objects and primitive types, and are defined in the package java.io..  Like filter streams, their constructors take the source or destination for the bytes that encode the object.  Although ObjectInputStream and ObjectOutputStream are used like filter streams, they are not a subclasses of FilterInputStream and FilterOutputStream, respectively.

ObjectInputStream is a subclass of InputStream that implements the interface ObjectInput, a subinterface of DataInput (which defines method such as readBoolean and readDouble) that adds readObject.  That is, we can also use readInt and so on with object input streams.  Like any stream method, readObject can signal IOException if the stream fails.  It can also signal several subclasses of ObjectStreamException (a subclass of IOException) such as InvalidClassException, NotSerializableException and OptionalDataException (an attempt is made to read an object when the next item in the stream is a primitive type).  Deserializing an object may require loading its class so readObject can signal ClassNotFoundException.  Similarly, ObjectOuputStream is a subclass of OutputStream that implements the interface ObjectOutput which defines writeObject.  writeObject can signal IOException or several of its descendants such as InvalidClassException and NotSerializableException.  Instances of any serializable class can be used as the argument of ObjectOutputStream.writeObject and as the result of ObjectInputStream.readObject.

Both classes also define the method reset which resets the stream’s object cache, i.e. the stored serial numbers.  In particular, if an object output stream is reset and the client writes an object (possibly indirectly) that has already been written, another copy is written, which is used for subsequent references to that object.

2.2 Object output and input

Suppose that the variable appts refers to a hash map in which the keys are dates and the values are strings.  We can write the map to a file as follows:
   // writing an object to a file
try {
ObjectOutputStream outStr = new ObjectOutputStream(new FileOutputStream(“appointments.ser”));
outStr.writeObject(appts);
outStr.flush();
outStr.close();
}
catch(IOException ex) {
System.out.println(ex.getMessage());
}
By convention, the filename extension ser is used for serialized object files.  This simple technique works because the classes HashMap, Date, and String are serializable.  Like a filter output stream, the ObjectOutputStream constructor takes the destination for the bytes.  We can “wrap” an ObjectOutputStream around a stream attached to a socket or any other destination.

To read the hash map from the file back into memory is just as simple:
   // reading an object from a file
Map appts = null;
try {
ObjectInputStream inStr = new ObjectInputStream(new FileInputStream(“appointments.ser”));
appts = (HashMap) inStr.readObject();
inStr.close();
}
catch(IOException ex) {
System.out.println(ex.getMessage());
}
The cast to HashMap is necessary because the return type of readObject must be Object to accomodate all classes.  Note that if the same string object had been associated with more than one key in the original hash table written to the file, that relationship would be preserved in the object read from the file.  Like a filter input stream, the ObjectInputStream constructor takes the source of the bytes.  We can “wrap” an ObjectInputStream around a stream attached to a socket or any other source.  Note also that constructors are not used for deserialization: if there are validations or calculations in a class's constructor that must be done when creating an instance, you can override readObject, as described in the next section.

3. Writing Serializable classes

3.1 Using default serialization

To use default serialization, a class implements Serializable or extends a serializable class.  If a class's superclass is not serializable, it can still implement Serializable if the superclass has a no-argument constructor.  We will see that a class must be serializable for it to be used as the parameter or return type of a remote method.

If an instance variable should not be serialized, mark it as transient .  For example, we would declare an instance variable transient if its type is not serializable, or its value depends on run-time conditions or can be computed from other information in the object.  Recent revisions to the class library provide more control over which fields are serialized via the "Serializable Fields API".  We will not discuss this facility here.

3.2 Customizing serialization

If a class has non-serializable superclasses or instance variables, or requires more efficient serialization methods or other special processing, the class can implement its own serialization by defining the following methods:
   private void readObject(ObjectInputStream) throws IOException, ClassNotFoundException
private void writeObject(ObjectOutputStream) throws IOException
(In fact, HashMap defines these methods so that the empty buckets are not serialized.)  The methods readObject and writeObject are invoked by ObjectInputStream and ObjectOutputStream methods, respectively.  The implementations of these methods call defaultReadObject and defaultWriteObject to use default serialization for the class's non-transient fields, which are sent to the stream argument and have no arguments.  The methods can transfer additional bytes using DataInput and DataOutput methods (as well as read and write).  The readObject and writeObject methods for a class must read and write additional variables in the same order.  Since readObject and writeObject are private, a class cannot refine its superclass's methods.  However, when its methods invokes defaultReadObject and defaultWriteObject , they call the superclass readObject or writeObject methods.  (The fact that these methods are private also prevents them from being declared in the interface Serializable .)  With serialization, you should use readObject and writeObject rather than DataInput.readUTF and DataOutput.writeUTF for strings.

Suppose we have a class User that maintains the user's password in an instance variable (as well as other information about a user), and we do not want to store the password or send it over a network without encoding it. The following example demonstrates how to customize the serialization mechanism to achieve this:
   public class User implements Serializable {
protected String name;
protected transient String password;
// ... other serializable instance variables ...

private void readObject(ObjectInputStream inStr) throws IOException, ClassNotFoundException {
inStr.defaultReadObject();
password = decode((String) inStr.readObject());
}

private void writeObject(ObjectOutputStream outStr) throws IOException {
outStr.defaultWriteObject();
outStr.writeObject(encode(password));
}

// ... other methods (including encode and decode) ...
}
The variable password is marked transient so that the default mechanism does not serialize its value.  The class defines readObject to call defaultReadObject to serialize the values for all other instance variables and handle the object's class identity, and to use its private decode method when deserializing the value for the password variable. The writeObject method performs the corresponding operations in the same order.  Note that the methods for readObject and writeObject do not handle the exceptions that can occur, but propagate them to the caller.

As another example, suppose a class has an instance variable of type Image, which is not serializable.  The class's writeObject method can use an instance of PixelGrabber to covert the image to an int[], which can then be written to the stream with a writeInt loop (the width and height also are written using writeInt).  The readObject method reads the int[] and passes it to a MemoryImageSource constructor, and then passes that object to Component.createImage to create the image.

3.3 The interface Externalizable

The designer of a class can take complete control of serializing its instances by implementing Externalizable, a sub-interface of Serializable. An externalizable class defines the following methods:
   public void readExternal(ObjectInput) throws IOException, ClassNotFoundException
public void writeExternal(ObjectOutput) throws IOException
When an object whose class implements Externalizable is serialized, these methods are called rather than the default serialization or readObject and writeObject.  The readExternal method can use readObject and the DataInput methods, and similarly for writeExternal .  The methods must handle all the details of encoding instances in bytes and decoding them from bytes, including state information inherited from ancestors.  If class versioning is necessary (see the next section), these methods must implement it.

3.4 Class versioning

To establish an object’s class identity, the default serialization mechanism writes a “class descriptor” that identifies the class and its version.  This descriptor is an instance of ObjectStreamClass that includes the qualified class name, an SHA-1 hash of the class’s name and its ancestor and non-private component names (referred to as the “serial version unique identifier” or “serial version UID”), the names and types of instance variables serialized by the default mechanism, and whether the class defines readObject and writeObject methods.  (That is, an ObjectStreamClass is used rather than the class name or the serialized class object.)  If during deserialization, the information in the stream is different from that in the version of the class loaded in the recipient Virtual Machine, an InvalidClassException is signaled.  This ensures that the class used in the deserializing Virtual Machine is the same as that used in the serializing Virtual Machine.

Class versioning can result in problems when a class is under development.  If an object is written to a file and then the class is modified, the serial version UID of the class in the deserializing Virtual Machine can differ from that of the written object, preventing the object from being read.  Even if the developer has added non-private methods or changed method names since writing an instance, that object's serial version UID will not match that of the class in the reading virtual machine, though the object's fields are the same.  That is, the default mechanism errs on the safe side in that essentially any change in the class definition results in a different serial version UID.  To avoid a serialization exception while a class is under development, the class defines a private long final class variable called serialVersionUID (it must be private).  In this case, the Virtual Machine will use that serial version UID value rather than generating one for the class as described above.  To obtain a value for the variable, use the JDK program serialver or call the method ObjectStreamClass.lookup(YourClass.class).getSerialVersionUID().

If the serial version unique ID of the class in the deserializing Virtual Machine is the same as that in the stream, no exception is thrown.  If the object in the stream has values for instance variables that do not exist in the recipient's class, they are ignored.  If the recipient has variables (including inherited fields) that do not exist in the object written to the stream, they are initialized to the default values for their types (null for class-type variables).

In fact, it is considered good practice for classes to define serialVersionUID so that the developer can control versioning, and because the SHA computation is time-consuming.  When a new version of the class is developed that should not be compatible with earlier version, it is given a new serial version UID.