Designing Scalable Code Generators for very large Java Projects

Model-driven development is increasingly applied in large industrial projects. When java is used as the target language, some language details have to be considered that do not make any difference in conventional implementation. It is also necessary to take them into account when deciding between code generation and runtime interpretation. In order to make the generators scalable it is important to keep their execution times low. Furthermore, they need to be designed in such a way that changes to the source models result in localized and traceable effects in the generated code. This can be achieved by respecting some basic design rules during development of the code generators.



Introduction

Model-driven development is an established approach for developing the software of embedded control units in cars. Compared to manual implementation such an approach is also beneficial for developing the software of in-car infotainment systems because it allows for more flexibility even in late project phases. Usually code generation rather than runtime interpretation is applied to execute the models because of the limited resources of the hardware and the higher that can be obtained this way. Because of its platform-independence Java is a suitable programming language that can be targeted by the code generation [1].

Because modern infotainment systems offer more and more function, the extent of generated code grows, too. However, it is not sufficient to apply code generators working in a smaller scale to larger models. Instead of that, some particularities of the Java programming language have to be considered in the design of the generated code and in the code generators in order to achieve scalability – of course, this is not limited to code generation for embedded systems.



Limitation of the .class file format

The Java .class file comes with a couple of size limit [2, Section 4.10], that does not make any difference in manual implementation but have to be considered in large scale code generation. This, for instance, limits the number of field, method parameters, and local variables. These restrictions also apply for other languages like JSP or JavaScript when they are compiled to bytecode in the .class format and executed in a JVM.

However, the most relevant restriction is the limitation of the bytecode in a single method to 64kB. The consequences of this limitation are particularly dramatic in other languages that are compiled to JVM bytecode, e.g. Jython and JRuby. If such languages use different call methodologies it will be necessary to generate a single method containing all bytecode. In such cases, the size limitation of the method is a limitation to the overall size of the application. Since 1999 a RFE and a bug [3] exist to overcome the size limit but are not solved so far.

To exemplify this issue let’s take a look at the software of infotainment systems. In this domain, event oriented architectures are very common [4]. They require methods to dispatch incoming events to handler methods by their ID (Listing 1).

int dispatch(int eventId, Object data) {
	switch (eventId)  {
		case 0: return handler.next(data);
		case 1: return handler.previous(data);
		...
		case 1422: return handler.exit(data);
		default: throw new RuntimeException("Unknown eventId " + eventId);
	}
}
Listing 1 – Example of a typical dispatcher method


Because these dispatcher methods have a similar structure they are ideally suited to generate them automatically. If there are large numbers of different events to process such methods can exceed the bytecode limit. To overcome this problem one can refactor the code (Listing 2). The refactored code can also be generated by using a specific pattern in the code generator (Listing 3)

int dispatch(int eventId, Object data) {
	switch (eventId)  {
		case 0: return handler.next(data);
		case 1: return handler.previous(data);
		...
		case 999: return handler.pause(data);
		default: return dispatch1000(eventId, data);
	}
}

private int dispatch1000(int eventId, Object data) {
	switch (eventId)  {
		case 1000: return handler.resume(data);
		...
		case 1422: return handler.exit(data);
		default: throw new RuntimeException("Unknown eventId " + eventId);
	}
}
Listing 2 – Refactored dispatcher method


«DEFINE dispatcher FOR EventList»
	«EXPAND dispatcher(0, events.size, 1000) FOR this»
«ENDDEFINE»

«DEFINE dispatcher(int from, int totalNo, int length) FOR EventList»
	«IF from==0»
		int dispatch(int eventId, Object data) {
	«ELSE»
		private int dispatch«from»(int eventId, Object data) {
	«ENDIF»

		switch (eventId) {
			«FOREACH from.upTo(from+length-1) AS i»
				«IF i < totalNo»
					case «i»: return «EXPAND handler FOR events.get(i)»;
				«ENDIF»
			«ENDFOREACH»
			
			«IF from+length<=totalNo»
				default: return dispatch«from+length»(eventId, data);
			«ELSE»
				default: throw new RuntimeException("Unknown eventId " + eventId);
			«ENDIF»
		}
	}
	
	«IF from+length<=totalNo»
		«EXPAND dispatcher(from+length, totalNo, length) FOR this»
	«ENDIF»
«ENDDEFINE»
Listing 3 – Pattern in code generator (Xpand) to generate the refactored dispatcher method


However, using this pattern extensively makes the understanding of the code generator become more difficult and therefore has to be used with caution. Furthermore, it is difficult to predict where this pattern has to be used. Instead of that, the user is informed about that the bytecode limit is exceeded after the code generator run was completed by a failure message of the java compiler. In such cases one has to adapt the code generators and restart code generation which leads in large projects to long turnaround times. Furthermore, it is hard to estimate the amount of entries after which to split the method bodies. If this number is too large the bytecode limit can still be exceeded. On the other hand, the lower the number the worse is the runtime performance of the generated code.
For the future automated tools are required that apply these refactoring pattern automatically wherever necessary. This can be done by postprocessors that are executed right after code generator to adapt the generated code. An alternative solution is to use preprocessors that modify the code before passing it to the java compiler.



Finding the right balance between interpretation and code generation

The most effective method to avoid such problems is a suitable architecture of the generated code. Rather than generating pure code it may therefore be necessary to generate data that is interpreted at runtime. By pre-processing the data in the code generator and write them to an optimized, compact binary format the performance drawbacks of interpretation can be minimized.

In Java the classloading-mechanism determines which code is kept in memory and when it is loaded and unloaded. While this also applies for generated code one, interpreted data can use more efficient memory management strategies that holds only the data that is actually needed in the current system state. Moreover, in order to achieve faster start-up time lazy loading concept can be implemented that loads the data to be interpreted not until it is needed.

In some cases, it makes sense to include the interpreted data as constant values in the code instead of writing it to external files. Although this data has an array-like structure, no array constants should be used. This is because during initialization in the static constructor their contained elements are loaded from the constant pool and stored to the array one by one (Listing 4). This requires much bytecode [5]. Together with the 64kB limit this puts a ceiling on the number of entries in such constant arrays, e.g. byte arrays are limited to 9300 entries.

.java
static final byte B[] = {... 99, ... };	// Field descriptor #6 [B
.class
static final byte[] B;
  
// Method descriptor 
static {};
...
dup	
sipush 4711	
bipush 99	
bastore	
...

Size of the bytecode per element: 7 Bytes

Listing 4 – Initialization of an array in the bytecode


Instead of arrays one should use strings for such constant values because their value is written to the constant pool of the .class files – just like constant values of atomic datatypes [2, chapter 4.4.7] (Listing 5).

.java
// Field descriptor #6 Ljava/lang/String;
static final String s = ... + "\u0009" + ...;	
.class
static final java.lang.String s = "...\99..."; 

Size of the bytecode per element: 1 Byte

Listing 5 – Initialization of a string in the bytecode



Optimizations to reduce the time of code generation and compilation

In large projects a run of the code generator and the following compilation process can take hours. Because this complicates iterative development processes one should try to minimize the execution time.

Therefore, when source models are changed only those generated parts should need to be generated again, that is actually influenced by the change. Instead of a single generator this calls for multiple generators that can be executed separately and the code or interpreted data formats need to be decoupled from each other.

When deciding between code generation and interpretation one has to consider that data in external files that is interpreted at runtime does not need to be recompiled after it was changed. If code generation is applied it makes sense to recompile only those source code files that were semantically changed. This requires blocking to rewrite source code files that are not changed so that its file date is not changed. Whether a file is rewritten or not can be decided automatically by post processors that compare the hash value of the existing file with the one of the new version. However, this approach requires implementing a reproducible code generation.



Reproducible code generation

A reproducible code generation creates the exact same code whenever executed for the same source models.
In order to achieve this goal the generation time and date mustn’t be written to the generated code because it cannot be found in the source models. If the same models are used for another code generation a different date is written to the files. Thus their hash value is modified although there is no semantic change between both versions. In order to determine their equality a more detailed and time consuming analysis would be necessary that ignores whitespace and comments.

Besides for not writing the generation time to the generated files a couple more basic rules are to follow in order to achieve reproducible code generation:

The names of all identifiers need to be formed in the same way. On the one hand they have to be unique. On the other hand they should be meaningful to foster understanding the generated sources. A possible strategy to build names is to concatenate the names of all model elements they are semantically related to . However, such a strategy can lead to name clashes with existing elements. In such a case a counter suffix can be appended to the name. However, when rerunning the generator it has to assign the same counter suffixes to the same elements. As this example shows there need to be a specific order how the source model’s elements are processed.

When key values are required the generator mustn’t create new random values in each run. Instead of that, strategies have to be used that assign the same values in repeated runs.

Reproducible code generation means that only those parts of generated files are changed that result directly from changes to the source models. This enables to identify and analyze such changes in the generated files easily, e.g. by using Diff. Sometimes it is unavoidable to manage generated code in a version control system. In such cases a reproducible code generation prevents checking in new versions of the generated files after each generator run although there has been no semantic change.



Conclusion

When applying code generation in a larger scale some particularities of the Java programming language have to be taken into account. The decision between applying code generation, interpretation or mixtures of both of them must be well thought through for each individual case. The decision must be based on detailed knowledge about the Java, bytecode and the JVM. In particular the 64kB bytecode limit per method is important. There are different workaround possible that are based on adaption of the code generators and the architecture of the generated sources. In order to keep turnaround times low after changing the source files one should use separate generators and decouple the generated files. Furthermore, the code generators should be developed in such a way that they create completely reproducible results.



References


  1. Gerlach, Simon: Neue Kooperationsformen zwischen Fahrzeugherstellern & Zulieferern. JavaSpektrum. (05) 2011. SIGS Datacom. Troisdorf.
  2. http://java.sun.com/docs/books/jvms/second_edition/html/ClassFile.doc.html
  3. http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=4262078
  4. Wietzke, J.; Tran, M. T.: Automotive Embedded Systeme. Effizientes Framework – Vom Design zur Implementierung; Springer, Berlin, Heidelberg, 2005.
  5. Preda, Mihai: Synergetic Stupidity in Java. Mihai Mobile Blog. 2009. http://blog.javia.org/synergetic-stupidity-in-java/

Advertisements
  1. No trackbacks yet.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: