Hackystat Developer Documentation
Version Six Development Plan

Philip Johnson
Collaborative Software Development Laboratory
University of Hawaii 

$Id: VersionSix.html,v 1.4 2004/01/08 03:13:58 johnson Exp $

1.0 Overview

The purpose of this document is to provide a roadmap for development of Hackystat over the next one to two months, and a log of the status of each task.  At the conclusion of these development activities, we will increment the major release number from five to six.  This will represent the major advances we have made since the beginning of version five in understanding performance characteristics of Hackystat and how to better support cached abstractions of raw sensor data.

2.0 The "Develop-Review-Enhance-Disperse" (DRED) process model

Many of the current development tasks have a similar nature in that they consist of initial development of a new or improved "service", which must then be "dispersed" throughout the remainder of the system.  For example, the new XML-based approach to user data persistence will begin by extending the User class with new facilities.  Once implemented, these facilities must be adopted by other components that until now have implemented their own user-level persistence mechanisms due to inadequacies in the current Properties-based User persistence.

In this situation, there are a couple potential risk factors:

  1. The service is dispersed into other components before it is ready.  This could result in a certain amount of "thrashing", as the service is redesigned and then the changes have to be re-dispersed into the components again.
  2. The service is idiosyncratic and understood only by one developer. This could result in a less elegantly designed, harder to understand implementation.

The Develop-Review-Enhance-Disperse (DRED) process is my proposal for a way to reduce these risks during the development of these tasks.  Note that this risk reduction should translate directly into a reduction of time spent in development.

2.1 DRED Stage One: Develop

The "Develop" stage of the DRD process involves the initial development of the service. Each task has a principal developer (the "Master Chef", or MC) as well as someone they have official permission to bug regarding problems and bugs (the "Sous Chef", or SC).  I recommend, but do not enforce, that agile practices like Pair Programming and Test-Driven Development be followed during this stage if the MC and SC deem them useful.

2.2 DRED Stage Two: Review

Once the Develop stage has reached a stable point in implementation, they should call for a Review of their design and implementation.  The review process will basically follow the CSDL review procedure, except that we will use Jupiter rather than creating simple textf files. We will use Jupiter for all reviews, because (a) using Jupiter will provide Takuya with useful feedback on improving the system; (b) Jupiter should be more efficient and effective than the previous CSDL process for code review, since it is integrated with Eclipse; (c) using Jupiter should help allow "remote" developers like Cam, Joy, and Shenyan to participate even if they cannot physically attend the meeting, and (d) using Jupiter will create an XML repository of defects which can later be sent to Hackystat as defect data.

2.3 DRED Stage 3: Enhance

After review, the service goes back to the MC and SC for enhancement and improvement.  Once these changes have been completed, then the service moves on to the next phase, dispersion. (In rare cases, the service may require an additional round of review before dispersion).

2.4 DRED Stage 4: Disperse

Once the service is ready for dispersion across the Hackystat code base, this task will be assigned to developers other than the original MC (and perhaps SC).  The point of this phase is to ensure that developers other than the original implementers gain experience as "clients" of the service early on in its life.  This spreads knowledge and also may uncover additional opportunities for enhancement and improvement of the service.

3.0 Task Requirements Specifications

This section presents the major tasks and their requirements in two sections. The first set of tasks focuses on improvements to our performance analysis, thread-safety, and caching capabilities. The second set of tasks focuses on improvements to our module-level structure.

3.1 hackyPerf module

There are two goals for the hackyPerf module:

(1) Enable us to evaluate the correctness of various hackystat configurations under concurrent loads. In the simplest case, if a ConcurrentModificationException is thrown, then the system is not correct. (If it's not thrown, we don't know for sure.)

(2) Evaluate the responsiveness of various hackystat configurations under concurrent loads. In the simplest case, determine how long a given analyses takes to complete under a given load condition. Load is measured by the amount of data required to be looked at in order to compute the answer, plus (potentially) the number of other processes interested in that data at the same time, plus (potentially) the number of processes updating the dataset at the same time.

An approach: Define an integer N to be the number of users to simulate, and D to be the (simulated) number of days for the test run.

- Register all N of these users with the [test] server, call them test-user-<N>. The first test user creates a project Test-Project and subscribes the others to it.

- Spawn 0 to N "Sensor" threads, up to one for each user. - Each Sensor thread starts sending data (Activity, FileMetric, JUnit, and Coverage) at a rate of once per second. The data is timestamped and increments each time by 60 minutes. Thus, every 24 seconds of real time corresponds to a day's worth of data. This goes on for 24 * D seconds.

- Spawn 0 to N "Analysis" threads, up to one for each user. - Each Analysis thread starts invoking project-related analyses for today's date + D days. It stores the amount of time required for each request/response and makes it available as the test run result.

We need the ability to vary the amount of data sent, and the ability to vary the number of sensor and analysis threads in order to understand how the mixture impacts on performance and correctness.

Let's use Ant as the user interface so that we can run it on any platform. So, hackyPerf will allow us to do things like:

ant -Dsensor.threads=10 -Danalysis.threads=20 -Ddays=20 analyzePerformance

There will probably need to be a few different output settings so that we can monitor changes in responsiveness over the course of a run.

Finally, hackyPerf needs to be configurable to support performance evaluation in all of our configurations. This means it must be possible to vary the kinds of data that is sent and the kinds of analyses that are invoked.  For starters, we want to focus on the hackystat-UH configuration.

3.2 User XML-based persistent storage

There is a diverse set of data regarding user configuration and preference settings that must be stored on the server side.  The User class implements persistency based upon Properties, storing the results in a file called User.txt in their top-level directory. This approach has proven insufficient for clients that need to persist lists of data, for example.  These clients (such as Workspace, Alerts, Course, Projects, DailyDiary, etc.) have implemented their own custom persistency mechanisms which store data in such files as ByteCodePerLine.configuration, ComplexityThresholdAlert.configuration, courses.xml, WorkspaceRootConfig.xml, projects.xml, HiddenWorkspace.xml, and DailyDiaryColumn.xml.

The goal of this task is to enhance the User class to support XML based persistency in a manner that allows clients that currently implement their own persistency mechanism to delete it and use the simpler and more robust facilities provided by the User class. The features include:

3.3 IntervalSelector

IntervalSelector is a simplified, consistent approach to time interval specification for analyses. IntervalSelector, after Dispersion, will be the _only_ selector used to specify time intervals for any analyses involving time intervals. With the IntervalSelector, you have a choice of exactly three grain sizes: days, weeks, and months.

  1. Days are uniformly defined as the following interval: 12:00:00.0000am - 11:59:59.9999pm
  2. Weeks are uniformly defined as the following interval: Sunday at 12:00:00.0000am to Saturday at 11:59:59.9999pm
  3. Months are uniformly defined as the following interval: The first day of the month at 12:00:00.0000am to the last day of the month at 11:59.59.9999pm

These definitions result in the following very useful property: given any given time point, it belongs to exactly one day interval, exactly one week interval, and exactly one month interval. In other words, when doing an analysis over a three week interval, you are constrained to selecting three adjacent weeks, each of which start on Sunday and end on the next Saturday. You can't, as we allow in the current system, select a 7 day interval that starts on a Wednesday and ends the following Tuesday. Similarly, if you want to select an analysis that aggregates values together into a monthly interval, you can't pick a 30 day interval that starts on the 10th of one month and ends on the 10th of the next month. You have to pick January, or February, or March, etc.

It is my belief that imposing these restrictions on time interval specification will have essentially zero negative impact on a user's ability to gain meaningful insight from their analyses. Part of my belief comes from fiddling with the interval specification on my own data, and never finding that "shifting the window" revealed anything of interest.

On the other hand, I believe that imposing these restrictions has a HUGE positive impact on our ability to build effective high-level caches for project-level information. This is because we will be able to build a cache of project level information for a given user (or set of users) at the level of weeks and months, just like we can now at the level of days. (You can look at it another way: we never allowed the user to specify a "start hour" or an "end hour" for a day--we forced them to view each day as having a fixed start and end time. That restriction was what enabled us to build the DailyAnalysis caches. By eliminating the ability of the user to specify a "start day" and "end day" for weeks (and months), it will analogously enable us to build WeeklyAnalysis and MonthlyAnalysis caches. Given the fact that new data is usually sent to the server within minutes of its generation, new data is not likely to invalidate any DailyCache but the current one, no WeeklyCache but the current one, and no MonthlyCache but the current one.

Here's my idea of how the IntervalSelector would look:

Some things to note about this selector:

  1. What you pick for the interval with the radio button determines which interval specification on the following three lines is actually used by the analysis.
  2. As I noted before, this selector would be the only way to specify intervals for all of our analyses. I think this consistency in interval specification will be an improvement over our current random approach. In fact, I think this one is much more obvious than "period size" and "number periods".
  3. There is an issue of how many <week> values to provide in the pull down list. I propose that we start by providing the current week (because data can't be generated in the future) and going back 26 weeks into the past. I am guessing that after six months in the past, people don't really care about weekly data any more and would start analyzing in terms of months. But, just in case I'm guessing wrong, we can provide a interval.selector.weeks property for hackystat.properties that an admin can use to make the pull-down list as long or as short as their users need.
  4. The last choices made by the user can be persisted using the new User XML class with the "intervalselector" data partition.

Once we can effectively build Weekly and/or Monthly project-level caches, only the first user to run the analysis takes the hit of iterating through the raw data, and everyone else will just use the cached values provided by the first user. Since these caches will be stored using our new and improved ThreeKeyCache with thread safe iterators, there will be no locking and thus no resource contention among threads. Most importantly, the project-level analyses will become extremely fast because they will typically only need to refer to a few dozen objects, as opposed to a few thousand like they do now.

3.4 Thread-safe ThreeKeyCache

The goal of this task is to create a new version of ThreeKeyCache with thread-safe iterators.  We will investigate the use of the ConcurrentHashMap class from the util.concurrent package to implement this. (See http://www-106.ibm.com/developerworks/java/library/j-jtp07233.html for details.)

We will also investigate whether we can provide improved control over garbage collection by augmenting ThreeKeyCache to support weak and phantom references in addition to hard and software references supported now. See http://java.sun.com/developer/technicalArticles/ALT/RefObj/  for details.)

3.5 Project week/month caching

The dispersion of the IntervalSelector and thread-safe ThreeKeyCache will enable us to build efficient week and month-level caches for project-level information.

Once we can effectively build Weekly and/or Monthly project-level caches, only the first user to run the analysis takes the hit of iterating through the raw data, and everyone else will just use the cached values provided by the first user. Since these caches will be stored using our thread-safe ThreeKeyCache, there will be no locking and thus no resource contention among threads. In addition, the project-level analyses will become extremely fast because they will typically only need to refer to a few dozen objects, as opposed to a few thousand like they do now.

3.6 SimpleDateFormat thread safety

There are several alternatives I can come up with for dealing with the thread-unsafe nature of SimpleDateFormat. Recall, for starters, that SimpleDateFormat looks like this:

public class DateInfo {
  /** yyyy-MM-dd formatter */
  private static SimpleDateFormat monthFormat = new SimpleDateFormat("yyyy-MM");
  /** yyyy-MM-dd formatter */
  private static SimpleDateFormat dayFormat = new SimpleDateFormat("yyyy-MM-dd");
  /** MM/dd/yyyy formatter */
  public static String getMonthFileName(Date date) {
    return DateInfo.monthFormat.format(date) + ".xml";
  }

In other words, we create about a dozen SimpleDateFormat instances as private variables, and we later invoke their format() method from within a Hackystat utility method like getMonthFileName() in order to produce a string from a Date instance. These methods have the potential to be used a LOT in hackystat, and because they all access the same 12 instances of SimpleDateFormat, there are definitely thread-safety issues.

BTW, it makes perfect sense to me that Java wouldn't make these thread-safe. That would incur a performance hit for all applications, and these are the kind of IO classes that you want to be fast.

With that little introduction, on to the choices:

1. Synchronize access to all of the DateInfo methods. This is a simple solution. It has the potential for a significant performance hit, of course, if it turns out that these five or so SimpleDateFormat instances form a kind of bottleneck for system processing.

2. Don't share SimpleDateFormat instances, just create a new one each time. This is also a simple solution, and eliminates the need for synchronization. However, the SimpleDateFormat javadocs recommend against this---apparently instance construction is expensive, and we would prefer not to incur the cost of GC'ing all of those instances.

3. Use the ThreadLocal class! Here's a cool approach: The ThreadLocal class allows you to maintain a separate copy of a variable value in each thread! We can easily alter the DateInfo class to maintain ThreadLocal copies of each of the SimpleDateFormat instances. That way we don't need to synchronize access to the DateInfo classes, since each thread will access its own copy of the SimpleDateFormat instance. It eliminates the bottleneck possibility, while only costing us a dozen instances per thread.

Here's some more info on using ThreadLocal:

4. A fourth approach, suggested by Hongbing at the meeting on Thursday, would be to replace the simple date formats involving YYYY, MM, and DD by doing arithmetic on the UTC value passed in. This eliminates the need for synchronization, and also allows some simple optimization. (For example, one could use nested if-then's to get the YYYY value for years between 2000 and 2005 and provide exactly 5 string values!)

Which one to choose? For the simple date formats that involve YYYY, MM, and DD, we should go with the arithmetic version.  For others, we should simply synchronize, unless profiling finds that the threadlocal version would be better. 

3.7 Uniform Active Time Units

The goal of this task is to start uniformly reporting Active Time in tenths of an hour.

For example:

My rationale for this is as follows:

1. For projects, reporting in minutes is less usable. One sitewatch projects reports 10,040 minutes. It would be way more usable to report this active time as 167.3 hours. (That also makes it much easier to eyeball relative contributions---it's easy to see that each developer put in about 55 hours on the project.)

2. Reporting in tenths of an hour more properly reflects the true precision of our measurement. When we report in minutes, like 10,040, we are reporting with a precision of 1 minute, which does not reflect the actual precision of our measurement. When we report in tenths of an hour, we are reporting with a precision of 6 minutes, which is actually quite close to our actual precision (5 minute grain sizes).

Would it be helpful to create a utility function that takes an integer and returns a string representing the active time in tenths of an hour? For example,

In org.hackystat.stdext.activity.AbstractEffort:

public static String getActiveTimeHoursString(int minutes) {}

In addition, all of us especially me have been sloppy thus far about using "Effort" as a synonym for "Active Time" in the code and the user interface. Based upon this semester, I think we need to start cleaning that up as well. One thing that is really clear from the classroom setting is that I needed to make sure the students understand that we are not measuring their total "Effort" on the project, we are measuring their total "Active Time"---which is defined as the time they spend physically editing files. This is an important distinction, because I had several discussions with students where they'd begin by saying the system simply doesn't work, since they devote effort to meetings and so forth. By getting everyone to think in terms of "Active Time" instead of "Effort", the discussion changes to: (1) are the sensors actually collecting Active Time correctly? and (2) is a measurement of Active Time actually useful in software development? These are more interesting and appropriate conversations.

3.8 Configuration performance analysis

Once we have completed these improvements to the system, we will subject our configurations to performance analysis with the goal of ensuring that they do not fail with thread-related issues under higher loads, and also to confirm that they exhibit improved memory management due to the new caching and garbage collection facilities.

4.0 Task Assignments and Status

Task Name Developers (MC, SC) DRED Status
hackyPerf Cedric Hongbing 12/4/2003: Entered Develop
User XML Philip Aaron 12/19/2003: Entered Develop
IntervalSelector Hongbing Philip ?
SimpleDateFormat thread safety Aaron Takuya ?
Uniform Active Time Units Takuya Cedric ?
Thread-safe ThreeKeyCache Philip?    
Project week/month caching Hongbing?    
Configuration performance analysis      
       

5.0 That's not all, folks!

While the completion of the above tasks will probably be enough to justify a 6.0 release, I wanted to briefly list some of the other tasks that are on my Hackystat To-Do list.

  1. Sensor Module Decomposition:  This involves extracting the sensor implementations from the hackyStdExt module and creating hackyJBuilder, hackyAnt, hackyEclipse, and hackyEmacs packages.
  2. Migrate to Ant 1.6: This version has new features that we can use to simplify our build.xml.
  3. hackyTemplate and hackyVim modules: Provide a template that Dan can use to package up his Vim sensor.
  4. Relational Database version of Hackystat:  Once we have 4.0 complete and those changes implemented, we could experiment with replacing our XML back-end with an RDBMS like mySQL and see what performance improvements result.
  5. Project and Estimation re-design:  We should revisit the Project and Estimation modules and see how they can work better together.
  6. Internationalization: Support a Japanese Hackystat, for example.
  7. Hackystat installation at JPL.
  8. LOCC sensor implementation: This includes both the sensor implementation and LOCC extension to support languages like C++ and Fortran.
  9. Configuration definitions of University of Maryland and Sun-Russia development groups.