LAPS Software Integration
(test, build, and release strategy)

Updated 3/11/2008

Introduction

Here is a draft strategy for how one can do software testing and integration with our LAPS runs. This is a living document and ideas are welcome. Some of these ideas are what we are already doing, others are proposed additions. There are two main areas for running LAPS, the parallel side and operational side. The root directories for these are as follows:
Parallel side...
$LAPS_SRC_ROOT = /usr/nfs/common/lapb/parallel/laps
$LAPSINSTALLROOT = /usr/nfs/lapb/parallel/laps/bin (only on the IBM machines)
$LAPS_DATA_ROOT = /data/lapb/parallel/laps/data

Operational side...
$LAPS_SRC_ROOT = /usr/nfs/common/lapb/operational/laps
$LAPSINSTALLROOT = /usr/nfs/lapb/operational/laps/bin (only on the IBM machines)
$LAPS_DATA_ROOT = /data/lapb/operational/laps/data
Note that '/usr/nfs/common' is a common directory accessable from all the platforms. '/usr/nfs/lapb' (without the common) appears on all the platforms, but points to a different directory depending on what type of machine you are logged in on (IBM, HP, etc.). '/usr/nfs/lapb/builds*' is where any platform dependent software or executables go. The parallel side has some additional soft links present that tend to blur the distinction between the three ROOT directories.

Disk Space

Here is a brief summary about how we like to store our files on disk. We usually put our "personal" software and "home" related files on '/home/fab/username' and this is usually kept to a limit of about 1 GB. Other larger datasets are usually stored in '/data/lapb/users/username' where we have more space. Temporary items could be kept in '/scratch/lapb/username' and this is purged as often as every 2 weeks.

You can also see how we're doing as a group by doing the 'df' command and examining the various disks. The main point is that we want to avoid filling up the disks thus avoiding system disruptions.

Parallel Side Software Testing

The parallel side is where all software changes that are desired to be integrated into LAPS are tested for compilation and correct execution. This is a working superset of the software repository contents; extra files may be present that are used to assist development and testing. A cron on 'speedy' (an IBM) runs using the parallel side software while compilation is done on 'toro' using GNU make located in '/usr/opt/freeware/bin/make'.

Compilation of software is usually done in individual 'src' subdirectories and libraries by running 'make' (possibly including 'make clean', 'make install' and 'make debug' as appropriate). This is in contrast to the README, where running 'make' in the top level $LAPS_SRC_ROOT directory is described. After a software change is tested, the updated parallel code should be checked into the repository with CVS, making sure that all relevant modified components are checked in (using 'cvs commit [filename]'). CVS is a widely used software management system that is nicely suited to the size of our group. Your CVSROOT environment variable should be set to '/usr/nfs/common/lapb/cvsroot'. CVS background documentation should be accessible by trying 'man cvs'.

Although some volatility can be expected on the parallel side, a combination of good communication and working with CVS can help minimize the potential for interference between software developers. It is therefore important that everyone coordinate their work for times that more than one person is working on the same program. The library is a common software area than multiple program/routines may be utilizing. A summary of library routines is in a README file. Advance notice of library mods should be given if there is a significant possibility of the mods not being transparent to all users.

In general it is prudent to check for diffs relative to the repository to determine if other work is in progress, before editing or comitting a file. To check how a file differs from the repository, type 'cvs diff [filename]'. To summarize how all the files in the current directory differ from the repository, type 'cvs status -l | grep locally'. Putting a temporary lock on files (i.e. doing a 'chmod g-w' after setting yourself as the file owner) while editing and compile/runtime testing is also a good way to signal others that mods are in progress. The 'cvs log' command also tells the who, what, and when for repository file updates.

In the 'src/*' directories, there is usually a "primary" author that should be notified in advance about changes. The most important thing is to be aware of what you are committing to the repository, ensuring that other developer's mods are not unknowingly mixed in with yours. Otherwise, common sense and good communication are what is suggested. Software updates to the parallel side and repository should generally be coordinated in a sequential fashion. Simultaneous parallel code updates, where several people are testing separate modified versions of source code, can be tricky; these can be merged later on if it is agreeable to all involved.

LAPS changes will often be made in common areas and in some cases will require a full recompilation. Examples include library, parameter, configure, and Makefile changes. These should be tested in such a way that the ramifications to all affected programs throughout the LAPS tree are carefully considered. For example, if a parameter is added to 'nest7grid.parms', a number of library related mods and a full rebuild of the parallel side executables will be necessary. Note that the risk of this is minimized if this is done relatively early in the day when the largest number of people are here. Once the build is done, the LAPS monitor (e.g.) should be checked to see if we're still getting all the products correctly. An advance e-mail should be sent to the group mentioning the rebuild. If a problem is detected, the first step would be to try to debug and fix the problem. The second step would then be an effort to return to the original version of 'nest7grid.parms', perhaps failing-over to the original parallel executables (that were hopefully saved).

Localization on the parallel side should generally be done without doing an automated parameter merge (e.g. in 'laps_localization.pl'). This is so that any test parameter settings in the $LAPS_DATA_ROOT on the parallel side are left intact.

Operational Side Weekly Build

Weekly builds are done automatically to keep the operational side, and its attendant tar file, reasonably in sync with the repository. Keeping the operational side in sync with the repository facilitates operational/parallel output comparisons during software development. The operational side is also the primary vehicle for testing the viability of the latest repository version prior to Web release. The tar file that attends the operational build has a version number and forms the basis of the multi-platform build as discussed in the next section.

Operational side reliability is important. Towards that end a number of tools exist in CVS and these Web pages for diagnosing the builds and runtime performance. The build scripts are designed to failover to the previous build if an error is detected during 'make'. We also try to have someone knowledgable about the build on hand to monitor its progress/results.

The following outline highlights the main steps of the weekly builds.

toro cron (user oplapb)
    update_operational.pl (output: /home/fab/oplapb/update_operational.log)

        export code from the repository to update the operational software in '/usr/nfs/common/lapb/operational/laps' [soft link]

        create tagged tar file from the updated operational software  (~3MB stored in '/w3/lapb/software/restrct')

        run configure on updated operational code

        build_laps.pl (output: /usr/nfs/common/lapb/operational/laps/build.log.)
            run make  (output: /usr/nfs/common/lapb/operational/laps/make.out)

            if (error detected in make) then
                email statement, abort without install [to laps-bugs] 
                exit (from both build_laps.pl and update_operational.pl)
            endif

            run make lapsplot (independent of error checking functionality)

            move last week's dataroot, including lapsprd and other subdirectories 
                move /data/lapb/operational/laps/data to /data/lapb/operational/laps/old_data

            move last week's etc directory
                move /data/lapb/operational/laps/etc to /data/lapb/operational/laps/old_etc

            move last week's util directory
                move /data/lapb/operational/laps/util to /data/lapb/operational/laps/old_util

            move last week's bin directory
                move /data/lapb/operational/laps/bin to /data/lapb/operational/laps/old_bin

            run make install to update the executables in /bin (output: /usr/nfs/common/lapb/operational/laps/make_install.out)
                generates empty DATAROOT tree (should it?)

            if (error detected in make install) then
                email statement, abort with partial install [to laps-bugs]
                failover to last week's build 
                    restore last week's lapsprd, and other dataroot subdirectories
                        move laps/old_data to laps/data
                    restore last week's etc directory
                        move laps/old_etc to laps/etc
                    restore last week's util directory
                        move laps/old_util to laps/util
                    restore last week's bin directory
                        move laps/old_bin to laps/bin
                exit (from both build_laps.pl and update_operational.pl)

            else
                keep this week's executables 

            endif

            run make install_lapsplot (independent of error checking functionality)

        continue - based on presumed success in build_laps.pl

        merge in last week's lapsprd, and certain other dataroot subdirs (Question, where did last weeks lapsprd really go this last time, they were not with ./old_data?)

        laps_localization.pl
            create '/data/lapb/operational/laps/data/static.nest7grid' file

        submit cron to 'ren' (may be commented out)

        append 'sched.append' to 'sched.pl'

        update_multiplatform.sh (submit multi-platform builds)
            brain (solaris)
            ibm (toro)
            ejet (linux-64-bit)
            ijet (linux-32-bit)
            OTHERS AS WELL (see http://laps.noaa.gov/builds/laps_builds.html  )

    check_release_driver
        email summary of main 'toro' build [to laps-bugs]
Links to the main log files are provided above, as well as on the LAPS Issues Status Page. (click on Results of Latest LAPS Builds. The main log files with paths are listed under the "Overview" section for each build as posted on the Results page together with any errors flagged via grep commands. Some of this information is also distributed via an email list each time a build is attempted. For comparison, here are sample log files from a previous normal build: Overview, Make output, update_operational.log, build.log. The latest software revisions may be tracked in our REVISIONS file.

The weekly build on the operational side is automatically constructed from the repository each Monday prior to 12z, in a cron running on 'toro'. With this timing, many of us will be able to double check on Monday that the weekly build was successful and that we are still getting output on the operational side. Note that the automatic build should normally be the only method code changes are migrated to the operational side. Manual porting of the changes might compromise the testing process.

The operational source code tree is located in '/usr/nfs/common/lapb/operational/laps'. This in turn is a soft link to '/usr/nfs/common/lapb/operational/laps-m-n-o'. Note that these weekly versions will accumulate and should be purged occasionally. As part of the build, a tar file is constructed from this software and is located in '/w3/lapb/software/restrct'. The tar file, operational side, and the repository are thus all brought into sync. The build procedure tests various components such as the configure and localization scripts. The operational dataroot tree is '/data/lapb/operational/laps/data'. For the most part, we ensure that previously existing operational files aren't compromising the evaluation of the build. One exception to this is that new 'lapsprd' subdirectories must be created by hand on the operational side, the weekly build does not yet automaticlly do this.

Multiple Platform Builds, Compilation/Runtime Testing

As alluded to above, an experimental tarfile is created from the repository simultaneously with each weekly operational build. This is then automatically ported and compiled on various platforms as well as being localized. Users are free to execute individual programs to see how particular processes run. As these builds are mainly for experimental use there is no automatic failover like what we have on the operational running LAPS. The directory on each platform for source (LAPS_SRC_ROOT), binaries (LAPSINSTALLROOT), and data (LAPS_DATA_ROOT) is '/usr/nfs/lapb/builds/laps-m-n-o', depending on the latest version number. The path '/usr/nfs/lapb/builds/laps' is also linked in.

Based on the results of the multi-platform tests, posted on the Web, a list of needed software changes can be constructed. These changes can then be inserted back into the parallel side for additional testing, thus closing the loop.

Generic compiler flags can be set for each platform in the 'configure.in' script in the repository. There are constructs there that set the FFLAGS and CFLAGS and OPTIMIZE for each architecture and/or compiler. After editing 'configure.in' on the parallel side, run the '$LAPS_SRC_ROOT/autoconf' script to update 'configure'. At that stage both can be committed to the repository.

Here's an outline of the various build scripts. Everything that 'install_laps' calls is located in the repository. A possible goal is to move more functionality from 'install_laps' into 'install_laps.csh' so that more of the steps are residing in repository scripts.

update_operational.pl
    update_multiplatform.sh
        install_laps_builds
            install_laps
                configure
                install_laps.csh
                    make
                    window_domain_rt.pl
                        localize_domain.pl
                    cronfile.pl
                check_release.csh
            install_onsite (localizes additional domains for a platform)

Web Release

Approximately once per month, we might want to more thoroughly evaluate the operational side (along with the simultaneous updates of the multi-platform port). The operational side is evaluated by qualitative output examination, and perhaps comparison with the parallel side. The multi-platform builds could be evaluated in terms of compilation and execution (at least by hand) as outlined above. Once these are all working smoothly over about a week's period, we can create a new export Web release from the most recent weekly tarfile. Note that the tarfile's contents should match the code placed on the operational side during the testing period. If the LAPS runs are not working, we can punt and try again on the next weekly build.

The monthly builds will give us more up to date software on the Web. This will help us to be comfortable with the Web posting at any given time, thus allowing short notice porting to as wide a range of users as possible. We will also have good assurance that ongoing software development and bug fixes will propogate through the testing system in a reasonably timely manner. There's always "the next release" if your change doesn't make it into the current one. At the same time, the needs of various projects should be anticipated enough in advance to allow for a reliable and smooth testing process. In the event a bug is noticed in the Web release after the fact, we can document this in the "Release Notes" link on the software web page, or in dire situations we can revert to a previous build that is stored on our system. The only down side here is that the test dataset may become somewhat out of sync with the software that is posted.

A 'csh' script that posts the new tarfile can be run as follows:
$ su - oplapb (a linux desktop or tom should work)
$ cd ~oplapb/bin
$ update_www_release m-n-o dd/mm/yyyy y

As arguments on the command line, m-n-o denotes the version number of LAPS, dd/mm/yyyy denotes the date the tarfile was built (usually a Monday), and y denotes that we want to update the testdata tar files as part of the Web posting.

This 'update_www_release' script posts the tar file, README, and the test data. The test data is comprised of raw data from /public and /data/fxa, as well as LAPS outputs from the 'lapsprd' directory. The raw data is tarred from file lists. These file lists in turn are generated from the 'followup_fsl' script that is part of the operational LAPS cron.

The access list for the LAPS software Page is contained internally within '/w3/lapb/cgi/LAPS_SOFTWARE.cgi' and is easily modified by editing that script. Users can also be registered in the 'laps-users' list by adding their e-mail address to '/w3/lapb/laps-users/addresses.txt.full'

Log Files

To assist in problem detection, it is suggested that we use the key words "Error" and "Warning" in log file output. Generally, "Warning" would be something important to know, though not something really out of the ordinary. "Error" would indicate a more significant problem, probably requiring corrective action. Either uppercase or lowercase may be used.

More info...

Additional related topics are on Jim Edwards home page.