







|

|
This section provides access to publications
associated with my research projects.
A list of paper titles organized by subject follows.
To see the abstract
(and citation) for a paper, click on its title.
With most abstracts there will be an icon that looks like . If you click on that icon,
the postscript/pdf/doc form
of the paper will be retrieved for viewing and optional printing.
|
|
Books & Book Chapters
|
|
The Energy Gap: How to Solve the World Energy Crisis, Preserve the Environment & Save Civilization, Doug L. Hoffman and Allen Simmons,
The Resilient Earth Press, July 2010.
Performance Modeling of Enterprise Grid Systems
chapter in Data Engineering: Mining, Information, and Intelligence,
Springer,Series: International Series in Operations Research & Management Science , Vol. 132
Chan, Yupo; Talburt, John; Talley, Terry M. (Eds.), 2009.
The Resilient Earth: Science, Global Warming, and the Future of Humanity, Doug L. Hoffman and Allen Simmons,
Booksurge Publishing, October 2008.
|
|
Published Papers
|
|
A Forecasting Capability Study of Empirical Mode Decomposition for the Arrival Time of a Parallel Batch System,
April, 2010.
|
Modeling and Simulation of HPC Systems
Through Job Scheduling Analysis,
April, 2010.
|
Fairshare Scheduling – A Case Study,
March, 2010.
|
Application of Empirical Mode Decomposition to the Arrival Time Characterization of a Parallel Batch System Using System Logs ,
September, 2009.
Capacity Planning of a Commodity Cluster in an Academic Environment: A Case Study, ,
April, 2008.
A Case Study on Grid Performance Modeling ,
November, 2006.
Initial Starting Point Analysis for K-Means
Clustering: A Case Study ,
March, 2006.
Adaptive Automatic Grid Reconfiguration
Using Workload Phase Identification ,
December, 2005.
Comparison of Protein Structures
by Transformation into Dihedral Angle Sequences ,
August, 1996.
BioSCAN: A Dynamically Reconfigurable
Systolic Array for Biosequence Analysis , June, 1996.
BioSCAN: A Network Sharable Computational
Resource for Searching Biosequence Databases , March, 1996.
Rapid Protein Structure Classification
using One-dimensional Structure Profiles on the BioSCAN Parallel
Computer, October, 1995.
Pseudotorsional OCCO backbone angle as a single
descriptor of protein secondary structure, May, 1995.
A Scalable Systolic Multiprocessor System for
Analysis of Biological Sequences , March, 1993.
|
|
Technical Notes
|
|
The Energy Gap: How to Solve the World Energy Crisis, Preserve the Environment & Save Civilization,
Doug L. Hoffman and Allen Simmons,
The Resilient Earth Press, July 2010.
ABSTRACT:
Humans have a trait that distinguishes us from all other species: the ability to use fire. We turn on a switch and light comes into our homes. With the turn of a key, vehicles take us where we want to go. We adjust a thermostat in our homes to make us warm or cool. These are everyday events we hardly think about. It took centuries of vision, science and engineering to achieve this comfort-point in our long evolutionary journey. Today, an average person lives better than kings lived several centuries ago. As we revealed the facts behind global warming in our last book, The Resilient Earth, we take the same tack in out latest work, The Energy Gap. In its pages, we present the hard science and engineering that will close a looming energy gap for our country and the world. There is also a warning. If we chose the political route, the activist route, the human race will slide backwards for the first time since the Industrial Revolution. If we choose the correct path, as revealed in The Energy Gap, our species will continue its forward march towards a brighter future for all on Earth.
|
A Forecasting Capability Study of Empirical Mode Decomposition for the Arrival
Time of a Parallel Batch System,
Linh Ngo, Amy Apon, and Doug Hoffman,
In Proceedings 7th International Conference on Information Technology: New Generations, April, 2010.
ABSTRACT:
This paper demonstrates the feasibility and potential of
applying empirical mode decomposition (EMD) to
forecast the arrival time behaviors in a parallel batch
system. An analysis of the workload records shows the
existence of daily and weekly patterns within the
workload. Results show that the intrinsic mode functions
(IMF), products of the sifting/decomposition process of
EMD, produce a better prediction than the original
arrival histogram when used in a simple weight-matching
prediction technique. Promising applications include the
implementation of an EMD/neural network combination.
|
Modeling and Simulation of HPC Systems Through Job Scheduling Analysis,
W. B. Hurst, S. Ramaswamy, R. B. Lenin and D. Hoffman,
In Proceedings of ALAR 2010 Conference on Applied Research in Information
Technology, April, 2010.
ABSTRACT:
A key component needed for researching High Performance
Cluster (HPC) Systems can be found through simulation of
the HPC system. This paper presents comparative analysis of
performance characteristics found from the operations of an
“active” HPC system and a “simulated” HPC system.
|
Fairshare Scheduling – A Case Study,
Hung Bui, Wesley Emeneker, Amy Apon, Doug Hoffman and Larry Dowdy,
In Proceedings of The 11th LCI International Conference on
High-Performance Clustered Computing, March, 2010.
ABSTRACT:
Scheduling and resource management are important in
optimizing multiprocessor cluster resource allocation.
Resources must be multiplexed to service requests of varied
importance, and the policy chosen to manage this
multiplexing can have enormous impact on throughput and
response time. Fairshare scheduling is a way to manage
application performance by dynamically allocating shares
of system resources among competing users. The primary
objective of this paper is to present an in-depth case study
of fairshare scheduling In this case study, an in-depth
sensitivity analysis of the various tunable parameters in
fair-share scheduling techniques will be provided. The
starting points for the study are scheduler log files
collected from two production systems, one a production
industry cluster and the second a university cluster. The
approach to the case study is in two parts. First, using
well-known techniques in the field, workload models for the
two different environments are built and analyzed.
Secondly, after the models are developed, they are
presented to a fairshare scheduler under what-if scenarios.
The experimental results are examined to evaluate the
performance of fairshare scheduling.
|
Performance Modeling of Enterprise Grid Systems,
D. L. Hoffman, A. Apon, L. Dowdy, B. Lu, et al, in Data Engineering: Mining, Information, and Intelligence, Series: International Series in Operations Research & Management Science , Vol. 132,
Chan, Yupo; Talburt, John; Talley, Terry M. (Eds.), Springer, 2009.
ABSTRACT:
Modeling has long been recognized as an invaluable tool for predicting the performance behavior of computer systems. Modeling software, both commercial and open source, is widely used as a guide for the development of new systems and the upgrading of exiting ones. Unfortunately, no set of comprehensive tools exists for modeling complex distributed computing environments such as the ones found in emerging grid deployments. This chapter addresses concepts, methodologies, and tools that are useful when designing, implementing, and tuning the performance in grid and cluster environments.
|
Application of Empirical Mode Decomposition to the Arrival Time Characterization of a Parallel Batch System Using System Logs,
Linh Ngo, Baochuan Lu, Hung Bui, Amy Apon, Nathan Hamm, Larry Dowdy, Doug Hoffman and Denny Brewer,
In Proceedings of the 2009 International Conference on Modeling, Simulation, and Visualization Methods, July, 2009.
ABSTRACT:
Abstract: Traditionally, workload models of large-scale production computer clusters are created fromsystem logs for the purpose of analyzing and predithe performance of these systems. Such logs are oflarge, complex, and unwieldy. For conciseness, thsystem log can be approximated by finding a hypeexponential distribution that captures the workload dynamics as closely as possible. Using this techniqthe workload model is able to match closely the glostatistical measurements of the original system log.However, using a hyperexponential distribution to synthetically regenerate job arrival times in a simulation model does not capture the realistic randomness of bursts of arrivals in the original log.this paper, a new workload modeling method basedEmpirical Mode Decomposition (EMD) is describeEMD provides a compromise between the full complexity of the original log data and a simple hyperexponential representation. Likewise, the EMapproach provides a compromise between the accuassociated with the log data and the coarse approximation using the hyperexponential representation. The tradeoff of using an EMD approach can be effective in certain performance modeling studies.
|
The Resilient Earth: Science, Global Warming, and the Future of Humanity,
Doug L. Hoffman and Allen Simmons,
Booksurge Publishing, October 2008.
ABSTRACT:
A million years after the birth of our sun, the violent explosion of a nearby supernova nearly ended life on Earth before it began. Over the next four and a half billion years, forces of nature shaped our planet and the life it harbored. Barely surviving the traumatic birth of the Moon, buffeted by supernovae, and bombarded by asteroids, the resilient Earth endured. And despite planet-freezing ice ages, devastating mass extinctions, and ever changing climate, life not only survived, it thrived. Today, we are told all life on Earth is threatened by a new peril--human-caused global warming. The Resilient Earth presents the science behind global warming for a general audience, separating fact from fiction and truth from exaggeration.
|
Capacity Planning of a Commodity Cluster in an Academic Environment: A Case Study,
Baochuan Lu, Linh Ngo, Hung Bui, Amy Apon, Nathan Hamm,
Larry Dowdy, Doug Hoffman and Denny Brewer,
9th LCI International Conference on
High-Performance Clustered Computing, April 2008.
ABSTRACT:
In this paper, the design of a simulation model for evaluat-
ing two alternative supercomputer configurations in an academic envi-
ronment is presented. The workload is analyzed and modeled, and its
effect on the relative performance of both systems is studied. The In-
tegrated Capacity Planning Environment (ICPE) toolkit, developed for
commodity cluster capacity planning, is successfully applied to the tar-
get environment. The ICPE is a tool for workload modeling, simulation
modeling, and what-if analysis. A new characterization strategy is ap-
plied to the workload to more accurately model commodity cluster work-
loads. Through “what-if” analysis, the sensitivity of the baseline system
performance to workload change, and also the relative performance of
the two proposed alternative systems are compared and evaluated. This
case study demonstrates the usefulness of the methodology and the ap-
plicability of the tools in gauging system capacity and making design
decisions.
|
A Case Study on Grid Performance Modeling,
B. Lu, A. Apon, L. Dowdy, F. Robinson, D. Hoffman, and D. Brewer,
International Conference on Parallel and Distributed Computing Systems,
November 13, 2006.
ABSTRACT:
The purpose of this case study is to develop a performance
model for an enterprise grid for performance management
and capacity planning. The target environment includes
grid applications such as health-care and financial services
where the data is located primarily within the resources of a
worldwide corporation. The approach is to build a discrete
event simulation model for a representative work-flow grid.
Five work-flow classes, found using a customized k-means
clustering algorithm characterize the workload of the grid.
Analyzing the gap between the simulation and measurement
data validates the model. The case study demonstrates
that the simulation model can be used to predict the
grid system performance given a workload forecast. The
model is also used to evaluate alternative scheduling strategies.
The simulation model is flexible and easily incorporates
several system details.
|
Initial Starting Point Analysis for K-Means Clustering: A Case Study,
F. Robinson, A. Apon, D. Brewer, L. Dowdy, D. Hoffman, B. Lu,
Proceedings of ALAR 2006 Conference on Applied Research in Information
Technology, March, 2006.
ABSTRACT:
Workload characterization is an important part of systems performance modeling. Clustering is a method used to find classes of jobs within workloads. K-Means is one of the most popular clustering algorithms. Initial starting point values are needed as input parameters when performing k-means clustering. This paper shows that the results of the running the k-means algorithm on the same workload will vary depending on the values chosen as initial starting points. Fourteen methods of composing initial starting point values are compared in a case study. The results indicate that a synthetic method, scrambled midpoints, is an effective starting point method for k-means clustering.
|
Adaptive Automatic Grid Reconfiguration Using Workload Phase
Identification,
B. Lu, M. Tinker, A. Apon, D. Hoffman, and L. Dowdy,
Proceedings of EScience 2005, December, 2005.
ABSTRACT:
The purpose of this study is to develop an adaptive model
of a very large scale data processing and storage environment.
The target environment includes grid applications
such as health-care and finance in which the data may be
located primarily within the resources of a worldwide corporation.
The approach is to use phase identification techniques
that can detect over-utilized grid resources, and then
to make dynamic decisions to reassign additional resources
to that portion of the application processing. Two phase
identification techniques are proposed, a variation technique
and a real-time threshold-based technique. The techniques
are validated with a simulation model and a case
study using measured data from a production grid environment.
The case study demonstrates that phase identification
techniques can be used as the intelligent component of
a reactive mechanism for a grid to adapt to changing environmental
conditions by dynamic automatic reconfiguration.
Results show that threshold based phase identifying
techniques combined with dynamic resource allocation capabilities
are effective in alleviating performance hot spots
and improving response time in a large scale data grid.
|
Comparison of Protein Structures by Transformation
into Dihedral Angle Sequences,
D. L. Hoffman, PhD dissertation, University of North
Carolina at Chapel Hill, 1996.
ABSTRACT:
Proteins are large complex organic molecules that are essential to the
existence of life. Decades of study have revealed that proteins having
different sequences of amino acids can posses very similar
three-dimensional structures. To date, protein structure comparison
methods have been accurate but costly in terms of computer time. This
dissertation presents a new method for comparing protein structures using
dihedral transformations. Atomic XYZ coordinates are transformed into a
sequence of dihedral angles, which is then transformed into a sequence of
dihedral sectors. Alignment of two sequences of dihedral
sectors reveals similarities between the original protein
structures. Experiments have shown that this method detects structural
similarities between sequences with less than 20% amino acid sequence
identity, finding structural similarities that would not have been
detected using amino acid alignment techniques. Comparisons can be
performed in seconds that had previously taken minutes or hours.
|
BioSCAN: A Dynamically Reconfigurable Systolic Array for
Biosequence Analysis,
Raj K. Singh, W. D. Dettloff, V. L. Chi, D. L. Hoffman, S. G. Tell,
C. T. White, S. F. Altschul, and B. W. Erickson, Proc. of CERCS96,
National Science Foundation, Arlington, VA, June 22-24, 1996.
ABSTRACT:
We describe the design, implementation, and deployment via the
Internet of BioSCAN, an application-specific computer system for
the rapid determination of statistically significant alignments
of biopolymer (DNA, RNA, protein) sequences. BioSCAN continues
to outperform other systems designed to perform this basic task
of molecular biology, which continues to grow in magnitude and
importance. The BioSCAN system is hosted by a general-purpose
workstation containing a special-purpose hardware engine that
accelerates the core algorithm for comparing two biosequences.
Careful partitioning of the computational tasks between hardware
and software provides not only high performance but also
programmability. The BioSCAN system can compare a sequence of
up to 12,992 characters with an arbitrarily large database
containing arbitrarily long sequences at a rate of 2 million
database characters per second. This rate is nearly 1,000 times
greater than the rate achieved by a state-of-the-art workstation
using software alone. This network-sharable computational
resource is accessible interactively via the World Wide Web
using Mosaic, Netscape or other client software.
|
BioSCAN: A Network-Sharable Computational Resource for Searching
Biosequence Databases,
Raj K. Singh, D. L. Hoffman, S. G. Tell, and C. T. White,
Computer Applications in the Biosciences, Vol. 12, No. 3, 1996,
pp. 191-196.
ABSTRACT:
We describe a network sharable, interactive computational tool for
rapid and sensitive search and analysis of biomolecular sequence
databases such as GenBank, GenPept, Protein Identification Resource,
and SWISS-PROT. The resource is accessible via the World Wide Web
using popular client software such as Mosaic and Netscape. The client
software is freely available on a number of computing platforms including
Macintosh, IBM-PC, and Unix workstations.
|
Rapid Protein Structure Classification using One-dimensional Structure
Profiles on the BioSCAN Parallel Computer,
D. L. Hoffman, S. Laiter, Raj K. Singh, I. I. Vaisman, and A. Tropsha,
Computer Applications in the Biosciences, Vol. 11, No. 6, 1995, pp. 675-679.
ABSTRACT:
Rapid growth of protein structures database in recent years requires an
effective approach for objective comparison and classification of
deposited protein structures. We describe a novel method for structure
comparison and classification based on the alignment of one-dimensional
structure profiles. These profiles are obtained by calculating
the OCCO pseudodihedral angles (formed by O-C-C-O atoms of carbonyl
groups of consecutive amino acid residues) from protein three-dimensional
coordinates. These angle measurements are then converted into a 24 letter
alphabet, and the protein structures are represented by sequences of letter
from this alphabet. The BioSCAN parallel computer, designed for primary
sequence alignment, is used to rapidly align and classify these
one-dimensional structure profiles. We have developed and implemented
weighted scoring matrix to identify structural classes based on commonly
found structural motifs. The results of our experiments are in good
agreement with the traditional protein structure classification schemes.
One-dimensional structure profiles significantly improve efficiency of
structure comparison and classification.
|
Pseudotorsional OCCO backbone angle as a single descriptor of protein secondary structure,
Sergei Laiter, Doug L. Hoffman, Raj K. Singh, Iosif I. Vaisman, and
Alexander Tropsha, Protien Science,Volume 4, Issue 8, 1995, pp.1633-1643.
ABSTRACT:
Protein secondary structure is conventionally identified using characteristic ranges of two backbone torsional angles φ and ψ. We suggest that the secondary structure can be adequately characterized by a single descriptor, the Oi-1Ci-1CiOi (where i is the residue number) pseudotorsional backbone angle. A set of 102 structurally distinct protein chains from the Protein Data Bank was used to evaluate the adequacy of this descriptor. We find that a specific range of OCCO angles corresponds to each major secondary structure. The complete range of OCCO angles (-180° to 179°) was broken into 18 consecutive subranges of 20° each, and each subrange was assigned a letter. Thus, the OCCO profiles for each protein in the database were translated into a sequence of letters. The Needleman-Wunsch primary sequence alignment algorithm was then used for secondary/tertiary structure comparison and alignment. Preliminary results indicate that this new approach has a significant potential for rapid identification of fold families in the Protein Data Bank.
|
A Scalable Systolic Multiprocessor for Analysis of Biological
Sequences,
Raj K. Singh, S. G. Tell, C. T. White, D. L. Hoffman, V. L. Chi, and
B. W. Erickson, Proc. of the Symposium on Integrated Systems,
Seattle, WA, March 3-5, 1993, MIT Press, Cambridge, MA, pp. 168-182.
ABSTRACT:
The design and implementation of an application-specific, fault-tolerant,
and scalable multiprocessor system called BioSCAN (Biological Sequence
Comparative Analysis Node) are described. Discussed are system
partitioning and integration, functional decomposition between hardware
and software, the algorithm and its implementation in VLSI, the early
results of using the system, and comparison with other hardware and
software solutions for biological sequence analysis.
|
Design of the BioSCAN Server Software,
D. L. Hoffman, Department of Computer Science, University of North
Carolina, Chapel Hill, NC. TR93-049, 1993.
ABSTRACT:
This paper is an exploration of the design goals for the Biological
Sequence Comparative Analysis Node (BioSCAN) network server software and of
the impact that these goals had on the overall structure and implementation
of that software. The primary audience for this paper consists of computer
scientists and computational biologists involved in developing similar
server software. Biologists who are users of the BioSCAN computational
node and have a desire for deeper understanding of how the server functions
will also find this paper useful. It is assumed that the reader is
familiar with UNIX and with basic networking concepts. The
peculiarities of implementing a network server for a batch resource will be
identified and the solutions chosen by the BioSCAN design team explained.
Research for the BioSCAN project, including the design of the server
software, was supported in part by NSF grant MIP-9024585.
|
A comparison of the BioSCAN algorithm on Multiple
architectures,
D. L. Hoffman, Department of Computer Science, University of North
Carolina, Chapel Hill, NC. TR93-050, 1993.
ABSTRACT:
This paper compares the performance characteristics of the BioSCAN
biological sequence matching algorithm on several different computer
architectures. The architectures examined are a conventional RISC
general purpose uni-processor, a vector oriented ``supercomputer'', and a
Single Instruction Multi Data (SIMD) massively parallel computer.
These architectures are represented by the following hardware platforms: a
Sun 490 RISC, a Convex 240, and a MasPar MP-1. The performance of these
three platforms is compared with that of the custom built BioSCAN hardware.
|
|