







|

|
This section provides access to publications
associated with my research projects.
A list of paper titles organized by subject follows.
To see the abstract
(and citation) for a paper, click on its title.
With most abstracts there will be an icon that looks like . If you click on that icon,
the postscript/pdf/doc form
of the paper will be retrieved for viewing and optional printing.
|
|
Books & Book Chapters
|
|
Performance Modeling of Enterprise Grid Systems
chapter in Data Engineering: Mining, Information, and Intelligence,
Springer,Series: International Series in Operations Research & Management Science , Vol. 132
Chan, Yupo; Talburt, John; Talley, Terry M. (Eds.), 2009.
The Resilient Earth: Science, Global Warming, and the Future of Humanity, Doug L. Hoffman and Allen Simmons,
Booksurge Publishing, October 2008.
|
|
Published Papers
|
|
Application of Empirical Mode Decomposition to the Arrival Time Characterization of a Parallel Batch System Using System Logs ,
September, 2009.
Capacity Planning of a Commodity Cluster in an Academic Environment: A Case Study, ,
April, 2008.
A Case Study on Grid Performance Modeling ,
November, 2006.
Initial Starting Point Analysis for K-Means
Clustering: A Case Study ,
March, 2006.
Adaptive Automatic Grid Reconfiguration
Using Workload Phase Identification ,
December, 2005.
Comparison of Protein Structures
by Transformation into Dihedral Angle Sequences ,
August, 1996.
BioSCAN: A Dynamically Reconfigurable
Systolic Array for Biosequence Analysis , June, 1996.
BioSCAN: A Network Sharable Computational
Resource for Searching Biosequence Databases , March, 1996.
Rapid Protein Structure Classification
using One-dimensional Structure Profiles on the BioSCAN Parallel
Computer, October, 1995.
Pseudotorsional OCCO backbone angle as a single
descriptor of protein secondary structure, May, 1995.
A Scalable Systolic Multiprocessor System for
Analysis of Biological Sequences , March, 1993.
|
|
Technical Notes
|
|
Performance Modeling of Enterprise Grid Systems,
D. L. Hoffman, A. Apon, L. Dowdy, B. Lu, et al, in Data Engineering: Mining, Information, and Intelligence, Series: International Series in Operations Research & Management Science , Vol. 132,
Chan, Yupo; Talburt, John; Talley, Terry M. (Eds.), Springer, 2009.
ABSTRACT:
Modeling has long been recognized as an invaluable tool for predicting the performance behavior of computer systems. Modeling software, both commercial and open source, is widely used as a guide for the development of new systems and the upgrading of exiting ones. Unfortunately, no set of comprehensive tools exists for modeling complex distributed computing environments such as the ones found in emerging grid deployments. This chapter addresses concepts, methodologies, and tools that are useful when designing, implementing, and tuning the performance in grid and cluster environments.
|
Application of Empirical Mode Decomposition to the Arrival Time Characterization of a Parallel Batch System Using System Logs,
Linh Ngo, Baochuan Lu, Hung Bui, Amy Apon, Nathan Hamm, Larry Dowdy, Doug Hoffman and Denny Brewer,
In Proceedings of the 2009 International Conference on Modeling, Simulation, and Visualization Methods, July, 2009.
ABSTRACT:
Abstract: Traditionally, workload models of large-scale production computer clusters are created fromsystem logs for the purpose of analyzing and predithe performance of these systems. Such logs are oflarge, complex, and unwieldy. For conciseness, thsystem log can be approximated by finding a hypeexponential distribution that captures the workload dynamics as closely as possible. Using this techniqthe workload model is able to match closely the glostatistical measurements of the original system log.However, using a hyperexponential distribution to synthetically regenerate job arrival times in a simulation model does not capture the realistic randomness of bursts of arrivals in the original log.this paper, a new workload modeling method basedEmpirical Mode Decomposition (EMD) is describeEMD provides a compromise between the full complexity of the original log data and a simple hyperexponential representation. Likewise, the EMapproach provides a compromise between the accuassociated with the log data and the coarse approximation using the hyperexponential representation. The tradeoff of using an EMD approach can be effective in certain performance modeling studies.
|
The Resilient Earth: Science, Global Warming, and the Future of Humanity,
Doug L. Hoffman and Allen Simmons,
Booksurge Publishing, October 2008.
ABSTRACT:
A million years after the birth of our sun, the violent explosion of a nearby supernova nearly ended life on Earth before it began. Over the next four and a half billion years, forces of nature shaped our planet and the life it harbored. Barely surviving the traumatic birth of the Moon, buffeted by supernovae, and bombarded by asteroids, the resilient Earth endured. And despite planet-freezing ice ages, devastating mass extinctions, and ever changing climate, life not only survived, it thrived. Today, we are told all life on Earth is threatened by a new peril--human-caused global warming. The Resilient Earth presents the science behind global warming for a general audience, separating fact from fiction and truth from exaggeration.
|
Capacity Planning of a Commodity Cluster in an Academic Environment: A Case Study,
Baochuan Lu, Linh Ngo, Hung Bui, Amy Apon, Nathan Hamm,
Larry Dowdy, Doug Hoffman and Denny Brewer,
9th LCI International Conference on
High-Performance Clustered Computing, April 2008.
ABSTRACT:
In this paper, the design of a simulation model for evaluat-
ing two alternative supercomputer configurations in an academic envi-
ronment is presented. The workload is analyzed and modeled, and its
effect on the relative performance of both systems is studied. The In-
tegrated Capacity Planning Environment (ICPE) toolkit, developed for
commodity cluster capacity planning, is successfully applied to the tar-
get environment. The ICPE is a tool for workload modeling, simulation
modeling, and what-if analysis. A new characterization strategy is ap-
plied to the workload to more accurately model commodity cluster work-
loads. Through “what-if” analysis, the sensitivity of the baseline system
performance to workload change, and also the relative performance of
the two proposed alternative systems are compared and evaluated. This
case study demonstrates the usefulness of the methodology and the ap-
plicability of the tools in gauging system capacity and making design
decisions.
|
A Case Study on Grid Performance Modeling,
B. Lu, A. Apon, L. Dowdy, F. Robinson, D. Hoffman, and D. Brewer,
International Conference on Parallel and Distributed Computing Systems,
November 13, 2006.
ABSTRACT:
The purpose of this case study is to develop a performance
model for an enterprise grid for performance management
and capacity planning. The target environment includes
grid applications such as health-care and financial services
where the data is located primarily within the resources of a
worldwide corporation. The approach is to build a discrete
event simulation model for a representative work-flow grid.
Five work-flow classes, found using a customized k-means
clustering algorithm characterize the workload of the grid.
Analyzing the gap between the simulation and measurement
data validates the model. The case study demonstrates
that the simulation model can be used to predict the
grid system performance given a workload forecast. The
model is also used to evaluate alternative scheduling strategies.
The simulation model is flexible and easily incorporates
several system details.
|
Initial Starting Point Analysis for K-Means Clustering: A Case Study,
F. Robinson, A. Apon, D. Brewer, L. Dowdy, D. Hoffman, B. Lu,
Proceedings of ALAR 2006 Conference on Applied Research in Information
Technology, March, 2006.
ABSTRACT:
Workload characterization is an important part of systems performance modeling. Clustering is a method used to find classes of jobs within workloads. K-Means is one of the most popular clustering algorithms. Initial starting point values are needed as input parameters when performing k-means clustering. This paper shows that the results of the running the k-means algorithm on the same workload will vary depending on the values chosen as initial starting points. Fourteen methods of composing initial starting point values are compared in a case study. The results indicate that a synthetic method, scrambled midpoints, is an effective starting point method for k-means clustering.
|
Adaptive Automatic Grid Reconfiguration Using Workload Phase
Identification,
B. Lu, M. Tinker, A. Apon, D. Hoffman, and L. Dowdy,
Proceedings of EScience 2005, December, 2005.
ABSTRACT:
The purpose of this study is to develop an adaptive model
of a very large scale data processing and storage environment.
The target environment includes grid applications
such as health-care and finance in which the data may be
located primarily within the resources of a worldwide corporation.
The approach is to use phase identification techniques
that can detect over-utilized grid resources, and then
to make dynamic decisions to reassign additional resources
to that portion of the application processing. Two phase
identification techniques are proposed, a variation technique
and a real-time threshold-based technique. The techniques
are validated with a simulation model and a case
study using measured data from a production grid environment.
The case study demonstrates that phase identification
techniques can be used as the intelligent component of
a reactive mechanism for a grid to adapt to changing environmental
conditions by dynamic automatic reconfiguration.
Results show that threshold based phase identifying
techniques combined with dynamic resource allocation capabilities
are effective in alleviating performance hot spots
and improving response time in a large scale data grid.
|
Comparison of Protein Structures by Transformation
into Dihedral Angle Sequences,
D. L. Hoffman, PhD dissertation, University of North
Carolina at Chapel Hill, 1996.
ABSTRACT:
Proteins are large complex organic molecules that are essential to the
existence of life. Decades of study have revealed that proteins having
different sequences of amino acids can posses very similar
three-dimensional structures. To date, protein structure comparison
methods have been accurate but costly in terms of computer time. This
dissertation presents a new method for comparing protein structures using
dihedral transformations. Atomic XYZ coordinates are transformed into a
sequence of dihedral angles, which is then transformed into a sequence of
dihedral sectors. Alignment of two sequences of dihedral
sectors reveals similarities between the original protein
structures. Experiments have shown that this method detects structural
similarities between sequences with less than 20% amino acid sequence
identity, finding structural similarities that would not have been
detected using amino acid alignment techniques. Comparisons can be
performed in seconds that had previously taken minutes or hours.
|
BioSCAN: A Dynamically Reconfigurable Systolic Array for
Biosequence Analysis,
Raj K. Singh, W. D. Dettloff, V. L. Chi, D. L. Hoffman, S. G. Tell,
C. T. White, S. F. Altschul, and B. W. Erickson, Proc. of CERCS96,
National Science Foundation, Arlington, VA, June 22-24, 1996.
ABSTRACT:
We describe the design, implementation, and deployment via the
Internet of BioSCAN, an application-specific computer system for
the rapid determination of statistically significant alignments
of biopolymer (DNA, RNA, protein) sequences. BioSCAN continues
to outperform other systems designed to perform this basic task
of molecular biology, which continues to grow in magnitude and
importance. The BioSCAN system is hosted by a general-purpose
workstation containing a special-purpose hardware engine that
accelerates the core algorithm for comparing two biosequences.
Careful partitioning of the computational tasks between hardware
and software provides not only high performance but also
programmability. The BioSCAN system can compare a sequence of
up to 12,992 characters with an arbitrarily large database
containing arbitrarily long sequences at a rate of 2 million
database characters per second. This rate is nearly 1,000 times
greater than the rate achieved by a state-of-the-art workstation
using software alone. This network-sharable computational
resource is accessible interactively via the World Wide Web
using Mosaic, Netscape or other client software.
|
BioSCAN: A Network-Sharable Computational Resource for Searching
Biosequence Databases,
Raj K. Singh, D. L. Hoffman, S. G. Tell, and C. T. White,
Computer Applications in the Biosciences, Vol. 12, No. 3, 1996,
pp. 191-196.
ABSTRACT:
We describe a network sharable, interactive computational tool for
rapid and sensitive search and analysis of biomolecular sequence
databases such as GenBank, GenPept, Protein Identification Resource,
and SWISS-PROT. The resource is accessible via the World Wide Web
using popular client software such as Mosaic and Netscape. The client
software is freely available on a number of computing platforms including
Macintosh, IBM-PC, and Unix workstations.
|
Rapid Protein Structure Classification using One-dimensional Structure
Profiles on the BioSCAN Parallel Computer,
D. L. Hoffman, S. Laiter, Raj K. Singh, I. I. Vaisman, and A. Tropsha,
Computer Applications in the Biosciences, Vol. 11, No. 6, 1995, pp. 675-679.
ABSTRACT:
Rapid growth of protein structures database in recent years requires an
effective approach for objective comparison and classification of
deposited protein structures. We describe a novel method for structure
comparison and classification based on the alignment of one-dimensional
structure profiles. These profiles are obtained by calculating
the OCCO pseudodihedral angles (formed by O-C-C-O atoms of carbonyl
groups of consecutive amino acid residues) from protein three-dimensional
coordinates. These angle measurements are then converted into a 24 letter
alphabet, and the protein structures are represented by sequences of letter
from this alphabet. The BioSCAN parallel computer, designed for primary
sequence alignment, is used to rapidly align and classify these
one-dimensional structure profiles. We have developed and implemented
weighted scoring matrix to identify structural classes based on commonly
found structural motifs. The results of our experiments are in good
agreement with the traditional protein structure classification schemes.
One-dimensional structure profiles significantly improve efficiency of
structure comparison and classification.
|
Pseudotorsional OCCO backbone angle as a single descriptor of protein secondary structure,
Sergei Laiter, Doug L. Hoffman, Raj K. Singh, Iosif I. Vaisman, and
Alexander Tropsha, Protien Science,Volume 4, Issue 8, 1995, pp.1633-1643.
ABSTRACT:
Protein secondary structure is conventionally identified using characteristic ranges of two backbone torsional angles φ and ψ. We suggest that the secondary structure can be adequately characterized by a single descriptor, the Oi-1Ci-1CiOi (where i is the residue number) pseudotorsional backbone angle. A set of 102 structurally distinct protein chains from the Protein Data Bank was used to evaluate the adequacy of this descriptor. We find that a specific range of OCCO angles corresponds to each major secondary structure. The complete range of OCCO angles (-180° to 179°) was broken into 18 consecutive subranges of 20° each, and each subrange was assigned a letter. Thus, the OCCO profiles for each protein in the database were translated into a sequence of letters. The Needleman-Wunsch primary sequence alignment algorithm was then used for secondary/tertiary structure comparison and alignment. Preliminary results indicate that this new approach has a significant potential for rapid identification of fold families in the Protein Data Bank.
|
A Scalable Systolic Multiprocessor for Analysis of Biological
Sequences,
Raj K. Singh, S. G. Tell, C. T. White, D. L. Hoffman, V. L. Chi, and
B. W. Erickson, Proc. of the Symposium on Integrated Systems,
Seattle, WA, March 3-5, 1993, MIT Press, Cambridge, MA, pp. 168-182.
ABSTRACT:
The design and implementation of an application-specific, fault-tolerant,
and scalable multiprocessor system called BioSCAN (Biological Sequence
Comparative Analysis Node) are described. Discussed are system
partitioning and integration, functional decomposition between hardware
and software, the algorithm and its implementation in VLSI, the early
results of using the system, and comparison with other hardware and
software solutions for biological sequence analysis.
|
Design of the BioSCAN Server Software,
D. L. Hoffman, Department of Computer Science, University of North
Carolina, Chapel Hill, NC. TR93-049, 1993.
ABSTRACT:
This paper is an exploration of the design goals for the Biological
Sequence Comparative Analysis Node (BioSCAN) network server software and of
the impact that these goals had on the overall structure and implementation
of that software. The primary audience for this paper consists of computer
scientists and computational biologists involved in developing similar
server software. Biologists who are users of the BioSCAN computational
node and have a desire for deeper understanding of how the server functions
will also find this paper useful. It is assumed that the reader is
familiar with UNIX and with basic networking concepts. The
peculiarities of implementing a network server for a batch resource will be
identified and the solutions chosen by the BioSCAN design team explained.
Research for the BioSCAN project, including the design of the server
software, was supported in part by NSF grant MIP-9024585.
|
A comparison of the BioSCAN algorithm on Multiple
architectures,
D. L. Hoffman, Department of Computer Science, University of North
Carolina, Chapel Hill, NC. TR93-050, 1993.
ABSTRACT:
This paper compares the performance characteristics of the BioSCAN
biological sequence matching algorithm on several different computer
architectures. The architectures examined are a conventional RISC
general purpose uni-processor, a vector oriented ``supercomputer'', and a
Single Instruction Multi Data (SIMD) massively parallel computer.
These architectures are represented by the following hardware platforms: a
Sun 490 RISC, a Convex 240, and a MasPar MP-1. The performance of these
three platforms is compared with that of the custom built BioSCAN hardware.
|
|