Pattern Discovery/Statistical Services

Ronald Christensen invented Entropy Minimax, an information theoretic method of pattern recognition, in 1963. Originally conceived as an answer to the age-old philosophical problem of the justification of inductive reasoning and as an explanation of the evolution of languages, Entropy Minimax has been applied to numerous problems involving searches for patterns in large databases and development of statistical forecasting models.

Examples of some of the problems addressed by Entropy Minimax include:

**
Bio-Medical Research and Development**

- Develop early detection algorithms for cancers of the breast, cervix, and prostrate
- Estimating five-year survival likelihoods for coronary artery disease patients
- Estimating response to chemotherapy in acute myelocytic leukemia
- Estimating response to chemotherapy in non-Hodgkin's lymphoma
- Differential diagnosis of diseases of the cervical spine
- Screening drugs for potential antineoplasitic activity
- Differential diagnosis of heart disease using ECG waveforms
- Assessing symptoms of pelvic surgery patients
- Analyzing out-patient clinic use/overuse
- Developing screens for radiology
- Analyzing protein structure/function patterns in DNA coding of adenyl kinase
- Developing an understanding of molecular evolution

**
Engineering Modeling**

- Detecting underground storage systems leakage
- Estimating useful lifetimes of underground tanks
- Modeling irreversible processes with incomplete information via hazard axes
- Modeling fracture of UO2 pellets subjected to large thermal gradients
- Modeling fission gas release in nuclear fuel rods under commercial operating conditions
- Modeling failure (rupture) of nuclear fuel rods under commercial operating conditions
- Modeling longitudinal distortion (bowing) of rods in nuclear fuel assembles during commercial reactor operations
- Modeling radial distortion (swelling) of nuclear fuel rods under simulated LOCA (Loss of Coolant Accident) conditions

**
Weather Forecasting**

- Forecasting annual precipitation in Northern California at 2 month and 6 month lead times
- Forecasting winter precipitation in Western Oregon at 7 month lead time
- Forecasting spring and winter precipitation in Eastern Washington at 6 and 7 month lead times
- Forecasting winter precipitation in Central Arizona at 2 month lead times
- Forecasting summer precipitation in Central Arizona at 8 month lead times

**
Statistical Analysis**

- Estimating probabilities using data reported in published literature
- Step sizing in numerical solution of differential equations
- Performing non-linear curve fitting
- Feature selection in complex situations
- Analyzing incomplete data sets
- Informationally optimal expansions for waveform analysis
- Computer vision/recognition system development

**Examples of Entropy Research Using Entropy Minimax**

Long-Term Weather ForecastingIt is well known that mechanistic "general circulation" models, regardless of the completeness and accuracy of their global input, reach fundamental limits in predictive capabilities somewhere between two and three weeks into the future. A chaos theory example commonly given in explanation is that the earth-ocean-atmosphere system is so sensitive to slight changes over lead times of 2-3 weeks that the weather in New York today can be significantly affected by a butterfly flapping its wings in Beijing three weeks ago.

Despite this mechanistic modeling limitation, it has been understood that statistical averages for extended periods, say a season, can be predicted with something better than purely random success. However, extensive research by numerous universities and government agencies prior to 1980 concluded that, for example, annual precipitation in California could not be predicted to be above or below median at 2-6 month lead times with an accuracy more than 5 percentage points better than random guessing.

Under contract to the U.S. Dept. of Interior, Entropy Limited in 1980, used Entropy Minimax to develop a means of making such predictions with an accuracy 13 percentage points better than random. A key finding of the Entropy Minimax pattern discovery was the importance of sea surface temperatures in the equatorial Pacific (El Nino) to mid- and long-range weather forecasting. Subsequently, Entropy Minimax has been applied to forecasting precipitation (at 18 percentage points better than random), temperature and a temperature/humidity discomfort index for Oregon, Washington and Arizona, to serve as input for water and energy resource management.

Nuclar Fuel Performance ModelingAlong with global weather modeling, nuclear reactor fuel performance modeling is another problem requiring massive computer codes. A typical reactor has 30,000 or more fuel rods, each a metallic sealed cannister about 12 feet long containing hundreds of uranium dioxide fuel pellets. There can be temperature differences of more than a thousand degrees across the short distance (e.g. 1 cm or less) from the center to the surface of a fuel pellet. The computer models track the varying temperature and pressures throughout the reactor core as control rod configurations are changed during power generation operation.

With respect to the thermo-mechanical properties, these models worked quite well. However, historically, where they ran into difficulty was in the final stage, namely in modeling the cracking failure of the cladding which would permit leakage of radioactivity into the primary coolant.

Under contract with the Electric Power Research Institute, and working in cooperation with Argonne National Laboratory, Stanford University Dept. of Materials Science, Univ. of Manchester Dept. of Materials, Science Applications Inc., and Failure Analysis Associates, EL applied Entropy Minimax to successfully model the fuel cracking. This modeling was subsequently embodied in a system for use in failure avoidance operation, fuel cycle management and fuel design evaluation. A model of stainless steel fuel rods was built by researchers at the Univ. of Michigan using Entropy Minimax, and their paper describing this work won the American Nuclear Society Mark Mills Award as the best paper of the year in its category in Nuclear Engineering.

Diagnostic Classification of Non-Hodgkin's LymphomaTreatment decisions for patients with non-Hodgkin's lymphoma are based, in part, on diagnostic classification of the patient's disease. For many years, this classification was based primarily on histology (favorable or unfavorable) determined by microscopic analysis of tissue.

In 1984, patterns of survival for classifying patients with advanced non-Hodgkin's lymphoma were discovered by an information theoretic Entropy Minimax analysis of a sample of 334 patients (224 for model development and 110 for model testing), presented at the Second International Conference on Malignant Lymphoma in Lugano, Switzerland, and published the next year in the conference proceedings book. Patterns of good prognosis for survival were found to be defined by a high Karnofsky status (a simple measure of level of daily activities) and either a normal serum transaminase (SGOT) level or a normal spleen (by palpation and/or scan). Patterns of poor prognosis were identified as having either low Karnosfky status (<70%) or night sweats. The surprising finding was that these survival prognosis patterns apply both to patients with favorable histology and to patients with unfavorable histology.

Ten years later a Harvard-led consortium of 20 university medical schools, using pooled data on 5000 patients, confirmed the finding of these factors as proper for classifying non-Hodgkin's lymphoma. This is an example of the utility of information theoretic analyses even with modest sample data.

**Analysis Methodologies Employed by Entropy Limited**

As well as Entropy Minimax, during the past 30 years, key team members of Entropy Limited have incorporated a wide range of important procedures and developed a number of new computer-based tools that efficiently extract reliable indicators from real-world and often not-high-quality databases. Included are tools which use information theory to efficiently detect patterns; compensate for background noise in estimating critical system attributes such as durations of a behavioral trend or lifetimes; and combine data from disparate sources and models to yield a synergetic result more reliable than any single source or model. These tools permit EL to find and project the future predictive capability of patterns from databases that are smaller and more varied than is considered suitable for older procedures. Listed below are some of the methodologies used by EL in carrying out its work, where starred (*) items relate to or include special Entropy methodologies.

**
Predictive Modeling**

- Cluster Analysis
- Principal Components (Karhunen-Loeve) Analysis
- Linear Regression
- Logistic Regression
- Kaplan-Meier Survival Analysis
- Correlation & Autocorrelation
- Statistical Significance Testing

**
Advanced Statistically Validated Predictive Modeling**

- Small Sample High Dimensionality Statistics
- Automatic Phase Alignment for Karhunen-Loeve Analysis*
- Information Theoretic Multidimensional Feature Extraction*
- Integration of Mechanistic and Statistical Models*
- Entropy Minimax Multivariate Predictive Modeling*
- Information Theoretic Expert Resolution*
- Self-Consistent Decensoring Survival Analysis*
- Cross-validation

**
Special Analysis Methodologies**

- Error Propagation Analysis*
- Statistical Distribution Parameter Fitting
- Statistics of Rare Events
- Automatic Data Distribution Determination*
- Multidimensional Data Displays (M-Views)*
- Automatic Missing Data Estimation Procedures*
- Sequential Outlier Flagging Procedures*
- Eigenvector Analysis
- Waveform Analysis
- Finite Element Analysis
- Noise Reduction

**Examples of Entropy Research Using Various Mathematical/Statistical
Methodologies**

Surprising Findings About the FluIt is generally understood that heart disease and cancer are statistically the big killers in the U.S. It is also a fact that significantly more people die in winter months than in the summer, from about 10% more to over 20% more, depending on the year. A surprising result of research by Entropy Limited and its collaborators is that there is a single cause for almost all of the differential of winter over summer death rates in the U.S., in fact in all developed countries, and that cause is influenza.

Death rates due to influenza, both the "ordinary" annual flu strains and the less frequent but more virulent pandemic strains, are higher among older people than among those who are younger. Recognizing this, most countries have implemented policies with emphasis on vaccinating "populations at risk" such as the elderly. Another surprising result of research by EL and its collaborators is that these vaccination programs have failed to reduce death rates in any age group. In fact, one of the few programs that did significantly reduce death rates for the elderly was vaccination in Japan, not of the the "at risk" elderly, but rather of schoolchildren. (Influenza research supported, in part, by Entropy Limited and, in part, by the National Institutes of Health.)

Garbage In, Gold OutSuppose you are a physician examining a patient with clinically observable features corresponding to an 80% likelihood of having a particular disease. You await the results of a lab test before giving your final diagnosis. Suppose that when the test results arrive they indicate a lower probability, only a 70% chance of your patient having the disease.

Virtually all conventional analytic procedures would produce a final diagnosis compromising somewhere between the two estimates, 70% and 80%, depending on various measures of reliability of each estimate. However, a computer simulation at MIT and an information theoretic analysis at Entropy Limited has revealed that, although such a compromise is often the right answer, there are identifible circumstances when the correct combination of these two results is a final diagnosis higher than 80% and under some conditions as high as 95%.

Computer algorithms have been developed at Entropy Limited embodying the theory underlying this analysis for achieving, under specifible conditions, high reliability output from moderate reliability input with applicability to multivariate detection problems.

Mathematical EspionageIt is generally thought that if one has data on the individuals in a group but publishes only aggregates such as totals or averages, one is not disclosing the information about any specific individual in the group. However, in precisely such a case Entropy Limited showed that if Line-of-Business (LB) data were released by the Federal Trade Commission, then competitors could mathematically discover proprietary information about individual companies despite the aggregated nature of the release. (For a major New York law firm representing over 100 corporations, and presented in testimony before the U.S. Federal Trade Commission.)

Statistical Distribution of Humpback Whale Data.For the Ocean Alliance (formerly the Whale and Dolphin Conservation Institute), Entropy Limited conducted a statistical analysis of data on whales in the Atlantic off-shore South America and discovered a bimodal rather than a unimodal distribution which showed that what was previously thought to be a single whale population was actually two populations.

Entropy Limited has developed distribution fitting software for over 100 distributions, rank ordering them for any input dataset by goodness-of-fit and providing color plots of data histograms and fitted distributions, annotated with error measures, on PC-based monitors and printers.