The Artificial Intelligence of Diagnosing Diseases

March 7th, 2012

This post is a framework for diagnosing diseases with artificial intelligence.  It draws its inspiration heavily from a transcript of Henry Cohen‘s excellent lecture in 1943, “The Nature, Method and Purpose of Diagnosis.”  I liked reading Cohen’s lecture because it is clear and concise and seemed to fit well with an artificial intelligence approach to diagnosing diseases.  My interest in this subject is in developing algorithms to make diagnoses.

There are no diseases, only disease.

[1]

This kind of sums the whole thing up and is a great place to start.  This quote reinforces the idea that we’re not looking to exhaustively search innumerable avenues.  We’re looking to find what, at its root, is bothering the patient.

Why artificial intelligence to diagnose diseases?  The reason is to provide a consistently high quality of patient care, a quality of care that is repeatable and reliable.  Cohen argues that one of the main problems with patient care is consistency.  The same patient could get different diagnoses from different doctors.  Doctors are human, after all, and clearly differ.  How do doctors differ?

  • Observational prowess
  • Knowledge of symptoms, signs, and syndromes of disease
  • Interpretive ability
  • Use different labels

[1]

Creating and deploying an artificial intelligence based system with the same observational and interpretive abilities, and a consistent taxonomy would relieve the confusion of conflicting medical advices.

The literature on using artificial intelligence to algorithmically make medical diagnoses is surprisingly timid.  Usually, attempts at automatically diagnosing diseases are couched in words like, “assist” and “consult” and rarely, if ever, take full responsibility for making the diagnosis.  One author suggested a reason for this: trepidation about encroaching on the doctor patient relationship.  Usually, algorithmic diagnostic systems are tailored to diagnosing specific disease.  The task of this post is to outline a framework for diagnosing disease.

 

Observation

The first stage in the diagnosing disease is the recognition of simple quantitative deviations from normal.

[1]

This sentence from Cohen provides a great start to our diagnostic engine.  ”First stage” implies that a simple state machine structure for the main diagnostic engine is appropriate, with the first stage seeking to gather information and compare it to what is considered normal for that patient.

 

The inputs to the observation stage are the patient’s medical history, their testimony about why they are seeking diagnosis and a physical examination.  Processing the patient’s testimony can exist on a spectrum with one end anchored by using natural language processing (NLP) to parse and comprehend the patient’s testimony and the other by letting the patient choose from a drop down menu.  The former is much more sophisticated, involved, but more accurately answers the task as it allows the patient to consult the device when they don’t know what is wrong, which makes the device infinitely more usable.  The latter essentially boils down to a choose-your-own adventure approach, is trivial to implement, but does not significantly move the needle beyond simple internet searches, leaving the responsibility of diagnosis or deciding whether or not to consult a doctor in the hands of the patient.  The NLP approach, or something close to it is preferred.

The NLP lies outside of the main diagnostic engine so that different algorithms can be swapped in and out seamlessly.  The diagnostic flow should not depend on the specifics of the NLP used to parse the patient’s testimony.  The NLP outputs certain key words gleaned from the patient’s testimony ordered and weighted in terms of “importance.”  Clearly, the NLP will need to know what the observation stage of the diagnostic engine considers important, but does not need to be embedded in the diagnostic engine.

The patient’s testimony, their medical record and any vital signs recorded are collected in the observation stage and passed to the next stage.  The observation stage may and probably will order or perform certain biometric tests on the patient and wait for those results before proceeding to the next stage, Interpretation.  For example, the observation stage may perform and pass on the results of an electrocardiogram (ECG) test based on certain watchwords figuring prominently in the patient’s testimony.

The observation stage looks at physical exam measurements and compares them to expected values given the patient’s data form their medical records (age, height, weight, gender…).  It also performs additional tests based on simple keyword matching from the patient’s testimony.  The observation stage writes back to the patient’s file and passes everything onto the interpretation stage.  The observation stage is like the nurse taking your vital signs and the interpretation stage is like the doctor who comes to make a medical diagnosis.

The output of the observation stage is an up-to-date medical record of the patient.  The ability to index the patient’s record temporally is important as this allows the Interpretation stage to analyze how a particular condition has changed over time.

 

Interpretation

The interpretation stage can ask the patient questions directly to obtain additional information.  It is better to keep this ability directly in the interpretation stage instead of going back to the observation stage because this new information will probably be directly linked to branching in this stage of the diagnosis.  To simplify things, the query/response format in the interpretation stage does not need to be as open-ended, from a linguistic standpoint, as in the observation stage.  In the observation stage, natural language processing is needed because we may not be sure what we are trying to figure out yet, so we have to derive meaning from a very complex set of possibilities.  However, in the interpretation stage, we are seeking specific, targeted information, so the response options should be limited.  This fits well into a dropdown box, an example is below.

Have you felt chest pain?

 

The last three, “I don’t know”, “I don’t understand the question,” and “It’s complicated” allow the response/query routine to improve its chances for getting a useful answer out of the patient.

The AI for medical diagnosis will need to reason anatomically, that is, it will have to move from one part of the body to the other in search for interpretations that fit the existing data.  Cohen considered the “fundamental tripod of medicine” to be anatomy, physiology and pathology.  Of these, anatomy lends itself well to being described as a connectivity graph.  The AI could have different graphs for different systems in the body such as circulatory, respiratory, endocrine, …. each describing how different parts of the body are connected together.  A simple 1, 0 (connected, not connected) would probably do as the AI is simply looking for what to try next, that is, once it traverses the graph from, say, the heart to the liver, it is using “liver” as a keyword to lookup potential next steps.

What about a Bayesian approach to interpretation?  I would stay away from it because it relies on “models that are subjective, and the resulting inference depends greatly on the model selected.” [2]  We are seeking a framework that can be used for diagnosing a wide range of diseases, not tuned to specific diseases.  The framework must be general and its reasoning mathematical.  The reasoning itself cannot have a subjective foundation.

The output of the interpretation stage is a provisional medical decision about which steps to take next.  If the algorithm does not have enough information to make a decision, it does not need to do so.  It can order more tests or suggest a therapy to alleviate the condition and have the patient report back.  When the patient reports back, they start again in the observation phase.

 

Symbolization, Corrective Action and Evaluating an Algorithm

Even after the diagnostic engine reaches the heart of the matter, what’s wrong with the patient, there is still much more work to do.  First it must encode the diagnosis in a manner that will allow it to treat the same disease the same way, every time, from patient to patient.  The Oxford American Dictionary defines syndrome as

syndrome n. 1) a group of concurrent symptoms of a disease

If the list of syndromes for a disease is complete enough, it will uniquely identify a disease.  Cohen assesses syndromes as the site, functional disturbances and cause of disease [1]. This should be enough information to universally encode the disease.  Notice we have not included any prescriptive remedy in the encoding as this will vary from patient to patient as patients with the same disease at the same site may need different courses of action based on age, gender….

Second, we must figure out the cause of the disease and its implications.

Too frequently we have been content with a diagnostic label without investigating its implications.

[1]

Causation implies a search for antecedents, and not for the ultimate — the final — cause of all things.  This means not a single antecedent or even a chain of antecedents, but a whole interlacing network of them.

[1]

This points directly to graph theory for reasoning through the causes and implications for the disease.  Somehow we’d need to map corporal function to a manifold and be able to traverse it.  This is significantly more complicated than the simple graph traversal in the Interpretation stage, as there, we are simply seeking clues to help us along our decision making tree.  We’ve already mapped the several systems of the body to graphs: circulatory system, skeletal, respiratory system, and are simply looking them up.  In this stage we’d likely need to do the mapping on the fly based on what we figured out from the previous stage.

In addition to affixing a label to the diagnosis, the output of this stage is to recommend a corrective action.

The main aim of diagnosis, that of providing the rational basis for treatment and prognosis…

[1]

The main implementation decision to make here is: do we spend more time/energy investigating causes and implications and make the treatment recommendation and prognosis estimate simpler or vice versa.  For instance, if the algorithm is good at figuring out causation and implication, maybe the treatment and prognosis can be a simple look up table.  If causation/implication is simple, then we’ll want to do something more complicated for treatment/prognosis.  I prefer the former.  Because they are so tightly coupled, causation/implication and prognosis/treatment I consider them part of the same stage of diagnosis, even though they may have separate artificial intelligence approaches.

 

Evaluating a Medical Diagnosis Algorithm

‘…common sense pressed for time accepts and acts on acceptance.’  We physicians are often confronted by a situation in which we have to give a provisional verdict on the admittedly inadequate available evidence.

[1]

For any algorithm, execution time will be crucial.  The algorithm will have to provide feedback quickly to the patient, even if it does not have a final diagnosis.  The user experience aspect of keeping the patient informed as the algorithm works through its reasoning will help the patient become comfortable with seeking  diagnosis from a machine, as opposed to a human doctor.  Time is of the essence; in fact, the algorithm should not get bogged down in spending clock cycles on getting every corner case right in exchange for reaching most common conclusions quickly.

The New England Journal of Medicine published the results of case records presented to clinicians and discussant groups and whether or not they were able to correctly diagnose the disease in the case studies.  Of the 43 case studies, individual clinicians were correct 65% of the time and the discussant groups were correct 80% of the time.  The study asked the participants to assess the confidence level of their diagnosis as well.

 

Clinicians Discussant
Correct, definite 23 29
Correct, tentative 5 6
Total 43 43

[3]

It seems that being right 80% is good enough, at least it was the state of the art in 1969.  That’s another important thing to remember when testing medical diagnostic AI: what are we comparing it against?  If medical diagnostic AI can approach 80% success rate then it will be a viable alternative to seeing a doctor.  Even at roughly 50% success rate, its a reasonable alternative.  For some purposes such as the Tricorder X-Prize, this should be good enough since the potential use of the Tricorder X-Prize device would be to help people decide if they should go see a doctor.

 

References

[1] Cohen, Henry. “The Nature, Method and Purpose of Diagnosis,” The Skinner Lecture, 1943. Cambridge, UK: University Press, 1943.

[2] Hogg, Robert and Allen Craig.  Introduction to Mathematical Statistics.  5th ed, NJ: Prentice-Hall, 1995.

[3] Case Records of the Massachusetts General Hospital (Case 30-1969).  New England Journal of Medicine.  1969; 281: 206-213.

 

 

Tags: , , , ,
Posted in Uncategorized | Comments Off

Which Diseases to Diagnose for Tricorder X-Prize?

February 7th, 2012

The Tricorder X-Prize is a $10M competition to foster innovation in medical diagnostics.  The goal of the competition is to create a medical device that can diagnose 15 diseases.  The competition guidelines do not state which 15 diseases the tricoder will need to diagnose, however, it seems the competition guidelines will be refined in September.  Maybe they will be announced at that point.  Maybe they won’t be announced before the competition.  For now, it’s fun to speculate which disease should be included.

In 2008, the Center for Disease Control and Prevention did a survey of ambulatory care in the US. They summarized the most prevalent diagnoses at office visits for nearly a million participants.  The most common of all diagnoses was essential hypertension.The fourth most common diagnosis was diabetes mellitus.  Each of these medical conditions has a fairly well-understood decision tree for diagnosis.

 

Primary Diagnosis       Number of Visits     Percentage
Essential hypertension 45,969 4.81%
Routine infant or child health check 43,178 4.52%
Acute upper respiratory infections, excluding pharyngitis 29,296 3.06%
Arthropathies and related disorders 28,404 2.97%
Diabetes mellitus 25,365 2.65%
Spinal disorders 24,376 2.55%
Normal pregnancy 22,140 2.32%
General medical examination 20,913 2.19%
Malignant neoplasms 19,770 2.07%
Rheumatism, excluding back 18,757 1.96%
Specific procedures and aftercare 18,372 1.92%
Follow up examination 17,652 1.85%
Heart disease, excluding ischemic 17,017 1.78%
Gynecological examination 16,140 1.69%
Otitis media and eustachian tube disorders 15,812 1.65%
Disorders of lipoid metabolism 15,274 1.60%
Ischemic heart disease 14,448 1.51%
Chronic sinusitis 12,506 1.31%
Acute pharyngitis 11,729 1.23%
Allergic rhinitis 9,966 1.04%
All other diagnoses 528,885 55.32%
TOTAL 955,969 100.00%

Table 1: Primary Diagnosis Groups from NAMCS 2008 Survey [1]

 

My understanding — and I am not a doctor — is that hypertension is diagnosed primarily with a high blood pressure reading.  You do have to make sure that the reading is repeatable and not primarily influenced by external factors, such as the presence of a doctor.  Overall, it sounds like diagnosing hypertension boils down to getting consistently high blood pressure readings for the patient’s profile (gender, age, etc…).  Blood pressure is not difficult too measure non-invasively — you see blood pressure monitoring machines in grocery stores. The main design consideration for the Tricorder competition would be is there an even less non-invasive way to do it?  One that does not involve requiring the patient to strap a band around themselves.  Even using a traditional approach, for the price of a blood pressure monitor, a device could diagnose nearly 5% of all office visits in the US.

Diabetes mellitus is #5 with 2.7% of office visit diagnoses.  Again, my understanding is that the decision tree is pretty simple: blood glucose readings outside of the norm for a patients profile.  However, blood glucose is traditionally measured very invasively, by taking a small blood sample.  While the Tricorder X-Prize guidelines do not rule out devices that use invasive techniques, they strongly encourage noninvasive techniques.  In fact, a medical doctor on our board at Chesney Research, described noninvasive blood glucose monitoring to me as one of the “holy grails” of medical device technology.  Since one of the stated goals of the competition is to drive sensor technology, I think diagnosing diabetes has to be one of the diseases in the competition.

Another holy grail is characterizing bacterial versus viral upper respiratory tract infection.  This disease is the third most prevalent diagnosis in office visits, according the NAMCS survey.  Right now, there’s no real way to tell the difference other than waiting; bacterial infections tend to last 7-10 days and viral only 2.  However, the course of treatment is very different for each: antibiotics for the bacteria, but not for the virus, since they do not respond to antibodies.

Further down the list is heart disease, the non-ischemic variety, that is, not due to low blood volume.  Heart disease is a pretty broad category.  However, there are analog integrated circuits on the market aimed at measuring electrocardiogram (ECG) signals.  For the price of this chip (typically around $20) and the appropriate interface with the patient, a medical device could take a big step towards diagnosing heart disease.  There is also a wealth of information on the links between heart disease and hypertension and heart disease and diabetes. With an ECG, a blood pressure monitor, a glucose meter and some fancy AI, a team may be well on its way to gobbling up a significant portion of heart diseases diagnoses.  In fact, those three, hypertension, diabetes and heart disease, would get you nearly one out of every ten (9.24%) of all office visit diagnoses.

If we look at the NACMS top 20 again, and take out routine follow-ups, checkups and pregnancy, we are left with 14 diseases.  They are given in Table 2.  They accounted for nearly one third of all office visits in 2008.

 

Rank Primary Diagnosis Number of Visits Percentage
1 Essential hypertension 45,969 4.81%
3 Acute upper respiratory infections, excluding pharyngitis 29,296 3.06%
4 Arthropathies and related disorders 28,404 2.97%
5 Diabetes mellitus 25,365 2.65%
6 Spinal disorders 24,376 2.55%
9 Malignant neoplasms 19,770 2.07%
10 Rheumatism, excluding back 18,757 1.96%
13 Heart disease, excluding ischemic 17,017 1.78%
15 Otitis media and eustachian tube disorders 15,812 1.65%
16 Disorders of lipoid metabolism 15,274 1.60%
17 Ischemic heart disease 14,448 1.51%
18 Chronic sinusitis 12,506 1.31%
19 Acute pharyngitis 11,729 1.23%
20 Allergic rhinitis 9,966 1.04%

TABLE TOTAL 288,689 30.20%
TOTAL DIAGNOSES 955,969 100.00%

 Table 2: Top 14 Diseases, Including Chronic Conditions from NAMCS 2008 Survey Data

 

The competition’s 15 diseases will need to be diagnosed on 30 different patients and the Tricoder will be evaluated for its effectiveness and ease of use by a panel of judges.  The devices should be able to tell the patient if they need to go see a doctor or not.  These 14 diseases are a good place to start.

 

References

[1] National Ambulatory Medical Care Survey: 2008 Summary Tables.  The Center for Disease Control and Prevention.  http://www.cdc.gov/nchs/ahcd.htm

 

Tags: , , , , ,
Posted in Uncategorized | Comments Off

Tricorder X-prize: An Opportunity for Machine Learning

January 29th, 2012

A couple of weeks ago, Qualcomm announced they are sponsoring an X-prize to come up with a health care device.  The prize is $10 M and the device must diagnose 15 diseases — it doesn’t say which ones.  The goal is to make a consumer device, dubbed a tricorder, that people can use in their homes to provide medical care, which requires advances in sensor technology, medical diagnostics and artificial intelligence, among other things, according to the site.

In reading the overview, there is a section about the potential trade-offs a design team in the competition will have to make.  One of those mentioned is the placement of the AI engine.  I think a more important concern is what kind of artificial intelligence would be in the device and how it would interact with the sensors and the user.  From that the best placement of the AI engine would probably become pretty obvious.

The authors of the overview are right to draw attention to the AI in the tricorder, as diagnoses, seems to be the true intent of the device.  The emphasis is not so much on the accuracy and resolution of the sensors themselves that are used — their resolution will be determined by what is just good enough to make an accurate diagnosis.  Instead the real focus of the competition is in the diagnoses, the intelligence, and this is ripe for machine learning.

One of the most important machine learning functions in the device will be its natural language processing.  Telling the doctor your symptoms is still a major aspect of any patients experience with any medical conditions.  We’re not yet at the stage where people can just submit to series of measurements and get a diagnosis.  From both standpoints of the patients comfort level with the device as well as our own understanding of medicine, any effective diagnosis machine will have to be able to understand a person’s description of why they are consulting the device for medical help in the first place.  A major aspect of any medical diagnosis is how the patient feels, how much pain they are experiencing and how the condition for which they are seeking help is affecting them.  Without a sensor that can accurately measure pain, we have to rely on the patients words.

The effective tricorder will have to navigate the language as well as incorporate information from sensors to arrive at an effective diagnosis.  I think this is the first consideration when designing the tricorder: how will it parse the patient’s description of the problem.  Then blend in the information from the sensors to complete the story.

 

Tags: , , , ,
Posted in Uncategorized | Comments Off

Introduction to Compressive Sensing

January 24th, 2012

Over at the Chesney Research website, I posted an introduction to compressive sensing.  There’s already lots of good information out there on what is compressive sensing.  This introduction is a little different because it gives an overview of compressive sensing without getting too far into the math and stays focused on why you would want to use it.

 

Tags:
Posted in Uncategorized | Comments Off

Clustering

January 10th, 2012

I’ve been working on clustering as it pertains to graph partitioning.  A brief introduction is up at http://www.chesneyresearch.org/cluster.html.

 

 

Posted in Uncategorized | Comments Off

Projections and Eigenvectors

September 13th, 2011

I was thinking of the immutability of eigenvectors and the immutability of certain vectors when projected and realized these two qualities are one in the same.  Eigenvectors can be viewed and explained in terms of a projection matrix, which may be a more intuitive and easier way to understand eigenvectors than is commonly taught.  Certainly it relies on much less math — only the concept of rows or columns as vectors and a basic understanding of the fundamental and canonical equation that defines eigenvectors and eigenvalues.

Projecting one vector onto another gives a scalar length of the first vector on the second.  It is the amount that the second vector “contributes” to the first.  This contribution is easily calculated by taking the dot product.

In the above example, the vectors are given as column vectors, and since their dot product must result in a scalar quantity, it is calculated as aTb.

The projection of b onto a creates another vector, let’s call it p, that is in the same direction as a, but whose length is determined by the length and direction of b.

Strang’s Introduction to Linear Algebra book gives a great explanation of projection.  In it, he rearranges the terms of p to get

 

Pa is a rank 1 projection matrix, which makes sense because we are projecting onto the scalar’s one-dimensional space [1].  The projection matrix, Pa, is a transform that projects a vector onto a.  The result of this transform is a scaled version of the vector a.

The lambda parameter gives the amount of scaling on a.  The answer gives how much of a is in b, sometimes referred to as the component of a in b.

Projection is a way of moving one vector onto another, that is, the projection changes the direction of a vector to line up with another.  More generally, an n-by-n matrix is a transform that changes an n-element vector into another n-element vector, possibly pointing in a different direction.

The vectors that are able to survive this transform unmoved are called eigenvectors.  They may be scaled differently as a result of the transform, but they are pointing in their same original direction before the transform was applied.  The amount that the transform scales the vector is the eigenvalue.

This is the familiar linear algebra definition for eigenvalues and eigenvectors.  The vector x is an eigenvector of A, if the above equation holds for some nonzero x.  Lambda gives the eigenvalue for that eigenvector, x.

Numerically eigenvectors and eigenvalues look like this,

Where [1;2] is an eigenvector of [1,3;4,5] with eigenvalue 7.  The vector [1;2] survives the transform pointed in the same direction, but with a different scale value (7).

Some vectors do not survive the transform without a change in direction.  For instance,

The vectors that are able to survive this transform unmoved, i.e. pointed in the same direction, are the eigenvectors of that transform.  The eigenvalues are the scale values resulting from the transform.

 

 

[1] Strang, Gilbert.  Introduction to Linear Algebra.  Wellesley, MA: Cambridge-Wellesley Press, 2009.

Tags: , ,
Posted in Uncategorized | Comments Off

Finding the Chorus of a Song Using Auto-correlation

March 9th, 2011

Correlation is useful for finding commonality between two signals.  This commonality can also be considered redundancy.

The auto-correlation is defined as,

For a message, the auto-correlation sequence describes how well a message matches up with a shifted version of itself.  When the message is not shifted at all, it matches up very well with itself.  For non-repetitive, random or otherwise unpredictable messages, there is little in common with a shifted version of itself.  The auto-correlation for a random message of 15 symbols is shown in the following plot.

 

 

 

 

 

 

 

 

 

The auto-correlation can be generalized to the cross-correlation between two messages.

Both are represented by a sequence of values of the number of matches at each shift position.  For the purpose of calculation, shifts are barrel shifts, meaning the part that is shifted off the end is prepended back to the beginning.

A plot of cross-correlation between two random messages follows.  Notice that even if the two messages each don’t match up very well when compared with a shifted version of  themselves, they may still correlate when compared with each other at the appropriate shift amount.

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Under an appropriate signaling alphabet (i.e. closed, finite set of unique symbols), the atomic portion of the autocorrelation calculation, x[n] * x[n+i] is simply a compare between two symbols at different points in the message.

Some example Python code for the correlation function is given below.  Rxy is the calculated cross-correlation sequence between messages a and b, each length N.

Notice the nested loops.  There may be a better way of doing this in Python, a better way to code it or a better solution, such as using lower-level routines like Numpy does, but this illustrates what makes correlation computationally expensive.  For two sequences, each N long, it takes N2 operations to calculate the cross-correlation between the two.  This is the drawback of using a serial machine, which for all intents and purposes is how most processors approach it.  You would probably enjoy a significant speed up by using a highly-parallelized machine like a graphics processor, at the expense of some very low-level coding to trick it into calculating correlation for you, something it was probably not designed to do.

Even though it can take a while to calculate for long sequences, the auto-correlation function can be very useful, especially when analyzing a message of symbols.  In a graph of the correlation function, the peaks are instances of repetition.  Repetition occurs, for instance in the chorus of a song.  Consider an entire song a message — a sequences of symbols.  Consider only the lyrics to the song, stripping it entirely of its musical content.  Each word of the song is a symbol; the specific order and collection of these symbols uniquely identifies the lyrics of the song.

Take, for example, the lyrics to Michael Jackson’s 1982 smash hit “Beat It.”  The start of the chorus is recognizable by the oft-repeated title of the song.  However, this is not the entire chorus.  Can we find the chorus of the using auto-correlation?  In fact we can; the autocorrelation function does a good job of finding the chorus in this example.  The compare part of the auto-correlation calculation is a simple string compare: do the two words being compared match exactly or not?

Here’s the algorithm:
  1. Flatten the song into a sequence of words.
  2. Calculate the auto-correlation for the sequence
  3. Find the index (not 0) of the auto-correlation that has the biggest value.
  4. Find the longest run of consecutive matches for the index found in step 3.
  5. Therein lies your chorus!

For “Beat It,” I plotted the auto-correlation.

 

The auto-correlation peaks at a shift of 28 (Rxx[28] = 0.184343434343).  There are several words that match when you take the lyrics of “Beat It,” shift them by 28 words and compare it to an unshifted version of itself.  Several of them are stray individual words matching.  However, when I looked at position 278, there is a long string of consecutive matches

In fact, it keeps going to index 333.  The auto-correlation method finds two consecutive instances of the chorus, which is:

Just beat it! Beat it!
No one wants to be defeated
Show them how funky and strong is your fight
It doesn’t matter who’s wrong or right

Notice that it finds the whole chorus, not just the oft-repeated “Beat” or “Beat It.”  Other methods might focus on the most frequently appearing word or diagram (two words together) and miss the rest of the chorus, which actually contains much more thematic content.  In this example, the most oft-repeated phrase, “Beat It,” is just part of the reason for the high-degree of correlation.  The rest of the chorus, probably the less easily remembered (but equally important) part, “No one wants to be defeated…. who’s wrong or right,” matches as well when the message is shifted by the appropriate amount (28 words) and compared with itself.  With other methods it can be difficult to tell that “No one wants to be defeated…” is part of the chorus and not the next verse.

Tags: , , , , ,
Posted in Uncategorized | 6 Comments »

Collapsing Content

February 25th, 2011

I’ve learned a few new terms this week.  One is online astroturfing, which is the practice of creating a lot of fake online personas and using them to support a certain cause.  I learned of it by way of George Monbiot’s blog.  George points to the discovery by the Daily Kos of the recent solicitation by the US Air Force for persona management software.  These personas are created to build support for/against certain things online by parroting essentially the same point of view in comment threads, discussion groups, etc…, and generating entire online identities to lend credence to those “opinions.”

Another term I learned is churnalism, which is the practice of posting a press release as a news article, masquerading as original journalism — by way of Media Standard Trust.  The churnalism web site lets a user paste an article into a window and check to see if it matches up with press releases in its database.  Their site compresses the text and compares it to a database of other compressed texts to check for similarity, or more accurately, instances of exact matches.

What’s common among these two new terms is the desire to recognize original content in the presence of artificially repeated content.  We humans are easily persuaded when there appears to be a lot of people saying essentially the same thing, which is fairly straightforward to do online.  What would help us negotiate a world of churnalism and online astroturfing is the ability to collapse content.  In other words, we need to be able to view a body of texts as comprised of several themes and view the themes fairly and individually, regardless of how often they are repeated, if we choose.

Natural language processing (NLP) can help.  By using machine learning to programmatically make sense of human text and speech we could develop algorithms to collapse similar content together, leaving only original ideas — no matter how unpopular — standing out.  If we perform this kind of filtering on our content before we see it, we could avoid our innately human tendency to lend more credence to the idea being supported by more people.

The idea of using people vs machines to sift through content has certainly come up previously.  The two methods were probably most directly pitted against one another as news consumers evaluated Yahoo’s and Google’s news aggregators in the mid- to late-2000′s.  Yahoo used (still uses?) human readers to decide which news stories most readers would like to see.  Google News uses an algorithm to determine which stories are relevant and also to cluster similar stories together.

So how does Google do it?  How do they decide which stories are similar?  The speculation is that they may be using stopwords — these are very common words, such as “the” and “a”.  As Greg Linden points out, the seminal paper on using stopwords comes out of Stanford and it develops an algorithm for finding matching signatures in a large corpus of data, such as a web crawl.  Greg gives an example of it.  Say you find “a weeklong campaign” in a page, the “a” is the stopword and a-weeklong-campaign is the stopword signature.  Texts that have both of these stopword signatures (and others) are likely to be talking about the same thing.

Unfortunately, this type of approach may not help much when trying to spot a generated online persona from blog posts, comments and tweets, for a couple of reasons.  For content collapsing application, its not good enough to know that multiple texts are discussing the same subject; instead we need to determine if they are espousing the essentially same point of view.  Another reason this technique might not work for content collapsing is that it appears to be vulnerable to trivial substitutions, such as minor spelling differences (“week-long” vs. “weeklong”) or synonym substitution.

The churnalism algorithm appears to have a similar vulnerability — reliance on exact matches to determine similarity.  When I say vulnerability, I’m talking about repurposing these algorithms for spotting generated identities online or trivially-rephrased news articles — I’m not talking about the intended purpose of these algorithms, which they are probably pretty good at achieving.  I do believe these algorithms are a first step towards collapsing content of large corpora together, something that would be immensely useful to me personally in dealing with information deluge.  However, to collapse content in a way that removes or mitigates the bias formed from artificially-multiplied consensus, we’ll need to dive deeper into natural language processing and develop more sophisticated algorithms.  I believe the promise is there — to stay one step ahead of the automatically-generated content — but not realized yet.

 

Tags: , , , , , , , , ,
Posted in Uncategorized | Comments Off

Clean Rooms

February 2nd, 2011

The other day I was talking to a friend who works at a large general contractor — these are the companies that manage major construction projects like ports, schools, and hospitals.  He had just finished managing the construction of a new wing of a hospital.  We were talking about that and talking about his business, in general, and he reminded me that they also manage construction projects for semiconductor companies, the kinds of projects that would be involved in opening new fabrication facilities or expanding existing ones.

I commented that I thought that was a strange mix of business: hospitals and semiconductor fabs.  My friend explained to me that it really isn’t — a lot of the skills needed for each are actually complementary, such as the need to be able to build facilities that are exceptionally clean, to keep them that way and to show that they are clean.

When I say clean, I mean free of particles and dust.  Wikipedia provides a good definition of a clean room as,

an environment … that has a low-level of pollutants such as dust, airborne microbes, aerosol particles and chemical vapors.

A room free of airborne microbes can significantly reduce the risk of hospital-borne infections.  A room free of dust and chemical vapors is a must when etching microscopic features into silicon, usually with the help of a “wet” acid, to form electrical circuits on a semiconductor substrate — microchips are made this way.

You can never keep a room of any appreciable size and utility completely free of dust, particles and vapors, so fortunately, there are ways of measuring a room’s cleanliness by counting the number of particles in a given volume of air.  The ISO standard (14644-1) has different strata of clean room classification.  For instance, ISO-9 is normal room air, which is considered to be 35 million particles per cubic meter.  What is considered a particle?  Anything bigger than half a micron, which is 5/10,000th of a millimeter.  For particles of this size, the ISO standard decrease in number to ISO-2, which is 4 particles (half a micron or bigger) in a volume of 1 cubic meter.  There is an ISO-1 classification, which has no particles of half a micron, but counts smaller particles.

What level of cleanliness is appropriate for which activities?  That probably depends on what you’re trying to do.  There’s a semiconductor fab at BYU that boasts ISO-4 compliance.  A company called Nemotek that makes “wafer-level cameras” announced nearly 2 years ago the opening of an ISO-4 clean room in Morocco.  You can imagine how dust particles would affect building camera lenses en masse.  Nemotek claims they can upgrade to ISO-1, but for now ISO-4 is good enough.

So how does one go about constructing a clean room?  The Global Society for Contamination Control maintains a wiki that has a good section on designing a clean room.  The main point of design is how air is handled in the room. Air coming into the room needs to be filtered free of dust.  Air inside the room is recirculated through HEPA (High Efficiency Particulate Air) or ULPA (Ultra Low Particulate Air) filters.  If a person enters or exits a room, they must first go through an air lock and sometimes through an air shower.  Sometimes the room is kept slightly pressurized, in case there is an air leak in the room, the air goes out.  Also, the entire air flow system must be considered, whether laminar or turbulent, and designed to drive particles to filters installed near the floor.

With so much care and attention to detail, it’s no wonder that clean rooms tend to be built by major microelectronics corporations or large hospitals.  Here in Ann Arbor, the new and massive Mott’s Children’s Hospital, a state of the art complex in not only medicine but green building design, has installed HEPA filters throughout the hospital so that patients with compromised immune systems can roam freely.  It is quite an undertaking to build a clean room, but one that is becoming more and more necessary across different industries.  From semiconductor fabs to hospitals, it appears that an environment free of airborne particles is becoming a necessity.  Is this necessity only available to large complexes and multi-million and billion dollar corporations?  Is there a way to make this technology more readily available?

Certainly making clean rooms more readily available has its hurdles.  This includes the work to construct the room to exacting standards and testing to those standards.  It also includes the materials to construct the room, which must be unlikely to flake or shed over time.  But most importantly of all, it includes teaching the people that do their jobs in the clean room how to maintain a clean environment.  No matter how clean you build the room, you can easily destroy that hard work with common habits.  This means that one of the biggest challenges is not just constructing the clean room, but in making sure it can stay clean.

Tags: , ,
Posted in Uncategorized | Comments Off

Humanity’s Most Versatile Servant

December 16th, 2010

Via metafilter, a blog called Abnormal Use is looking back through the New York Times archive at predictions made in 1931 about 2011.  My favorite of the bunch is from William F. Ogburn, a sociologist, who predicted,

Humanity’s most versatile servant will be the electron tube.

According to Wikipedia, Mr. Ogburn was known for his idea of cultural lag in technology, which means that social strife will result from technological advances to which humans haven’t yet learned to adapt.  He was also an advocate of technological determinism, which argues that a society’s technology drives its social development and structure.

While Mr. Ogburn did not know about the microchip, as Abnormal Use points out, in 1931, I believe the gist of this prediction is right on.  Electrical circuits are our most versatile servant.  Today, we are still exploring that versatility with electrical circuits implemented in semiconductors.    With the benefit of hindsight, Mr. Ogburn’s prediction is probably best restated as the transistor, as opposed to the electron tube, since this is the key innovation that was miniaturized (and made more manufacturable) with the advent of semiconductor circuits.  However, it’s hard to imagine that 80 years from now we will not have moved migrated to a more advanced platform.  Hopefully, in 2091, we will still be making useful electrical circuits, but with a more accessible technology than semiconductors.  Although we can make and distribute millions and millions of semiconductors each day, the cost of designing and producing them is expensive and only accessible to few people.

Tags: , , , ,
Posted in Uncategorized | Comments Off

Previous page