Suppose you have 3 sensors: A, B, C. Each has a set of readings that goes into a column of a matrix, b:
A matrix A is constructed such that bA, bB, bC represent compressed measurements in some domain. A is Gaussian, since this is a universal sensing matrix for compressive sensing (CS), due to its poor correlation to anything other than itself. 
The solution for each xi (i ∈ A,B,C) represents the data recovered from these compressive measurements and can be found with min ||xi||1, or some other sparsity-promoting technique.
I am interested in the condition when all sensors (A,B,C) agree. Consensus, agreement among multiple sensors, is important for many applications. One example is safing functions. Safing is checking for agreement among, sometimes redundant, sensors before engaging a drastic action, for example, deploying the airbags in the event of an automobile crash. When all sensors agree (or agree enough), then the dangerous function of deploying an airbag towards the driver and passengers can be performed.
In this example, I am interested in the case when each sensor finds some nonzero sparse solution. Previously, I developed the notion of the diverse solution to a set of linear equations that require a positive integer solution. The most diverse solution is the one that has values of roughly similar value. In another post, I showed that the diverse solution can be used to mitigate risk.
For the safing function we want to mitigate the risk that our sensor data is in error, indicating that we should erroneously employ our drastic action. We use multiple redundant sensor to mitigate this risk. The agreement among sensors is maximized when diversity is maximized across all sensors. Even though we are seeking a sparse solution for each individual sensor (for example, min ||xi||1), we seek to diversify the number of sensors that have witnessed a significant event, that is, have found a nonzero sparse solution. So the algorithm is a sparsity-based linear program wrapped inside a diversity-based linear program. When all sensors agree, a peak-to-sum ration (PSR) is minimized, if the sensitivity and units of each sensor are normalized. Since we are nesting two linear programs together, it is important that the innermost program, recovering the compressively-sensed (CS) data, seeks a sparse solution, since this can be computationally efficient.
The l1-norm minimization routine is used as an example of a CS recovery routine and exists inside the maximum diversity routine.
Where x(:,i) indicates the ith column of x. The objective function value of the l1-norm minimization of each column of x is stored in a vector, u. This is the innermost routine. The outermost routine seeks to maximize the diversity of u by minimizing its peak-to-sum ratio.
Let’s look at some example solutions. We’ll just look at types of solutions that we can reach with this algorithm, independent of the b and A that actually generated them.
In the first example, above, all sensors seem to be in agreement that something happened (of magnitude 10). This is indicated by the low PSR, which has a lower bound of 1/N (= 1/3, in this case). There must be a little bit of noise in the readings, since the agreement isn’t perfect, but this could be considered close enough to declare consensus. Notice, also, that the sensors don’t get exactly the same reading; the readings are basically shifted by one index. If the columns of x represent samples in time, then each sensor is phase-shifted with respect to the others. In this algorithm, we don’t have to explicitly time-align the data — which can be a computationally-intensive step. We just accept the data as is, even with small difference between sensors.
In the second example, above, 2 of the 3 sensors seem to agree. One sensor appears to be broken, however, this has not completely destroyed our PSR. We could still limp along with 2 functional sensors for a little while, until we are in a position to address the potentially defective sensor. The algorithm has mitigated the risk of one sensor dying by using a diversity-promoting strategy.
In the third example, above, something is clearly busted. One sensor has detected a significant event, but the majority have not. A simple majority voting scheme would not be suspicious of this data set, but our nested diversity/sparsity linear program is. Notice that the PSR is getting closer to its upper bound, which is 1. At this level of PSR, our algorithm diagnoses our system of sensors as broken and can take action accordingly, instead of making a bad decision.
The strategy of reaching a consensus among sensors by using both sparsity- and diversity-promoting techniques have made this system more robust. However, the way the actual computation is performed hasn’t been made clear yet. Sparsity-based recovery techniques have been well covered and recently, I posted about how to solve a diversity-based integer program. Next I’ll look at how to nest these linear programs.
 Baraniuk, Richard. “Compressive Sensing,” IEEE Signal Processing Magazine. July 2007.
In an earlier post, we saw the benefits of finding the solution to a set of equations with the most diversity, which was found by minimizing the peak-to-sum ratio (PSR). This post discusses ways to formulate the linear program to arrive at that solution.
Consider the following system of equations.
The A and b matrices for this system of equations are
We focus on finding an initial solution for the linear program using two methods.
Method 1: Elimination and Back-substitution
The first method for solving for the minimum PSR solution starts with Gaussian elimination and Jordanian back-substitution to find the initial solution. From there, the linear program iterates to find a feasible solution with the minimum PSR, that is also an integer solution, greater than 0. With elimination and back-substitution, [A b] turns into:
This means that
is a solution to the system of equations, if non-integer solutions are allowed. However, in this problem, they are not. The first method takes this solution and then finds the nearest integer vector — not necessarily a solution — and then proceeds to the minimum PSR solution in the feasible set, considering only integer vectors.
For an underdetermined system of equations, using elimination and substitution will not yield a solution, at all. Instead it will describe the system of equations in terms of certain variables for which values need to be chosen, called “free variables.” From the choice of free variables, the others are determined and the linear program can iteratively make better and better choices of free variables. One common initial choice for the the free variables is to set them all to 0. This, unfortunately, will yield the most sparse solution, which we saw earlier is the opposite of the most diverse solution. Starting with the sparsest possible solution means the linear program will have to work for as long as possible to get to the most diverse solution.
Method 2: The Intersection of a Boundary Condition and the Minimum PSR Line
Let’s modify the system of equations slightly to look at them from a different angle. We’ll turn the “greater-than” to “less-than” and add in another constraint. This looks more like the investing example from earlier.
Note also that PSR is bounded for nonnegative integers.
The second method for finding the initial solution for the maximum diversity integer program exploits this fact: the PSR of an N-element, nonnegative, integer vector can never be below 1/N. So if one of our equations has the form:
The minimum PSR integer program can reduce to:
So you can use a sparsity-promoting integer program to find the minimum PSR, maximum diversity solution once you subtract off the evenly distributed vector. Note that this vector does not need to be in the feasible solution space – it can be used as an initial value. Now the minimum PSR integer program can take advantage of the computational efficiency of and the extensive analysis devoted to sparsity-promoting integer programs. We are looking for a vector that is not very different from the evenly distributed vector, just like with sparsity-promoting programs we look for a vector that is not very much different from 0.
Making Sure All the Units are the Same is a Big Deal
If N >> S, then subtracting off the evenly distributed vector doesn’t do much and we’re pretty much back to trying to solve a system of equations by minimizing the first norm. This only works if x is sparse, otherwise, its a terrible approximation to the solution of Ax<=b. However, N >> S is telling us that there isn’t much to distribute, compared to the number of bins we are distributing to. So we may not get a good, diverse answer anyway.
Another way of looking at the constraint
is that it relates all of the elements of x to one another. This equation requires that all of the units of x are the same, for the equation to make any sense. Referring back to the investing example again, the A that gives the system of equations for this example, overall, is actually pretty sparse. Without the two row vectors of A that are not mostly made up of zeroes, 1T and pT, the possible values of x could take on wildly different values, if the values of b varied wildly. For instance, x2 could be on the order of 1000 and x3 could be on the order of 1,000,000. This would make it very difficult for an integer program that minimizes the first norm, as given in method 2, to arrive at a good answer.
So the choice of starting with a dense row of A and finding its most diverse solution as an initial vector is not arbitrary. For well-formed problems that seek a most diverse solution, it can actually be a great starting point. If your system of equations for a problem does not have such a vector then you may not really need the most diverse solution or maybe you have not fully captured the problem. Usually, a system of equations has most, if not all dense rows. Then it becomes a question of finding the best one for finding an initial solution to the maximum diversity integer program, which I’m sure is a whole investigation unto itself.
Minimize risk by maximizing diversity. Whether you’re a trader deciding which stocks to invest in, an investor deciding which companies to invest in, or a manager trying to decide which projects to support, you can minimize your risk by investing in a portfolio of products. This approach fits very nicely with the notion of diversity developed earlier.
Suppose you have 5 companies seeking investment. An investor can purchase up to 30% of a company. Each company must project its value at the end of the investment period and the investor has this information, along with the price per share for each company. Different companies have different prices per share and different projected valuations at the end of the investment period.
The investor cashes out and earns a share of the value of the company when the investment period is complete, proportional to the amount of the company the investor owns. For example, if the investor owns 3000 of the company’s 10000 shares, 30% of the valuation of the company at the end of the investment period goes to the investor.
The investor can only purchase whole shares of the company (no fractional shares, this is like a minimum investment). The investor may not ’short’ a company, that is, sell shares of a company that the investor does not own. So the number of shares an investor chooses to purchase must either be 0 or some positive integer less than or equal to the number of shares for sale.
Companies may or may not meet their valuation projections. Companies may go under, resulting in a valuation of 0 at the end of the investment period. The investor does not know how likely a company is to meet its goal or fold altogether. The goal is to avoid modeling this unknown and to develop a robust strategy based on the only two things the investor knows for sure: price per share and the projected valuation. The investor seeks a 10% return and has $100,000 to invest.
Let’s write some equations to figure out what the investor should do. We’ll use x to represent the amount of money invested in each company.
The restriction on how much of each company the investor can own amounts to a restriction on the amount of money the investor can invest in each company. Let’s say that for company 1, the maximum investment is 50000, for company 2, its 40000, and for company 3 its 60000.
Each company expects to turn the investment they receive into a return to the investor. We’ll use p to represent the percentage return on investment each company claims they will return to the investor at the end of the investment period. If a company claims a 10% return on their investment, then their value for p is 1.1, since they will return the initial investment, plus another 10% on top of that. The investor has a goal of 10% return on total investment, so weighted sum of p, weighted by the amount invested, must be greater than 110000.
That last equation can be rewritten as:
So now we have our equations in the form Ax<=b, where
To find the most diverse solution, the optimization problem is
As opposed to the most common way of solving this problem, which is to focus on maximizing return. One way of maximizing return is to ignore the constraint of getting at least 10% on the investment and just get as much as possible. That linear program is
Another way of maximizing the return is to leave the constraint as a minimum bound on the return. That just tells the solver of the linear program to quit once 10% return is reached. That kind of violates the spirit of maximizing your return and is not very illustrative when comparing it to maximizing diversity. We’ll treat max return as truly getting as much return as possible, without regard for how difficult the problem is for the solver, and drop the minimum return requirement for max return.
Let’s look at the two strategies: max diversity and max return. Let’s say that as a company offers more return, it’s price per share is higher and, even accounting for differences in the number of shares available, that this translates into a higher maximum investment. So the companies with a higher maximum investment are promising a bigger return. Let’s say p is:
The max return solution is:
This solution has a projected return of 14.1%. This solution is 3-sparse for N=5, or 60% sparse. It concentrates as much investment as possible into the highest returning company. Once that investment is maxed out, then the rest goes into the second highest returning company until the total investment limit is reached.
The most diverse solution is:
It has a PSR of 0.20834 and a return of 10.00016%.
What happens if one of the companies fails? If it’s the high-return company, 4, the max return solution, since it is sparse, only returns 33.6% of the investor’s money ($33600). However, the diverse solution is much more robust and still returns over 86% of the investor’s money ($860411) if company 4 fails and the others meet their targets. For a 4% gain on return, the risk that you would lose over 50% of your money is clearly not worth it.
Maximizing return could give a sparse solution, even when you don’t specifically seek it. The sparse solution is not robust to the risk of total failure, in this case, zero return. The minimum PSR solution sacrifices return to deliver a diverse solution which is more robust to failure, that is, it is much more likely to have an acceptable return. Notice that I am comparing max return vs. the diverse solution as strategies. Allowing the sparse solution is what makes max return less robust. When considering how to invest, the sparse solution is focusing your efforts, which could promise more return, but is less robust. The diverse solution is hedging your bets. Once the minimum return goal is met the diverse solution mitigates risk by spreading it out across a portfolio of investments.
We know about sparsity for vectors, which is the property of having a lot of 0’s (or elements close enough to 0). The 0’s in a sparse matrix or vector, generally, make computation easier. This has been a boon to sampling systems in the form of compressive sampling, which allows recovery of sampled data by using computationally-efficient algorithms that exploit sparsity.
The opposite of sparsity is density. A vector whose entries are mostly nonzero is said to be “dense,” both in the sense that it is not sparse, as well as being slow and difficult to muddle through calculations on this vector if it is really large. In general, dense vectors are not very useful. Or are they? In this post, I introduce the notion of diverse vectors, a subset of dense vectors, as a more interesting foil to sparsity than simply dense vectors.
This post explores a measure of diversity called the peak-to-sum ratio (PSR). We can use PSR to find the most diverse solution in a linear program, but what can it be used for? We’ll find this linear program optimally distributes quantized, positive units, such as currency or genes. Maximizing diversity, by minimizing PSR, is desirable in several applications:
Sensor Data Fusion and Machine Learning
In contrast to some systems which allow dense vectors, these systems actually desire the most diverse solution.
Consider the following system of equations:
x1 + x2 + x3 + x4 = 12
x2 – x3 = 0
x4 = 0
The following table gives a partial list of possible solutions (for x1, x2, and x3, since x4 is already given).
The sparsest solution is highlighted in green, (12, 0, 0, 0). It has the most zeroes in it. It has everything concentrated in one element of the vector, the first element. All other elements are zero, making several computations on that vector faster and easier.
Notice the solution highlighted in yellow, (4,4,4,0). It has very few zeros in it, in fact, none except the one variable that was explicitly set to 0. But there are other solutions that have only one zero, or are, as they say, 1-sparse. For instance, (10,1,1,0) is also 1-sparse. The vector (4,4,4,0) is highlighted and (10,1,1,0) is not because (4,4,4,0) is more diverse. What makes it more diverse? The (4,4,4,0) vector has the most equitable distribution among its elements, which is, conceptually, the opposite of being sparse. Whereas the most sparse solution has its energy concentrated in the fewest elements, the most diverse solution has its energy distributed to the most elements. In fact, (10,1,1,0) could be considered to be more sparse than (4,4,4,0) as it is closer to having its energy concentrated in one vector (first posited in ).
So if we’re talking about more or less diverse/sparse, this implies that we can quantify it. One way to quantify it is to look at the peak-to-sum ratio (PSR). The peak to sum ratio is defined as the ratio of the largest element in x to the sum of all elements of x.
PSR can also be expressed as the ratio of two norms, the infinity norm and the first norm. Let’s see what this ratio looks like for the solutions we considered above, plus a few more.
The (4,4,4,0) solution appears to have the lowest PSR, until we proceed to solution (13,13,-14,0). As we proceed past (13,13,-14,0), PSR becomes vanishingly small. In fact, we see that (4,4,4,0) was just a local minima for PSR.
This is unsatisfying, for a couple of reasons. The first is linear programming. We can’t use a linear program to minimize PSR and arrive at the most diverse solution because PSR is not convex, that is, we can’t be sure we’re on a path to smaller and smaller PSR because we could encounter a local minima, like we did at (4,4,4,0).
The second way this is unsatisfying is that (13,13,-14,0) does not seem more diverse than (4,4,4,0). If we’re considering the elements of x to be bins of energy, or some other thing that is to be distributed, then having an element that is -14 does not make any sense. Instead of distributing, or deciding how things are to be allocated, we are actually taking them away, which defeats the whole purpose of what we are trying to do. So let’s add in a constraint that x has to be greater than 0.
For this particular problem, we can proceed in a linear fashion from the first candidate solution (12,0,0,0) to the most diverse one (4,4,4,0). However, in general, is this problem convex? That is, could we choose another set of equations and be able to march towards the most diverse answer?
Consider the definition of convexity for a feasible set of solutions. A set is convex if, when proceeding from one point to the next, you stay in that set . It helps to rewrite our problem in terms of linear algebra. We seek the most diverse solution to a set of linear equations,
where Ax=b represents our equations we’re trying to solve. Since x cannot be negative, we know that,
Is this convex? The shaded region in the following graph shows the x’s that satisfy the quantity (PSR) to be minimized when it’s less than or equal to 1.
In the example in the graph above, there are only two elements in x. The possible values of x outside of the shaded region, when either x1 or x2 or both are negative, are not in the set of feasible solutions. So do the answers in this shaded region form a convex set? Yes, because we can pick two points in the shaded region, and draw a straight line between them that has to stay in the shaded region.
Notice if we restrict values of the elements of x to the integers, we still have a convex set! This result is satisfying because we’ll want to distribute units of energy, currency, etc…, anyway.
In the language of linear algebra,
where Z represents the set of integers. So the two constraints we’ve added to our convex optimization problem to find the most diverse solution:
Must not contain a negative distribution (x greater than or equal to 0)
Must have a base unit of distribution (x must contain integers)
We can now use an integer program to perform the optimization. Restricted to positive, integer values, when we minimize PSR, our linear program will march algorithmically, inexorably to the most diverse solution to a system of linear equations.
 Zonoobi et al. “Gini Index as Sparsity Measure for Signal Reconstruction from Compressive Samples,” IEEE Journal of Selected Topics in Signal Processing. vol 5, no. 5, Sept. 2011.
This post is a framework for diagnosing diseases with artificial intelligence. It draws its inspiration heavily from a transcript of Henry Cohen‘s excellent lecture in 1943, “The Nature, Method and Purpose of Diagnosis.” I liked reading Cohen’s lecture because it is clear and concise and seemed to fit well with an artificial intelligence approach to diagnosing diseases. My interest in this subject is in developing algorithms to make diagnoses.
There are no diseases, only disease.
This kind of sums the whole thing up and is a great place to start. This quote reinforces the idea that we’re not looking to exhaustively search innumerable avenues. We’re looking to find what, at its root, is bothering the patient.
Why artificial intelligence to diagnose diseases? The reason is to provide a consistently high quality of patient care, a quality of care that is repeatable and reliable. Cohen argues that one of the main problems with patient care is consistency. The same patient could get different diagnoses from different doctors. Doctors are human, after all, and clearly differ. How do doctors differ?
Knowledge of symptoms, signs, and syndromes of disease
Use different labels
Creating and deploying an artificial intelligence based system with the same observational and interpretive abilities, and a consistent taxonomy would relieve the confusion of conflicting medical advices.
The literature on using artificial intelligence to algorithmically make medical diagnoses is surprisingly timid. Usually, attempts at automatically diagnosing diseases are couched in words like, “assist” and “consult” and rarely, if ever, take full responsibility for making the diagnosis. One author suggested a reason for this: trepidation about encroaching on the doctor patient relationship. Usually, algorithmic diagnostic systems are tailored to diagnosing specific disease. The task of this post is to outline a framework for diagnosing disease.
The first stage in the diagnosing disease is the recognition of simple quantitative deviations from normal.
This sentence from Cohen provides a great start to our diagnostic engine. “First stage” implies that a simple state machine structure for the main diagnostic engine is appropriate, with the first stage seeking to gather information and compare it to what is considered normal for that patient.
The inputs to the observation stage are the patient’s medical history, their testimony about why they are seeking diagnosis and a physical examination. Processing the patient’s testimony can exist on a spectrum with one end anchored by using natural language processing (NLP) to parse and comprehend the patient’s testimony and the other by letting the patient choose from a drop down menu. The former is much more sophisticated, involved, but more accurately answers the task as it allows the patient to consult the device when they don’t know what is wrong, which makes the device infinitely more usable. The latter essentially boils down to a choose-your-own adventure approach, is trivial to implement, but does not significantly move the needle beyond simple internet searches, leaving the responsibility of diagnosis or deciding whether or not to consult a doctor in the hands of the patient. The NLP approach, or something close to it is preferred.
The NLP lies outside of the main diagnostic engine so that different algorithms can be swapped in and out seamlessly. The diagnostic flow should not depend on the specifics of the NLP used to parse the patient’s testimony. The NLP outputs certain key words gleaned from the patient’s testimony ordered and weighted in terms of “importance.” Clearly, the NLP will need to know what the observation stage of the diagnostic engine considers important, but does not need to be embedded in the diagnostic engine.
The patient’s testimony, their medical record and any vital signs recorded are collected in the observation stage and passed to the next stage. The observation stage may and probably will order or perform certain biometric tests on the patient and wait for those results before proceeding to the next stage, Interpretation. For example, the observation stage may perform and pass on the results of an electrocardiogram (ECG) test based on certain watchwords figuring prominently in the patient’s testimony.
The observation stage looks at physical exam measurements and compares them to expected values given the patient’s data form their medical records (age, height, weight, gender…). It also performs additional tests based on simple keyword matching from the patient’s testimony. The observation stage writes back to the patient’s file and passes everything onto the interpretation stage. The observation stage is like the nurse taking your vital signs and the interpretation stage is like the doctor who comes to make a medical diagnosis.
The output of the observation stage is an up-to-date medical record of the patient. The ability to index the patient’s record temporally is important as this allows the Interpretation stage to analyze how a particular condition has changed over time.
The interpretation stage can ask the patient questions directly to obtain additional information. It is better to keep this ability directly in the interpretation stage instead of going back to the observation stage because this new information will probably be directly linked to branching in this stage of the diagnosis. To simplify things, the query/response format in the interpretation stage does not need to be as open-ended, from a linguistic standpoint, as in the observation stage. In the observation stage, natural language processing is needed because we may not be sure what we are trying to figure out yet, so we have to derive meaning from a very complex set of possibilities. However, in the interpretation stage, we are seeking specific, targeted information, so the response options should be limited. This fits well into a dropdown box, an example is below.
Have you felt chest pain?
The last three, “I don’t know”, “I don’t understand the question,” and “It’s complicated” allow the response/query routine to improve its chances for getting a useful answer out of the patient.
The AI for medical diagnosis will need to reason anatomically, that is, it will have to move from one part of the body to the other in search for interpretations that fit the existing data. Cohen considered the “fundamental tripod of medicine” to be anatomy, physiology and pathology. Of these, anatomy lends itself well to being described as a connectivity graph. The AI could have different graphs for different systems in the body such as circulatory, respiratory, endocrine, …. each describing how different parts of the body are connected together. A simple 1, 0 (connected, not connected) would probably do as the AI is simply looking for what to try next, that is, once it traverses the graph from, say, the heart to the liver, it is using “liver” as a keyword to lookup potential next steps.
What about a Bayesian approach to interpretation? I would stay away from it because it relies on “models that are subjective, and the resulting inference depends greatly on the model selected.”  We are seeking a framework that can be used for diagnosing a wide range of diseases, not tuned to specific diseases. The framework must be general and its reasoning mathematical. The reasoning itself cannot have a subjective foundation.
The output of the interpretation stage is a provisional medical decision about which steps to take next. If the algorithm does not have enough information to make a decision, it does not need to do so. It can order more tests or suggest a therapy to alleviate the condition and have the patient report back. When the patient reports back, they start again in the observation phase.
Symbolization, Corrective Action and Evaluating an Algorithm
Even after the diagnostic engine reaches the heart of the matter, what’s wrong with the patient, there is still much more work to do. First it must encode the diagnosis in a manner that will allow it to treat the same disease the same way, every time, from patient to patient. The Oxford American Dictionary defines syndrome as
syndrome n. 1) a group of concurrent symptoms of a disease
If the list of syndromes for a disease is complete enough, it will uniquely identify a disease. Cohen assesses syndromes as the site, functional disturbances and cause of disease . This should be enough information to universally encode the disease. Notice we have not included any prescriptive remedy in the encoding as this will vary from patient to patient as patients with the same disease at the same site may need different courses of action based on age, gender….
Second, we must figure out the cause of the disease and its implications.
Too frequently we have been content with a diagnostic label without investigating its implications.
Causation implies a search for antecedents, and not for the ultimate — the final — cause of all things. This means not a single antecedent or even a chain of antecedents, but a whole interlacing network of them.
This points directly to graph theory for reasoning through the causes and implications for the disease. Somehow we’d need to map corporal function to a manifold and be able to traverse it. This is significantly more complicated than the simple graph traversal in the Interpretation stage, as there, we are simply seeking clues to help us along our decision making tree. We’ve already mapped the several systems of the body to graphs: circulatory system, skeletal, respiratory system, and are simply looking them up. In this stage we’d likely need to do the mapping on the fly based on what we figured out from the previous stage.
In addition to affixing a label to the diagnosis, the output of this stage is to recommend a corrective action.
The main aim of diagnosis, that of providing the rational basis for treatment and prognosis…
The main implementation decision to make here is: do we spend more time/energy investigating causes and implications and make the treatment recommendation and prognosis estimate simpler or vice versa. For instance, if the algorithm is good at figuring out causation and implication, maybe the treatment and prognosis can be a simple look up table. If causation/implication is simple, then we’ll want to do something more complicated for treatment/prognosis. I prefer the former. Because they are so tightly coupled, causation/implication and prognosis/treatment I consider them part of the same stage of diagnosis, even though they may have separate artificial intelligence approaches.
Evaluating a Medical Diagnosis Algorithm
‘…common sense pressed for time accepts and acts on acceptance.’ We physicians are often confronted by a situation in which we have to give a provisional verdict on the admittedly inadequate available evidence.
For any algorithm, execution time will be crucial. The algorithm will have to provide feedback quickly to the patient, even if it does not have a final diagnosis. The user experience aspect of keeping the patient informed as the algorithm works through its reasoning will help the patient become comfortable with seeking diagnosis from a machine, as opposed to a human doctor. Time is of the essence; in fact, the algorithm should not get bogged down in spending clock cycles on getting every corner case right in exchange for reaching most common conclusions quickly.
The New England Journal of Medicine published the results of case records presented to clinicians and discussant groups and whether or not they were able to correctly diagnose the disease in the case studies. Of the 43 case studies, individual clinicians were correct 65% of the time and the discussant groups were correct 80% of the time. The study asked the participants to assess the confidence level of their diagnosis as well.
It seems that being right 80% is good enough, at least it was the state of the art in 1969. That’s another important thing to remember when testing medical diagnostic AI: what are we comparing it against? If medical diagnostic AI can approach 80% success rate then it will be a viable alternative to seeing a doctor. Even at roughly 50% success rate, its a reasonable alternative. For some purposes such as the Tricorder X-Prize, this should be good enough since the potential use of the Tricorder X-Prize device would be to help people decide if they should go see a doctor.
 Cohen, Henry. “The Nature, Method and Purpose of Diagnosis,” The Skinner Lecture, 1943. Cambridge, UK: University Press, 1943.
 Hogg, Robert and Allen Craig. Introduction to Mathematical Statistics. 5th ed, NJ: Prentice-Hall, 1995.
 Case Records of the Massachusetts General Hospital (Case 30-1969). New England Journal of Medicine. 1969; 281: 206-213.
The Tricorder X-Prize is a $10M competition to foster innovation in medical diagnostics. The goal of the competition is to create a medical device that can diagnose 15 diseases. The competition guidelines do not state which 15 diseases the tricoder will need to diagnose, however, it seems the competition guidelines will be refined in September. Maybe they will be announced at that point. Maybe they won’t be announced before the competition. For now, it’s fun to speculate which disease should be included.
In 2008, the Center for Disease Control and Prevention did a survey of ambulatory care in the US. They summarized the most prevalent diagnoses at office visits for nearly a million participants. The most common of all diagnoses was essential hypertension.The fourth most common diagnosis was diabetes mellitus. Each of these medical conditions has a fairly well-understood decision tree for diagnosis.
Table 1: Primary Diagnosis Groups from NAMCS 2008 Survey 
My understanding — and I am not a doctor — is that hypertension is diagnosed primarily with a high blood pressure reading. You do have to make sure that the reading is repeatable and not primarily influenced by external factors, such as the presence of a doctor. Overall, it sounds like diagnosing hypertension boils down to getting consistently high blood pressure readings for the patient’s profile (gender, age, etc…). Blood pressure is not difficult too measure non-invasively — you see blood pressure monitoring machines in grocery stores. The main design consideration for the Tricorder competition would be is there an even less non-invasive way to do it? One that does not involve requiring the patient to strap a band around themselves. Even using a traditional approach, for the price of a blood pressure monitor, a device could diagnose nearly 5% of all office visits in the US.
Diabetes mellitus is #5 with 2.7% of office visit diagnoses. Again, my understanding is that the decision tree is pretty simple: blood glucose readings outside of the norm for a patients profile. However, blood glucose is traditionally measured very invasively, by taking a small blood sample. While the Tricorder X-Prize guidelines do not rule out devices that use invasive techniques, they strongly encourage noninvasive techniques. In fact, a medical doctor on our board at Chesney Research, described noninvasive blood glucose monitoring to me as one of the “holy grails” of medical device technology. Since one of the stated goals of the competition is to drive sensor technology, I think diagnosing diabetes has to be one of the diseases in the competition.
Another holy grail is characterizing bacterial versus viral upper respiratory tract infection. This disease is the third most prevalent diagnosis in office visits, according the NAMCS survey. Right now, there’s no real way to tell the difference other than waiting; bacterial infections tend to last 7-10 days and viral only 2. However, the course of treatment is very different for each: antibiotics for the bacteria, but not for the virus, since they do not respond to antibodies.
Further down the list is heart disease, the non-ischemic variety, that is, not due to low blood volume. Heart disease is a pretty broad category. However, there are analog integrated circuits on the market aimed at measuring electrocardiogram (ECG) signals. For the price of this chip (typically around $20) and the appropriate interface with the patient, a medical device could take a big step towards diagnosing heart disease. There is also a wealth of information on the links between heart disease and hypertension and heart disease and diabetes. With an ECG, a blood pressure monitor, a glucose meter and some fancy AI, a team may be well on its way to gobbling up a significant portion of heart diseases diagnoses. In fact, those three, hypertension, diabetes and heart disease, would get you nearly one out of every ten (9.24%) of all office visit diagnoses.
If we look at the NACMS top 20 again, and take out routine follow-ups, checkups and pregnancy, we are left with 14 diseases. They are given in Table 2. They accounted for nearly one third of all office visits in 2008.
Table 2: Top 14 Diseases, Including Chronic Conditions from NAMCS 2008 Survey Data
The competition’s 15 diseases will need to be diagnosed on 30 different patients and the Tricoder will be evaluated for its effectiveness and ease of use by a panel of judges. The devices should be able to tell the patient if they need to go see a doctor or not. These 14 diseases are a good place to start.
A couple of weeks ago, Qualcomm announced they are sponsoring an X-prize to come up with a health care device. The prize is $10 M and the device must diagnose 15 diseases — it doesn’t say which ones. The goal is to make a consumer device, dubbed a tricorder, that people can use in their homes to provide medical care, which requires advances in sensor technology, medical diagnostics and artificial intelligence, among other things, according to the site.
In reading the overview, there is a section about the potential trade-offs a design team in the competition will have to make. One of those mentioned is the placement of the AI engine. I think a more important concern is what kind of artificial intelligence would be in the device and how it would interact with the sensors and the user. From that the best placement of the AI engine would probably become pretty obvious.
The authors of the overview are right to draw attention to the AI in the tricorder, as diagnoses, seems to be the true intent of the device. The emphasis is not so much on the accuracy and resolution of the sensors themselves that are used — their resolution will be determined by what is just good enough to make an accurate diagnosis. Instead the real focus of the competition is in the diagnoses, the intelligence, and this is ripe for machine learning.
One of the most important machine learning functions in the device will be its natural language processing. Telling the doctor your symptoms is still a major aspect of any patients experience with any medical conditions. We’re not yet at the stage where people can just submit to series of measurements and get a diagnosis. From both standpoints of the patients comfort level with the device as well as our own understanding of medicine, any effective diagnosis machine will have to be able to understand a person’s description of why they are consulting the device for medical help in the first place. A major aspect of any medical diagnosis is how the patient feels, how much pain they are experiencing and how the condition for which they are seeking help is affecting them. Without a sensor that can accurately measure pain, we have to rely on the patients words.
The effective tricorder will have to navigate the language as well as incorporate information from sensors to arrive at an effective diagnosis. I think this is the first consideration when designing the tricorder: how will it parse the patient’s description of the problem. Then blend in the information from the sensors to complete the story.
Over at the Chesney Research website, I posted an introduction to compressive sensing. There’s already lots of good information out there on what is compressive sensing. This introduction is a little different because it gives an overview of compressive sensing without getting too far into the math and stays focused on why you would want to use it.
I was thinking of the immutability of eigenvectors and the immutability of certain vectors when projected and realized these two qualities are one in the same. Eigenvectors can be viewed and explained in terms of a projection matrix, which may be a more intuitive and easier way to understand eigenvectors than is commonly taught. Certainly it relies on much less math — only the concept of rows or columns as vectors and a basic understanding of the fundamental and canonical equation that defines eigenvectors and eigenvalues.
Projecting one vector onto another gives a scalar length of the first vector on the second. It is the amount that the second vector “contributes” to the first. This contribution is easily calculated by taking the dot product.
In the above example, the vectors are given as column vectors, and since their dot product must result in a scalar quantity, it is calculated as aTb.
The projection of b onto a creates another vector, let’s call it p, that is in the same direction as a, but whose length is determined by the length and direction of b.
Pa is a rank 1 projection matrix, which makes sense because we are projecting onto the scalar’s one-dimensional space . The projection matrix, Pa, is a transform that projects a vector onto a. The result of this transform is a scaled version of the vector a.
The lambda parameter gives the amount of scaling on a. The answer gives how much of a is in b, sometimes referred to as the component of a in b.
Projection is a way of moving one vector onto another, that is, the projection changes the direction of a vector to line up with another. More generally, an n-by-n matrix is a transform that changes an n-element vector into another n-element vector, possibly pointing in a different direction.
The vectors that are able to survive this transform unmoved are called eigenvectors. They may be scaled differently as a result of the transform, but they are pointing in their same original direction before the transform was applied. The amount that the transform scales the vector is the eigenvalue.
This is the familiar linear algebra definition for eigenvalues and eigenvectors. The vector x is an eigenvector of A, if the above equation holds for some nonzero x. Lambda gives the eigenvalue for that eigenvector, x.
Numerically eigenvectors and eigenvalues look like this,
Where [1;2] is an eigenvector of [1,3;4,5] with eigenvalue 7. The vector [1;2] survives the transform pointed in the same direction, but with a different scale value (7).
Some vectors do not survive the transform without a change in direction. For instance,
The vectors that are able to survive this transform unmoved, i.e. pointed in the same direction, are the eigenvectors of that transform. The eigenvalues are the scale values resulting from the transform.
 Strang, Gilbert. Introduction to Linear Algebra. Wellesley, MA: Cambridge-Wellesley Press, 2009.