Tag: machine learning
January 29th, 2012
A couple of weeks ago, Qualcomm announced they are sponsoring an X-prize to come up with a health care device. The prize is $10 M and the device must diagnose 15 diseases — it doesn’t say which ones. The goal is to make a consumer device, dubbed a tricorder, that people can use in their homes to provide medical care, which requires advances in sensor technology, medical diagnostics and artificial intelligence, among other things, according to the site.
In reading the overview, there is a section about the potential trade-offs a design team in the competition will have to make. One of those mentioned is the placement of the AI engine. I think a more important concern is what kind of artificial intelligence would be in the device and how it would interact with the sensors and the user. From that the best placement of the AI engine would probably become pretty obvious.
The authors of the overview are right to draw attention to the AI in the tricorder, as diagnoses, seems to be the true intent of the device. The emphasis is not so much on the accuracy and resolution of the sensors themselves that are used — their resolution will be determined by what is just good enough to make an accurate diagnosis. Instead the real focus of the competition is in the diagnoses, the intelligence, and this is ripe for machine learning.
One of the most important machine learning functions in the device will be its natural language processing. Telling the doctor your symptoms is still a major aspect of any patients experience with any medical conditions. We’re not yet at the stage where people can just submit to series of measurements and get a diagnosis. From both standpoints of the patients comfort level with the device as well as our own understanding of medicine, any effective diagnosis machine will have to be able to understand a person’s description of why they are consulting the device for medical help in the first place. A major aspect of any medical diagnosis is how the patient feels, how much pain they are experiencing and how the condition for which they are seeking help is affecting them. Without a sensor that can accurately measure pain, we have to rely on the patients words.
The effective tricorder will have to navigate the language as well as incorporate information from sensors to arrive at an effective diagnosis. I think this is the first consideration when designing the tricorder: how will it parse the patient’s description of the problem. Then blend in the information from the sensors to complete the story.
Tags: AI, diseases, machine learning, natural language processing, sensors
Posted in Uncategorized | Comments Off
February 25th, 2011
I’ve learned a few new terms this week. One is online astroturfing, which is the practice of creating a lot of fake online personas and using them to support a certain cause. I learned of it by way of George Monbiot’s blog. George points to the discovery by the Daily Kos of the recent solicitation by the US Air Force for persona management software. These personas are created to build support for/against certain things online by parroting essentially the same point of view in comment threads, discussion groups, etc…, and generating entire online identities to lend credence to those “opinions.”
Another term I learned is churnalism, which is the practice of posting a press release as a news article, masquerading as original journalism — by way of Media Standard Trust. The churnalism web site lets a user paste an article into a window and check to see if it matches up with press releases in its database. Their site compresses the text and compares it to a database of other compressed texts to check for similarity, or more accurately, instances of exact matches.
What’s common among these two new terms is the desire to recognize original content in the presence of artificially repeated content. We humans are easily persuaded when there appears to be a lot of people saying essentially the same thing, which is fairly straightforward to do online. What would help us negotiate a world of churnalism and online astroturfing is the ability to collapse content. In other words, we need to be able to view a body of texts as comprised of several themes and view the themes fairly and individually, regardless of how often they are repeated, if we choose.
Natural language processing (NLP) can help. By using machine learning to programmatically make sense of human text and speech we could develop algorithms to collapse similar content together, leaving only original ideas — no matter how unpopular — standing out. If we perform this kind of filtering on our content before we see it, we could avoid our innately human tendency to lend more credence to the idea being supported by more people.
The idea of using people vs machines to sift through content has certainly come up previously. The two methods were probably most directly pitted against one another as news consumers evaluated Yahoo’s and Google’s news aggregators in the mid- to late-2000′s. Yahoo used (still uses?) human readers to decide which news stories most readers would like to see. Google News uses an algorithm to determine which stories are relevant and also to cluster similar stories together.
So how does Google do it? How do they decide which stories are similar? The speculation is that they may be using stopwords — these are very common words, such as “the” and “a”. As Greg Linden points out, the seminal paper on using stopwords comes out of Stanford and it develops an algorithm for finding matching signatures in a large corpus of data, such as a web crawl. Greg gives an example of it. Say you find “a weeklong campaign” in a page, the “a” is the stopword and a-weeklong-campaign is the stopword signature. Texts that have both of these stopword signatures (and others) are likely to be talking about the same thing.
Unfortunately, this type of approach may not help much when trying to spot a generated online persona from blog posts, comments and tweets, for a couple of reasons. For content collapsing application, its not good enough to know that multiple texts are discussing the same subject; instead we need to determine if they are espousing the essentially same point of view. Another reason this technique might not work for content collapsing is that it appears to be vulnerable to trivial substitutions, such as minor spelling differences (“week-long” vs. “weeklong”) or synonym substitution.
The churnalism algorithm appears to have a similar vulnerability — reliance on exact matches to determine similarity. When I say vulnerability, I’m talking about repurposing these algorithms for spotting generated identities online or trivially-rephrased news articles — I’m not talking about the intended purpose of these algorithms, which they are probably pretty good at achieving. I do believe these algorithms are a first step towards collapsing content of large corpora together, something that would be immensely useful to me personally in dealing with information deluge. However, to collapse content in a way that removes or mitigates the bias formed from artificially-multiplied consensus, we’ll need to dive deeper into natural language processing and develop more sophisticated algorithms. I believe the promise is there — to stay one step ahead of the automatically-generated content — but not realized yet.
Tags: algorithm, astroturf, churnalism, clustering, content collapsing, corpus, machine learning, natural language processing, persona, stopwords
Posted in Uncategorized | Comments Off