Finding the Chorus of a Song Using Auto-correlation

March 9th, 2011

Correlation is useful for finding commonality between two signals.  This commonality can also be considered redundancy.

The auto-correlation is defined as,

For a message, the auto-correlation sequence describes how well a message matches up with a shifted version of itself.  When the message is not shifted at all, it matches up very well with itself.  For non-repetitive, random or otherwise unpredictable messages, there is little in common with a shifted version of itself.  The auto-correlation for a random message of 15 symbols is shown in the following plot.

 

 

 

 

 

 

 

 

 

The auto-correlation can be generalized to the cross-correlation between two messages.

Both are represented by a sequence of values of the number of matches at each shift position.  For the purpose of calculation, shifts are barrel shifts, meaning the part that is shifted off the end is prepended back to the beginning.

A plot of cross-correlation between two random messages follows.  Notice that even if the two messages each don’t match up very well when compared with a shifted version of  themselves, they may still correlate when compared with each other at the appropriate shift amount.

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Under an appropriate signaling alphabet (i.e. closed, finite set of unique symbols), the atomic portion of the autocorrelation calculation, x[n] * x[n+i] is simply a compare between two symbols at different points in the message.

Some example Python code for the correlation function is given below.  Rxy is the calculated cross-correlation sequence between messages a and b, each length N.

Notice the nested loops.  There may be a better way of doing this in Python, a better way to code it or a better solution, such as using lower-level routines like Numpy does, but this illustrates what makes correlation computationally expensive.  For two sequences, each N long, it takes N2 operations to calculate the cross-correlation between the two.  This is the drawback of using a serial machine, which for all intents and purposes is how most processors approach it.  You would probably enjoy a significant speed up by using a highly-parallelized machine like a graphics processor, at the expense of some very low-level coding to trick it into calculating correlation for you, something it was probably not designed to do.

Even though it can take a while to calculate for long sequences, the auto-correlation function can be very useful, especially when analyzing a message of symbols.  In a graph of the correlation function, the peaks are instances of repetition.  Repetition occurs, for instance in the chorus of a song.  Consider an entire song a message — a sequences of symbols.  Consider only the lyrics to the song, stripping it entirely of its musical content.  Each word of the song is a symbol; the specific order and collection of these symbols uniquely identifies the lyrics of the song.

Take, for example, the lyrics to Michael Jackson’s 1982 smash hit “Beat It.”  The start of the chorus is recognizable by the oft-repeated title of the song.  However, this is not the entire chorus.  Can we find the chorus of the using auto-correlation?  In fact we can; the autocorrelation function does a good job of finding the chorus in this example.  The compare part of the auto-correlation calculation is a simple string compare: do the two words being compared match exactly or not?

Here’s the algorithm:
  1. Flatten the song into a sequence of words.
  2. Calculate the auto-correlation for the sequence
  3. Find the index (not 0) of the auto-correlation that has the biggest value.
  4. Find the longest run of consecutive matches for the index found in step 3.
  5. Therein lies your chorus!

For “Beat It,” I plotted the auto-correlation.

 

The auto-correlation peaks at a shift of 28 (Rxx[28] = 0.184343434343).  There are several words that match when you take the lyrics of “Beat It,” shift them by 28 words and compare it to an unshifted version of itself.  Several of them are stray individual words matching.  However, when I looked at position 278, there is a long string of consecutive matches

In fact, it keeps going to index 333.  The auto-correlation method finds two consecutive instances of the chorus, which is:

Just beat it! Beat it!
No one wants to be defeated
Show them how funky and strong is your fight
It doesn’t matter who’s wrong or right

Notice that it finds the whole chorus, not just the oft-repeated “Beat” or “Beat It.”  Other methods might focus on the most frequently appearing word or diagram (two words together) and miss the rest of the chorus, which actually contains much more thematic content.  In this example, the most oft-repeated phrase, “Beat It,” is just part of the reason for the high-degree of correlation.  The rest of the chorus, probably the less easily remembered (but equally important) part, “No one wants to be defeated…. who’s wrong or right,” matches as well when the message is shifted by the appropriate amount (28 words) and compared with itself.  With other methods it can be difficult to tell that “No one wants to be defeated…” is part of the chorus and not the next verse.

Tags: , , , , ,
Posted in Uncategorized | 6 Comments »

Collapsing Content

February 25th, 2011

I’ve learned a few new terms this week.  One is online astroturfing, which is the practice of creating a lot of fake online personas and using them to support a certain cause.  I learned of it by way of George Monbiot’s blog.  George points to the discovery by the Daily Kos of the recent solicitation by the US Air Force for persona management software.  These personas are created to build support for/against certain things online by parroting essentially the same point of view in comment threads, discussion groups, etc…, and generating entire online identities to lend credence to those “opinions.”

Another term I learned is churnalism, which is the practice of posting a press release as a news article, masquerading as original journalism — by way of Media Standard Trust.  The churnalism web site lets a user paste an article into a window and check to see if it matches up with press releases in its database.  Their site compresses the text and compares it to a database of other compressed texts to check for similarity, or more accurately, instances of exact matches.

What’s common among these two new terms is the desire to recognize original content in the presence of artificially repeated content.  We humans are easily persuaded when there appears to be a lot of people saying essentially the same thing, which is fairly straightforward to do online.  What would help us negotiate a world of churnalism and online astroturfing is the ability to collapse content.  In other words, we need to be able to view a body of texts as comprised of several themes and view the themes fairly and individually, regardless of how often they are repeated, if we choose.

Natural language processing (NLP) can help.  By using machine learning to programmatically make sense of human text and speech we could develop algorithms to collapse similar content together, leaving only original ideas — no matter how unpopular — standing out.  If we perform this kind of filtering on our content before we see it, we could avoid our innately human tendency to lend more credence to the idea being supported by more people.

The idea of using people vs machines to sift through content has certainly come up previously.  The two methods were probably most directly pitted against one another as news consumers evaluated Yahoo’s and Google’s news aggregators in the mid- to late-2000′s.  Yahoo used (still uses?) human readers to decide which news stories most readers would like to see.  Google News uses an algorithm to determine which stories are relevant and also to cluster similar stories together.

So how does Google do it?  How do they decide which stories are similar?  The speculation is that they may be using stopwords — these are very common words, such as “the” and “a”.  As Greg Linden points out, the seminal paper on using stopwords comes out of Stanford and it develops an algorithm for finding matching signatures in a large corpus of data, such as a web crawl.  Greg gives an example of it.  Say you find “a weeklong campaign” in a page, the “a” is the stopword and a-weeklong-campaign is the stopword signature.  Texts that have both of these stopword signatures (and others) are likely to be talking about the same thing.

Unfortunately, this type of approach may not help much when trying to spot a generated online persona from blog posts, comments and tweets, for a couple of reasons.  For content collapsing application, its not good enough to know that multiple texts are discussing the same subject; instead we need to determine if they are espousing the essentially same point of view.  Another reason this technique might not work for content collapsing is that it appears to be vulnerable to trivial substitutions, such as minor spelling differences (“week-long” vs. “weeklong”) or synonym substitution.

The churnalism algorithm appears to have a similar vulnerability — reliance on exact matches to determine similarity.  When I say vulnerability, I’m talking about repurposing these algorithms for spotting generated identities online or trivially-rephrased news articles — I’m not talking about the intended purpose of these algorithms, which they are probably pretty good at achieving.  I do believe these algorithms are a first step towards collapsing content of large corpora together, something that would be immensely useful to me personally in dealing with information deluge.  However, to collapse content in a way that removes or mitigates the bias formed from artificially-multiplied consensus, we’ll need to dive deeper into natural language processing and develop more sophisticated algorithms.  I believe the promise is there — to stay one step ahead of the automatically-generated content — but not realized yet.

 

Tags: , , , , , , , , ,
Posted in Uncategorized | Comments Off

Clean Rooms

February 2nd, 2011

The other day I was talking to a friend who works at a large general contractor — these are the companies that manage major construction projects like ports, schools, and hospitals.  He had just finished managing the construction of a new wing of a hospital.  We were talking about that and talking about his business, in general, and he reminded me that they also manage construction projects for semiconductor companies, the kinds of projects that would be involved in opening new fabrication facilities or expanding existing ones.

I commented that I thought that was a strange mix of business: hospitals and semiconductor fabs.  My friend explained to me that it really isn’t — a lot of the skills needed for each are actually complementary, such as the need to be able to build facilities that are exceptionally clean, to keep them that way and to show that they are clean.

When I say clean, I mean free of particles and dust.  Wikipedia provides a good definition of a clean room as,

an environment … that has a low-level of pollutants such as dust, airborne microbes, aerosol particles and chemical vapors.

A room free of airborne microbes can significantly reduce the risk of hospital-borne infections.  A room free of dust and chemical vapors is a must when etching microscopic features into silicon, usually with the help of a “wet” acid, to form electrical circuits on a semiconductor substrate — microchips are made this way.

You can never keep a room of any appreciable size and utility completely free of dust, particles and vapors, so fortunately, there are ways of measuring a room’s cleanliness by counting the number of particles in a given volume of air.  The ISO standard (14644-1) has different strata of clean room classification.  For instance, ISO-9 is normal room air, which is considered to be 35 million particles per cubic meter.  What is considered a particle?  Anything bigger than half a micron, which is 5/10,000th of a millimeter.  For particles of this size, the ISO standard decrease in number to ISO-2, which is 4 particles (half a micron or bigger) in a volume of 1 cubic meter.  There is an ISO-1 classification, which has no particles of half a micron, but counts smaller particles.

What level of cleanliness is appropriate for which activities?  That probably depends on what you’re trying to do.  There’s a semiconductor fab at BYU that boasts ISO-4 compliance.  A company called Nemotek that makes “wafer-level cameras” announced nearly 2 years ago the opening of an ISO-4 clean room in Morocco.  You can imagine how dust particles would affect building camera lenses en masse.  Nemotek claims they can upgrade to ISO-1, but for now ISO-4 is good enough.

So how does one go about constructing a clean room?  The Global Society for Contamination Control maintains a wiki that has a good section on designing a clean room.  The main point of design is how air is handled in the room. Air coming into the room needs to be filtered free of dust.  Air inside the room is recirculated through HEPA (High Efficiency Particulate Air) or ULPA (Ultra Low Particulate Air) filters.  If a person enters or exits a room, they must first go through an air lock and sometimes through an air shower.  Sometimes the room is kept slightly pressurized, in case there is an air leak in the room, the air goes out.  Also, the entire air flow system must be considered, whether laminar or turbulent, and designed to drive particles to filters installed near the floor.

With so much care and attention to detail, it’s no wonder that clean rooms tend to be built by major microelectronics corporations or large hospitals.  Here in Ann Arbor, the new and massive Mott’s Children’s Hospital, a state of the art complex in not only medicine but green building design, has installed HEPA filters throughout the hospital so that patients with compromised immune systems can roam freely.  It is quite an undertaking to build a clean room, but one that is becoming more and more necessary across different industries.  From semiconductor fabs to hospitals, it appears that an environment free of airborne particles is becoming a necessity.  Is this necessity only available to large complexes and multi-million and billion dollar corporations?  Is there a way to make this technology more readily available?

Certainly making clean rooms more readily available has its hurdles.  This includes the work to construct the room to exacting standards and testing to those standards.  It also includes the materials to construct the room, which must be unlikely to flake or shed over time.  But most importantly of all, it includes teaching the people that do their jobs in the clean room how to maintain a clean environment.  No matter how clean you build the room, you can easily destroy that hard work with common habits.  This means that one of the biggest challenges is not just constructing the clean room, but in making sure it can stay clean.

Tags: , ,
Posted in Uncategorized | Comments Off

Humanity’s Most Versatile Servant

December 16th, 2010

Via metafilter, a blog called Abnormal Use is looking back through the New York Times archive at predictions made in 1931 about 2011.  My favorite of the bunch is from William F. Ogburn, a sociologist, who predicted,

Humanity’s most versatile servant will be the electron tube.

According to Wikipedia, Mr. Ogburn was known for his idea of cultural lag in technology, which means that social strife will result from technological advances to which humans haven’t yet learned to adapt.  He was also an advocate of technological determinism, which argues that a society’s technology drives its social development and structure.

While Mr. Ogburn did not know about the microchip, as Abnormal Use points out, in 1931, I believe the gist of this prediction is right on.  Electrical circuits are our most versatile servant.  Today, we are still exploring that versatility with electrical circuits implemented in semiconductors.    With the benefit of hindsight, Mr. Ogburn’s prediction is probably best restated as the transistor, as opposed to the electron tube, since this is the key innovation that was miniaturized (and made more manufacturable) with the advent of semiconductor circuits.  However, it’s hard to imagine that 80 years from now we will not have moved migrated to a more advanced platform.  Hopefully, in 2091, we will still be making useful electrical circuits, but with a more accessible technology than semiconductors.  Although we can make and distribute millions and millions of semiconductors each day, the cost of designing and producing them is expensive and only accessible to few people.

Tags: , , , ,
Posted in Uncategorized | Comments Off

Glitch

November 23rd, 2010

I haven’t been posting much because I’ve been working on my application for grad school, but this weekend, I did get sucked into alpha testing Tiny Speck‘s new Glitch game, to be released this coming Spring.  I’m not much of a gamer but when I saw the invitation come through, I decided to give it a whirl.

I got sucked in and enjoyed playing it, but I am not totally sure why.  It has some pretty fundamental gaming concepts that don’t seem that new: time-based skill learning (like Civilization), hunting for resources, buying tools, etc…, linear 2-D player movement like all the old Nintendo games….  I don’t think it was any of these things that kept me playing.  Nor was it the graphics, which were interesting, but no reason to stay logged in for hours.

I think the unique aspect of it was the impression that your character was building the game.  There was a large map to explore, but it seemed pretty bare bones, as if something went there, but nobody had built it yet.  I would stumble upon houses at certain points, where it looked like characters lived and wonder, “how do I get one of those?”  The game even has real estate listings for these houses!

Although after a while a house in the game seemed like a bad investment.  I had elected for my character to pursue what was basically a chemistry-based skill set: refining, alchemy, mining… and I could carry all of my tools around with me.  I’m not sure what I would use the house for, except if I had decided to pursue a more agrarian approach, like raising animals and crops?  But even then, the houses didn’t seem like they were dedicated to that.

What I liked most about the game was the opportunity to participate in projects.  For instance, if a new street was being built, you could contribute materials, or food or currency or other things.  This felt like I was contributing to the game, however, I wish that it somehow tied in more to other players playing the game.  Not sure how this would be done.  In general I didn’t understand the social networking aspect of the game — how it would help and how to initiate it.

The part of the game that I felt needed the most work was the mood portion.  What’s this for?  Nothing ever seemed to detract from my mood points.  I guess this is because there were no real “bad guys” in the alpha version.  From the trailer (single-link youtube), it sounds like this will change for the general release.  Overall, it was entertaining, but hard to put my finger on why.

Tags: , ,
Posted in Uncategorized | Comments Off

QR Decomposition and Numerical Accuracy in SciPy

October 21st, 2010

I was solving a least squares problem using A = QR decomposition with Python and SciPy and come across an interesting anomaly.  I found that the numerical accuracy of the least squares solution changed for different QR decompositions.

To recap, one way to find the best fit solution to Ax = b is to use the least squares approach, that is the solution that minimizes the square of the error at all of the given data points.  To do this, you can factor A into Q, a set of orthonormal column vectors, and R, an upper triangular matrix.  Q and R can be used to find the least squares fit by solving R xhat = QT * b, where xhat is the least squares fit to the data given in b.  Since R is upper triangular, this equation can be solved quickly with back-substitution.  So the algorithm is,

  1. Factor A = QR
  2. Multiply QTb
  3. Back-substitution on [R QTb]

and in the last column, after the back-substitution will appear your answer.  I followed the example in the Strang book, which starts with

and factors A = QR using the Gram-Schmidt process.  Factoring by hand gives the following decomposition,

if you accept the first column of A as the first basis vector.  For b = [1; 0; 1], this leads to the solution,

I tried doing this in Python with the SciPy package and got an interesting result.  I calculated the error vector e = A * xhatb.  There’s a chance this could be tricky for python: there are a lot of square roots in the decomposition and one of the entries in xhat is 2/3 — both of which can be tough for python in terms of numerical accuracy.  In fact, the error vector in Python for b = [1; 0; 1] is [2e-16; 0; -2e-16].  To quantify the error vector, I take it’s l1-norm, which is 4e-16.

I started taking a closer look for the source of the numerical inaccuracy, and as it turns out, it wasn’t the precision of the square roots, at least not directly.  It turns out SciPy had factored A into Q and R differently than the Strang example by hand.  SciPy came up with

This isn’t necessarily wrong, an A can have different QR decompositions.  I guess SciPy started with a different choice of the first basis vector.  This is still valid, as long as the columns in the resulting Q are still orthogonal to each other and normalized.  But what was strange was how this choice of the first basis vector affected the accuracy of the solution to the least squares problem.  When I used the Q and R derived by hand, the numerical inaccuracy went away in Python!  The l1-norm of the error vector was exactly zero with a different A = QR decomposition!

I’m using SciPy version 0.7.  It’s qr() function relies on functions from the LAPACK library.  I don’t know why there’s a difference between the two decompositions.  I only checked the accuracy of the two decompositions to about 7 or 8 decimal places, so it could still be that one decomposition is less accurate, instead of one decomposition used to solve for this particular b being less accurate.

Tags: , , , , , , ,
Posted in Linear Algebra, QR decomposition | 1 Comment »

Visualizing 5-dimensional Space

October 16th, 2010

Visualizing higher dimension spaces is tough because once we get beyond 3 dimensions, visualizing space is not very intuitive for us anymore.  This can be a major stumbling block early in learning linear algebra.  Often matrices are describing spaces that have more dimensions than three.

In reading an image processing paper this week, I thought of an intuitive way to visualize 5 dimensions.  Imagine what is physically a 2 dimensional space, but actually has more than 2-d worth of information encoded in it.  A  2-dimensional picture is a great illustration of this.  Although this is not how images are traditionally interpreted, imagine a space that is described this way: a collection of pixels, and each pixel contains 5 pieces of information — an x-coordinate, a y-coordinate, an R-value, a G-value, and a B-value; from the RGB color model.

In terms of linear algebra, those pieces of information would make up our x vector,

A single pixel can be described  as a linear combination of these components.  An image can be described as a collection of pixels, so a 5 by m matrix A can encode a picture described by m pixels by multiplying x on the left (Ax).  For instance, consider, in this example, the vector A = [1, 1, 1, 0, 0], where RGB values are considered to be between 0 and 1 (0 means no intensity and 1 full intensity).  The image would be a single red pixel at (1,1).

The greyed out lines are just drawn for reference and are not part of the image to be rendered; they indicate where the x- and y-axes are.  I’ve shown the background as black because that’s consistent with the RGB vector of 000, which is black.

We could add a green dot to the image of a red dot by appending the vector [1, -1, 0, 1, 0] to A.

and finally we could add a blue dot and a white dot to the image.

The A for the image above would be

and the collection of pixels that describe that image, b, would be

Different images would have different A‘s, each resulting in a different b, but the meaning of the x basis vector would have to stay the same for the representation to be the same.  Each colored pixel is just a mixture of information from three different palates: red, blue and green.  Each can be added into the picture with an intensity regardless of the others.  For instance, adding as much blue as I like doesn’t mean I can’t add as much red as I like to an individual pixel.  Those three variables (R, G, B) are independent.  They are also independent of the x and y coordinate system used to describe the location of the pixels in the picture.  That is, I can put a purple pixel anywhere on the picture I like, not just in one section.  The x and y axes are themselves orthogonal to each other.  In this way, these five components form a five dimensional space, each independent of the other and comprehending a picture described by them constitutes being able to visualize 5-dimensional space.

Tags: , , , , ,
Posted in Image Processing | Comments Off

Inverting A’CA

October 14th, 2010

Suppose instead of solving ATCA * x = b once, as for the resistor network in the previous post, you wanted to solve it over and over again.  For instance, suppose the b vector is a set of measurements and you want to solve for x and y based on the new b. In the resistor network example, this would be a new set of external voltages applied to the circuit, but it could also be many other things.

In this case, it behooves you to invert ATCA * x = b up front instead of using elimination and back-substitution with b because the inversion only needs to be done once. Once you have (ATCA)-1 you can do a matrix vector multiply to find x and another to find y.

To find (ATCA)-1, you can use the same techniques as finding the row-reduced echelon form to invert ATCA. This is done by performing elimination and back-substitution on the augmented matrix [ATCA I], where I is the identity matrix.  The result is [I (ATCA)-1].  This, of course, assumes that A and C are invertible. They also haven’t changed in this example. We’re finding new x and y on a new b only; A and C have stayed the same.

So the method is:

  1. Multiply A on the left by C; call this CA
  2. Multiply CA on the left by AT
  3. Remove the 4th row and 4th column from the result of step 2. Append the identity matrix (I) to the right of what’s left.
  4. Put the result of step 3 into row reduced echelon form. This can be done by performing Gaussian elimination and back-substitution. Normalize the entries on the right to get the identity matrix on the left. The result on the right is (ATCA)-1
  5. Multiply b by on the left by (ATCA)-1 to get x
  6. Multiply x on the left by CA and scale all entries of the result by -1. The final result is y.

And there’s an illustration for that too…

This method looks slightly longer than the last, but the advantage is that steps 5 and 6 can be repeated for a new b.  They only require matrix-vector multiply, instead of the more costly elimination and back-substitution.  This is prefect for situations where the A and C stay the same, and b is being updated with new data.

Tags: , , , , ,
Posted in Uncategorized | Comments Off

Solving Resistor Networks Using Gaussian Elimination — An Illustration

October 12th, 2010

A great example of an application of Gaussian elimination comes from Dr. Gil Strang’s Introduction to Linear Algebra. In it, he discusses solving (AtCA) for graphs and networks. Let’s use the example he formulates with Ohm’s Law and Kirchoff’s Voltage and Current Laws (KVL and KCL, respectively) to illustrate how Gaussian elimination is used to solve problems. It gives a network of resistors, every node a voltage, every pair of nodes connected by an edge, representing current flow. The matrix representation of this problem is

Find x for a given b. Once you know x, which represents the node voltages, you can find y, the currents between the nodes. The matrix A is determined by the connectivity diagram in the example in the book. It’s given below.  The matrix C represents the conductances between each node and is given below as well.

One method for finding the node voltages and currents given the connectivity and conductance matrices is to use Gaussian elimination.  This page shows one way to visualize the solution to this problem, using HTML5′s canvas element.  Dr. Strang does an excellent job of explaining the solution to this problem in his book.  The method can be summarized as follows:

  1. Multiply A on the left by C; call this CA
  2. Multiply the result of step 1 on the left by A’ (A transpose)
  3. Remove the 4th row and 4th column from the result of step2.  Append a column of [1, 0, 0]‘ appended on what’s left.
  4. Perform Gaussian elimination on the result of step 3. The node voltages, x, will appear as the 4th column of the 3×4 matrix on which elimination was performed.
  5. Multiply x on the left by CA and scale all entries of the result by -1.  The final result is y, the edge currents.

The purpose of step 3 is to set a common potential, ground, as a reference for all the other nodes.  In doing so, the 4th row becomes a redundant combination of the remaining three unknown node voltages and can be eliminated.  The 4th column can be eliminated as well since the common voltage has been set to ground, which is 0 potential.

In step 4, b is introduced into the solution.  This is done by setting the node voltage x1, to some potential, in this case 1, relative to ground (x4).  If x4 is where ground is connected, then x1 is where the external power supply is connected.  The rest of b (x2=0, x3=0) means that x2 and x3 do not have any external voltage source connected to them.

In Python, this solution can be found using scipy and this script.  Using scipy, Gaussian elimination can be performed by breaking down LU factorization into two steps: the forward elimination step and the back substitution step.  To do this, the user must use the “lu_factor” and “lu_solve” functions, which don’t return P, L, U (the permutation matrix, the lower triangular portion and the upper triangular portion).  Instead, it returns L and U combined into one matrix to save space and the pivots.  The pivots aren’t used in this example, but the LU form is, along with b to find x.

In the case where b is a measurement vector, we will want to calculate (AtCA)^-1 and simply do a matrix-vector multiply every time b is updated to find a new x and y.  Performing elimination every time b is updated is costly and unnecessary.  A matrix vector multiply can turn out to be faster.

Tags: , , , , , , , ,
Posted in Uncategorized | 3 Comments »

Update on Benchmarking LU Factoring

September 27th, 2010

An update on the benchmark from the previous post on factoring a random matrix to upper and lower triangular form, LU:

A quick experiment with MATLAB shows how long it takes (in 2009) to factor a random matrix of size n to LU form.

n A=LU using MATLAB (s)
1000 0.29
2000 1.83
3000 5.77
4000 13.27
5000 25.36
6000 42.9
7000 67
8000 98.7
9000 140
10000 270

The table above is from the average of 3 passes factoring the same matrix. MATLAB version 7.9.0 was used running on Red Hat Enterprise Linux (3.4.6) on a Dell machine with a 2.2 GHz Intel processor and only 345 MB of RAM.

A similar comparison performed with Python and the scientific computing package SciPy running on a Macbook shows similar, even better performance up to n=6000. For larger matrices, MATLAB takes over and at n=7000, the Python + Macbook combination actually runs out of memory.

n MATLAB (s) Python (s)
1000 0.29 0.23
2000 1.83 1.11
3000 5.77 3.02
4000 13.27 6.7
5000 25.36 11.7
6000 42.9 39.59
7000 67 102.7
8000 98.7
9000 140
10000 270

Python version 2.6.3 was used, with Scipy version 0.7.1. The OS used was Mac OSX 10.5.8 running on a 2.1 GHz Intel Core 2 Duo with 2 GB of 667 MHz DDR2 SDRAM. This version of OSX comes with vecLib which means that the Scipy code can take advantage of the LAPACK and BLAS libraries. These were not included in the MATLAB run above, which would improve things dramatically. The Python benchmark shown is the average of 10 runs of factoring a random a into L and U.

The Python code stays competitive with MATLAB until n=7000 and is a good free alternative for matrix sizes less than 7000.

Tags: , , , ,
Posted in Uncategorized | Comments Off

Next page Previous page