The M45 supercomputer provided by Yahoo opened its ports to its partners at Carnegie Mellon University, where the initiative should help boost research that benefits the broader Internet community.
For those of you firing up the old faithful laptop for a morning of surfing, blogging, maybe a little development work, get a load of what some of the lucky geeks at Carnegie Mellon University got to play with this morning:
The M45, Yahoo’s supercomputing cluster, has approximately 4,000 processors, three terabytes of memory, 1.5 petabytes of disks, and a peak performance of more than 27 trillion calculations per second (27 teraflops), placing it among the top 50 fastest supercomputers in the world.
Their ranking claim won’t be confirmed until the next Top500 Supercomputer list comes out on Tuesday at this week’s SC07 conference in Reno, so it will be interesting to see how M45 measures against the best in the world. Yahoo’s M45 figures should put it in the top 30.
We chatted with Yahoo’s Ron Brachman, VP for worldwide research operations with the company. He’s also wearing the hat as head of academic relationships. Jay Kistler, VP for engineering system tools & services, also talked with us ahead of this morning’s announcement.
Brachman said the M45 supercomputer came about from the opportunity for Yahoo and the university community to advance science and technology on an Internet scale. They have opted to focus on open source, developing solutions for large scale distributed computing.
Yahoo and Carnegie Mellon understand grid computing well. The M45 setup has been geared toward that understanding. It’s capable of partitioning large data sets thanks to the installation of Hadoop.
Hadoop accomplishes this by implementing MapReduce and Pig the latter which may be known to those who follow Yahoo’s research projects closely.
Kistler said they have been working on layering Pig over a Hadoop core. Pig’s runtime extensions for parallel computing are similar to SQL, but they are procedural rather than declarative.
In the M45 environment, the runtime maps statements down to where MapReduce can divide them into little blocks of work and run them across the supercomputing platform.
We wanted to understand better what the distributed development effort being enabled by M45 might be able to do for this level of supercomputing. Kistler rattled off a couple of achievements he would like to see happen, if developers can pull them off.
One would provide for the improvement of job scheduling across clusters; another the enhancement of monitoring and instrumentation of heterogeneous jobs, where it would be easier to find bottlenecks and faults, and correct them for better performance.
Compelling stuff for the folks who will really get into the tasty innards of supercomputing. The potential gains from the M45 go beyond the items on Kistler’s wish list.
Carnegie Mellon’s Randy Bryant, dean of the School of Computer Science, told us in a phone interview about such possibilities. Top of the list: generating statistics for language translation. It’s a demanding task due to the number of documents needed for mapping words from multiple languages.
Another potential gain would be with digital image editing. Bryant discussed this with the example of getting an ex-brother in law out of photos. Through the use of a massive digital image database, supercomputing could allow the editor to find the content of a photo minus the person to be edited out, and replace that person with the background that would be normally visible.
Semantics and language search support would benefit, and we think Yahoo will be interested in that. Bryant noted such a project would look at distinguishing linguistics, where the system would understand when a speaker means “bare” or “bear” from the context of the rest of a conversation.
Research takes time, but the M45 platform should substantially improve the total time needed for these projects to bear productive results. Some very lucky geek types started researching on this platform today.