Friday, September 20, 2024

Interview:Thunderstone VP Doran Howitt

A few weeks ago I put up a post about the Google Small Business Search Appliance. I listed it’s pros and cons and thought Google had come up with a pretty decent product.

After that post I was contacted by Doran Howitt, the VP of Marketing at Thunderstone. Doran told me about the Thunderstone Search Appliance SBE his company had recently launched. He pointed to an eWeek head to head comparison of the Thunderstone search appliance and Google’s search appliance. The review is pretty good but I thought you deserved a bit more of that “personal touch” so I conducted an interview with Doran. Enjoy!

Thunderstone's Small Business Edition Search Appliance

Jason: What does the “SBE” stand for in the Thunderstone Search Appliance?
Doran: “Small Business Edition”. That’s to differentiate it from our “enterprise” editions.

Jason: Do you allow your customers to expose the search results of your search appliance to the internet or is it strictly for use on internal corporate intranets?
Doran: Yes, either. In addition, we allow its use for *indexing* other web sites out on the internet. You can serve those search results either to the public or just for your internal use.

Jason: What Thunderstone software is embedded in the Search Appliance?
Doran: The appliance runs on top of our TEXIS software. Texis is our flagship product. It combines the features of a search engine and relational database. Texis is actually an entire application development suite for text-intensive or search-intensive applications.

Jason: What adjustments, if any, can users make to the algorithm(s) that determine the importance of a particular document for a particular query?
Doran: Users can set the rank knobs selectively for each search. They also can turn on or off the thesaurus, pattern matching, proximity, and stemming. That’s if the administrator has turned those things on — those are settings! And of course there is the + and – logic operators, phrases, and wild cards. For the geeks, you can search with a regular expression.

Jason: Is linking part of the built in ranking algorithm?
Doran: That information is captured for tracking and reporting. As of today we’re not using it in ranking. The reason is that link weighting is not useful in most intra-net situations or within a single web site. It only helps in the context of a very broad web index, where links created by a huge number of people provide a kind of popularity measure. We would add linkages as a ranking feature if customers requested it, but so far they haven’t.

Jason: Your FAQ page says the appliance can index data stored in relational databases like SQL Server, MySQL and Oracle but do you point it at a specific table(s) or can you tell it to only index the results of a particular query?
Doran: In the underlying Texis software, you actually point it at a table. That’s not yet enabled in the appliance, mainly because in most situations we’ve seen, the appliance can get at all the dynamic content by HTTP. It can submit queries as needed. But we’ll probably add direct database indexing in the next major release, because certainly there are situations where it would be useful.

Jason: How long does it take to set up the appliance from opening the box to having it online and indexing documents?
Doran: Setup and configuration should be 20 minutes or so. It’s mainly a matter of pointing the crawler at the desired data. Although you might want spend a little more time prettying up the results page HTML! If you have a somewhat complicated web structure, where you only want some things indexed and not others, you would spend some more time defining the exclusion and inclusion rules.

Jason: What kind of support is involved / necessary from Thunderstone in order to get the box up and running?
Doran: We usually ask for the IP settings before shipping it, so that you can just plug in the ethernet and start going through the admin menus from a browser. In case of any problem we can remotely diagnose it. Of course some customers like to be talked through the initial configurations, and we’re happy to do that. If you’ll be indexing a public web site, usually we’ll actually crawl it before we ship the box, so that when you plug it in, it’s already to serve search results, and the updating will proceed in background.

Jason: Architecture: What’s the operating system, amount of ram, processor speed / type, and hard disk size / arrangement on each of the appliances?
Doran: It’s Linux and Intel on the inside. But the OS isn’t exposed to the Appliance customer. The whole idea of an appliance is that you don’t have to worry about software. Anyone who wants to get at the product at that level, can license the software without our box! As to Appliance memory, etc., those will vary depending on the capacity that you buy. The low end box will have 1GB and 40GB disk. That’s all you need to index 50,000 documents or web pages.

Jason: What language is the search engine written in?
Doran: The core Texis software is written in C. The crawler application is written in our Texis Web Script, which is a compiled high-level language similar to PHP.

Jason: What are the 5 ranking knobs the user can adjust?
Doran: Closeness of query words to the beginning of a document; order of occurrence of the query words; proximity (closeness) of query words to each other within a record; rarity of query words in each document; and rarity of query words in the whole index.

Jason: What file type has been the most difficult to index from a programmatic standpoint? Flash files, video files, applets?
Doran: Not any of those. We index text and links found in any Flash file. Also text such as captions within video or graphics. Applets we see as just another frame with some JavaScript, so no problem there — the Appliance executes the JavaScript and indexes everything found in the file. The thing that’s occasionally troublesome is Lotus Notes with Domino. That tends to have a lot of different views of the same data, and web pages that are near duplicates, but different enough to confuse our duplicate detector. In the end, it typically takes some trial and error to get the exclusion settings right for Notes data.

Jason: What vertical market has required the most tuning / tweaking to your ranking algorithm on the software side?
Doran: An interesting example that we’ve seen is news publishing. Newspaper articles tend to have the most important material close to the beginning, so in a newspaper search application, you would give “lead bias” factor more weight. Magazine articles tend to start with an anecdote, which can actually be misleading as to what the rest of the article is about. So in a magazine archive, you’d crank that factor down.

Jason: Why is this application better for small businesses and organizations than the Google small business appliance?
Doran: Our product differs from Google’s in a some key respects. One is that we allow our customers to index third-party information, that is, material you don’t own, which may be on any web sites out on the internet. Google prohibits that, I guess because they don’t want customers competing with their core business. Another key difference is that Thunderstone optionally licenses the underlying software. We even give out the crawler application source code. You can hack it up, create extensions, or tie it in with a larger set of applications. We’ll even take back an appliance as trade-in on a software license! The software is available on all major Unix platforms, and Windows.

Jason Dowdell is a technology entrepreneur and operates the Marketing Shift blog.

Related Articles

LEAVE A REPLY

Please enter your comment!
Please enter your name here

Latest Articles