DinoSearch Architecture

24th January 2001
The architecture message read as follows:
From: Mike Taylor 
To: William.BONNET@tcc.thomson-csf.com
Cc: dinosearch@egroups.com
Subject: Re: Info Request : Dino place in paris

> Date: Thu, 18 Jan 2001 11:36:04 +0100
> From: William.BONNET@tcc.thomson-csf.com
>
> I'm really interested by the message you posted about dinosearch. I
> had answered to Ekaterina Amalitzkaya about the finding of reprints
> in this way.  I was thinking about the idea to set up such an online
> database.
> 
> If I could be any help let me know. I'm a beginner here about
> paleontology (amongst others fields...).  It's one of my
> passions. So I saying many stupid things :) but I'm better at
> computer science (at least I hope :) ).

Well, because it's a European-funded thing, and they like to deal with
companies and other instuitutions rather than individuals, I think
that the most helpful thing you could do at this stage would be to
stir up interest in your employer (or any other organisation that
you're affiliated with).  Do you work for someone relevant?  And what
country are they in?

Beyond that, at some stage we'll be looking to launch a volunteer
program to get papers into the system: this could be a matter of
scanning and OCRing (whole papers or just abstracts), or re-typing
abstracts, or choosing keywords, or any number of other activities.
At some stage, we'll need to someone to co-ordinate that work.
Interested?  And of course, we'll need people to actually _do_ the
work!

> I actually don't know how your project will be set up. I was
> expecting to build an online database with MySql, Apache and a Linux
> Server I can access at will (as long as what I put online is
> non-commercial, legal, etc.).

We're planning bigger than that! :-)

The architecture is a federated hierarchy of brokers.  Here's what I
mean:

Any number of institution -- individual publishers, reprint agencies
and other organisations -- can maintain their own archive servers.
Each server presents a standardised Z39.50 interface, which can be
searched uniformly by a single, standard client (accessed via the
web.)

Rather than connect a client directly to a server, what you'll more
often do is connect it to a broker that forwards the query to a bunch
of servers that it knows about, and synthesises all the responses into
a result set that it forwarded back to the client.  From the point of
view of the client, the broker is Just Another Server; and from the
point of view of the server, the broker is Just Another Client.

Because this system is "plug-compatible", brokers may in fact delegate
a searching not directly to a server, but via another broker: it
neither knows nor cares whether any given server is "real" or a
broker.  So one can build arbitrary hierarchies, allowing autonomy to
departments, institutions, provices, countries, etc.  Consider the
following example simplified topology (hope you're reading this with a
fixed-width font):

			Client
			  |
			  |
			global
			broker
	   ______________/\______________
	  /			         \
	USA				European
	broker				broker
   ______/\______		   ______/|\______
  /	         \		  /	  |       \
AMNM		NMNH		France	Germany	England	...
server		broker		broker	broker	broker
	   ______/\______	 /|\	   ______/\______	
	  /	         \	.....	  /	         \
	NMNH		NMNH		Oxford		NHM
	fossils		cladistics	server		broker
	server		server			   ______/\______
						  /	         \
						NHM		NHM
						Kensington	Glasgow
						broker		server
					   ______/\______
					  /	         \
					NHM		NHM
					Saurischia	Ornithischia
					broker		server
				   ______/\______
				  /	         \
				NHM		NHM
				Sauropod	Theropod
				server		server

So if the client issues a search for Baryonix, then that search will
get propagated down through the hierarchy, and if it gets a hit on a
paper help by (say) the English Natural History Museum's conjectural
sub-department of Theropoda, then it will be passed up through the NHM
Saurischia broker, then NHM kensington broker, the NHM broker, the
England broker, the European broker and the global broker, which
returns it to the client _exactly_ as though the global broker had
been an archive server containing the paper directly.  In effect, it
behaves as a union catalogue for the entire world; or the client could
instead speak directly to the USA broker, or indeed (if you know
exactly what you want already) the NHM Ornithischia server.

You can bet, though, that Linux, Apache and MySQL (and Perl) will be
right in the heart of what gets built.  If if you have money to spend
on tools, you still want to use what's best, right?  Even if it's free
:-)
The original architecture message is stored in the DinoSearch mailing list's archive, at www.egroups.com/message/dinosearch/2
Feedback to mike@tecc.co.uk is welcome!