[IMC-Tech] Re: mirroring issues

Michael deBeer madebeer at igc.org
Wed, 3 Jan 2001 04:35:12 -0800 (PST)


On Tue, 2 Jan 2001, Matthew Arnison wrote:
> i looked into altaway, and it's roughly $650 a year for something that i
> feel is much less than we currently have with loudeye.

I agree, loudeye is a hell of a good partner.
 
> * i think about 100 GB and growing of storage

If indymedia is already at 100 GB, when will the current 200 GB raid array
fill up? Do you have a breakdown of size per media-type (gifs would be
easier to distribute elsewhere than realaudio)? How much 'expires' and how
much of it is permanent? Is there a way to generate an average 'profile'
of a city.indymedia.org disk-usage over time, to forcast the disk usage
for the next year?

I can think of four possibilities:

* enough stuff expires so that you won't go above 200 GB till after 2001
* loudeye will give you RAID arrays as you need them, 400-600 GB, whatever
* find another big group to give you another server to put overflow
  material on.  Maybe approach Exodus and Sun, and ask them to host a free 
  300 GB RAID array.
* figure out a distributed storage for older stuff - building on the mysql
  table you have planned which has the name of each content item. 
  Have the index server store a list of URLS where each content item can
  be found.  Maybe freenet or maybe a homebrew of distributed mirroring.

> so u can see maybe why i am willing to do some work on our mirroring
> software to suit loudeye if necessary.

Yes.

If for storage space reasons we do need to split some of the work between
loudeye and a distributed network of other servers, we'd want to make sure
it was clear which section of the network is loudeye and which was a
consortium of rabble-mirrors ;)  Maybe have older stories have the URL:
  rabble.indymedia.org or
  archives.indymedia.org or
  archives.prague.indymedia.org

> i think rsync would still not solve certain problems, due to the way
> we want to have media mirrored as soon as possible after it is
> published. while i agree rsync is excellent, and much better than ftp,
> i think rsync is designed for slower mirroring, such as daily.

That is true.  rsync is better for sync-ing directories, for mirror sites
grabbing all the latest files at their leisure, not for file-upload on
demand.
 
> > Comments on the current scheme: As a way of dealing with thousands of
> > files per directory, perhaps do an MD5 hash of the filename, and use that
> > as the directory.  This would only have to be computed once, and the
> 
> sounds interesting, but i'm not sure i understand this. wouldn't the md5
> hash be different for each filename? 

The hash would be different for most filenames.  Some files might have the
same hash.

> also i think it's better to have URLs
> that are short and easy to type, from a usability standpoint (i know
> people shouldn't have to type them in, but sometimes people end up needing
> to for some reason or other).

True.  

I think the directory system should not break, not matter how many files
are put into it.  MD5-hash directory names would be expandable, but would
create ugly directory names.  I think someone else suggested directories
for each month, which would work.  Also, if each item has a numeric id in
the database, the directories could be based on the numeric ids, so item
number 2002 would go into directory /2/ and item 8003 would go to
directory /8/.  If there is a reason for storing different media types in
different directories, the /2/ proposal could be joined with the
media-type proposal, so that a gif file numbered 3003 would go in /gif/2/

Michael