The high availability wiki project (aka the distributed wiki project) (aka ARGSS! the Automatic Regenerative Grocking Storage System!)
stream-of-consiousness thoughts on the distributed wiki project. Please help organize. previous thoughts at Community:DistributedEditing , WikiFeatures:FailSafeWiki, 2007-02-19 – should I move them here?
I would like to suggest that we should move them here – SamRose
My primary goal is a “fault-tolerant” wiki: a wiki where, if any one thing goes wrong (for example, if any one server is suddenly unplugged), none of the regular wiki users even notice.
Secondary goal: testability.
keywords: “twin”, “backup”, “failsafe”, “wiki”, “fault tolerant”, “data”, “data store”, “file system” “data storage”, “continuous data protection”, “vault”, “high availability”, “brittle vs. resilient”, …
goal: fault tolerant.
Therefore, distribute across at least 2 separate servers (preferably in different cities).
Therefore, avoid encryption (at least in version 1.0), because that makes things more difficult to test.
Therefore, distribute simple copies of wiki pages (at least in version 1.0), rather than complicated erasure codes (ErasureCode).
SteelEye Technology Inc. mentions “transparent failover”. How do they do that? Why don’t they list how much their products cost?
In any case, When a new server comes online, somehow tell it the URL of some node in the network. Then it automatically downloads from that node some (or all?) URLs of other nodes in the network. Then it automatically downloads the latest version of some (or every) page in the wiki, in the normal way (i.e., each page is independently downloaded from the geographically closest node that has that page, with a brief query to a few more distant nodes to see if that is really the latest version). Then it edits its local copy of the “list of URLs of every node in the network” (is this a normal wiki page?) to add its own URL, pings some (all?) the other URLs and tags ones that are currently not online with a frowny face. After that file is saved, that new version of the file is distributed normally to the rest of the network of nodes. Eventually the URLs with the most frowny faces are pruned from the list.
Specialized hardware. We don’t need any of this stuff for version 1.0 . But it might help reduce hardware costs for very large systems.
At first I thought I would need some sort of custom, or at least specialized, fault-tolerant hardware – perhaps the NSLU2 or something like it: NSLU2 wiki: How to back up your Linux box with NSLU2
“If you are a sysadmin contemplating the use of RAID, I strongly encourage you to use EVMS instead. Its a more flexible tool that uses RAID under-the-covers, and provides a better and more comprehensive storage solution than stand-alone RAID.” http://linas.org/linux/raid.html links to http://evms.sourceforge.net/
advanced mathematical techniques. We don’t need any of this stuff for version 1.0 . But it might help reduce hardware costs for very large systems.
For example, say we want to guarantee no files are lost when any one disk fails. And we want to be able to store N disks full of files.
Other erasure codes (such as the one used by RAID 6) are even more robust against drive failure.
WikiWikiSync: Leslie Michael Orchard writes: “I have a wiki on my laptop, my PDA, and on my webserver. I’d like a way in which I can keep all three of them synched up in part or in whole. Because, I use my laptop for personal notes in places where there is no internet access and I’d like certain categories of those notes to be public on my webserver. … I also want to be sure not to clobber changes which might have been done while I’ve been away twiddling. …”
DAV: The one component most likely to fail in a data storage array is a hard drive. RAID does a great job preserving the data when any single hard drive fails. I want something similar to that, except even more fault tolerant. I want something that not only preserves the data, but also keeps on allowing people to read files and write files, even when one of the power cords is unplugged.
MAID is a technique that, given a certain collection of disks, tries to make them last as long as possible (at the cost of a small reduction in performance).
See wear leveling for details.
[this describes something like what I want … but much, much larger. Can I actually buy a small version of this ?] “The technology used a proprietary web server from company X. The system was large: it consisted of hundreds of Unix-based SMP servers and a handful of mainframes located in two cities, mirrored in real time so that a major catastrophe in one city would not disrupt the operation of the system. It cost tens of millions of dollars, … The powerful, personal lesson that I take away from this is never, ever use proprietary software for something important. Even if your life depends on it. Especially if your life depends on it. If a bug needs fixing, I can hire a programmer, or two or three or a dozen, to fix it. The cost will be less than waiting for the vendor to fix it for me.” http://linas.org/theory/freetrade.html
Perhaps I could take 2 (or more) of these file servers, and with a little bit of additional software convert them into a single fault-tolerant file server system.
Would such a distributed fault-tolerant file server system be a good base for my fault-tolerant wiki?
n a democracy, information access requires an information base secure from intrusion, distortion, and destruction; one protected from both physical and technological deterioration. -- Patricia Wilson Berger, 1989 President of the American Library Association http://www.clir.org/pubs/reports/graham/intpres.html
lvm2 “the Linux Logical Volume Manager … groups arbitrary disks into volume groups. mdadm “a program that can be used to create, manage, and monitor MD devices (software RAID) …” (Can this be used to group one hard drive in Oregon and another hard drive in Tulsa into a RAID system?)
“storage space is no longer an issue. A couple of weeks ago, Amazon announced S3 (Simple Storage Service). For 15-cents/month you get a gigabyte of online storage.” http://www.wikisquared.com/2006/05/storage_space_i.html (yet another online file server)
Should I put this distributed wiki project on Sourceforge?
“The Linux Virtual Server is a highly scalable and highly available server built on a cluster of real servers, with the load balancer running on the Linux operating system. The architecture of the server cluster is fully transparent to end users, and the users interact as if it were a single high-performance virtual server.” http://www.linuxvirtualserver.org/
would any of the items in the “Distributed Computing” section of https://help.ubuntu.com/community/UbuntuScience be useful in assembling a fault-tolerant data store?
“The aim is that if you unexpectedly switch off the computer, or if it crashes, then on re-booting the computer is as similar as possible to its state just before the disaster, and certainly hasn’t become inconsistent or corrupt.” http://rrt.sc3d.org/Software/Tau/Doing%20it%20better%201999-06-10%20c
“I’m working on a user friendly mirroring/monitoring system, so I (and other users) can keep multiple copies of their site in sync at multiple Memebot (or other) servers.” http://www.memebot.com/faq.html?faqindex=4#faqanswer
decentralization, in general – perhaps a topic for my fault-tolerant wiki
Pavatars are something like Gravatars, but they are independent of a central server.
<blockquote> Our new goal is to have no single point of hardware failure be able to bring down any of our sites, to be able to upgrade OSes without bringing everything down, and to be able to recover from very major problems in 15 minutes. This goal is fairly absurd. There will always be single points of failure. The ozone layer, for example. A more accurate goal would be “No single hard drive, power supply, or fan failure takes us out of commission for even 1 second. Any other server failure can be recovered from in 15 minutes. Routine Windows Updates will no longer bring our site down. If something happens to the colo facility, its power supply, or the City of New York, well, we go off the air for a while.” … In choosing servers, there’s a lot of benefit to using totally standard, identical parts. It makes it easy to swap things around when you need to scale or when something fails, and things are cheaper by the dozen. In order to avoid going totally broke, we still want reasonably affordable servers, but my requirement of “no single point of failure” means we need servers with dual power supplies, hot-swap RAID, and everything engineered for longetivity. … </blockquote> http://www.joelonsoftware.com/articles/ColoExpansionPart1.html
The Plastic File System is an LD_PRELOAD module for manipulating executing programs’ ideas of what the file system looks like. This allows virtual file systems to exist in user space, without kernel hacks or kernel modules. http://members.canb.auug.org.au/~millerp/README.html
DAV: perhaps promote my distributed wiki as “the distributed wiki”, with initial discussions on distibuted computing, … what else would be related ? … anti-censorship? Well, no, I think starting out without paranoid secrecy. … Ah – networks of trust, trust metric, classification. … hardware fault tolerance … hardware failure analysis, software failure analysis … CommunityWiki:OpenTheory ? …
fault-tolerant data store: competitors
“Western Digital Puts Storage on VPN: Western Digital has introduced a new dual-drive shared network storage product for small business.” by Chris Mellor, Techworld, Friday, February 23, 2007 http://www.pcworld.com/article/id,129325-c,storage/article.html <q> The My Book World Edition II can have [up to] 1TB of data … The device can run independently of a host PC and has a browser-based link to the Internet. Part of its software makes the device a target in Windows ‘Save As’ and ‘Open’ operations whenever the user is connected to the Internet. That means users can have the device attached to their base office PC and access it over the Internet when hundreds or thousands of miles away. They can also backup new data on their laptop back to their office using the same software.
Small workgroups can share documents and other files over the Internet or within a local network without the need for separate FTP server or network-attached storage (NAS) functionality.
… quasi-VPN tunnels are set up between the device and valid accessing PC users across Internet links. It uses Mionet software. Accessing users are validated via a Mionet facility and can then read and write files on the device securely. Transmitted data is encrypted. (Read the Mionet FAQ here.)
The My Book World Edition II comes with wired or wireless access to a host PC and in 500GB or 1TB capacities. It is user-serviceable, allowing users to open the enclosure to replace one or both of the hard drives by following the detailed instructions, without voiding the system’s three-year limited warranty.
A base version – Edition I – comes with 500 Gbyte of capacity and retails for US$279. The twin drive Edition II costs $499 (about £270 at today’s exchange rates).
online file storage / internet file storage Drive Headquarters: online storage service providers http://drivehq.com/
[fault-tolerant data store: online file storage / internet file storage] http://datadepositbox.com/
man mdadm mentions “… mirroring over a slow link” – does it already do “twin data backup” ?
This sounds like a good property for my fault-tolerant data store -- which underlying filesystems have this property?
“While no vessel can ever be regarded as unsinkable, it should be capable of absorbing a number of errors and misfortunes before there is danger of sinking.”
Statement of the Royal Institution of Naval Architects. Seatrader Review, December 1997, page 7 http://www.uscg.mil/hq/gm/moa/docs/fishco.htm
“A place to share about technology, business continuity, and crisis management.” http://stevewatson.net/
“Setting up with a computer with recovery and security in mind” http://www.stevewatson.net/index.php/2005/10/24/setting-up-with-a-computer-with-recovery-and-security-in-mind/
(would “anycast” be a good addition? Wikipedia:Anycast )
The High Availability Linux Project wiki http://wiki.linux-ha.org/HomePage Does this already do what I want (synchronize between a Linux box in Wichita and a Linux box in Tulsa), or does it only work for clusters connected to the same Ethernet network?
DAV: I imagined something like the http://synology.com/ products, except twin boxes …
Hadoop is a framework for running applications on large clusters built of commodity hardware. The Hadoop framework transparently provides applications both reliability and data motion. Hadoop implements a computational paradigm named Map/Reduce, where the application is divided into many small fragments of work, each of which may be executed or reexecuted on any node in the cluster. In addition, it provides a distributed file system that stores data on the compute nodes, providing very high aggregate bandwidth across the cluster. Both Map/Reduce and the distributed file system are designed so that node failures are automatically handled by the framework.
The intent is to scale Hadoop up to handling thousand of computers.
(todo: play with this software) (todo: make some comments here on “distributed wiki”, and link to Software bazaar: distributed wiki)
a little java program which remotely accesses every page of kayakwiki, and creates a backup of all of the wiki text. http://www.kayakforum.com/cgi-sys/cgiwrap/guille/wiki.pl?Backup_Strategy
Syncing Versus Backups
Some people confuse syncing with backing up, which is not surprising because each employs similar techniques and strategies. Both syncing and backing up take the information you have stored in one place and copy it to another place -- http://www.oreillynet.com/pub/a/mac/2006/09/05/synching.html?page=3
Is there a better way to make sure this sort of thing never happens again?
Is there a better way to test the backup system before it is too late?
Writing a distributed filesystem in 24 hours http://wsobczuk.blogspot.com/2006/08/writing-distributed-filesystem-in-24.html
Amazon Simple Storage Service (Amazon S3)
Amazon S3 is storage for the Internet. It is designed to make web-scale computing easier for developers.
Amazon S3 provides a simple web services interface that can be used to store and retrieve any amount of data, at any time, from anywhere on the web. It gives any developer access to the same highly scalable, reliable, fast, inexpensive data storage infrastructure that Amazon uses to run its own global network of web sites. The service aims to maximize benefits of scale and to pass those benefits on to developers.
http://blog.dreamhost.com/2006/08/25/ask-dreamhost-customers/ http://blog.dreamhost.com/2006/10/13/sales-are-slow/ <q> We’ve got a good system of RELIABILITY and PERFORMANCE already.. but the cost per usable GB is $10. The main problem is the 300GB Fiber Channel drives we use, which are $800 each. Is there anything out there that can do the same but with SATA drives that cost more like $100? Even if we needed twice or four times as many drives for the same performance and reliability, it seems possible!
There are also some REALLY WANT TO HAVE features, though possibly could be passed up if the top three are satisfied.
Wikipedia:Zfs <q> ZFS, is a free, open-source file system produced by Sun Microsystems … Dynamic striping across all devices to maximize throughput means that as additional devices are added to the zpool, the stripe width automatically expands to include them, thus all disks in a pool are used, which balances the write load across them. … ZFS … presently only n+1 redundancy is possible. n+2 redundancy (RAID level 6) is only in the development branch – via the OpenSolaris? distribution. … Matt Dillon started porting ZFS to DragonFly? BSD … </q>
DAV: the list of over a dozen distributed version control systems listed at Wikipedia: List of revision control software#Distributed_model .
It makes no sense to me why we need a dozen different free and open source systems – why don’t people, you know, merge the best features of each into one really good system? You know, the way people merged the best features of each Linux distribution so now there are only 300 different distributions?
add this link to the “fault-tolerant file store” documentation … and email/comment at this link to point to SoftwareBazaar: distributed wiki. http://rambleon.org/2006/03/08/i-knew-this-was-coming/
Building a Self-Healing Network by Greg Retkowski 05/25/2006 http://www.onlamp.com/pub/a/onlamp/2006/05/25/self-healing-networks.html “Greg Retkowski is a network engineering consultant with over 10 years of experience in UNIX/Linux network environments.”
Perhaps I should pitch my fault-tolerant data store ideas at Greg?
“the nascent home NAS market” http://www.engadget.com/2006/09/18/hp-media-vault-nas-we-go-again/
DAV: perhaps “vault” is a nice marketing term?
google for “distributed server”
“a wiki file system” Wikipedia:Wikifs
“Hadoop is a collection of Free Java software previously developed by the Nutch project but now maintainted by Lucene. The system includes a distributed filesystem reminiscent of GoogleFS? named the “Hadoop Distributed File System” (or just DFS).” Wikipedia:Hadoop … “Both map/reduce and the distributed file system are designed so that node failures are automatically handled by the framework.” http://lucene.apache.org/hadoop/about.html
How To Share State Among Multiple Computers http://c2.com/cgi/wiki?HowToShareStateAmongMultipleComputers
“Svk is a decentralized version control system. It uses the Subversion filesystem but provides additional, powerful features.” http://www.wikiindex.com/Svk_Wiki
keywords: “virtual desktop”
“The tape checkpoint is independent of machine configuration and can be restored and restarted on any system with sufficient disk space. In fact, this is the way KeyKOS? is distributed.” … “Key Logic developed a prototype UNIX-compatible system implemented on top of KeyKOS?. At UNIFORUM ‘90, we demonstrated this system by literally pulling the plug on the computer at random. Within 30 seconds of power restoration, the system had resumed processing, complete with all windows and state that had been on the display. We are aware of no other UNIX implementation with this feature today.” -- http://www.agorics.com/Library/KeyKos/checkpoint.html
“DynaOS - Dynamic Operating System: An OS that can boot from remote server.” by Niyaz PK
"The Next Few Decades of Computing" by Linas Vepstas 2000-2004 describes the “Eternity Service”. It makes sure that I can access “my” files from any computer in the world, even if my computer crashes. Accessing previous versions would also be nice. Also lists many other nice features such a service should have (and a few features that conflict with each other).
The distributed wiki looks like it might fulfill a big fraction (but not all) of these features.
"I Want a Pony: Snapshots of a Dream Productivity App" (2005) mentions “IMAP-like syncing - For me, the Hegelian truth between the “OS <whatever>” vs. “Web OS” rhubarb is that’s it’s both and neither. Personally, I want my important information stored on a secure server, but I want the data and its structure seamlessly sync-able to applications on the web, via wireless devices, and yeah, in my most important desktop apps. Why is there not “IMAP” for my address book and task outlines? Why the heck isn’t there a standard calendar format that lets me collaborate with colleagues and use whatever program I want? Why do I have to learn CVS to have smart versioning on plain text? I don’t know either. (insert image of Merlin angrily kicking his slippers at the television)”
Yes, I want this distributed wiki to do both syncing and easy-to-use versioning on plain text. I also am astonished that someone hasn’t already done it. – DavidCary
“Publius: Censorship Resistant Publishing System” http://www.cs.nyu.edu/~waldman/publius/
“Submitting documents to Eternity” http://cypherspace.org/adam/eternity/
“the Peer Distributed Transfer Protocol” http://pdtp.org/ … Ruby programming language …
“Real-time Mantra highlights the key issues in real-time system design. It also cover real-time design patterns and issues in complex software design.” http://eventhelix.com/RealtimeMantra/ links to Fault Handling and Fault Tolerance http://www.eventhelix.com/RealtimeMantra/FaultHandling/
Would Koopman be interested in my “twin backup” ideas? Dependable Embedded Systems Research:
DAV: can I hijack the IMAP protocol to produce something more like a fault-tolerant wiki? Mark Crispin’s Web Page http://staff.washington.edu/mrc/ “My main programming project is the IMAP toolkit, which is the internal engine used by the popular Pine mailer. The IMAP toolkit is the reference implementation of the Internet Message Access Protocol (IMAP), which I invented in 1985. IMAP, often called the best-kept secret in electronic messaging, is a protocol which enables an advanced distributed client/server electronic mail paradigm.” … “A good rule of thumb to keep in mind is that any field of study which has “science” in its name is not a science.”
(would this be a good person to discuss the “twin backup” idea with?) http://www.ludism.org/mentat/MarkSchnitzius
“OpenBinder? is a complete open-source solution supporting new kinds of component-based system-level design. A variety of of production-quality operating system services (not included with OpenBinder?) have already been implemented using it, including a distributed user interface framework, display server and window manager, and media framework.” http://openbinder.org/
fault tolerance load balancing
alternative to distributed wiki? Or one way to implement it?
Keeping Your Life in Subversion by Joey Hess 01/06/2005 http://www.onlamp.com/pub/a/onlamp/2005/01/06/svn_homedir.html
another alternative to distributed wiki http://www.datadepositbox.com/partners/?pid=08422f400a7c200547f40838027632e9
(subversion and SVN and SVK) “SVK, the SVN without the .svn” http://weblogs.goshaky.com/weblogs/page/lars?entry=svk_the_svn_without_the “I switched to SVK, a distributed version control system built on top of SVN and written in Perl.”
“ATA over ethernet”
“Virtual Linux Server Project” Linux Virtual Server
“If it’s true that the web is becoming the world’s memory, regardless of its problems, then I feel that we, the creators and owners of this immense wealth of information, ought to have a free (free as in freedom) way of accessing it. A way that doesn’t depend on the health or ethical decisions of a particular company (Google, Microsoft, or whatever). A way that doesn’t require a rack of 400 servers (and growing) or immense amounts of bandwidth in order to index what’s there and process it.
Unfortunately, nobody has come up with a solution yet. However, any ideas are most definitely welcome.”
the comments respond
http://freesoftwaremagazine.com/free_issues/issue_08/intro_tla/index_p2.html (discusses TLA, apparently a distributed version control system – exactly what a distributed wiki needs) (includes a “type this” tutorial)
“If the Empire tears one [host] up, the data can move to another one. This is probably part of Distributed Consumerium, a later phase of the project.”
mention “Extremely Reliable Systems”
(Is this joey a wiki programmer? Is there any way I can persuade him to help me build a fault-tolerant wiki?) http://kitenet.net/~joey/blog/entry/ikiwiki_split.html
(does the “replicated database” mentioned here … would this analysis also apply to distributed file server? distributed fault-tolerant wiki?) Secure Property Titles with Owner Authority Copyright (c) 1998,1999,2002,2005 by Nick Szabo permission to redistribute without alteration hereby granted http://szabo.best.vwh.net/securetitle.html
(todo: check out “wackamole”, available on Ubuntu systems via synaptic.) “Whackamole is an application that helps with making a cluster highly available. … “
(todo: check out “ucarp” available via synaptic.) “UCARP allows a pair of hosts to share common virtual IP addresses in order to provide automatic fail-over. … homepage: http://ucarp.org “
wikifeatures : failsafe wiki
“I am considering making it possible to use some form of mirroring or distrubuted storage of wiki pages to inconvience censorship. I welcome suggestions and advice on the matter.” http://www.sacredelectron.org/why.shtml
Todo: Build a distributed wiki.
[electronic failure modes] “Re: How many hours will an average TV work?” by Madhu Siddalingaiah http://madsci.org/posts/archives/aug98/899004396.Eg.r.html discusses the various failure modes of typical television sets. The picture tube will eventually wear out, but typically it lasts (on average) at least 10 years. “The semiconductors, passive components, and connections can theoretically last forever.” but the article lists situations that can cause them to fail.
Would an account here at this non-profit help me set up my “fail-safe wiki” ? If so, set up that account. http://www.obscure.org/info/faq.shtml#why (I found this web site listed in the “mirror sites” section of the Physics FAQ http://www.xs4all.nl/~johanw/PhysFAQ/index.html )
fail safe wiki? http://wikidev.net/Talk:MediaWiki_hosting_and_development
“Why do so many chips fail?” article by Nicolas Mokhoff 2006-02-09 says “There is very little that can be done about chip failures -- other than to design chips correctly in the first place -- concluded a panel Wednesday (Feb. 8) at DesignCon 2006” http://eet.com/news/design/showArticle.jhtml?articleID=179102593
yet another single-point failure
“Although we have triple redundant pressure sensors and dual redundant pressurization valves, when this component shorted, it caused the valve controller board to reboot, effectively eliminating the redundancy.” http://www.spacefellowship.com/News/?p=1414
the difference between Failsafe vs. Redundant (also mentions “fail soft”: “Example: when your car computer cannot manage closed loop but allows the engine to continue in open loop rather than leaving you stranded.” … aka “limp mode”).
Notes from a conference talk about a distributed web service in python: http://wiki.sheep.art.pl/Notes_from_RuPy_2007
Interesting. Is this merely talk about what someone wants, or is there an actual implementation already? – DavidCary
It’s actual, working, deployed implementation (see http://grono.net). They started small but now have tens of servers in use. They refused to give more technical details, being afraid of competition. – RadomirDopieralski
Maybe monotone.ca would fit better than CVS?
About the “plain text”: are you talking about markup that is optional and linking that is automatic?
Yes, the “Better SCM Initiative” ( http://better-scm.berlios.de/ ) mentions that Monotone is better than CVS. (But it also mentions 8 other version control systems that are also better than CVS). So, have you seen the Revision Control wiki ( http://revctrl.org/ ), which appears to be a better place for comparisons?
Perhaps some of the file synchronization utilities mentions on the SynchronizeTip would be helpful.
“Unison http://www.cis.upenn.edu/~bcpierce/unison/ is a file-synchronization tool for Unix and Windows. It allows two replicas of a collection of files and directories to be stored on different hosts (or different disks on the same host), modified separately, and then brought up to date by propagating the changes in each replica to the other.”
the Circle is an open source, scalable, decentralized, peer to peer application. “The Circle is written in Python. It runs on Linux and Windows.” The Circle is based on the Chord lookup protocol. “no single point of failure.” “As long as there is one Circle peer running, anywhere in the world, there’s still a network.” infoAnarchy wiki: The Circle.
which architecture is best?
The “three level” system seems more complicated, but it helps partition our special “distributed wiki” software into the middle – it uses standard web browsers on one side, and some standard data storage system on the back end. The “back end” nodes might be
Joel gives some very good advice in this article. However, I hope things are not as bad as this particular quote implies. (Or perhaps what Spolsky thinks of as “really high uptime” is a much higher and far more difficult than my hoped-for “pretty high uptime”). For a distributed wiki, I tried to think up every possible thing that could possibly go wrong, and I only came up with 4: “One or more things wrong in Wichita”, “One or more things wrong in Tulsa”, “Something wrong elsewhere”, and “Simultaneous failures in multiple geographic locations”. That seems to cover all possible things that could possibly go wrong. Or is there something else I missed?
For the first 2 types of failures, I don’t wait for a human to switch out failed parts – I can let the software running in areas that haven’t failed (yet) automatically bypass the city with the failure, then continue to run (“limp mode”) until a human figures out what went wrong and fixes it, and then the software discovers that city is now working fine and starts using it again.
Tudor Marian, Ken Birman, and Robbert van Renesse. “A Scalable Services Architecture”. Proceedings of the IEEE Symposium on Reliable Distributed Systems (SRDS 2006), October, 2006. Available online: http://www.truststc.org/pubs/163.html ; http://www.cs.cornell.edu/projects/quicksilver/public_pdfs/marian-Scalable-Services-srds06.pdf ; and mirrored elsewhere.
Is this similar to the “cloud computing” described at http://www.cloudcomputing.gopc.net/ ?
"The Tiger Video Fileserver" by William J. Bolosky, Joseph S. Barrera, III, Richard P. Draves, Robert P. Fitzgerald, Garth A. Gibson, Michael B. Jones, Steven P. Levi, Nathan P. Myhrvold, Richard F. Rashid. “Tiger is a distributed, fault-tolerant real-time fileserver.”
"A Conversation with Jim Gray" interview by Dave Patterson has some interesting ideas on data storage and how to move bits from one place to another.
"Direct-to-NFS" by Michael Alyn Miller discusses the implementation of “Target Revocable E-Mail”, involving clustering with live fail-over, plenty of redundancy, fault-tolerance, … the same qualities I want the distributed wiki to have.
“Offline access to a wiki” http://www.pycs.net/users/0000177/2004/07/29.html#P279 is a step towards a distributed wiki …
“all supported configurations of the 7410 have multiple paths to all JBODs across multiple HBAs. So even without NSPF, we have the ability to survive HBA, cable, and JBOD controller failure.” -- Eric Schrock's Weblog
"So how much work was it?" has some comments about “Enthusiasts have been doing homebrew NAS” and comparing it to “a complete, polished solution that stands up under the stress of an enterprise environment”.
"The Foundation for P2P Alternatives" wiki – can they help me build a high-reliability wiki?
the Fedora Directory Server has “multi-master replication … gives administrators the ability to provide no single point of read failure and no single point of write failure, so the directory is always available. … automatically recovers at startup in the event of power failure or other outage. …” the Fedora Directory Server wiki: FAQ
So can I make a high-availability wiki out of it?
Aegis is a transaction-based software configuration management system. It provides a framework within which a team of developers may work on many changes to a program independently, and Aegis coordinates integrating these changes back into the master source of the program, with as little disruption as possible.
Is this useful for building a high availability wiki?
"The Risks of Distributed Version Control" by Ben Collins-Sussman. Do any of these risks apply to high-availability wiki, and if so, what can/should we do to mitigate those risks?
In certain conditions it is impossible to guarantee that distributed processes to come to consensus. See the “FLP impossibility proof” at Wikipedia: consensus (computer science). Do those conditions apply to high-availability wiki? If we cannot guarantee consensus, can we design it so that when some parts fail (a) the remaining functional parts are likely to come to a correct consensus, and (b) any problems caused by a lack of consistence are small and easy to fix ?
Coda Hale. "You Can't Sacrifice Partition Tolerance". 2010.
Is this something that can be used by a fault-tolerant wiki?
“Rackspace is open-sourcing the specs for its Cloud Servers and Cloud Files APIs under the Creative Commons 3.0 Attribution license, enabling third-party developers to copy, implement and rehash them as they see fit.” – http://www.techcrunch.com/2009/07/23/rackspace-opens-the-cloud/
“… the … sort of fault-tolerant distributed system that makes Gmail trustworthy today. … Whenever ubiquitous reliable non-local data storage arrives, it won’t be a minute too soon. The world now buys hundreds of millions of hard drives every year, so I bet a lot more than a million drives now die every year.” -- "Won't someone PLEASE think of the hard drives?!" by Daniel Rutter 2009
“Distributed Storage Failure Recovery”  describes “the distributed storage project” -- how can the “distributed wiki project” work with them?
This is an excellent page on a very important issue which is a prerequisite for democracy, true. Well done.