Chris Dew

writes about software development

Holographic Views of Data

An algorithm for creating unlimited, redundant, orthogonal views of data – and regenerating original data from any available views (whose aggregate size exceed that of the original data).

Update

Thanks to everyone for your comments by email – chris@chrisdew.com

I can summarise: I am completely out of my depth here – there is extensive work already done on this area:

I had not encountered any obvious explicit uses of this type of work, though it may be in use at lower levels of communication protocols.

Where is my HyperZFS filesystem? (Where my data is accessible so long as I have working disks whose aggregate size exceeds the size of my data.)

TL;DR;

I’ve just invented a fountain code. As far as I can tell, it’s not the LT Code, the Raptor Code or an Online Code. (I don’t use XOR or polynomials, which LT and Raptor do, and it’s a single phase algorithm, which Online Codes aren’t).

I don’t know whether what I’ve invented is faster than the existing fountain codes, at ~100MB/s/core on an i5. The single core en/decoding rate is similar to the rate at which you can read from spinning rust. You need several cores to saturate an SSD.

If you have any interest in this work, please contact me: chris@chrisdew.com.

Transcript

Hi, I’d like to ask whether an algorithm that I’ve developed, is new or useful.

I’m a Software Developer by trade but my background is in physics, not computer science.

Therefore I may have just reinvented the wheel here and have been too ignorant to see it.

The algorithm is inspired by holography, where every part of a hologram contains some of the information about all of the scene – I’ve named the binary ‘holo’.

Anyway, on to the demonstration…

First I’ll need some chunky data, to demonstrate that the algorithm isn’t stupidly slow.

For this I’ll use an Ubuntu iso image, though obviously the algorithm works on any type of binary data.

I’ll give the iso image a shorter name to save typing.

chris@tux.io:~/screencast$ cp ~/Downloads/ubuntu-12.04.2-desktop-amd64.iso iso
chris@tux.io:~/screencast$ ll
total 711688
drwxrwxr-x  2 chris chris      4096 Aug 15 21:36 ./
drwxr-xr-x 42 chris chris      4096 Aug 15 21:09 ../
-rw-r--r--  1 chris chris 728760320 Aug 15 21:36 iso

700MB of test data.

Let’s get a checksum of the file, so that we can guard against any file corruption.

chris@tux.io:~/screencast$ sha1sum iso
f94d8591dad3043634a35e23884306f16b6b5fc3  iso
chris@tux.io:~/screencast$ time holo --generate --size=1/4 --views=8 iso

I’ve asked for eight views, each a quarter the size of the original, but you could ask for any amount of any size.

This will take a moment, as we’re dealing with a few hundred megs of data.

generating views from iso...

real    0m12.594s
user    0m11.749s
sys     0m0.660s

That’s finished.

If I list the directory we can see the original iso, and the eight generated views.

chris@tux.io:~/screencast$ ll
total 2146316
drwxrwxr-x  2 chris chris      4096 Aug 15 21:37 ./
drwxr-xr-x 42 chris chris      4096 Aug 15 21:09 ../
-rw-r--r--  1 chris chris 728760320 Aug 15 21:36 iso
-rw-rw-r--  1 chris chris 183624680 Aug 15 21:37 iso.view0
-rw-rw-r--  1 chris chris 183624680 Aug 15 21:37 iso.view1
-rw-rw-r--  1 chris chris 183624680 Aug 15 21:37 iso.view2
-rw-rw-r--  1 chris chris 183624680 Aug 15 21:37 iso.view3
-rw-rw-r--  1 chris chris 183624680 Aug 15 21:37 iso.view4
-rw-rw-r--  1 chris chris 183624680 Aug 15 21:37 iso.view5
-rw-rw-r--  1 chris chris 183624680 Aug 15 21:37 iso.view6
-rw-rw-r--  1 chris chris 183624680 Aug 15 21:37 iso.view7

Let’s do some maths, to check the sizes.

chris@tux.io:~/screencast$ ghci
GHCi, version 7.4.1: http://www.haskell.org/ghc/  :? for help
Loading package ghc-prim ... linking ... done.
Loading package integer-gmp ... linking ... done.
Loading package base ... linking ... done.
Prelude> 183624680 * 4.0 / 728760320
1.0078741938090152
Prelude>
Leaving GHCi.

We can see that four pieces together are less than one percent larger than the original, so this is pretty space efficient.

Let’s delete the original file.

chris@tux.io:~/screencast$ rm iso
chris@tux.io:~/screencast$ ll
total 1434632
drwxrwxr-x  2 chris chris      4096 Aug 15 21:38 ./
drwxr-xr-x 42 chris chris      4096 Aug 15 21:09 ../
-rw-rw-r--  1 chris chris 183624680 Aug 15 21:37 iso.view0
-rw-rw-r--  1 chris chris 183624680 Aug 15 21:37 iso.view1
-rw-rw-r--  1 chris chris 183624680 Aug 15 21:37 iso.view2
-rw-rw-r--  1 chris chris 183624680 Aug 15 21:37 iso.view3
-rw-rw-r--  1 chris chris 183624680 Aug 15 21:37 iso.view4
-rw-rw-r--  1 chris chris 183624680 Aug 15 21:37 iso.view5
-rw-rw-r--  1 chris chris 183624680 Aug 15 21:37 iso.view6
-rw-rw-r--  1 chris chris 183624680 Aug 15 21:37 iso.view7

Now for some magic. From any four pieces we can recreate the original file.

chris@tux.io:~/screencast$ time holo --regenerate iso.view0 iso.view7 iso.view3 iso.view5
regenerating iso from iso.view0 iso.view7 iso.view3 iso.view5...

The requirements for recreating the original are just that the total size of the views used are larger than the size of the original by at least one bit.

There is nothing in the algorithm preventing the parts from being of different sizes, it was just easier for this implementation to make them all the same.

The algorithm itself is embarrassingly parallel, but my implementation is unoptimised code running single threaded on a CPU.

A GPU implementation may be much faster.

real    0m8.631s
user    0m8.157s
sys     0m0.460s

That’s finished, lets check it’s worked.

chris@tux.io:~/screencast$ ll
total 2146312
drwxrwxr-x  2 chris chris      4096 Aug 15 21:38 ./
drwxr-xr-x 42 chris chris      4096 Aug 15 21:09 ../
-rw-rw-r--  1 chris chris 728760320 Aug 15 21:38 iso
-rw-rw-r--  1 chris chris 183624680 Aug 15 21:37 iso.view0
-rw-rw-r--  1 chris chris 183624680 Aug 15 21:37 iso.view1
-rw-rw-r--  1 chris chris 183624680 Aug 15 21:37 iso.view2
-rw-rw-r--  1 chris chris 183624680 Aug 15 21:37 iso.view3
-rw-rw-r--  1 chris chris 183624680 Aug 15 21:37 iso.view4
-rw-rw-r--  1 chris chris 183624680 Aug 15 21:37 iso.view5
-rw-rw-r--  1 chris chris 183624680 Aug 15 21:37 iso.view6
-rw-rw-r--  1 chris chris 183624680 Aug 15 21:37 iso.view7

And its checksum.

chris@tux.io:~/screencast$ sha1sum iso
f94d8591dad3043634a35e23884306f16b6b5fc3  iso

Let’s repeat that, choosing a different set of parts… any selection of four different parts will work.

chris@tux.io:~/screencast$ rm iso
chris@tux.io:~/screencast$ ll
total 1434632
drwxrwxr-x  2 chris chris      4096 Aug 15 21:39 ./
drwxr-xr-x 42 chris chris      4096 Aug 15 21:09 ../
-rw-rw-r--  1 chris chris 183624680 Aug 15 21:37 iso.view0
-rw-rw-r--  1 chris chris 183624680 Aug 15 21:37 iso.view1
-rw-rw-r--  1 chris chris 183624680 Aug 15 21:37 iso.view2
-rw-rw-r--  1 chris chris 183624680 Aug 15 21:37 iso.view3
-rw-rw-r--  1 chris chris 183624680 Aug 15 21:37 iso.view4
-rw-rw-r--  1 chris chris 183624680 Aug 15 21:37 iso.view5
-rw-rw-r--  1 chris chris 183624680 Aug 15 21:37 iso.view6
-rw-rw-r--  1 chris chris 183624680 Aug 15 21:37 iso.view7
chris@tux.io:~/screencast$ time holo --regenerate iso.view1 iso.view2 iso.view3 iso.view6
regenerating iso from iso.view1 iso.view2 iso.view3 iso.view6...

real    0m8.743s
user    0m8.165s
sys     0m0.536s
chris@tux.io:~/screencast$ ll
total 2146312
drwxrwxr-x  2 chris chris      4096 Aug 15 21:39 ./
drwxr-xr-x 42 chris chris      4096 Aug 15 21:09 ../
-rw-rw-r--  1 chris chris 728760320 Aug 15 21:39 iso
-rw-rw-r--  1 chris chris 183624680 Aug 15 21:37 iso.view0
-rw-rw-r--  1 chris chris 183624680 Aug 15 21:37 iso.view1
-rw-rw-r--  1 chris chris 183624680 Aug 15 21:37 iso.view2
-rw-rw-r--  1 chris chris 183624680 Aug 15 21:37 iso.view3
-rw-rw-r--  1 chris chris 183624680 Aug 15 21:37 iso.view4
-rw-rw-r--  1 chris chris 183624680 Aug 15 21:37 iso.view5
-rw-rw-r--  1 chris chris 183624680 Aug 15 21:37 iso.view6
-rw-rw-r--  1 chris chris 183624680 Aug 15 21:37 iso.view7
chris@tux.io:~/screencast$ sha1sum iso
f94d8591dad3043634a35e23884306f16b6b5fc3  iso

The same original file, created from a different set of views.

I’ve tried to think of some practical uses for this, but I’d love to hear your ideas, on Hacker News, or Reddit.

Thanks for taking the time to watch this. Please let me know if this algorithm already exists and if so, a wikipedia link would be great.