I am a PHD student of Geophysics and work with large amounts of image data (hundreds of GB, tens of thousands of files). I know svn and git fairly well and come to value a project history, combined with the ability to easily work together and have protection against disk corruption. I find git also extremely helpful for having consistent backups but I know that git cannot handle large amounts of binary data efficiently.
In my masters studies I worked on data sets of similar size (also images) and had a lot of problems keeping track of different version on different servers/devices. Diffing 100GB over the network really isn't fun, and cost me a lot of time and effort.
I know that others in science seem to have similar problems, yet I couldn't find a good solution.
I want to use the storage facilities of my institute, so I need something that can use a "dumb" server. I also would like to have an additional backup on a portable hard disk, because I would like to avoid transferring hundreds of GB over the network wherever possible. So I need a tool that can handle more than one remote location.
Lastly, I really need something that other researcher can use, so it does not need to be super simple, but should be learnable in a few hours.
If evaluated a lot of different solutions, but non seems to fit the bill:
- svn is somewhat inefficient and needs a smart server
- hg bigfile/largefile can only use one remote
- git bigfile/media can also use only one remote, but is also not very efficient
- attic doesn't seem to have a log, or diffing capabilities
- bup looks really good, but needs a "smart" server to work
I've tried git annex which does everything I need it to do (and much much more) but is very difficult to use, and not well documented. I've used it for several days and couldn't get my head around it, so I doubt any other coworker would be interested.
How do researchers deal with large datasets, and what are other research groups using?
To be clear, I am primarily interested in how other researchers deal with this situation, not just this specific dataset. It seems to me almost everyone should have this problem, yet I don't know anyone who has solved it. Should I just keep a backup of the original data and forget all this version control stuff? Is that what everyone else is doing?
Aucun commentaire:
Enregistrer un commentaire