S3git: Git for Cloud Storage (or Version Control For Data)

s3git: git for Cloud Storage (or Version Control for Data)

s3git applies the git philosophy to Cloud Storage. If you know git, you will know how to use s3git!

s3git is a simple CLI tool that allows you to create a distributed, decentralized and versioned repository. It scales limitlessly to 100s of millions of files and PBs of storage and stores your data safely in S3. Yet huge repos can be cloned on the SSD of your laptop for making local changes, committing and pushing back.

Exactly like git, s3git does not require any server-side components, just download and run the executable. It imports the golang package s3git-go that can be used from other applications as well. Or see the Python module or Ruby gem.

Use cases for s3git

  • Build and Release Management (see example with all Kubernetes releases).
  • DevOps Scenarios
  • Data Consolidation
  • Analytics
  • Photo and Video storage

See use cases for a detailed description of these use cases.

Download binaries

DISCLAIMER: These are PRE-RELEASE binaries -- use at your own peril for now

OSX

Download s3git from https://github.com/s3git/s3git/releases/download/v0.9.2/s3git-darwin-amd64

$ mkdir s3git && cd s3git
$ wget -q -O s3git https://github.com/s3git/s3git/releases/download/v0.9.2/s3git-darwin-amd64
$ chmod +x s3git
$ export PATH=$PATH:${PWD}   # Add current dir where s3git has been downloaded to
$ s3git

Linux

Download s3git from https://github.com/s3git/s3git/releases/download/v0.9.2/s3git-linux-amd64

$ mkdir s3git && cd s3git
$ wget -q -O s3git https://github.com/s3git/s3git/releases/download/v0.9.2/s3git-linux-amd64
$ chmod +x s3git
$ export PATH=$PATH:${PWD}   # Add current dir where s3git has been downloaded to
$ s3git

Windows

Download s3git.exe from https://github.com/s3git/s3git/releases/download/v0.9.1/s3git.exe

C:\Users\Username\Downloads> s3git.exe

Building from source

Build instructions are as follows (see install golang for setting up a working golang environment):

$ go get -d github.com/s3git/s3git
$ cd $GOPATH/src/github.com/s3git/s3git 
$ go install
$ s3git

BLAKE2 Tree Hashing and Storage Format

Read here how s3git uses the BLAKE2 Tree hashing mode for both deduplicated and hydrated storage (and here for info for BLAKE2 at scale).

Example workflow

Here is a simple workflow to create a new repository and populate it with some data:

$ mkdir s3git-repo && cd s3git-repo
$ s3git init
Initialized empty s3git repository in ...
$ # Just stream in some text
$ echo "hello s3git" | s3git add
Added: 18e622875a89cede0d7019b2c8afecf8928c21eac18ec51e38a8e6b829b82c3ef306dec34227929fa77b1c7c329b3d4e50ed9e72dc4dc885be0932d3f28d7053
$ # Add some more files
$ s3git add "*.mp4"
$ # Commit and log
$ s3git commit -m "My first commit"
$ s3git log --pretty

Push to cloud storage

$ # Add remote back end and push to it
$ s3git remote add "primary" -r s3://s3git-playground -a "AKIAJYNT4FCBFWDQPERQ" -s "OVcWH7ZREUGhZJJAqMq4GVaKDKGW6XyKl80qYvkW"
$ s3git push
$ # Read back content
$ s3git cat 18e6
hello s3git

Note: Do not store any important info in the s3git-playground bucket. It will be auto-deleted within 24-hours.

Directory versioning

You can also use s3git for directory versioning. This allows you to 'capture' changes coherently all the way down from a directory and subsequently go back to previous versions of the full state of the directory (and not just any file). Think of it as a Time Machine for directories instead of individual files.

So instead of 'saving' a directory by making a full copy into 'MyFolder-v2' (and 'MyFolder-v3', etc.) you capture the state of a directory and give it a meaningful message ("Changed color to red") as version so it is always easy to go back to the version you are looking for.

In addition you can discard any uncommitted changes that you made and go back to the last version that you have captured, which basically means you can (after committing) mess around in a directory and then be rest assured that you can always go back to its original state.

If you push your repository into the cloud then you will have an automatic backup and additionally you can easily collaborate with other people.

Lastly, it works of course with huge binary data too, so not just for text files as in the following 'demo' example:

$ mkdir dir-versioning && cd dir-versioning
$ s3git init .
$ # Just create a single file
$ echo "First line" > text.txt && ls -l
-rw-rw-r-- 1 ec2-user ec2-user 11 May 25 09:06 text.txt
$ #
$ # Create initial snapshot
$ s3git snapshot create -m "Initial snapshot" .
$ # Add new line to initial file and create another file
$ echo "Second line" >> text.txt && echo "Another file" > text2.txt && ls -l
-rw-rw-r-- 1 ec2-user ec2-user 23 May 25 09:08 text.txt
-rw-rw-r-- 1 ec2-user ec2-user 13 May 25 09:08 text2.txt
$ s3git snapshot status .
     New: /home/ec2-user/dir-versioning/text2.txt
Modified: /home/ec2-user/dir-versioning/text.txt
$ #
$ # Create second snapshot
$ s3git snapshot create -m "Second snapshot" .
$ s3git log --pretty
3a4c3466264904fed3d52a1744fb1865b21beae1a79e374660aa231e889de41191009afb4795b61fdba9c156 Second snapshot
77a8e169853a7480c9a738c293478c9923532f56fcd02e3276142a1a29ac7f0006b5dff65d5ca245255f09fa Initial snapshot
$ more text.txt
First line
Second line
$ more text2.txt
Another file
$ #
$ # Go back one version in time
$ s3git snapshot checkout . HEAD^
$ more text.txt
First line
$ more text2.txt
text2.txt: No such file or directory
$ #
$ # Switch back to latest revision
$ s3git snapshot checkout .
$ more text2.txt
Another file

Note that snapshotting works for all files in the directory including any subdirectories. Click the following link for a more elaborate repository that includes all releases of the Kubernetes project.

Clone the YFCC100M dataset

Clone a large repo with 100 million files totaling 11.5 TB in size (Multimedia Commons), yet requiring only 7 GB local disk space.

(Note that this takes about 7 minutes on an SSD-equipped MacBook Pro with 500 Mbit/s download connection so for less powerful hardware you may want to skip to the next section (or if you lack 7 GB local disk space, try a df -h . first). Then again it is quite a few files...)

$ s3git clone s3://s3git-100m -a "AKIAI26TSIF6JIMMDSPQ" -s "5NvshAhI0KMz5Gbqkp7WNqXYlnjBjkf9IaJD75x7"
Cloning into ...
Done. Totaling 97,974,749 objects.
$ cd s3git-100m
$ # List all files starting with '123456'
$ s3git ls 123456
12345649755b9f489df2470838a76c9df1d4ee85e864b15cf328441bd12fdfc23d5b95f8abffb9406f4cdf05306b082d3773f0f05090766272e2e8c8b8df5997
123456629a711c83c28dc63f0bc77ca597c695a19e498334a68e4236db18df84a2cdd964180ab2fcf04cbacd0f26eb345e09e6f9c6957a8fb069d558cadf287e
123456675eaecb4a2984f2849d3b8c53e55dd76102a2093cbca3e61668a3dd4e8f148a32c41235ab01e70003d4262ead484d9158803a1f8d74e6acad37a7a296
123456e6c21c054744742d482960353f586e16d33384f7c42373b908f7a7bd08b18768d429e01a0070fadc2c037ef83eef27453fc96d1625e704dd62931be2d1
$ s3git cat cafebad > olympic.jpg
$ # List and count total nr of files
$ s3git ls | wc -l
97974749

Fork that repo

Below is an example for alice and bob working together on a repository.

$ mkdir alice && cd alice
alice $ s3git clone s3://s3git-spoon-knife -a "AKIAJYNT4FCBFWDQPERQ" -s "OVcWH7ZREUGhZJJAqMq4GVaKDKGW6XyKl80qYvkW"
Cloning into .../alice/s3git-spoon-knife
Done. Totaling 0 objects.
alice $ cd s3git-spoon-knife
alice $ # add a file filled with zeros
alice $ dd if=/dev/zero count=1 | s3git add
Added: 3ad6df690177a56092cb1ac7e9690dcabcac23cf10fee594030c7075ccd9c5e38adbaf58103cf573b156d114452b94aa79b980d9413331e22a8c95aa6fb60f4e
alice $ # add 9 more files (with random content)
alice $ for n in {1..9}; do dd if=/dev/urandom count=1 | s3git add; done
alice $ # commit
alice $ s3git commit -m "Commit from alice"
alice $ # and push
alice $ s3git push

Clone it again as bob on a different computer/different directory/different universe:

$ mkdir bob && cd bob
bob $ s3git clone s3://s3git-spoon-knife -a "AKIAJYNT4FCBFWDQPERQ" -s "OVcWH7ZREUGhZJJAqMq4GVaKDKGW6XyKl80qYvkW"
Cloning into .../bob/s3git-spoon-knife
Done. Totaling 10 objects.
bob $ cd s3git-spoon-knife
bob $ # Check if we can access our empty file
bob $ s3git cat 3ad6 | hexdump
00000000  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00
*
00000200
bob $ # add another 10 files
bob $ for n in {1..10}; do dd if=/dev/urandom count=1 | s3git add; done
bob $ # commit
bob $ s3git commit -m "Commit from bob"
bob $ # and push back
bob $ s3git push

Switch back to alice again to pull the new content:

alice $ s3git pull
Done. Totaling 20 objects.
alice $ s3git log --pretty
3f67a4789e2a820546745c6fa40307aa490b7167f7de770f118900a28e6afe8d3c3ec8d170a19977cf415d6b6c5acb78d7595c825b39f7c8b20b471a84cfbee0 Commit from bob
a48cf36af2211e350ec2b05c98e9e3e63439acd1e9e01a8cb2b46e0e0d65f1625239bd1f89ab33771c485f3e6f1d67f119566523a1034e06adc89408a74c4bb3 Commit from alice

Note: Do not store any important info in the s3git-spoon-knife bucket. It will be auto-deleted within 24-hours.

Here is an nice screen recording:

asciicast

Happy forking!

You may be wondering about concurrent behaviour from

Integration with Minio

Instead of S3 you can happily use the Minio server, for example the public server at https://play.minio.io:9000. Just make sure you have a bucket created using mc (example below uses s3git-test):

$ mkdir minio-test && cd minio-test
$ s3git init 
$ s3git remote add "primary" -r s3://s3git-test -a "Q3AM3UQ867SPQQA43P2F" -s "zuf+tfteSlswRu7BJ86wekitnifILbZam1KYY3TG" -e "https://play.minio.io:9000"
$ echo "hello minio" | s3git add
Added: c7bb516db796df8dcc824aec05db911031ab3ac1e5ff847838065eeeb52d4410b4d57f8df2e55d14af0b7b1d28362de1176cd51892d7cbcaaefb2cd3f616342f
$ s3git commit -m "Commit for minio test"
$ s3git push
Pushing 1 / 1 [==============================================================================================================================] 100.00 % 0

and clone it

$ s3git clone s3://s3git-test -a "Q3AM3UQ867SPQQA43P2F" -s "zuf+tfteSlswRu7BJ86wekitnifILbZam1KYY3TG" -e "https://play.minio.io:9000"
Cloning into .../s3git-test
Done. Totaling 1 object.
$ cd s3git-test/
$ s3git ls
c7bb516db796df8dcc824aec05db911031ab3ac1e5ff847838065eeeb52d4410b4d57f8df2e55d14af0b7b1d28362de1176cd51892d7cbcaaefb2cd3f616342f
$ s3git cat c7bb
hello minio
$ s3git log --pretty
6eb708ec7dfd75d9d6a063e2febf16bab3c7a163e203fc677c8a9178889bac012d6b3fcda56b1eb160b1be7fa56eb08985422ed879f220d42a0e6ec80c5735ea Commit for minio test

Contributions

Contributions are welcome! Please see CONTRIBUTING.md.

Key features

Easy: Use a workflow and syntax that you already know and love

Fast: Lightning fast operation, especially on large files and huge repositories

Infinite scalability: Stop worrying about maximum repository sizes and have the ability to grow indefinitely

Work from local SSD: Make a huge cloud disk appear like a local drive

Instant sync: Push local changes and pull down instantly on other clones

Versioning: Keep previous versions safe and have the ability to undo or go back in time

Forking: Ability to make many variants by forking

Verifiable: Be sure that you have everything and be tamper-proof (“data has not been messed with”)

Deduplication: Do not store the same data twice

Simplicity: Simple by design and provide one way to accomplish tasks

Command Line Help

$ s3git help
s3git applies the git philosophy to Cloud Storage. If you know git, you will know how to use s3git.

s3git is a simple CLI tool that allows you to create a distributed, decentralized and versioned repository.
It scales limitlessly to 100s of millions of files and PBs of storage and stores your data safely in S3.
Yet huge repos can be cloned on the SSD of your laptop for making local changes, committing and pushing back.

Usage:
  s3git [command]

Available Commands:
  add         Add stream or file(s) to the repository
  cat         Read a file from the repository
  clone       Clone a repository into a new directory
  commit      Commit the changes in the repository
  init        Create an empty repository
  log         Show commit log
  ls          List files in the repository
  pull        Update local repository
  push        Update remote repositories
  remote      Manage remote repositories
  snapshot    Manage snapshots
  status      Show changes in repository

Flags:
  -h, --help[=false]: help for s3git

Use "s3git [command] --help" for more information about a command.

FAQ

Q Is s3git compatible to git at the binary level?
A No. git is optimized for text content with very nice and powerful diffing and using compressed storage whereas s3git is more focused on large repos with primarily non-text blobs backed up by cloud storage like S3.
Q Do you support encryption?
A No. However it is trivial to encrypt data before streaming into s3git add, eg pipe it through openssl enc or similar.
Q Do you support zipping?
A No. Again it is trivial to zip it before streaming into s3git add, eg pipe it through zip -r - . or similar.
Q Why don't you provide a FUSE interface?
A Supporting FUSE would mean introducing a lot of complexity related to POSIX which we would rather avoid.

Download Details:

Author: s3git
Source Code: https://github.com/s3git/s3git 
License: Apache-2.0 license

#go #golang #git #decentralized 

What is GEEK

Buddha Community

S3git: Git for Cloud Storage (or Version Control For Data)
Adaline  Kulas

Adaline Kulas

1594162500

Multi-cloud Spending: 8 Tips To Lower Cost

A multi-cloud approach is nothing but leveraging two or more cloud platforms for meeting the various business requirements of an enterprise. The multi-cloud IT environment incorporates different clouds from multiple vendors and negates the dependence on a single public cloud service provider. Thus enterprises can choose specific services from multiple public clouds and reap the benefits of each.

Given its affordability and agility, most enterprises opt for a multi-cloud approach in cloud computing now. A 2018 survey on the public cloud services market points out that 81% of the respondents use services from two or more providers. Subsequently, the cloud computing services market has reported incredible growth in recent times. The worldwide public cloud services market is all set to reach $500 billion in the next four years, according to IDC.

By choosing multi-cloud solutions strategically, enterprises can optimize the benefits of cloud computing and aim for some key competitive advantages. They can avoid the lengthy and cumbersome processes involved in buying, installing and testing high-priced systems. The IaaS and PaaS solutions have become a windfall for the enterprise’s budget as it does not incur huge up-front capital expenditure.

However, cost optimization is still a challenge while facilitating a multi-cloud environment and a large number of enterprises end up overpaying with or without realizing it. The below-mentioned tips would help you ensure the money is spent wisely on cloud computing services.

  • Deactivate underused or unattached resources

Most organizations tend to get wrong with simple things which turn out to be the root cause for needless spending and resource wastage. The first step to cost optimization in your cloud strategy is to identify underutilized resources that you have been paying for.

Enterprises often continue to pay for resources that have been purchased earlier but are no longer useful. Identifying such unused and unattached resources and deactivating it on a regular basis brings you one step closer to cost optimization. If needed, you can deploy automated cloud management tools that are largely helpful in providing the analytics needed to optimize the cloud spending and cut costs on an ongoing basis.

  • Figure out idle instances

Another key cost optimization strategy is to identify the idle computing instances and consolidate them into fewer instances. An idle computing instance may require a CPU utilization level of 1-5%, but you may be billed by the service provider for 100% for the same instance.

Every enterprise will have such non-production instances that constitute unnecessary storage space and lead to overpaying. Re-evaluating your resource allocations regularly and removing unnecessary storage may help you save money significantly. Resource allocation is not only a matter of CPU and memory but also it is linked to the storage, network, and various other factors.

  • Deploy monitoring mechanisms

The key to efficient cost reduction in cloud computing technology lies in proactive monitoring. A comprehensive view of the cloud usage helps enterprises to monitor and minimize unnecessary spending. You can make use of various mechanisms for monitoring computing demand.

For instance, you can use a heatmap to understand the highs and lows in computing visually. This heat map indicates the start and stop times which in turn lead to reduced costs. You can also deploy automated tools that help organizations to schedule instances to start and stop. By following a heatmap, you can understand whether it is safe to shut down servers on holidays or weekends.

#cloud computing services #all #hybrid cloud #cloud #multi-cloud strategy #cloud spend #multi-cloud spending #multi cloud adoption #why multi cloud #multi cloud trends #multi cloud companies #multi cloud research #multi cloud market

 iOS App Dev

iOS App Dev

1620466520

Your Data Architecture: Simple Best Practices for Your Data Strategy

If you accumulate data on which you base your decision-making as an organization, you should probably think about your data architecture and possible best practices.

If you accumulate data on which you base your decision-making as an organization, you most probably need to think about your data architecture and consider possible best practices. Gaining a competitive edge, remaining customer-centric to the greatest extent possible, and streamlining processes to get on-the-button outcomes can all be traced back to an organization’s capacity to build a future-ready data architecture.

In what follows, we offer a short overview of the overarching capabilities of data architecture. These include user-centricity, elasticity, robustness, and the capacity to ensure the seamless flow of data at all times. Added to these are automation enablement, plus security and data governance considerations. These points from our checklist for what we perceive to be an anticipatory analytics ecosystem.

#big data #data science #big data analytics #data analysis #data architecture #data transformation #data platform #data strategy #cloud data platform #data acquisition

Sid  Schuppe

Sid Schuppe

1618404240

Benefits of Hybrid Cloud for Data Warehouse

In today’s market reliable data is worth its weight in gold, and having a single source of truth for business-related queries is a must-have for organizations of all sizes. For decades companies have turned to data warehouses to consolidate operational and transactional information, but many existing data warehouses are no longer able to keep up with the data demands of the current business climate. They are hard to scale, inflexible, and simply incapable of handling the large volumes of data and increasingly complex queries.

These days organizations need a faster, more efficient, and modern data warehouse that is robust enough to handle large amounts of data and multiple users while simultaneously delivering real-time query results. And that is where hybrid cloud comes in. As increasing volumes of data are being generated and stored in the cloud, enterprises are rethinking their strategies for data warehousing and analytics. Hybrid cloud data warehouses allow you to utilize existing resources and architectures while streamlining your data and cloud goals.

#cloud #data analytics #business intelligence #hybrid cloud #data warehouse #data storage #data management solutions #master data management #data warehouse architecture #data warehouses

Gerhard  Brink

Gerhard Brink

1620629020

Getting Started With Data Lakes

Frameworks for Efficient Enterprise Analytics

The opportunities big data offers also come with very real challenges that many organizations are facing today. Often, it’s finding the most cost-effective, scalable way to store and process boundless volumes of data in multiple formats that come from a growing number of sources. Then organizations need the analytical capabilities and flexibility to turn this data into insights that can meet their specific business objectives.

This Refcard dives into how a data lake helps tackle these challenges at both ends — from its enhanced architecture that’s designed for efficient data ingestion, storage, and management to its advanced analytics functionality and performance flexibility. You’ll also explore key benefits and common use cases.

Introduction

As technology continues to evolve with new data sources, such as IoT sensors and social media churning out large volumes of data, there has never been a better time to discuss the possibilities and challenges of managing such data for varying analytical insights. In this Refcard, we dig deep into how data lakes solve the problem of storing and processing enormous amounts of data. While doing so, we also explore the benefits of data lakes, their use cases, and how they differ from data warehouses (DWHs).


This is a preview of the Getting Started With Data Lakes Refcard. To read the entire Refcard, please download the PDF from the link above.

#big data #data analytics #data analysis #business analytics #data warehouse #data storage #data lake #data lake architecture #data lake governance #data lake management

Gerhard  Brink

Gerhard Brink

1622603640

The Best Options to Store Data and Keep it Safe Forever

In a data-driven age, such as the one we live in today, keeping our files is more or less preserving our history. Both the personal and emotional one and the professional one. All of us produce hundreds of megabytes of data every day, including photos and videos that we make with smartphones, the files we share with friends and those we work on in the office.

Yet, strangely, hardly any of us are wondering how to save data. Archiving is something very few think about and even fewer do regularly. So maybe we took thousands of photos of our children or our dog, but when we look for them, we never find them.

And, sometimes, we just don’t find them anymore. But what is the most effective, economical and long-lasting method for storing our data?

#cloud-storage #data-storage #storage #backup #options-to-store-data #store-data #data #hard-drive