1660192500
SeaweedFS is an independent Apache-licensed open source project with its ongoing development made possible entirely thanks to the support of these awesome backers. If you'd like to grow SeaweedFS even stronger, please consider joining our sponsors on Patreon.
docker run -p 8333:8333 chrislusf/seaweedfs server -s3
weed
or weed.exe
weed server -dir=/some/data/dir -s3
to start one master, one volume server, one filer, and one S3 gateway.Also, to increase capacity, just add more volume servers by running weed volume -dir="/some/data/dir2" -mserver="<master_host>:9333" -port=8081
locally, or on a different machine, or on thousands of machines. That is it!
SeaweedFS is a simple and highly scalable distributed file system. There are two objectives:
SeaweedFS started as an Object Store to handle small files efficiently. Instead of managing all file metadata in a central master, the central master only manages volumes on volume servers, and these volume servers manage files and their metadata. This relieves concurrency pressure from the central master and spreads file metadata into volume servers, allowing faster file access (O(1), usually just one disk read operation).
There is only 40 bytes of disk storage overhead for each file's metadata. It is so simple with O(1) disk reads that you are welcome to challenge the performance with your actual use cases.
SeaweedFS started by implementing Facebook's Haystack design paper. Also, SeaweedFS implements erasure coding with ideas from f4: Facebook’s Warm BLOB Storage System, and has a lot of similarities with Facebook’s Tectonic Filesystem
On top of the object store, optional Filer can support directories and POSIX attributes. Filer is a separate linearly-scalable stateless server with customizable metadata stores, e.g., MySql, Postgres, Redis, Cassandra, HBase, Mongodb, Elastic Search, LevelDB, RocksDB, Sqlite, MemSql, TiDB, Etcd, CockroachDB, YDB, etc.
For any distributed key value stores, the large values can be offloaded to SeaweedFS. With the fast access speed and linearly scalable capacity, SeaweedFS can work as a distributed Key-Large-Value store.
SeaweedFS can transparently integrate with the cloud. With hot data on local cluster, and warm data on the cloud with O(1) access time, SeaweedFS can achieve both fast local access time and elastic cloud storage capacity. What's more, the cloud storage access API cost is minimized. Faster and Cheaper than direct cloud storage!
By default, the master node runs on port 9333, and the volume nodes run on port 8080. Let's start one master node, and two volume nodes on port 8080 and 8081. Ideally, they should be started from different machines. We'll use localhost as an example.
SeaweedFS uses HTTP REST operations to read, write, and delete. The responses are in JSON or JSONP format.
> ./weed master
> weed volume -dir="/tmp/data1" -max=5 -mserver="localhost:9333" -port=8080 &
> weed volume -dir="/tmp/data2" -max=10 -mserver="localhost:9333" -port=8081 &
To upload a file: first, send a HTTP POST, PUT, or GET request to /dir/assign
to get an fid
and a volume server URL:
> curl http://localhost:9333/dir/assign
{"count":1,"fid":"3,01637037d6","url":"127.0.0.1:8080","publicUrl":"localhost:8080"}
Second, to store the file content, send a HTTP multi-part POST request to url + '/' + fid
from the response:
> curl -F file=@/home/chris/myphoto.jpg http://127.0.0.1:8080/3,01637037d6
{"name":"myphoto.jpg","size":43234,"eTag":"1cc0118e"}
To update, send another POST request with updated file content.
For deletion, send an HTTP DELETE request to the same url + '/' + fid
URL:
> curl -X DELETE http://127.0.0.1:8080/3,01637037d6
Now, you can save the fid
, 3,01637037d6 in this case, to a database field.
The number 3 at the start represents a volume id. After the comma, it's one file key, 01, and a file cookie, 637037d6.
The volume id is an unsigned 32-bit integer. The file key is an unsigned 64-bit integer. The file cookie is an unsigned 32-bit integer, used to prevent URL guessing.
The file key and file cookie are both coded in hex. You can store the <volume id, file key, file cookie> tuple in your own format, or simply store the fid
as a string.
If stored as a string, in theory, you would need 8+1+16+8=33 bytes. A char(33) would be enough, if not more than enough, since most uses will not need 2^32 volumes.
If space is really a concern, you can store the file id in your own format. You would need one 4-byte integer for volume id, 8-byte long number for file key, and a 4-byte integer for the file cookie. So 16 bytes are more than enough.
Here is an example of how to render the URL.
First look up the volume server's URLs by the file's volumeId:
> curl http://localhost:9333/dir/lookup?volumeId=3
{"volumeId":"3","locations":[{"publicUrl":"localhost:8080","url":"localhost:8080"}]}
Since (usually) there are not too many volume servers, and volumes don't move often, you can cache the results most of the time. Depending on the replication type, one volume can have multiple replica locations. Just randomly pick one location to read.
Now you can take the public URL, render the URL or directly read from the volume server via URL:
http://localhost:8080/3,01637037d6.jpg
Notice we add a file extension ".jpg" here. It's optional and just one way for the client to specify the file content type.
If you want a nicer URL, you can use one of these alternative URL formats:
http://localhost:8080/3/01637037d6/my_preferred_name.jpg
http://localhost:8080/3/01637037d6.jpg
http://localhost:8080/3,01637037d6.jpg
http://localhost:8080/3/01637037d6
http://localhost:8080/3,01637037d6
If you want to get a scaled version of an image, you can add some params:
http://localhost:8080/3/01637037d6.jpg?height=200&width=200
http://localhost:8080/3/01637037d6.jpg?height=200&width=200&mode=fit
http://localhost:8080/3/01637037d6.jpg?height=200&width=200&mode=fill
SeaweedFS applies the replication strategy at a volume level. So, when you are getting a file id, you can specify the replication strategy. For example:
curl http://localhost:9333/dir/assign?replication=001
The replication parameter options are:
000: no replication
001: replicate once on the same rack
010: replicate once on a different rack, but same data center
100: replicate once on a different data center
200: replicate twice on two different data center
110: replicate once on a different rack, and once on a different data center
More details about replication can be found on the wiki.
You can also set the default replication strategy when starting the master server.
Volume servers can be started with a specific data center name:
weed volume -dir=/tmp/1 -port=8080 -dataCenter=dc1
weed volume -dir=/tmp/2 -port=8081 -dataCenter=dc2
When requesting a file key, an optional "dataCenter" parameter can limit the assigned volume to the specific data center. For example, this specifies that the assigned volume should be limited to 'dc1':
http://localhost:9333/dir/assign?dataCenter=dc1
Usually distributed file systems split each file into chunks, a central master keeps a mapping of filenames, chunk indices to chunk handles, and also which chunks each chunk server has.
The main drawback is that the central master can't handle many small files efficiently, and since all read requests need to go through the chunk master, so it might not scale well for many concurrent users.
Instead of managing chunks, SeaweedFS manages data volumes in the master server. Each data volume is 32GB in size, and can hold a lot of files. And each storage node can have many data volumes. So the master node only needs to store the metadata about the volumes, which is a fairly small amount of data and is generally stable.
The actual file metadata is stored in each volume on volume servers. Since each volume server only manages metadata of files on its own disk, with only 16 bytes for each file, all file access can read file metadata just from memory and only needs one disk operation to actually read file data.
For comparison, consider that an xfs inode structure in Linux is 536 bytes.
The architecture is fairly simple. The actual data is stored in volumes on storage nodes. One volume server can have multiple volumes, and can both support read and write access with basic authentication.
All volumes are managed by a master server. The master server contains the volume id to volume server mapping. This is fairly static information, and can be easily cached.
On each write request, the master server also generates a file key, which is a growing 64-bit unsigned integer. Since write requests are not generally as frequent as read requests, one master server should be able to handle the concurrency well.
When a client sends a write request, the master server returns (volume id, file key, file cookie, volume node URL) for the file. The client then contacts the volume node and POSTs the file content.
When a client needs to read a file based on (volume id, file key, file cookie), it asks the master server by the volume id for the (volume node URL, volume node public URL), or retrieves this from a cache. Then the client can GET the content, or just render the URL on web pages and let browsers fetch the content.
Please see the example for details on the write-read process.
In the current implementation, each volume can hold 32 gibibytes (32GiB or 8x2^32 bytes). This is because we align content to 8 bytes. We can easily increase this to 64GiB, or 128GiB, or more, by changing 2 lines of code, at the cost of some wasted padding space due to alignment.
There can be 4 gibibytes (4GiB or 2^32 bytes) of volumes. So the total system size is 8 x 4GiB x 4GiB which is 128 exbibytes (128EiB or 2^67 bytes).
Each individual file size is limited to the volume size.
All file meta information stored on an volume server is readable from memory without disk access. Each file takes just a 16-byte map entry of <64bit key, 32bit offset, 32bit size>. Of course, each map entry has its own space cost for the map. But usually the disk space runs out before the memory does.
The local volume servers are much faster, while cloud storages have elastic capacity and are actually more cost-efficient if not accessed often (usually free to upload, but relatively costly to access). With the append-only structure and O(1) access time, SeaweedFS can take advantage of both local and cloud storage by offloading the warm data to the cloud.
Usually hot data are fresh and warm data are old. SeaweedFS puts the newly created volumes on local servers, and optionally upload the older volumes on the cloud. If the older data are accessed less often, this literally gives you unlimited capacity with limited local servers, and still fast for new data.
With the O(1) access time, the network latency cost is kept at minimum.
If the hot/warm data is split as 20/80, with 20 servers, you can achieve storage capacity of 100 servers. That's a cost saving of 80%! Or you can repurpose the 80 servers to store new data also, and get 5X storage throughput.
Most other distributed file systems seem more complicated than necessary.
SeaweedFS is meant to be fast and simple, in both setup and operation. If you do not understand how it works when you reach here, we've failed! Please raise an issue with any questions or update this file with clarifications.
SeaweedFS is constantly moving forward. Same with other systems. These comparisons can be outdated quickly. Please help to keep them updated.
HDFS uses the chunk approach for each file, and is ideal for storing large files.
SeaweedFS is ideal for serving relatively smaller files quickly and concurrently.
SeaweedFS can also store extra large files by splitting them into manageable data chunks, and store the file ids of the data chunks into a meta chunk. This is managed by "weed upload/download" tool, and the weed master or volume servers are agnostic about it.
The architectures are mostly the same. SeaweedFS aims to store and read files fast, with a simple and flat architecture. The main differences are
System | File Metadata | File Content Read | POSIX | REST API | Optimized for large number of small files |
---|---|---|---|---|---|
SeaweedFS | lookup volume id, cacheable | O(1) disk seek | Yes | Yes | |
SeaweedFS Filer | Linearly Scalable, Customizable | O(1) disk seek | FUSE | Yes | Yes |
GlusterFS | hashing | FUSE, NFS | |||
Ceph | hashing + rules | FUSE | Yes | ||
MooseFS | in memory | FUSE | No | ||
MinIO | separate meta file for each file | Yes | No |
GlusterFS stores files, both directories and content, in configurable volumes called "bricks".
GlusterFS hashes the path and filename into ids, and assigned to virtual volumes, and then mapped to "bricks".
MooseFS chooses to neglect small file issue. From moosefs 3.0 manual, "even a small file will occupy 64KiB plus additionally 4KiB of checksums and 1KiB for the header", because it "was initially designed for keeping large amounts (like several thousands) of very big files"
MooseFS Master Server keeps all meta data in memory. Same issue as HDFS namenode.
Ceph can be setup similar to SeaweedFS as a key->blob store. It is much more complicated, with the need to support layers on top of it. Here is a more detailed comparison
SeaweedFS has a centralized master group to look up free volumes, while Ceph uses hashing and metadata servers to locate its objects. Having a centralized master makes it easy to code and manage.
Ceph, like SeaweedFS, is based on the object store RADOS. Ceph is rather complicated with mixed reviews.
Ceph uses CRUSH hashing to automatically manage data placement, which is efficient to locate the data. But the data has to be placed according to the CRUSH algorithm. Any wrong configuration would cause data loss. Topology changes, such as adding new servers to increase capacity, will cause data migration with high IO cost to fit the CRUSH algorithm. SeaweedFS places data by assigning them to any writable volumes. If writes to one volume failed, just pick another volume to write. Adding more volumes is also as simple as it can be.
SeaweedFS is optimized for small files. Small files are stored as one continuous block of content, with at most 8 unused bytes between files. Small file access is O(1) disk read.
SeaweedFS Filer uses off-the-shelf stores, such as MySql, Postgres, Sqlite, Mongodb, Redis, Elastic Search, Cassandra, HBase, MemSql, TiDB, CockroachCB, Etcd, YDB, to manage file directories. These stores are proven, scalable, and easier to manage.
SeaweedFS | comparable to Ceph | advantage |
---|---|---|
Master | MDS | simpler |
Volume | OSD | optimized for small files |
Filer | Ceph FS | linearly scalable, Customizable, O(1) or O(logN) |
MinIO follows AWS S3 closely and is ideal for testing for S3 API. It has good UI, policies, versionings, etc. SeaweedFS is trying to catch up here. It is also possible to put MinIO as a gateway in front of SeaweedFS later.
MinIO metadata are in simple files. Each file write will incur extra writes to corresponding meta file.
MinIO does not have optimization for lots of small files. The files are simply stored as is to local disks. Plus the extra meta file and shards for erasure coding, it only amplifies the LOSF problem.
MinIO has multiple disk IO to read one file. SeaweedFS has O(1) disk reads, even for erasure coded files.
MinIO has full-time erasure coding. SeaweedFS uses replication on hot data for faster speed and optionally applies erasure coding on warm data.
MinIO does not have POSIX-like API support.
MinIO has specific requirements on storage layout. It is not flexible to adjust capacity. In SeaweedFS, just start one volume server pointing to the master. That's all.
This is a super exciting project! And we need helpers and support!
Installation guide for users who are not familiar with golang
Step 1: install go on your machine and setup the environment by following the instructions at:
https://golang.org/doc/install
make sure to define your $GOPATH
Step 2: checkout this repo:
git clone https://github.com/seaweedfs/seaweedfs.git
Step 3: download, compile, and install the project by executing the following command
cd seaweedfs/weed && make install
Once this is done, you will find the executable "weed" in your $GOPATH/bin
directory
When testing read performance on SeaweedFS, it basically becomes a performance test of your hard drive's random read speed. Hard drives usually get 100MB/s~200MB/s.
To modify or delete small files, SSD must delete a whole block at a time, and move content in existing blocks to a new block. SSD is fast when brand new, but will get fragmented over time and you have to garbage collect, compacting blocks. SeaweedFS is friendly to SSD since it is append-only. Deletion and compaction are done on volume level in the background, not slowing reading and not causing fragmentation.
My Own Unscientific Single Machine Results on Mac Book with Solid State Disk, CPU: 1 Intel Core i7 2.6GHz.
Write 1 million 1KB file:
Concurrency Level: 16
Time taken for tests: 66.753 seconds
Completed requests: 1048576
Failed requests: 0
Total transferred: 1106789009 bytes
Requests per second: 15708.23 [#/sec]
Transfer rate: 16191.69 [Kbytes/sec]
Connection Times (ms)
min avg max std
Total: 0.3 1.0 84.3 0.9
Percentage of the requests served within a certain time (ms)
50% 0.8 ms
66% 1.0 ms
75% 1.1 ms
80% 1.2 ms
90% 1.4 ms
95% 1.7 ms
98% 2.1 ms
99% 2.6 ms
100% 84.3 ms
Randomly read 1 million files:
Concurrency Level: 16
Time taken for tests: 22.301 seconds
Completed requests: 1048576
Failed requests: 0
Total transferred: 1106812873 bytes
Requests per second: 47019.38 [#/sec]
Transfer rate: 48467.57 [Kbytes/sec]
Connection Times (ms)
min avg max std
Total: 0.0 0.3 54.1 0.2
Percentage of the requests served within a certain time (ms)
50% 0.3 ms
90% 0.4 ms
98% 0.6 ms
99% 0.7 ms
100% 54.1 ms
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.
The text of this page is available for modification and reuse under the terms of the Creative Commons Attribution-Sharealike 3.0 Unported License and the GNU Free Documentation License (unversioned, with no invariant sections, front-cover texts, or back-cover texts).
Author: Seaweedfs
Source Code: https://github.com/seaweedfs/seaweedfs
License: Apache-2.0 license
1660184340
Goofys is a high-performance, POSIX-ish Amazon S3 file system written in Go
Overview
Goofys allows you to mount an S3 bucket as a filey system.
It's a Filey System instead of a File System because goofys strives for performance first and POSIX second. Particularly things that are difficult to support on S3 or would translate into more than one round-trip would either fail (random writes) or faked (no per-file permission). Goofys does not have an on disk data cache (checkout catfs), and consistency model is close-to-open.
Installation
On Linux, install via pre-built binaries. You may also need to install fuse-utils first.
On macOS, install via Homebrew:
$ brew cask install osxfuse
$ brew install goofys
$ export GOPATH=$HOME/work
$ go get github.com/kahing/goofys
$ go install github.com/kahing/goofys
Usage
$ cat ~/.aws/credentials
[default]
aws_access_key_id = AKID1234567890
aws_secret_access_key = MY-SECRET-KEY
$ $GOPATH/bin/goofys <bucket> <mountpoint>
$ $GOPATH/bin/goofys <bucket:prefix> <mountpoint> # if you only want to mount objects under a prefix
Users can also configure credentials via the AWS CLI or the AWS_ACCESS_KEY_ID
and AWS_SECRET_ACCESS_KEY
environment variables.
To mount an S3 bucket on startup, make sure the credential is configured for root
, and can add this to /etc/fstab
:
goofys#bucket /mnt/mountpoint fuse _netdev,allow_other,--file-mode=0666,--dir-mode=0777 0 0
See also: Instruction for Azure Blob Storage, Azure Data Lake Gen1, and Azure Data Lake Gen2.
Got more questions? Check out questions other people asked
Benchmark
Using --stat-cache-ttl 1s --type-cache-ttl 1s
for goofys -ostat_cache_expire=1
for s3fs to simulate cold runs. Detail for the benchmark can be found in bench.sh. Raw data is available as well. The test was run on an EC2 m5.4xlarge in us-west-2a connected to a bucket in us-west-2. Units are seconds.
To run the benchmark, configure EC2's instance role to be able to write to $TESTBUCKET
, and then do:
$ sudo docker run -e BUCKET=$TESTBUCKET -e CACHE=false --rm --privileged --net=host -v /tmp/cache:/tmp/cache kahing/goofys-bench
# result will be written to $TESTBUCKET
See also: cached benchmark result and result on Azure.
goofys has been tested under Linux and macOS.
List of non-POSIX behaviors/limitations:
--(dir|file)-mode
or --(uid|gid)
optionsctime
, atime
is always the same as mtime
rename
directories with more than 1000 childrenunlink
returns success even if file is not presentfsync
is ignored, files are only flushed on close
In addition to the items above, the following are supportable but not yet implemented:
goofys has been tested with the following non-AWS S3 providers:
Additionally, goofys also works with the following non-S3 object stores:
References
go test
gcsfuse
Author: Kahing
Source Code: https://github.com/kahing/goofys
License: Apache-2.0 license
1660166940
JuiceFS is a high-performance POSIX file system released under Apache License 2.0, particularly designed for the cloud-native environment. The data, stored via JuiceFS, will be persisted in object storage (e.g. Amazon S3), and the corresponding metadata can be persisted in various database engines such as Redis, MySQL, and TiKV based on the scenarios and requirements.
With JuiceFS, massive cloud storage can be directly connected to big data, machine learning, artificial intelligence, and various application platforms in production environments. Without modifying code, the massive cloud storage can be used as efficiently as local storage.
JuiceFS consists of three parts:
JuiceFS can store the metadata of file system on Redis, which is a fast, open-source, in-memory key-value data storage, particularly suitable for storing metadata; meanwhile, all the data will be stored in object storage through JuiceFS client. Learn more
Each file stored in JuiceFS is split into "Chunk" s at a fixed size with the default upper limit of 64 MiB. Each Chunk is composed of one or more "Slice"(s), and the length of the slice varies depending on how the file is written. Each slice is composed of size-fixed "Block" s, which are 4 MiB by default. These blocks will be stored in object storage in the end; at the same time, the metadata information of the file and its Chunks, Slices, and Blocks will be stored in metadata engines via JuiceFS. Learn more
When using JuiceFS, files will eventually be split into Chunks, Slices and Blocks and stored in object storage. Therefore, the source files stored in JuiceFS cannot be found in the file browser of the object storage platform; instead, there are only a chunks directory and a bunch of digitally numbered directories and files in the bucket. Don't panic! This is just the secret of the high-performance operation of JuiceFS!
Before you begin, make sure you have:
Please refer to Quick Start Guide to start using JuiceFS right away!
Check out all the command line options in command reference.
It is also very easy to use JuiceFS on Kubernetes. Please find more information here.
If you wanna use JuiceFS in Hadoop, check Hadoop Java SDK.
Please refer to JuiceFS Document Center for more information.
JuiceFS has passed all of the compatibility tests (8813 in total) in the latest pjdfstest .
All tests successful.
Test Summary Report
-------------------
/root/soft/pjdfstest/tests/chown/00.t (Wstat: 0 Tests: 1323 Failed: 0)
TODO passed: 693, 697, 708-709, 714-715, 729, 733
Files=235, Tests=8813, 233 wallclock secs ( 2.77 usr 0.38 sys + 2.57 cusr 3.93 csys = 9.65 CPU)
Result: PASS
Aside from the POSIX features covered by pjdfstest, JuiceFS also provides:
JuiceFS provides a subcommand that can run a few basic benchmarks to help you understand how it works in your environment:
A sequential read/write benchmark has also been performed on JuiceFS, EFS and S3FS by fio.
Above result figure shows that JuiceFS can provide 10X more throughput than the other two (see more details).
A simple mdtest benchmark has been performed on JuiceFS, EFS and S3FS by mdtest.
The result shows that JuiceFS can provide significantly more metadata IOPS than the other two (see more details).
There is a virtual file called .accesslog
in the root of JuiceFS to show all the details of file system operations and the time they take, for example:
$ cat /jfs/.accesslog
2021.01.15 08:26:11.003330 [uid:0,gid:0,pid:4403] write (17669,8666,4993160): OK <0.000010>
2021.01.15 08:26:11.003473 [uid:0,gid:0,pid:4403] write (17675,198,997439): OK <0.000014>
2021.01.15 08:26:11.003616 [uid:0,gid:0,pid:4403] write (17666,390,951582): OK <0.000006>
The last number on each line is the time (in seconds) that the current operation takes. You can directly use this to debug and analyze performance issues, or try ./juicefs profile /jfs
to monitor real time statistics. Please run ./juicefs profile -h
or refer to here to learn more about this subcommand.
JuiceFS supports almost all object storage services. Learn more.
JuiceFS is production ready and used by thousands of machines in production. A list of users has been assembled and documented here. In addition JuiceFS has several collaborative projects that integrate with other open source projects, which we have documented here. If you are also using JuiceFS, please feel free to let us know, and you are welcome to share your specific experience with everyone.
The storage format is stable, will be supported by all future releases.
We use GitHub Issues to track community reported issues. You can also contact the community for any questions.
Thank you for your contribution! Please refer to the CONTRIBUTING.md for more information.
Welcome to join the Discussions and the Slack channel to connect with JuiceFS team members and other users.
JuiceFS collects anonymous usage data by default to help us better understand how the community is using JuiceFS. Only core metrics (e.g. version number) will be reported, and user data and any other sensitive data will not be included. The related code can be viwed here.
You could also disable reporting easily by command line option --no-usage-report
:
juicefs mount --no-usage-report
The design of JuiceFS was inspired by Google File System, HDFS and MooseFS. Thanks for their great work!
JuiceFS supports many object storage. Please check out this list first. If the object storage you want to use is compatible with S3, you could treat it as S3. Otherwise, try reporting issue.
Yes. Since v1.0.0 Beta3 JuiceFS supports the use of Redis Cluster as the metadata engine, but it should be noted that Redis Cluster requires that the keys of all operations in a transaction must be in the same hash slot, so a JuiceFS file system can only use one hash slot.
See "Redis Best Practices" for more information.
See "Comparison with Others" for more information.
For more FAQs, please see the full list.
📺 Video: What is JuiceFS?
📖 Document: Quick Start Guide
Author: juicedata
Source Code: https://github.com/juicedata/juicefs
License: Apache-2.0 license
1657055940
S3LevelDOWN
An abstract-leveldown compliant implementation of LevelDOWN that uses Amazon S3 as a backing store. S3 is actually a giant key-value store on the cloud, even though it is marketed as a file store. Use this database with the LevelUP API.
To use this optimally, please read "Performance considerations" and "Warning about concurrency" sections below.
You could also use this as an alternative API to read/write S3. The API simpler to use when compared to the AWS SDK!
Install s3leveldown
and peer dependencies levelup
and aws-sdk
with yarn
or npm
.
$ npm install s3leveldown aws-sdk levelup
See the LevelUP API for high level usage.
s3leveldown(location [, s3])
Constructor of s3leveldown
backing store. Use with levelup
.
Arguments:
location
name of the S3 bucket with optional sub-folder. Example mybucket
or mybucket/folder
.s3
Optional S3
client from aws-sdk
. A default client will be used if not specified.Please refer to the AWS SDK docs to set up your API credentials before using.
(async () => {
// create DB
const db = levelup(s3leveldown('mybucket'));
// put items
await db.batch()
.put('name', 'Pikachu')
.put('dob', 'February 27, 1996')
.put('occupation', 'Pokemon')
.write();
// read items
await db.createReadStream()
.on('data', data => { console.log('data', `${data.key.toString()}=${data.value.toString()}`); })
.on('close', () => { console.log('done!') });
})();
const levelup = require('levelup');
const s3leveldown = require('s3leveldown');
const db = levelup(s3leveldown('my_bucket'));
db.batch()
.put('name', 'Pikachu')
.put('dob', 'February 27, 1996')
.put('occupation', 'Pokemon')
.write(function () {
db.readStream()
.on('data', console.log)
.on('close', function () { console.log('Pika pi!') })
});
You could also use s3leveldown with S3 compatible servers such as MinIO.
const levelup = require('levelup');
const s3leveldown = require('s3leveldown');
const AWS = require('aws-sdk');
const s3 = new AWS.S3({
apiVersion: '2006-03-01',
accessKeyId: 'YOUR-ACCESSKEYID',
secretAccessKey: 'YOUR-SECRETACCESSKEY',
endpoint: 'http://127.0.0.1:9000',
s3ForcePathStyle: true,
signatureVersion: 'v4'
});
const db = levelup(s3leveldown('my_bucket', s3));
You can create your Level DB in a sub-folder in your S3 bucket, just use my_bucket/sub_folder
when passing the location.
There are a few performance caveats due to the limited API provided by the AWS S3 API:
When iterating, getting values is expensive. A seperate S3 API call is made to get the value of each key. If you don't need the value, pass { values: false }
in the options. Each S3 API call can return 1000 keys, so if there are 3000 results, 3 calls are made to list the keys, and if getting values as well, another 3000 API calls are made.
Avoid iterating large datasets when passing { reverse: true }
. Since the S3 API call do not allow retrieving keys in reverse order, the entire result set needs to be stored in memory and reversed. If your database is large ( >5k keys ), be sure to provide start (gt
, gte
) and end (lt
, lte
), or the entire database will need to be fetched.
By default when iterating, 1000 keys will be returned. If you only want 10 keys for example, set { limit: 10 }
and the S3 API call will only request 10 keys. Note that if you have { reverse: true }
, this optimisation does not apply as we need to fetch everything from start to end and reverse it in memory. To override the default number of keys to return in a single API call, you can set the s3ListObjectMaxKeys
option when creating the iterator. The maximum accepted by the S3 API is 1000.
Specify the AWS region of the bucket to improve performance, by calling AWS.config.update({ region: 'ap-southeast-2' });
replace ap-southeast-2
with your region.
Individual operations (put
get
del
) are atomic as guaranteed by S3, but the implementation of batch
is not atomic. Two concurrent batch calls will have their operations interwoven. Don't use any plugins which require this to be atomic or you will end up with your database corrupted! However, if you can guarantee that only one process will write the S3 bucket at a time, then this should not be an issue. Ideally, you want to avoid race conditions where two processes are writing to the same key at the same time. In those cases the last write wins.
Iterator snapshots are not supported. When iterating through a list of keys and values, you may get the changes, similar to dirty reads.
S3LevelDOWN uses debug. To see debug message set the environment variable DEBUG=S3LevelDOWN
.
To run the test suite, you need to set a S3 bucket to the environment variable S3_TEST_BUCKET
. Also be sure to set your AWS credentials
$ S3_TEST_BUCKET=my-test-bucket npm run test
Author: loune
Source Code: https://github.com/loune/s3leveldown
License: MIT license
1656912240
STHREE ENV PLUGIN
This plugin is used to get config from a json formatted file in S3 and copy them to environment variable
For now it is quite simple, the bucket and config key name is predefined based on
Bucket: (service name)-config-(stage) Key: config.json
everything in that key will be copied over to your environment variable
eg. service name is my-apps and I am using dev stage
so create a bucket in the same region with the name my-apps-config-dev and config.json file inside there like below
{
"KEY": "VALUE"
}
Author: StyleTributeIT
Source Code: https://github.com/StyleTributeIT/serverless-sthree-env
License:
1656889860
Deploy functionality is in active development, it soon will be available
First, add Serverless Static to your project, be sure that you already have the serverless-offline plugin already installed
$ npm install serverless-static --save-dev
or, if serverless-offline is not already installed
$ npm install serverless-static serverless-offline --save-dev
Then inside your project's serverless.yml file add following entry to the plugins section: serverless-static. If there is no plugin section you will need to add it to the file.
It should look something like this:
plugins:
- serverless-offline
- serverless-static
custom:
static:
path: ./public # select the folder you want to serve
port: 8000 # select a specific port
# this will overide default behavior
# it will serve the folder ./public
# it will serve it throught localhost:8000
Author: iliasbhal
Source Code: https://github.com/iliasbhal/serverless-static
License:
1656614460
With this plugin for serverless, you can sync local folders to S3 buckets after your service is deployed.
Add the NPM package to your project:
# Via yarn
$ yarn add serverless-s3bucket-sync
# Via npm
$ npm install serverless-s3bucket-sync
Add the plugin to your serverless.yml
:
plugins:
- serverless-s3bucket-sync
Configure S3 Bucket syncing Auto Scaling in serverless.yml
with references to your local folder and the name of the S3 bucket.
custom:
s3-sync:
- folder: relative/folder
bucket: bucket-name
That's it! With the next deployment, serverless will sync your local folder relative/folder
with the S3 bucket named bucket-name
.
You can use sls sync
to synchornize all buckets without deploying your serverless stack.
You are welcome to contribute to this project! 😘
To make sure you have a pleasant experience, please read the code of conduct. It outlines core values and beliefs and will make working together a happier experience.
Author: sbstjn
Source Code: https://github.com/sbstjn/serverless-s3bucket-sync
License: MIT license
1656606960
A plugin to sync local directories and S3 prefixes for Serverless Framework ⚡ .
serverless-s3-sync
) & Contact form backend ( serverless
) .serverless
) & assets ( serverless-s3-sync
) .Run npm install
in your Serverless project.
$ npm install --save serverless-s3-sync
Add the plugin to your serverless.yml file
plugins:
- serverless-s3-sync
Version 2.0.0 is compatible with Serverless Framework v3, but it uses the legacy logging interface. Version 3.0.0 and later uses the new logging interface.
serverless-s3-sync | Serverless Framework |
---|---|
v1.x | v1.x, v2.x |
v2.0.0 | v1.x, v2.x, v3.x |
≥ v3.0.0 | v3.x |
custom:
s3Sync:
# A simple configuration for copying static assets
- bucketName: my-static-site-assets # required
bucketPrefix: assets/ # optional
localDir: dist/assets # required
# An example of possible configuration options
- bucketName: my-other-site
localDir: path/to/other-site
deleteRemoved: true # optional, indicates whether sync deletes files no longer present in localDir. Defaults to 'true'
acl: public-read # optional
followSymlinks: true # optional
defaultContentType: text/html # optional
params: # optional
- index.html:
CacheControl: 'no-cache'
- "*.js":
CacheControl: 'public, max-age=31536000'
bucketTags: # optional, these are appended to existing S3 bucket tags (overwriting tags with the same key)
tagKey1: tagValue1
tagKey2: tagValue2
# This references bucket name from the output of the current stack
- bucketNameKey: AnotherBucketNameOutputKey
localDir: path/to/another
# ... but can also reference it from the output of another stack,
# see https://www.serverless.com/framework/docs/providers/aws/guide/variables#reference-cloudformation-outputs
- bucketName: ${cf:another-cf-stack-name.ExternalBucketOutputKey}
localDir: path
resources:
Resources:
AssetsBucket:
Type: AWS::S3::Bucket
Properties:
BucketName: my-static-site-assets
OtherSiteBucket:
Type: AWS::S3::Bucket
Properties:
BucketName: my-other-site
AccessControl: PublicRead
WebsiteConfiguration:
IndexDocument: index.html
ErrorDocument: error.html
AnotherBucket:
Type: AWS::S3::Bucket
Outputs:
AnotherBucketNameOutputKey:
Value: !Ref AnotherBucket
Run sls deploy
, local directories and S3 prefixes are synced.
Run sls remove
, S3 objects in S3 prefixes are removed.
Run sls deploy --nos3sync
, deploy your serverless stack without syncing local directories and S3 prefixes.
Run sls remove --nos3sync
, remove your serverless stack without removing S3 objects from the target S3 buckets.
sls s3sync
Sync local directories and S3 prefixes.
If also using the plugins serverless-offline
and serverless-s3-local
, sync can be supported during development by placing the bucket configuration(s) into the buckets
object and specifying the alterate endpoint
(see below).
custom:
s3Sync:
# an alternate s3 endpoint
endpoint: http://localhost:4569
buckets:
# A simple configuration for copying static assets
- bucketName: my-static-site-assets # required
bucketPrefix: assets/ # optional
localDir: dist/assets # required
# ...
As per serverless-s3-local's instructions, once a local credentials profile is configured, run sls offline start --aws-profile s3local
to sync to the local s3 bucket instead of Amazon AWS S3
bucketNameKey
will not work in offline mode and can only be used in conjunction with valid AWS credentials, usebucketName
instead.
run sls deploy
for normal deployment
custom:
s3Sync:
# Disable sync when sls deploy and sls remove
noSync: true
buckets:
# A simple configuration for copying static assets
- bucketName: my-static-site-assets # required
bucketPrefix: assets/ # optional
localDir: dist/assets # required
# ...
Author: k1LoW
Source Code: https://github.com/k1LoW/serverless-s3-sync
License:
1656599580
plugin for serverless to make buckets empty before remove
Usage
Run next command.
$ npm install serverless-s3-remover
Add to your serverless.yml
plugins:
- serverless-s3-remover
custom:
remover:
buckets:
- my-bucket-1
- my-bucket-2
You can specify any number of bucket
s that you want.
Now you can make all buckets empty by running:
$ sls s3remove
When removing
When removing serverless stack, this plugin automatically make buckets empty before removing stack.
$ sls remove
Using Prompt
You can use prompt before deleting bucket.
custom:
remover:
prompt: true # default value is `false`
buckets:
- remover-bucket-a
- remover-bucket-b
Populating the configuration object before using it
custom:
boolean:
true: true
false: false
remover:
prompt: ${self:custom.boolean.${opt:s3-remover-prompt, 'true'}}
I can use the command line argument --s3-remover-prompt false
to disable the prompt feature.
Author: Sinofseven
Source Code: https://github.com/sinofseven/serverless-s3-remover
License: MIT license
1656592140
serverless-s3-local is a Serverless plugin to run S3 clone in local. This is aimed to accelerate development of AWS Lambda functions by local testing. I think it is good to collaborate with serverless-offline.
Installation
Use npm
npm install serverless-s3-local --save-dev
Use serverless plugin install
sls plugin install --name serverless-s3-local
Example
serverless.yaml
service: serverless-s3-local-example
provider:
name: aws
runtime: nodejs12.x
plugins:
- serverless-s3-local
- serverless-offline
custom:
# Uncomment only if you want to collaborate with serverless-plugin-additional-stacks
# additionalStacks:
# permanent:
# Resources:
# S3BucketData:
# Type: AWS::S3::Bucket
# Properties:
# BucketName: ${self:service}-data
s3:
host: localhost
directory: /tmp
resources:
Resources:
NewResource:
Type: AWS::S3::Bucket
Properties:
BucketName: local-bucket
functions:
webhook:
handler: handler.webhook
events:
- http:
method: GET
path: /
s3hook:
handler: handler.s3hook
events:
- s3: local-bucket
event: s3:*
handler.js (AWS SDK v2)
const AWS = require("aws-sdk");
module.exports.webhook = (event, context, callback) => {
const S3 = new AWS.S3({
s3ForcePathStyle: true,
accessKeyId: "S3RVER", // This specific key is required when working offline
secretAccessKey: "S3RVER",
endpoint: new AWS.Endpoint("http://localhost:4569"),
});
S3.putObject({
Bucket: "local-bucket",
Key: "1234",
Body: new Buffer("abcd")
}, () => callback(null, "ok"));
};
module.exports.s3hook = (event, context) => {
console.log(JSON.stringify(event));
console.log(JSON.stringify(context));
console.log(JSON.stringify(process.env));
};
handler.js (AWS SDK v3)
const { S3Client, PutObjectCommand } = require("@aws-sdk/client-s3");
module.exports.webhook = (event, context, callback) => {
const client = new S3Client({
forcePathStyle: true,
credentials: {
accessKeyId: "S3RVER", // This specific key is required when working offline
secretAccessKey: "S3RVER",
},
endpoint: "http://localhost:4569",
});
client
.send(
new PutObjectCommand({
Bucket: "local-bucket",
Key: "1234",
Body: Buffer.from("abcd"),
})
)
.then(() => callback(null, "ok"));
};
module.exports.s3hook = (event, context) => {
console.log(JSON.stringify(event));
console.log(JSON.stringify(context));
console.log(JSON.stringify(process.env));
};
Configuration options
Configuration options can be defined in multiple ways. They will be parsed with the following priority:
custom.s3
in serverless.ymlcustom.serverless-offline
in serverless.ymlOption | Description | Type | Default value |
---|---|---|---|
address | The host/IP to bind the S3 server to | string | 'localhost' |
host | The host where internal S3 calls are made. Should be the same as address | string | |
port | The port that S3 server will listen to | number | 4569 |
directory | The location where the S3 files will be created. The directory must exist, it won't be created | string | './buckets' |
accessKeyId | The Access Key Id to authenticate requests | string | 'S3RVER' |
secretAccessKey | The Secret Access Key to authenticate requests | string | 'S3RVER' |
cors | The S3 CORS configuration XML. See AWS docs | string | Buffer | |
website | The S3 Website configuration XML. See AWS docs | string | Buffer | |
noStart | Set to true if you already have an S3rver instance running | boolean | false |
allowMismatchedSignatures | Prevent SignatureDoesNotMatch errors for all well-formed signatures | boolean | false |
silent | Suppress S3rver log messages | boolean | false |
serviceEndpoint | Override the AWS service root for subdomain-style access | string | amazonaws.com |
httpsProtocol | To enable HTTPS, specify directory (relative to your cwd, typically your project dir) for both cert.pem and key.pem files. | string | |
vhostBuckets | Disable vhost-style access for all buckets | boolean | true |
buckets | Extra bucket names will be created after starting S3 local | string |
Feature
Working with IaC tools
If your want to work with IaC tools such as terraform, you have to manage creating bucket process. In this case, please follow the below steps.
#resources:
# Resources:
# NewResource:
# Type: AWS::S3::Bucket
# Properties:
# BucketName: local-bucket
$ mkdir /tmp/local-bucket
Triggering AWS Events offline
This plugin will create a temporary directory to store mock S3 info. You must use the AWS cli to trigger events locally. First, using aws configure set up a new profile, i.e. aws configure --profile s3local
. The default creds are
aws_access_key_id = S3RVER
aws_secret_access_key = S3RVER
You can now use this profile to trigger events. e.g. to trigger a put-object on a file at ~/tmp/userdata.csv
in a local bucket run: aws --endpoint http://localhost:4569 s3 cp ~/tmp/data.csv s3://local-bucket/userdata.csv --profile s3local
You should see the event trigger in the serverless offline console: info: PUT /local-bucket/user-data.csv 200 16ms 0b
and a new object with metadata will appear in your local bucket.
See also
Author: ar90n
Source Code: https://github.com/ar90n/serverless-s3-local
License: MIT license
1656584580
Serverless-s3-encryption
set or remove the encryption settings on the s3 buckets in your serverless stack
This plugin runs on the after:deploy
hook, but you can also run it manually with: sls s3-encryption update
npm install --save-dev serverless-s3-encryption
See the example below for how to modify your serverless.yml
# serverless.yml
plugins:
# ...
- serverless-s3-encryption
custom:
# ...
s3-encryption:
buckets:
MyEncryptedBucket:
# see: http://docs.aws.amazon.com/AWSJavaScriptSDK/latest/AWS/S3.html#putBucketEncryption-property
# accepted values: none, AES256, aws:kms
SSEAlgorithm: AES256
# only if SSEAlgorithm is aws:kms
KMSMasterKeyID: STRING_VALUE
resources:
Resources:
MyEncryptedBucket:
Type: "AWS::S3::Bucket"
Description: my encrypted bucket
DeletionPolicy: Retain
Author: Tradle
Source Code: https://github.com/tradle/serverless-s3-encryption
License:
1656577200
Plugin for serverless to deploy files to a variety of S3 Buckets
Note: This project is currently not maintained.
Installation
npm install --save-dev serverless-s3-deploy
Usage
Add to your serverless.yml:
plugins:
- serverless-s3-deploy
custom:
assets:
targets:
- bucket: my-bucket
files:
- source: ../assets/
globs: '**/*.css'
- source: ../app/
globs:
- '**/*.js'
- '**/*.map'
- bucket: my-other-bucket
empty: true
prefix: subdir
files:
- source: ../email-templates/
globs: '**/*.html'
You can specify any number of target
s that you want. Each target
has a bucket
and a prefix
.
bucket
is either the name of your S3 bucket or a reference to a CloudFormation resources created in the same serverless configuration file. See below for additional details.
You can specify source
relative to the current directory.
Each source
has its own list of globs
, which can be either a single glob, or a list of globs.
Setting empty
to true
will delete all files inside the bucket before uploading the new content to S3 bucket. The prefix
value is respected and files outside will not be deleted.
Now you can upload all of these assets to your bucket by running:
$ sls s3deploy
If you have defined multiple buckets, you can limit your deployment to a single bucket with the --bucket
option:
$ sls s3deploy --bucket my-bucket
You can optionally specificy an ACL for the files uploaded on a per target basis:
custom:
assets:
targets:
- bucket: my-bucket
acl: private
files:
The default value is private
. Options are defined here.
The appropriate Content Type for each file will attempt to be determined using mime-types
. If one can't be determined, a default fallback of 'application/octet-stream' will be used.
You can override this fallback per-source by setting defaultContentType
.
custom:
assets:
targets:
- bucket: my-bucket
files:
- source: html/
defaultContentType: text/html
...
Additional headers can be included per target by providing a headers
object.
See http://docs.aws.amazon.com/AmazonS3/latest/API/RESTObjectPUT.html for more details.
custom:
assets:
targets:
- bucket: my-bucket
files:
- source: html/
headers:
CacheControl: max-age=31104000 # 1 year
A common use case is to create the S3 buckets in the resources
section of your serverless configuration and then reference it in your S3 plugin settings:
custom:
assets:
targets:
- bucket:
Ref: MyBucket
files:
- source: html/
resources:
# AWS CloudFormation Template
Resources:
MyBucket:
Type: AWS::S3::Bucket
Properties:
AccessControl: PublicRead
WebsiteConfiguration:
IndexDocument: index.html
ErrorDocument: index.html
You can disable the resolving with the following flag:
custom:
assets:
resolveReferences: false
If you want s3deploy to run automatically after a deploy, set the auto
flag:
custom:
assets:
auto: true
You're going to need an IAM policy that supports this deployment. This might be a good starting point:
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": [
"s3:ListBucket"
],
"Resource": [
"arn:aws:s3:::${bucket}"
]
},
{
"Effect": "Allow",
"Action": [
"s3:PutObject",
"s3:PutObjectAcl",
"s3:GetObject",
"s3:DeleteObject"
],
"Resource": [
"arn:aws:s3:::${bucket}/*"
]
}
]
}
If you want to tweak the upload concurrency, change uploadConcurrency
config:
config:
assets:
# defaults to 3
uploadConcurrency: 1
Verbosity cloud be enabled using either of these methods:
Configuration:
custom:
assets:
verbose: true
Cli:
sls s3deploy -v
Author: Funkybob
Source Code: https://github.com/funkybob/serverless-s3-deploy
License: MIT license
1656009240
serverless-offline-s3
This Serverless-offline plugin emulates AWS λ and S3 queue on your local machine. To do so, it listens S3 bucket events and invokes your handlers.
Features:
First, add serverless-offline-s3
to your project:
npm install serverless-offline-s3
Then inside your project's serverless.yml
file, add following entry to the plugins section before serverless-offline
(and after serverless-webpack
if presents): serverless-offline-s3
.
plugins:
- serverless-webpack
- serverless-offline-s3
- serverless-offline
To be able to emulate AWS S3 Bucket on local machine there should be some bucket system actually running. One of the existing implementations suitable for the task is Minio.
Minio is a High Performance Object Storage released under Apache License v2.0. It is API compatible with Amazon S3 cloud storage service. Use MinIO to build high performance infrastructure for machine learning, analytics and application data workloads. See example s3
service setup.
We also need to setup actual buckets in Minio server, we can use AWS cli tools for that. In example, we spawn-up another container with aws-cli
pre-installed and run initialization script, against Minio server in separate container.
Once Minio is running and initialized, we can proceed with the configuration of the plugin.
Note that starting from version v3.1 of the plugin.
The configuration of function of the plugin follows the serverless documentation.
functions:
myS3Handler:
handler: handler.compute
events:
- s3:
bucket: myBucket
event: s3:ObjectCreated:Put
The configuration of aws.S3
's client of the plugin is done by defining a custom: serverless-offline-s3
object in your serverless.yml
with your specific configuration.
Minio with the following configuration:
custom:
serverless-offline-s3:
endpoint: http://0.0.0.0:9000
region: eu-west-1
accessKey: minioadmin
secretKey: minioadmin
Author: CoorpAcademy
Source Code: https://github.com/CoorpAcademy/serverless-plugins
License:
1655544198
In this video we will go through Web API and AWS S3 integration. We will learn about S3 and how to use it with our .Net web api and how can upload files to it.
So what we will cover today:
00:00 intro
00:51 Agenda
01:26 What is AWS S3
02:54 What is an S3 Bucket
06:22 S3 Characteristics - Benefits
09:26 Securing S3 Bucket
10:22 S3 Encryption
14:22 S3 Class Types
21:37 Ingredients (dev requirements)
22:13 Code time
22:37 Create an IAM User
23:29 Create an S3 bucket
27:03 Create Web API and Classlib
31:37 Setup the S3 Classlib
32:02 Create S3 Models (DTOs)
37:17 Create the interface and service
52:04 Creating the controller
59:53 Injecting the Services
01:01:10 Testing the application
Source code:
https://github.com/mohamadlawand087/NET6-S3
DotNet SDK:
https://dotnet.microsoft.com/download
Visual Studio Code:
https://code.visualstudio.com/](https://code.visualstudio.com/
1627128780
In this AWS Storage video, we will understand differences between object, block, file and distributed file system storages. Then compare S3, EBS, HDFS, EFS. We will look into some use cases and finally share some of interview tips.
#aws #s3 #ebs #hdfs #efs