Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
MinIO: A Bare Metal Drop-In for AWS S3 (marksblogg.com)
260 points by todsacerdoti on Aug 10, 2021 | hide | past | favorite | 120 comments


Is HDFS nice? I did a lot of research before settling on Ceph for our in-house storage cluster, and I don't remember even considering HDFS and I don't really know why. Ceph also is a drop-in for S3 for bare metal clusters.

I've been running Ceph for about a year now, and the start up was a bit rough. We are actually on second hand hard drives, that had a lot of bad apples, and the failures weren't actually very transparent to deal with, which was a bit of a disappointment. Maybe my expectations were too high, but I was hoping it would just sort of fix itself (i.e. down the relevant drive, send me a notification, and ensure continuity). I feel I had to learn way too much about Ceph to be able to operate it properly. Besides that the performance is also not stellar, it apparently scales with CPU frequency, which is a bit astonishing to me, but I've never designed a distributed filesystem so who am I to judge.

I was looking for something that would scale with the company. Now we've got 70 drives, maybe next year 100 and the next year 200. Now all our drives are 4TB, but I'd like to switch them out for 14TB or 18TB drives as we go along. We're not in a position to just drop 100k on a batch of shiny state of the art machines at once. Many filesystems assume the number of drives in your cluster never changes, it's crazy.


Curious -- any reason you didn't just go with a single machine export + expansion disk shelves on something like ZFS? Installing a MinIO gateway would also act as a bare drop-in for S3 too.

Asking since we're in the same position as yourself w/ high double-digit disks trying to figure out our plan moving forward. Right now we're just using a very large beefy node w/ shelves. ZFS (via TrueNAS) does give us pretty good guarantees on failed disks + automated notifications when stuff goes wrong.

Obviously a single system won't scale past a few hundred disks so we are looking at alternatives including Ceph, GlusterFS, and BeeGFS. From the outside looking in, Ceph seems like it might be more complexity than it's worth until you hit the 10s of PB range with completely standardized hardware?


Some of our rendering processes take multiple days to complete, and the blackbox software we use doesn't have a pause button. So it's not that we're in need of 99.99999% uptime, but there's actually never a moment where rebooting a machine would be convenient (or indeed cost us money). Being distributed over nodes means I can reboot them and the processes are not disrupted.


for k8s there is also kadalu btw. which is based on glusterfs, but simplified.


HDFS doesn't really work as a normal filesystem. I think some other commenters pointed out the challenges with FUSE.

If I recall correctly there isn't really a way to modify an existing file via HDFS, so you'd have to copy/edit/replace. Append used to be an issue, but that got sorted out a few years back.

Erasure coding is available in the latest versions. Which helps with replication costs.

I think, HDFS may just be a simpler setup than other solutions. (which is to say its not all that simple, but easier than some other choices). And I wouldn't use HDFS as a replacement for block storage, which is something I've seen done with Ceph.


Thanks, we actually use Ceph as a straight up filesystem, that gets mounted on linux machines and then exposed to our windows based processing nodes (they are human operated) over SMB. I think that explains why HDFS is not a good fit for us.


What about S3 didn't meet your use case? I don't work for AWS. I don't care if they lose business, I am interested in how different companies parse their requirements into manage vs. rent.


One aspect is that we have a lot of data that has PII in it, and we feel safer if we anonymise that locally before sending it into the cloud. Once the data is cleaned up it's actually sent to GCS for consumption in our product). Another aspect is that this data has to be accessible as windows fileshares (i.e. SMB) to our data processing team. The datasets are in the range of several 100's of GB to several TB, each of the team members works on several of those datasets per day. This would strain our uplink too and maybe the bandwidth would be costly as well.


If you are writing a ton of small files (we have billions of audit blobs we write) the API put costs can quickly creep up on your. We pay much more for those than on the actual storage costs. If you want to use tags on your objects, they charge you per tag per object per month - again, another huge cost. We missed that when pricing S3 out, and needed to do a project to pull out all of the tags we had, and are currently working on batching up multiple blobs into one larger blobs to hopefully reduce our API costs by an order of magnitude. This is purely a cost decision for us, adding complexity to our application and its operation. S3 seems better suited for fewer larger files. Our backups and other use cases like that work perfectly.


HDFS has pretty much all of Ceph's flaws plus it has a non-scalable metadata server, the "NameNode". If you're already up and running with Ceph I can think of no reason to abuse yourself with HDFS.


We're spinning up a medium sized Proxmox clusters (~50 nodes in total) to replace our aging Xen clusters. I saw Ceph is available on the Proxmox platform, but was hesitant to make all the VM storage backed by Ceph (throwing all the eggs into a single basket).

What were some of the other hurdles you faced in your Ceph deployment?


We've been playing around with migrating our bare metals to proxmox as well. Though one main argument, being able to reboot/manage crashed GPU accelerated nodes, was invalidated by proxmox (KVM?) itself crashing whenever the GPU would crash, so it didn't solve our core problem. This is of course also due to that we're not using industrial components, but it is what it is.

I found Ceph's error messages very hard to debug. Just google around a bit for the manuals of how to deal with buggy or fully defective drives. There's a lot of SSH'ing in, running vague commands looking up id's of drives and matching them to linux device mount points and reading vague error logs.

To me as a high level operator it feels it should be simple. If a drive supplies a block of data, and that data fails its checksum, it's gone. The drive already does its very best internally to cope with physical issues, if the drive couldn't come up with valid data, it's toast or as close to toast as anyone should be comfortable with. So it's simple, fail a checksum, out of the cluster, send me an e-mail, I don't get why Ceph has to be so much more complicated than that.


I found proxmox not to be very user friendly growing to such cluster sizes. Proxmox itself has been very stable and supports pretty much anything but the GUI is not that great if you have many nodes and VMS, and the API can be lacking. However, using ceph as a backing store for VM images is pretty easy in proxmox. I have not used the cephFS stuff. I used it in a separate cluster both physically and standalone (not using proxmox integration).

So RBD is easy, S3 is somewhat more complicated as you need to run multiple gateways, but still very doable. The FS stuff also needs extra daemons, but I have not yet tested it.


You have to use something like FUSE to mount HDFS, if that is your intention. It's not really like Ceph. Unless your app is written to use the HDFS API directly it's going to be a bigger rigmarole to store stuff.


Did you not evaluate linstor?


Thanks, I didn't but it looks interesting, I'll research it later.


Be warned that their code quality is pretty bad. There was a bug I was dealing with last year where it did not delete objects but returned the correct HTTP response code indicating it did. This was widespread, not just some edge case I encountered. Their broken test suite doesn't actually verify the object on disk changed. I tried to engage them but they blew me off.


Minio isn't durable. Any S3 operation might not be on disk after it is completed successfully. They had an environment variable MINIO_DRIVE_SYNC for a while that fixed some cases. Looking at the current code this setting is called MINIO_FS_OSYNC now (for some reason) https://github.com/minio/minio/pull/9581/commits/ce63c75575a... (but I wouldn't trust that... are they fsyncing directories correctly? Making sure object metadata gets deleted with the data in one transaction etc.). Totally undocumented, too.

I guess this makes minio "fast". But it might eat your data. Please use something like Ceph+RadosGW instead. It might be okay for running tests where durability isn't a requirement.


That had me curious, so I searched a bit in their issues.

Their attitude about it isn't great: https://github.com/minio/minio/issues/3536

That's too bad, as it seems well thought out in other areas, like clustering.


MinIO team care about an issue if you are paid customer, not for people who use the open source. Indeed MinIO is not even fully S3 compatible with many edge cases and close the issues related to it by saying it’s not a priority.

You might want to look at other options as well like SeaweedFS [0] a POSIX compliant S3 compatible distributed file system.

[0] https://github.com/chrislusf/seaweedfs


I haven't used seaweedfs yet, but it looks better (and small file/object performance should be miles better). W.r.t. to fsync/durability with the seaweedfs API to a volume server you have to turn fsync on via paramter and it is disabled by default. With S3 it is probably also off by default and you can turn it on per bucket: https://github.com/chrislusf/seaweedfs/wiki/Path-Specific-Co... .

Both should default to fsync on, with the option to turn it off. So not a great choice of defaults. Again, it probably looks good in benchmarks when people naively compare S3 stores. But it just shouldn't eat your data per default.


We tried to use it a year ago or so, because of the performance promise. We were getting random glitches every few thousand files during operations. There was no obvious pattern, so difficult to reproduce, and as far as I remember there was mentions of it in github. Hopefully they acknowledge and get over this hump, as it seems like a promising project all together.


Yep, same experience unfortunately


I also hit a very frustrating issue in minio where CORS headers weren't being set properly and there were many similar cases in their issues history. Their response was basically "works for me, sorry".

I'm pretty sure there was something weird going on with how minio was reading the config state, as I definitely was not the only one hitting it. Luckily I only had to use it for local testing in the project, but the whole thing didn't leave me feeling good.

[1] https://github.com/minio/minio/issues/11111


Github issue link? They seem to have a solid ci setup, and I know several large enterprises using. But I found a bug for my usage != bad code quality.


My usage was "setup a basic single node for testing, upload a file with mc client, delete a file with mc client". They failed that test. It was responding with 200s but the file was never deleted.

There are loads of issues like this on their github: https://github.com/minio/minio/issues/8873


That's an interesting issue. Boiled down to the object name having '//' in it, which drove a certain direction for the shard location that wasn't the same shard location that the delete function looked in.

Sounds like the shard hashing happens before or after object name normalization depending on the operation. Ouch.


Based on sib comment, it looks like it was related to a poor name sanitation / matching function (which says, to me, that's risky if you have untrusted names), but this could also be caused by a lazy or delayed deletion strategy.


I had issues with frequent crashes due to various panics a while a go. It eventually went away after a version upgrade. But now reading this I don’t feel terribly confident in using minio long term.


Is Ceph with Rados Gateway a better alternative to this?


CERN runs at least in part on Ceph and it's well documented:

https://www.youtube.com/watch?v=OopRMUYiY5E


I have a 500PB Ceph setup @ work, but I don't maintain it. It's been solid.


I would say no in production. I was recently testing a ceph + rgw as an on prem s3 solution, but high throughput puts + ls caused an index corruption that “lost” files according to future LS’s, the file was still there if you directly get it. When this was reported it was already found multiple years ago, and never fixed


Could you reference a bug url? I tried to find it via tracker.ceph.com but failed to do so (I don't claim that the problem doesn't exist). That said referencing a bug url would be nice if you want to increase credibility of your claim.


Could be: https://tracker.ceph.com/issues/24744

I know this bug has hampered our use of ceph at singlestore. Note that this is not an eventual consistency issue. When it happens the list command will permanently miss files.


100% agree. I pin the version we use because you never know if it will come with even more bugs.


what would be a better way to export a nfs storage to s3 than? swift, like it does for glusterfs?


I don't know when this was written, but MinIO does not have a great story (or really any story) around horizontal scalability. Yes, you can set it up in "distributed mode", but that is completely fixed at setup time and requires a certain number of nodes right from the beginning.

For anyone who wants HA and horizontal elastic scalability, checkout SeaweedFS instead, it is based on the Facebook "Haystack" paper: https://github.com/chrislusf/seaweedfs


Thanks! SeaweedFS has a linearly scalable architecture and performs much better.

It is also very easy to run. Just run this: "docker run -p 8333:8333 chrislusf/seaweedfs server -s3"


SeaweedFS is an amazing project, thanks so much for making it.

I know I'm asking quite a biased source but are there any shortcomings of SeaweedFS that are well known? Any hangups/weird corners that you can think of just off the top of your head?


I spent some hours looking at SeaweedFS, and walked away with the impression that most of the code outside of the happy paths wasn't exercised much.

For example, if you batch upload data, and the default 8 volumes happen to fill at the same time, you get transient errors until it has managed to create new volumes.


I haven't looked at it as deeply -- thanks for pointing this out. Up until now I actually wanted to use MinIO for running an S3 service (Ceph + RadosGW is also an option for me, and this thread make me consider it over MinIO though it was always a strong contender). In the process of researching I earmarked SeaweedFS and cortx[0], but SeaweedFS attracted me way more -- looks like it would fit me "just right".

It'd be nice if there were an issue to explain this shortcoming, does the project know it happens and it's like a "hopefully we'll have auto scaling/adjusting volumes/online configuration update in the future" or something? How would one mitigate that?

[0]: https://github.com/Seagate/cortx


Well, the author has to be aware of it: https://github.com/chrislusf/seaweedfs/issues/2216


For this particular issue, it was fixed in a PR that checks whether a volume is "crowded" and creates new volumes before they fill up. Not really an issue any more.


It really depends on the use case.

The project is still growing and there are different edge cases for each features, especially new ones. However, in general, I feel the project is structured layer-by-layer, and should be easy to fix the problems.

Some parts are complicated, e.g., FUSE mount. It's hard or impossible to be fully POSIX compliant. SeaweedFS has come a long way and has improved quite a lot, but maybe do not run your database on it just yet, until SeaweedFS supports block storage later.


I agree it’s great (and mostly used) for integration testing, do significant number of users actually use it for storage?


is it ready for production use? i can't find docs on how to run multiple master anywhere.


Seaweedfs has made some questionable security decisions (https://github.com/chrislusf/seaweedfs/issues/1937).


You can still pass the -ip flag right? This mostly means you probably need to read some 'Seaweedfs in production' sort of guide.


> you probably need to read some 'Seaweedfs in production' sort of guide.

This comes across as slightly condescending.

As I'm sure you'd agree, secure by default is very important, and it's what most responsible distributions aim for (i.e., Debian/Ubuntu). Starting up a daemon should not launch it in the most open way possible, but instead the most restricted way possible.

A reasonable expectation is that you should not have to pass the -ip flag; daemons should default to a secure configuration (which probably means defaulting to -ip 127.0.0.1, which you should be able to easily override, if that is your intention, and achieve the default behavior by simply passing -ip 0.0.0.0/0).


> This comes across as slightly condescending.

As does:

> As I'm sure you'd agree, secure by default is very important

I just meant it in a practical sense, you (as in people), need to read a guide in order to make it production ready instead of seafweed being production ready by default. I checked, there even is a guide in the repo, so I guess people need to read it.


wow that's got everything and almost the kitchen sink.


Actually SeaweedFS already supports asynchronous backup to data sinks such as S3,GCP,Azure,etc.

I did not heard of this "kitchen" sink before. :)

But adding one more sink should be trivial.


It's amazing. I use it everywhere.

The only limitation is that you don't have all the IAM access rules that you get with AWS.

Oh wait, that's exactly why I love it.


I feel like generating my own "S3 signed URLs" from Minio (as a node script) is a much better way to layer security than that IAM mess.

And, the mc command line client is awesome.

And, it all runs inside dokku which is incredible.


What specifically do you have issues with when it comes to IAM?

It’s a complicated tool for sure, but it comes from the natural complication of dealing with auth in a very flexible way.


Not the parent, but IMO it is an awkward way of thinking of permissions in an automated environment. It fits a human model much better, where you have a long lived entity which is self contained and expected to be trustworthy. Alice should have access to all finance data in this bucket because she works in finance, or Bob should be able to access these EC2 instances because he admins them.

It causes weird and overly broad privileges though usually, because you need to give permission to do any possible thing the job or user of the credentials COULD need to do, all the time.

This happens because any action to limit the scope usually causes more human friction than it is worth.

Ideally, when it is requested they do something, they get handed a token for the scope they are doing it in, which only gives them access to do the specific things they will need to do on the specific thing they need to do it, and only for the time they plausibly will need to do it for. This is a huge hassle for humans, and adds a lot of time and friction. For machines, it can be as simple as signing and passing along some simple data structures.

So for example, Alice would get a token allowing access to Q4 ‘20 only if that was plausibly correct and necessary, and then only for how long it took to do whatever she was doing. Bob would only get a token to access the specific EC2 instance that needs him to log into it because of a failure of the management tools that otherwise would fix things - and only after telling the token issuing tool/authority that, where it can be logged.

It makes a huge difference in limiting the scope compromises, catching security breeches and security bugs early on, identifying the true scope and accessibility of data, etc.

Also, since no commonly issued token should probably ever provide access to get everything - where the IAM model pretty much requires that a job that gets any random one thing, has to be able to get ALL things, then you also end up with the potential for runtime performance optimizations, since you can prune in advance the set of possible values to search/return.


You can model this with a combination of explicit conditions and principal/resource tags. You also can apply a specific custom policy with every role assumption that can be both time bound and more restrictive than the role policies themselves. All IAM stuff is also very heavily logged.

But overall I’m not sure constantly reaching out to IAM to retrieve scoped permissions for every single action makes much sense. Aside from the obvious latency issues the master set of credentials needs to have permissions to be able to request these scoped time-bound keys, and so them being leaked is just as bad as they can be used to just re-request access to “Q2 data”. Ok, so we need some logic to say “Alice should only be able to request these keys once a day” or some such, and these arbitrary requirements are much more complex to implement and a lot more fragile.

So it only makes sense if you’re expecting it to be materially more common for a service to somehow leak these time-bound single access keys but not leak any other credentials. Which isn’t an assumption that would hold up I think.

So what’s the point?


> It causes weird and overly broad privileges though usually, because you need to give permission to do any possible thing the job or user of the credentials COULD need to do, all the time.

Not really, unless you mean it needs permission to assume all the roles it could need in order to have the permissions it requires.


That is exactly what I mean. A web server that makes database requests needs permission to do any query a web request would need to be able to trigger - not just permission for the specific query that makes sense for the specific request it is serving at the time.

It’s the difference between ‘can query the database’ and ‘can retrieve user Clarice’s profile information because she just made a request To the profile edit page’

Does that make sense?


Yes, I understand, but the point I'm making is that it does support 'roles', 'assuming' them for a period of time, then dropping those privileges again, or 'assuming' a different one, etc.

The 'because' isn't there, but I'm not really sure what that would look like, at least in a meaningful (not trivially spoofable) way.


But no one is creating a role For read, modify, write for every distinct user bit of data no? Or at least I hope not and I doubt the system would function if they tried.

Tokens can do that.


But don't you just move the problem to the token-granting authority?

Don't get me wrong, I do see the hypothetical benefit, I'm just having trouble envisaging a practical solution. Is there something else not on AWS (or third-party for it) that works as you'd like IAM to?


I don’t think you understand my comment? (And the top level comment?)

Your token grantor is just taking in whatever request state you have (session, permissions granted, whatever), and stuffing them into the token that gets passed around. Then the various back ends and client calls also do that, and where there is a permissions check (or conversion) necessary, say on a backend API call to access something, it checks it the callers token has the right permission.

I don’t want IAM to work that way. I don’t want to use IAM for this? It’s the wrong tool.

There are a ton of various signed token frameworks, all with various trade offs. JWT (ugh), Gaia Mint (internal Google), etc.

It tends to work best where there are multiple layers of services, as you have an abstraction layer you can do checking/audits, etc. at.


> a backend API call to access something, it checks it the callers token has the right permission.

Fine, but then that backend handler has permission itself to do whatever it is, regardless of whether or not the token does, since as you said, it COULD need it to service that request?

> I don’t want IAM to work that way. I don’t want to use IAM for this? It’s the wrong tool.

Yes.. I.. agree, I'm confused now why we're even talking about IAM, it's solving a slightly similar problem at a different level; isn't particularly useful here.


Token checking can happen literally at the DB layer, or storage layer if you have such a system (like Google or AWS or whatever).

And I don't know why, you're the only who responded to my reply to 'What specifically do you have issues with when it comes to IAM?'

Haha


Well the use case wasn't clear to me 'up there', I thought you were arguing against IAM in general, or for cases where others do use it.

Essentially I suppose I disagree that it's only good for human users with long-lived roles, but I'm not saying it's the right tool for per-request granular authn, and I'd be surprised to learn that anyone is saying or (trying to be) using it like that. IAM's not even for end human users, (as in of your application) nevermind breaking further down into different types of request from them or on their behalf.


I wasn’t referring to end users? I was referring to admins, employees of the company, etc, which is what IAM is a good fit for.

Using that as the sole way to Scope process/machine access though IS a weird fit in an automated environment for the reasons I laid out. You either come up with a broad scope that covers everything the job/process/machine could ever need to do or access (and then hope there is no exploit or bug that results in it accessing more), or build something like a token system that lets you get/scope access or permission in the context of the work it is doing on behalf of someone else. Which requires investment, but fits what should really be happening better. That is more of the ‘zero trust’ model, but certainly not all of it.


You answered your own question. "Very flexible" is a plus for Amazon because they can cover everybody's use-cases with a single concept. "Very flexible" is a minus for end users because they only need to take care of their own use-case.

So you can say it's a "natural" complication, and you'd be right, but that says nothing about usability, which is where "issues" tends to come in.


Probably learning curve


The gateway feature, where Minio works as a local cache for actual AWS S3 buckets, looks pretty nice.

https://docs.min.io/docs/minio-gateway-for-s3.html


I've used this, years ago, at a company for exactly this, and it's really solid, I've also used it in a developer environment as a more expansive "fake S3" than the simpler ones I'd run across at the time. Good stuff.


One will wish to be cautious, as they recently changed their license to AGPL-3.0: https://github.com/minio/minio/blob/master/LICENSE because they're afraid of AWS offering Minio as a hosted service, I guess


That seems okay, since you can use any S3 client library. So, good advice, but probably very few folks would have a need to touch the server side source.

Minio's client side libraries appear to be packaged separately, and Apache licensed: https://github.com/minio/minio-go

https://github.com/minio/minio-js

(Etc)


And if you do touch the server-side code..do it in an open source fork?


They more likely afraid of smaller and upcoming cloud providers offering it as an S3 drop in.


This plus CDNs I'd imagine too. S3 protocol is the new FTP and minio ticks the box quickly, they want their share of that (and deserve it imo)


Why would AWS offer Minio, a clone of an AWS Service, as a service?

That seems very confusing


Amazon wouldn't, but another cloud service might decide to run this rather than implementing their own S3-compatible object storage from scratch. Or they might use part of Minio's code to make their existing object storage solution compatible with S3.


IIRC from this podcast with Anand Babu Periasamy [0], they already do.

https://www.dataengineeringpodcast.com/minio-object-storage-...


Who does what? Amazon runs Minio?

The notes don't mention this and the audio is over one hour, would you mind clarifying?


another cloud service [Azure] decided to run this rather than implementing their own S3-compatible object storage from scratch


So that AWS can still get people to pay subscription fees to them, instead of using their own hardware with a FOSS solution, if MinIO becomes too popular.


I don't understand. Aren't you describing S3? Why would Amazon offer a second version of S3?


I understood it as a joke referencing Aamazon's tendency to take just any open source product that happens to gain enough popularity, rename it and offer it as a shiny new feature of AWS.


Self hosted as a feature. Akin to managed vs colo hosting.


They pretty much already offer this with outposts (its technically their hardware but its on your premises).


By "self-hosted" you still mean still running on AWS hardware? I don't understand. Why would anybody pay EBS rates instead of S3 rates, to get data stored in the same place by the same people?


Unless you're running a locally patched version AGPL is indistinguishable from GPL.


While this is true a word of warning: if you ever end up in a due diligence situation or a source code audit AGPL can really freak people out and hang up/derail the process until you get them to understand this point. If you can at all.


I'm pretty sure that - the prospect of AWS hosting and rebranding minio - was a joke.


Pretty sure AWS would like to have something that at least looks and feels sorta like S3.

And, being S3-compatible at an API level would be a big bonus for a company the size of AWS, especially if it had nearly native compatibility with the aws-cli tool.


I know Minio can be used for production workloads, but Minio as a localhost substitute for an S3-like store is underrated. S3 dependent projects I work on spin up a Minio instance via docker-compose to get a fully local development experience.


In other words, I'm not sure what other use case it has by using Minio on top of more expensive block storage than using any other native S3 storage services.


MinIO is something I'll look into. And, as another example to the article's, it might also come in handy for some data needs for factories with imperfect Internet reliability (e.g., when the main submarine cable between the factory and AWS Singapore gets severed :).

This first example from the article sounds very valid, but is still personally funny to me, because it's related to the first use I made of S3, but in the opposite direction (due to different technical needs than the article's):

> If an Airline has a fleet of 100 aircraft that produce 200 TB of telemetry each week and has poor network connectivity at its hub.

Years ago, I helped move a tricky Linux-filesystem-based storage scheme for flight data recorder captures to S3. I ended up making a bespoke layer for local-caching, encryption (integrating a proven method and implementation, not rolling own, of course), compression, and legacy backward-compatibility.

That was a pretty interesting challenge of architecture, operations, and systems-y software development. And the occasional non-mainstream technical requirements we encounter are why projects like this MinIO are interesting.


MinIO is great. I use MinIO together with FolderSync (Android only) to automatically backup the photos from my phone to my local NAS. It runs a scheduled job every night and they're saved in the original HEIC format.

I've also used MinIO to mock an S3 service for integration tests, complete with auth and whatnot.


Were you already using MinIO? As somebody who wants to eventually backup photos on my phone, I'm curious why not just use Syncthing for that?


Tbh I wasn't aware of Syncthing. In this use case it would work just as well I suppose.

One of the advantages of MinIO would be the wide compatibility with other S3 storage services. If my NAS had downtime while on holiday I could spin up a new bucket on S3/Backblaze/Wasabi and backup everything in a few minutes.


At work, we use MinIO as a replacement of the S3 Api on our CI servers since we dont want to call production APIs for integration testing.

One of the challenges we had was "pre-filling" the MinIO server with testdata. Some tests require reading a testfile from our mocked S3 API. We wanted to have those testfiles instantly available at MinIO startup, but couldn not get that to work with docker volume mounts, MinIO simply would not recognize the files and serve a 404.

Has anyone got that working or a proposal for an alternative solution? We resorted to uploading the files via the application on startup (if it is in "testing mode"), but that does feel like a dirty hack.


Guys, don’t store your data in minio. Its a sandbox, not an actual object storage. Companies uses minio to store their temporary data not the actual critical data.

For example if you have a project which you store objects to S33, At CI pipeline you don’t want to store temp files into S3 for cosr purposes. So instead you store at minio. A company must be crazy to use minio as their real data storage.


Anyone compared MinIO vs Ceph? I like MinIO because it seems exponentially simpler to setup but I don't know about its distributed and scalability stories.


While I can't say much about its handling when using it distributed, I have had some negative experiences with MinIO/ceph when handling files > 10G.

One example: missing error handling for interrupted uploads leading to files that looked as if they had been uploaded, but had not.

Both ceph and MinIO's implementations differ from AWS original S3 server implementation, in subtle ways. ceph worked more reliably, but IIRC, both for MinIO and ceph, there is no guarantee that a file you upload is readable directly after upload. You have to poll if it is there, which might take a long time for bigger files (I guess because of the hash generation). AWS's original behavior is to keep the socket open until you can actually retrieve the file, which isn't necessary better, as it can lead to other errors like network timeouts.

I got it working halfway reliably by splitting uploads into multiple smaller files, and adding retry with exponential backoff. Then I figured out that using local node storage and handling distribution manually was much more efficient for my use case.

So for larger use cases, I'd take the 'drop in' claim with a grain of salt. YMMV :)


Try SeaweedFS. It should be much more scalable than MinIO or CEPH. Large files are well supported.


Do you have a helm chart for SeaweedFS, I have been observing it for a long time too.


I wonder if this kind of article push business into the consulting funnel.

Can the author let us know if he is happy with the business results of these articles?


I use minio a lot for development.

Some of my applications rely on S3 in production, but I don't want that dependency when running running the application on my machine - I use minio as a drop-in replacement for development.

Since I use docker compose to handle my app's services (postgres, rabbitmq, etc), adding minio into it is a perfect fix.


> Unfortunately, none of the above are available to the public, let alone something outsiders can run on their own hardware.

This is misleading. While there are no bare metal projects I'm aware of, there are 10+ S3-API compatible S3 alternatives, such as Wasabi, Digital Ocean Objects, etc., to name a few.


Something not really discussed a lot, FoundationDB works well as a blob / object store.


Is the project still alive? It sounds great if so, but if it's good at that, why doesn't Apple run it for iCloud instead of GCP? (Aside from the obvious massive scale issues, but it seems like Apple would be able to afford to engineer that.)


yeah, apple recently open sourced it. https://apple.github.io/foundationdb/index.html


Given all the config and services , why not just write to hdfs?


don't use minio for object storage. Use it because you need an s3 interface (and the object store you want to use doesn't provide it). It's actually pretty straight forward to build an integration if minio doesn't provide it. Implementation tip: make the minio side stateless. Have fun.


https://www.storj.io uses minio underneath, but they treat minio with respect, as a partner and pay their cut


Amazon S3 on Outposts (more info at https://aws.amazon.com/s3/outposts/ ) runs on-premises, offers durable storage, high throughput, the S3 API, currently scales to 380 TB, and doesn't require you to watch for and deal with failing disks.

I believe that it addresses many of the OP's reasons for deciding against S3.


At least disclose your conflicts of interest when writing spam like this.


And only costs $169,000 to start


Hey, let's be fair. The storage-optimized instances start at only $425,931. ;)


It's not like buying hardware, support personnel hours and write-off administration is that much cheaper, unless you're willing to discard some features, but at that point you're no longer comparing things equally.


Have Outposts fixed their "always-on, lose connectivity, lose your outpost" problem that they had when I first asked about them?

Can they scale down to "I need to spin up an S3 thing for local testing" for the cost of the storage and CPU?

Am I locked into a multi-year agreement, or can I just go and throw it away in a month and stop paying?


I'm not going to contact AWS sales when I can easily use minio on Docker or Kubernetes.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: