We had the same issue many years ago with the reactiflux community. We ended up moving to discord and that was the best decision ever. Discord has been an extremely welcoming place for all these kind of communities.
we've been using discord for years, it's great. Its model for making money is different, its primary market is gamers. Servers are content for their users to consume and they charge the users directly.
From my experience, the vast majority of reliability issues at Meta come from 3 areas:
- Code changes
- Configuration changes (this includes the equivalent of server topology changes like cloudformation, quota changes)
- Experimentation rollout changes
There has been issues that are external (like user behavior change for new year / world cup final, physical connection between datacenters being severed…) but they tend to be a lot less frequent.
All the 3 big buckets are tied to a single trackable change with an id so this leads to the ability to do those kind of automated root cause analysis at scale.
Now, Meta is mostly a closed loop where all the infra and product is controlled as one entity so those results may not be applicable outside.
Interesting. It sounds like “all” service state management (admin config, infra, topology) is discoverable/legible for meta. I think that contrasts with AWS where there is a strong DevTools org, but many services and integrations are more of an API centric service-to-service model with distributed state which is much harder to observe. Every cloud provider I know of also has a (externally opaque) division between “native” cloud-service-built-on-cloud-infra and (typically older) “foundational” services that are much closer to “bare metal” with their own bespoke provisioning and management. Ex EC2 has great visibility inside of their placement and launch flows, but itll never look like/interop with cfn & cloudtrail that ~280 other “native” services use.
Definitely agree that the bulk Of “impact” is back to changes introduced in the SDLC. Even for major incidents infrastructure is probably down to 10-20% of causes in a good org. My view in GP is probably skewed towards major incidents impairing multiple services/regions as well. While I worked on a handful of services it was mostly edge/infra side, and I focused the last few years specifically on major incident management.
Id still be curious about internal system state and faults due to issues like deadlocked workflows, incoherent state machines, and invalid state values. But maybe its simply not that prevalent.
> this leads to the ability to do those kind of automated root cause analysis at scale.
I'm curious how well that works in the situation where your config change or experiment rollout results in a time bomb (e.g. triggered by task restart after software rollout), speaking as someone who just came off an oncall shift where that was one of our more notable outages.
Google also has a ledger of production events which _most_ common infra will write to, but there are so many distinct systems that I would be worried about identifying spurious correlations with completely unrelated products.
> There has been issues that are external (like ... physical connection between datacenters being severed…) but they tend to be a lot less frequent.
That's interesting to hear, because my experience at Google is that we'll see a peering metro being fully isolated from our network at least once a year; smaller fiber cuts that temporarily leave us with a SPOF or with a capacity shortfall happen much much more frequently.
(For a concrete example: a couple months ago, Hurricane Beryl temporarily took a bunch of peering infrastructure in Texas offline.)
excalidraw.com is a PWA, but not unsure how helpful it’s going to be as comparison. As mentioned in the comments, success is likely more due to the usefulness of the app rather than the tech used. vjeuxx@gmail.com if you wanna chat.
Thank you so much for working on this! I strongly believe that we need as a community to invest in an open source video editor based on the web using WebCodec. I did a talk last year to beg people to work on it! https://youtu.be/0cb0Bq4gLPo?si=7sPcAuH_9CDzM4xg
Let me know if I can be of any help. vjeuxx@gmail.com
The preview is using a different faster model so you're not going to get the exact same styles of responses from the larger slower one. If you have ideas on how to make the user experience better based on those constraints please let us know!
Well my feedback would be that your larger slower model doesn’t seem to be capable of generating cartoon style images while the preview model does seem to be able to.
“Derivative Impact Reports. AI2 seeks to encourage transparency around Derivatives through the use of Derivative Impact Reports, available here. Before releasing a Model Derivative or Data Derivative, You will share with AI2 the intended use(s) of Your Derivative by completing a Derivative Impact Report or otherwise providing AI2 with substantially similar information in writing. You agree that AI2 may publish, post, or make available such information about Your Derivative for review by the general public.
You will use good faith efforts to be transparent about the intended use(s) of Your Derivatives by making the information freely available to others who may access or use Your Derivatives.
You acknowledge that Derivative Impact Reports are not intended to penalize any good faith disclosures about Derivatives. Accordingly, if You initiate or participate in any lawsuit or other legal action against a Third Party based on information in such Third Party’s Derivative Impact Report, then this MR Agreement will terminate immediately as of the date such lawsuit or legal action is filed or commenced.”