Good day to all! Over the last 30 minutes or so, I’ve been having issues loading beehaw.org. Sometimes CSS is missing and the page layout is broken, and others there is a server side NGINX error.

Just wanted to make the admins aware this is happening. There are some NGINX settings that can be adjusted to make more threads available to NGINX if it is hitting a worker limit.

top 22 comments

sorted by: hot top controversial new old

[–] ffmike 4 points 2 years ago (1 children)

Thanks. The admins are aware and are looking into the root cause.

[–] BitOneZero 2 points 2 years ago* (last edited 2 years ago) (1 children)

It's been 8 days, still ongoing with multiple instances, and I do not see any open issue about 'nginx 50x' errors on Github project for lemmy. See public cry: https://lemmy.ml/post/1453121

[–] ffmike 4 points 2 years ago (1 children)

Yes, Beehaw is struggling with uptime. From talking with the admins, this really isn't an nginx issue. It's more that the Lemmy code itself is immature, with memory leaks and SQL performance issues, and those issues are becoming more disruptive as the usage explodes.

If you've got development skills, helping out the Lemmy project on Github is probably the best way to help. If not, then just press F5 with the rest of us when the site goes down for a bit.

[–] BitOneZero 1 points 2 years ago* (last edited 2 years ago) (1 children)

If you’ve got development skills, helping out the Lemmy project on Github is probably the best way to help.

I have been, I'm RocketDerp on Github. I've been watching for weeks how none of the people running the major sites have opened an issue on observable problems, so I have done so myself:

Major data integrity issues ignored since June 14 issue opened: https://github.com/LemmyNet/lemmy/issues/3101

Obvious user-interface signs of the same problem reported June 19: https://github.com/LemmyNet/lemmy/issues/3203

The problems were going on weeks before I created these issues, and they are still being ignored. It wasn't in the 0.18 announcement today (June 23), etc.

[–] ffmike 4 points 2 years ago (1 children)

I'm not an official spokescritter, but I can assure you the Beehaw admins aren't ignoring the issues. But ultimately it's going to come down to someone getting PRs in to the code. I hope someone gets some performance-focused PRs in soon.

[–] BitOneZero 1 points 2 years ago* (last edited 2 years ago) (2 children)

I’m not an official spokescritter, but I can assure you the Beehaw admins aren’t ignoring the issues.

They are not informing the end-users (and flocking new server installers) of the problem, they are leaving people like me wasting their time calling out the problem. Denial isn't just a river in Egypt. Lemmy isn't scaling, it's falling flat on it's face, and the federation protocols of doing one single like per https transaction are causing servers to overload peer servers. There isn't even anything built into lemmy_server to detect that posts&comment are misisng, nor any tools to 'heal' missing data.

Where are the server logs? Why are the crashes not being shared to developers? Do i really have to build up an instance with 5000 users to get access to the data that Beehaw's servers are logging each hour?

[–] Gaywallet 3 points 2 years ago* (last edited 2 years ago) (1 children)

What are you asking for? I'm not smart enough to know what is going on here, but can relay the request to someone who is if you're willing to dumb it down for me and ask nicely

[–] BitOneZero 2 points 2 years ago (2 children)

What are you asking for?

Right out of the Lemmy documentation for servers:

journalctl -u lemmy

Log them to a file and dump them somewhere public, like a github repository. What is gong on in these logs when 500 errors are happening?

[–] Penguincoder 4 points 2 years ago* (last edited 2 years ago) (1 children)

Thanks for the suggestions. We are aware of how to review system logs and work to solve the issues. Right now there are a lot of moving parts, some of which we control and are responsible for, but a lot that we cannot.

As you know, an NGINX 500 issue is due to the server instance and not the client (you). For our stack, that could be an issue anywhere along the path with varnish, nginx, firewall rules, security/HIDS, host networking, docker networking, one or multiple services of the six containers, the docker service/daemon itself.

The issues are being addressed as we are able to troubleshoot, prove it, and verify a solution.

[–] BitOneZero 1 points 2 years ago (1 children)

Do you consider 0.17.4 a "stable" release of Lemmy that is proven and production ready, or more like an experimental project under active development?

I do not grasp why no Github issues are being opened to discuss openly these problems with the Lemmy platform that I have seen on many instances.

[–] Penguincoder 3 points 2 years ago* (last edited 2 years ago) (1 children)

Every version of Lemmy is experimental and not really production ready. But it is in use and serving our needs, with a few pain points that are being worked on. I don't have the time to run down every single bug or issue we experience with Lemmy, in order to make a good, useful, bug report; certainly not enough time to do that and fix them.

So why should I post a GitHub issue for the devs to see, that is just another "hey this isn't working, fix please" complaint? They have hundreds of open issues already. When I find one I can prove and give sufficient details for, I make an issue. To do otherwise is a pretty entitled take.

I'm getting what I pay for and happy to contribute how I can. You're saying that's not enough and I need to do more. No thanks.

[–] BitOneZero 1 points 2 years ago* (last edited 2 years ago) (1 children)

Your saying that’s not enough and I need to do more.

I see you run a Neruodiverse community here, maybe you are misinterpreting my Asperger syndrome. I posted here 8 days ago, and I'm revisiting it.

They have hundreds of open issues already.

I posed here 8 days ago, I linked back to lemmy.ml having the same problem, I am the one doing the labor here of screaming out loud how serious this problem is and it isn't like the other issues being posted on GitHub which are mostly end-user wishlists for new features.

[–] Gaywallet 3 points 2 years ago (1 children)

I posed here 8 days ago, I linked back to lemmy.ml having the same problem, I am the one doing the labor here of screaming out loud how serious this problem is and it isn’t like the other issues being posted on GitHub which are mostly end-user wishlists for new features.

Really sucks they aren't listening to you, they don't appear to be listening to us either. Best of luck screaming at them, hopefully they'll listen as they haven't fixed any of the bugs I opened either.

[–] BitOneZero 2 points 2 years ago

I have created new communities: /c/lemmyperformance and /c/lemmyfederation to try and not clutter up Github with all this, but so far the problem of two servers reliably talking to each other (which also are running into the nginx 500/504/404 errors) seems to be a problem nobody has taken ownership of.

[–] Gaywallet 2 points 2 years ago (1 children)

sent this along

[–] BitOneZero 2 points 2 years ago* (last edited 2 years ago)

Thank you. It's what's going on inside of lemmy.ml that concerns me the most, and I just don't grasp why the people running that server aren't opening issues about the precise logged errors on their server so that newcomers to the project have an idea what is happening.

[–] Penguincoder 1 points 2 years ago (1 children)

Why are the crashes not being shared to developers?

Because not every issue we're experiencing even the 500's , are a result of Lemmy or their code. There is no reason to share that with them.

[–] BitOneZero 1 points 2 years ago* (last edited 2 years ago) (1 children)

Because not every issue we’re experiencing even the 500’s , are a result of Lemmy or their code.

Then what are they, when Nginx is failing to talk to the NodeJS app? I also consider this more than code, as they are also giving recommendations for performance tuning various components, etc.

I have a lot of suspicion so far that federation activity is causing 500 and other errors due to how it queues (swarms) other peers. It isn't just the lemmy-ui webapp, smartphone users, and other end-users.

[–] Penguincoder 3 points 2 years ago (1 children)

[–] BitOneZero 1 points 2 years ago

If you aren't aware, Lemmy.ml has been down for the past 45 minutes, and could likely be causing your lemmy_server code to back up with all kinds of problems.

I'm actually working on these issues 10+ hours a day, for the past two weeks.

[–] BitOneZero 2 points 2 years ago (1 children)

Same thing. Another heavy used Lemmy instance has reports of the same problem: https://lemmy.ml/post/1271936

[–] slashzero 3 points 2 years ago

Interesting. I noticed similar behavior on https://startrek.website as well. I wonder if something else more global is going on?