Sysadmins for sysadmins

11 readers
1 users here now

Kažkas turi tai padaryti / Somebody has to do it

Related communities:

Fotkė / Photo camilo jimenez Unsplash

founded 2 years ago
MODERATORS
1
2
 
 

Tai užpylė serverinę priešgaisriniai čiaupai ar ne? :}

3
 
 

Will be interesting to see how it works out

The Indian nonprofit People+ai wants to fix this by creating an open and interoperable marketplace of cloud providers of all sizes. The Open Cloud Compute (OCC) project plans to use open protocols and standards to allow cloud providers of all sizes to offer their services on the network. It also plans to make it easy for customers to shift between offerings depending on their needs. People+ai held a hackathon on 20 September at People’s Education Society University (PES University) in Bengaluru to test out an early prototype of the platform.

4
 
 

Highlights

Failure is an expected state in production systems, and no predictable failure of either software or hardware components should result in a negative experience for users. The exact failure mode may vary, but certain remediation steps must be taken after detection. A common example is when an error occurs on a server, rendering it unfit for production workloads, and requiring action to recover.

It can be tempting to rely on the expertise of world-class engineers to remediate these faults, but this would be manual, repetitive, unlikely to produce enduring value, and not scaling.

The commonality of lower-priority failures makes it obvious when the response required, as defined in runbooks, is “toilsome”. To reduce this toil, we had previously implemented a plethora of solutions to automate runbook actions such as manually-invoked shell scripts, cron jobs, and ad-hoc software services. These had grown organically over time and provided solutions on a case-by-case basis, which led to duplication of work, tight coupling, and lack of context awareness across the solutions.

A good solution would not allow only the SRE team to auto-remediate, it would empower the entire company. The key to adding self-healing capability was a generic interface for all teams to self-service and quickly remediate failures at various levels: machine, service, network, or dependencies.

Temporal is a durable execution platform which is useful to gracefully manage infrastructure failures such as network outages and transient failures in external service endpoints. This capability meant we only needed to build a way to schedule “workflow” tasks and have Temporal provide reliability guarantees.

After a workflow is validated in the staging environment, we can then do a full release to production. It seems obvious, but catching simple configuration errors before releasing has saved us many hours in development/change-related-task time.

Building a system that is maintained by several SRE teams has allowed us to iterate faster, and rapidly tackle long-standing problems. We have set ambitious goals regarding toil elimination and are on course to achieve them, which will allow us to scale faster by eliminating the human bottleneck.

5
 
 

We, humanz, are very good in creating SPOFs

6
 
 

Thousands of machines running Linux have been infected by a malware strain that’s notable for its stealth, the number of misconfigurations it can exploit, and the breadth of malicious activities it can perform, researchers reported Thursday.

The malware has been circulating since at least 2021. It gets installed by exploiting more than 20,000 common misconfigurations, a capability that may make millions of machines connected to the internet potential targets, researchers from Aqua Security said. It can also exploit CVE-2023-33426, a vulnerability with a severity rating of 10 out of 10 that was patched last year in Apache RocketMQ, a messaging and streaming platform that’s found on many Linux machines.

7
 
 

Valtonen’s goal is to put CPUs back in their rightful, ‘central’ role. In order to do that, he and his team are proposing a new paradigm. Instead of trying to speed up computation by putting 16 identical CPU cores into, say, a laptop, a manufacturer could put 4 standard CPU cores and 64 of Flow Computing’s so-called parallel processing unit (PPU) cores into the same footprint, and achieve up to 100 times better performance. Valtonen and his collaborators laid out their case at the IEEE Hot Chips conference in August.

8
 
 

This is interesting and potentially useful for anyone, who works in the corp which does not allow Linux laptops, but you can get your hands on Macs.

9
 
 

OG

10
 
 

How We Built the Internet

Metadata

Highlights

The internet is a universe of its own.

The infrastructure that makes this scale possible is similarly astounding—a massive, global web of physical hardware, consisting of more than 5 billion kilometers of fiber-optic cable, more than 574 active and planned submarine cables that span a over 1 million kilometers in length, and a constellation of more than 5,400 satellites offering connectivity from low earth orbit (LEO).

“The Internet is no longer tracking the population of humans and the level of human use. The growth of the Internet is no longer bounded by human population growth, nor the number of hours in the day when humans are awake,” writes Geoff Huston, chief scientist at the nonprofit Asia Pacific Network Information Center.

As Shannon studied the structures of messages and language systems, he realized that there was a mathematical structure that underlied information. This meant that information could, in fact, be quantified.

Shannon noted that all information traveling from a sender to a recipient must pass through a channel, whether that channel be a wire or the atmosphere.

Shannon’s transformative insight was that every channel has a threshold—a maximum amount of information that can be delivered reliably to a sender.

Kleinrock approached AT&T and asked if the company would be interested in implementing such a system. AT&T rejected his proposal—most demand was still in analog communications. Instead, they told him to use the regular phone lines to send his digital communications—but that made no economic sense.

What was exceedingly clever about this suite of protocols was its generality. TCP and IP did not care which carrier technology transmitted its packets, whether it be copper wire, fiber-optic cable, or radio. And they imposed no constraints on what the bits could be formatted into—video text, simple messages, or even web pages formatted in a browser.

David Clark, one of the architects of the original internet, wrote in 1978 that “we should … prepare for the day when there are more than 256 networks in the Internet.”

Fiber was initially laid down by telecom companies offering high-quality cable television service to homes. The same lines would be used to provide internet access to these households. However, these service speeds were so fast that a whole new category of behavior became possible online. Information moved fast enough to make applications like video calling or video streaming a reality.

And while it may have been the government and small research groups that kickstarted the birth of the internet, its evolution henceforth was dictated by market forces, including service providers that offered cheaper-than-ever communication channels and users that primarily wanted to use those channels for entertainment.

In 2022, video streaming comprised nearly 58 percent of all Internet traffic. Netflix and YouTube alone accounted for 15 and 11 percent, respectively.

At the time, Facebook users in Asia or Africa had a completely different experience to their counterparts in the U.S. Their connection to a Facebook server had to travel halfway around the world, while users in the U.S. or Canada could enjoy nearly instantaneous service. To combat this, larger companies like Google, Facebook, Netflix, and others began storing their content physically closer to users through CDNs, or “content delivery networks.”

Instead of simply owning the CDNs that host your data, why not own the literal fiber cable that connects servers from the United States to the rest of the world?

Most of the world’s submarine cable capacity is now either partially or entirely owned by a FAANG company—meaning Facebook (Meta), Amazon, Apple, Netflix, or Google (Alphabet).

Google, which owns a number of sub-sea cables across the Atlantic and Pacific, can deliver hundreds of terabits per second through its infrastructure.

In other words, these applications have become so popular that they have had to leave traditional internet infrastructure and operate their services within their own private networks. These networks not only handle the physical layer, but also create new transfer protocols —totally disconnected from IP or TCP. Data is transferred on their own private protocols, essentially creating digital fiefdoms.

SpaceX’s Starlink is already unlocking a completely new way of providing service to millions. Its data packets, which travel to users via radio waves from low earth orbit, may soon be one of the fastest and most economical ways of delivering internet access to a majority of users on Earth. After all, the distance from LEO to the surface of the Earth is just a fraction of the length of subsea cables across the Atlantic and Pacific oceans.

What is next?

11
5
Incantations (josvisser.substack.com)
submitted 4 months ago by saint@group.lt to c/bofh@group.lt
 
 

Incantations

Metadata

Highlights

The problem with incantations is that you don’t understand in what exact circumstances they work. Change the circumstances, and your incantations might work, might not work anymore, might do something else, or maybe worse, might do lots of damage. It is not safe to rely on incantations, you need to move to understanding.

12
 
 

How much are your 9's worth?

Metadata

Highlights

All nines are not created equal. Most of the time I hear an extraordinarily high availability claim (anything above 99.9%) I immediately start thinking about how that number is calculated and wondering how realistic it is.

Human beings are funny, though. It turns out we respond pretty well to simplicity and order.

Having a single number to measure service health is a great way for humans to look at a table of historical availability and understand if service availability is getting better or worse. It’s also the best way to create accountability and measure behavior over time…

… as long as your measurement is reasonably accurate and not a vanity metric.

Cheat #1 - Measure the narrowest path possible.

This is the easiest way to cheat a 9’s metric. Many nines numbers I have seen are various version of this cheat code. How can we create a narrow measurement path?

Cheat #2 - Lump everything into a single bucket.

Not all requests are created equal.

Cheat #3 - Don’t measure latency.

This is an availability metric we’re talking about here, why would we care about how long things take, as long as they are successful?!

Cheat #4 - Measure total volume, not minutes.

Let’s get a little controversial.

In order to cheat the metric we want to choose the calculation that looks the best, since even though we might have been having a bad time for 3 hours (1 out of every 10 requests was failing), not every customer was impacted so it wouldn’t be “fair” to count that time against us.

Building more specific models of customer paths is manual. It requires more manual effort and customization to build a model of customer behavior (read: engineering time). Sometimes we just don’t have people with the time or specialization to do this, or it will cost to much to maintain it in the future.

We don’t have data on all of the customer scenarios. In this case we just can’t measure enough to be sure what our availability is.

Sometimes we really don’t care (and neither do our customers). Some of the pages we build for our websites are… not very useful. Sometimes spending the time to measure (or fix) these scenarios just isn’t worth the effort. It’s important to focus on important scenarios for your customers and not waste engineering effort on things that aren’t very important (this is a very good way to create an ineffective availability effort at a company).

Mental shortcuts matter. No matter how much education we try, it’s hard to change perceptions of executives, engineers, etc. Sometimes it is better to pick the abstraction that helps people understand than pick the most accurate one.

Data volume and data quality are important to measurement. If we don’t have a good idea of which errors are “okay” and which are not, or we just don’t have that much traffic, some of these measurements become almost useless (what is the SLO of a website with 3 requests? does it matter?).

What is your way of cheating nines? ;)

13
3
Composite SLO (blog.alexewerlof.com)
submitted 6 months ago by saint@group.lt to c/bofh@group.lt
 
 

How to calculate SLO

14
 
 

cross-posted from: https://feddit.it/post/7752642

A week of downtime and all the servers were recovered only because the customer had a proper disaster recovery protocol and held backups somewhere else, otherwise Google deleted the backups too

Google cloud ceo says "it won't happen anymore", it's insane that there's the possibility of "instant delete everything"

15
 
 

(again)

16
 
 

by @geerlingguy@mastodon.social

17
 
 

We run a bit of software called tabelau, I have had to restart it over night and the server hit 113 on the load average. on a 16 core box.

please tell me thats mad for any software

18
 
 

Good overview on how it works and why being compliant does not mean being secure.

19
 
 

Great article

20
 
 

What the title says - pros/cons

21
 
 

Interesting take - RIP Redis: How Garantia Data pulled off the biggest heist in open source history https://lnkd.in/ezme7dbw #redis #opensource

22
23
 
 

Seni krienai tauzyja apie IT

24
2
5 SRE Predictions For 2024 (www.codereliant.io)
submitted 9 months ago by saint@group.lt to c/bofh@group.lt
 
 

What about yours? What do you predict?

25
view more: next ›