Softpanorama

May the source be with you, but remember the KISS principle ;-)
Home Switchboard Unix Administration Red Hat TCP/IP Networks Neoliberalism Toxic Managers
(slightly skeptical) Educational society promoting "Back to basics" movement against IT overcomplexity and  bastardization of classic Unix

Pure stupidity

A collection of educational and horrifying tales of SGS (Spontaneous Gross Stupidity)

News Sysadmin Horror Stories Recommended Links An observation about corporate security departments Mistakes made because of the differences between various Unix/Linux flavors Missing backup horror stories Lack of testing complex, potentially destructive, commands before execution of production box Creative uses of rm
Locking yourself out Premature or misguided optimization Reboot Blunders  Performing the operation on a wrong server Executing command in a wrong directory Side effects of performing operations on home or application directories Typos in the commands with disastrous consequences Side effects of patching
Multiple sysadmin working on the same box Side effects of patching of the customized server Ownership changing blunders Dot-star-errors and regular expressions blunders Excessive zeal in improving security of the system Unintended consequences of automatic system maintenance scripts LVM mishaps Abuse of privileges
Humor Chronicle Top 10 Classic Unix Humor Softpanorama Humor Archive Perl Shell Linus Torvalds  Networking Debugging
Safe-rm Workaholism and Burnout Coping with the toxic stress in IT environment The Unix Hater’s Handbook Tips Horror stories History Humor Etc
  Unix: Modern operating system carefully crafted to prevent administrators from shooting themselves in the foot[1].

[1] Interestingly, most utilities have a command line option which will cause the system to rip the user's legs off and beat them to death with the soggy ends. This is often the default behaviour.

Bruce Murphy

The consequences of any action are never be fully understood until after it's too late to do anything about it.

Eric The Read

If all you have is an axe, every problem looks like hours of fun.

Frossie

"Unix for Dummies" is surely a single page pull out with "don't" printed on it.

Unknown

First of all, rephrasing Oscar Wilde, stupidity is rarely pure and never simple. Reasons for disastrous blunders vary, but in many cases they are more like a result of unique confluence of circumstances then purely voluntary blunder of sysadmin. Often the most stupid actions are done as a reaction to the problem and desire to fix it quickly without analyzing all the available information (for example if NFS server hangs and you can do nothing with it)

For example one sysadmin operating with Dell blade enclosure mistaken the operation of power down of the whole enclosure for the power down of the current blade server (and it really was not very clear in version of CM he used; in this respect Dell sucks) and powered down 16 production servers with one click.  Stupidity? Yes. Pure? No (the person was misled by the deficiency of Dell interface which suggested that the operation will be performed on the current blade). Simple? Yes (in such operation checking twice whether you do a right thing and giving yourself a pause before hitting ENTER is a must) .  After the power was restored, the switch in the enclosure failed to operate properly and lost the connection to the internal network because the person who configured it used old CISCO command for saving configuration, which did not work on this version and thus the configuration was not saved. Stupidity? Yes. Pure ? Yes (Checking that the backup was made before logging out after making any changes is sine qua non  of any network engineer)  Simple? No (the command set of the switch is pretty complex and it is easy to make a mistake, if you mostly work with older version; also there was no diagnostics that the  command  was wrong -- it was simply ignored )

The most common example is attempt to reboot the box without gathering all the necessary information about the problems (and all evidence will be destroyed by the reboot, which might or might not help).  A good procedure to follow is that "emergency reboot", if possible, can be done only after one hour of analyzing the situation.

Problems with the server  often cascade and for example hang NFS server exposes "embellishments" such as the sad fact that parts of  root dot files were sourced from NFS filesystem. And you can't login to root without cancelling dot files by execution with Ctrl-C. But in this changed atmostphere it is not that easy to  understand that there is a problem with your dot files in addition to problem with the NFS server and that they are related requires. It requres some time and a cool head.  

In this sense using the default dot files of the root user is probably a good, safe practice.

As Linux is excessively complex OS. That means that from one problem to another you usually forget most of the vital information, or the fact that there were similar situations in the past, so instead of jumping into action reading your notes about previous crash often helps. This is probably the most intelligent  step in troubleshooting you can do. Assuming you have such notes. Not having them is a classic  example of pure stupidity ;-). People forget important things. For example that they changes the scheme of passwords for the DRAC/ILO access, that they put some additional backup that can save the situation of an FIT flash drive installed in one of the server USB ports. That  there is a copy of destroyed data on a remote server because at one time you put rsync command in crontab for some reason and it remains indefinitely. 

Not having adequate internal documentation is a classic  example of "pure stupidity" ;-). systems are just too complex to remeber all nuances, all actiona, all decisions made, all script written, etc. Often the consequences are as tragic as not having the latest backup of the data in case of a harddrive crash. For example it is very eary to wipe out data in a heat of the moment if you do not fully understand the situation and do not remeber your own past actions.  In other words, any sysadmin who does not create his/her own database of relevant notes, clips, etc is a bad, stupid sysadmin. And one day he/she might  pay the price such such recklessness. No question about it.

Of course, no matter how hard your try, your notes typically will be inadequate in the crisis you experinces. But they provide a  very important and often indispensible starting point. Vital context for the situation you got in. 

Sometime trying to meet tight deadline  (and being exhausted), or in the situation when "nothing works" ( and being frustrated) people mechanically (on "autopilot") type wrong, but similar command as if part of the brain functions autonomously from the rest ( for example mkfs.ext3 instead of fsck.ext3; or /etc instead of etc) and instantly recognize the blunder, but only after hitting Enter key.

One  simple recommendation is to make hitting Enter key providing a prompt in "fishy" circumstances by writing simple wrappers tha tare "Are you really want to do this?" for such commands as rm, chown, reboot and like.

 Another typical source of blunders is using find -exec option without sufficient testing under time pressure. Hurry slowly is one of the saying that are very true for sysadmin. Sometimes your emotional state contribute to the problems: you didn’t have much sleep or your mind was distracted by your personal life problems. In such days it is important to slow down and be extra cautious.

Often some stupid mistake is complicated by the subsequent attempt of cover-up as it is too embarrassing to admit the real cause of the problem

Often some stupid mistake is complicated by the subsequent attempt of  cover-up as it is too embarrassing to admit the real cause of the problem.  Users and fellow sysadmins are usually good at breaking stuff by ''changing nothing'', ''touching nothing'' or even "doing nothing'"... 

Commands like rm -rf, chown -r, chmod -r, kill -9, and shutdown/reboot/halt  troika often play nasty tricks with people only peripherally acquainted with Unix but for some reason who are given root access. 

If I had to hire a Unix sysadmin, the first thing I'd look for is experience.  Nothing can substitute for real-life experience in this field and often the way to acquire it are living through some really nasty situations. In Army jargon  a bad situation, mistake, or cause of trouble often is called SNAFU.  Here is one interesting example:

... My friend was forced to work with some sysadmins who didn't have their act together.  One day, one of them was "cleaning" the filesystem and saw a file called "vmunix" in /. "Hmm, this is taking up a lot of space - let's delete it".  "rm /vmunix".

My friend had to reinstall the entire OS on that machine after his coworker did this "cleanup".  Ahh, the hazards of working with sysadmins who really shouldn't be sysadmins in the first place.

Few more examples from the Unofficial Unix Administration Horror Story Summary:

Typos in the commands with disastrous consequences are rare, but pressing Enter before checking the command can lead to a real SNAFU:

A classic, probably unmatched case of corporate security departments stupidity: blocking ICMP

  TL;DR, short for "too long; didn't read", is Internet slang to say that some text being replied to has been ignored because of its length. In slang it can also stand for "Too lazy; didn't read". It is also used as a signifier for a summary of an online post or news article.

[Jan 10, 2019]  When idiots are offloaded to security department, interesting things with network eventually happen

Security department often does more damage to the network then any sophisticated hacker can.  Especially if they are populated with morons, as they usually are.  One of the  most blatant examples is below... Those idiots decided to disable Traceroute (which means ICMP) in order to increase security.
An this is not a unique case; I observed such cases in large companies.
May 27, 2018 | linux.slashdot.org

jfdavis668 ( 1414919 ) , Sunday May 27, 2018 @11:09AM ( #56682996 )

Re:So ( Score: 5 , Interesting)

Traceroute is disabled on every network I work with to prevent intruders from determining the network structure. Real pain in the neck, but one of those things we face to secure systems.

Anonymous Coward writes:
Re: ( Score: 2 , Insightful)

What is the point? If an intruder is already there couldn't they just upload their own binary?

Hylandr ( 813770 ) , Sunday May 27, 2018 @05:57PM ( #56685274 )
Re: So ( Score: 5 , Interesting)

They can easily. And often time will compile their own tools, versions of Apache, etc..

At best it slows down incident response and resolution while doing nothing to prevent discovery of their networks. If you only use Vlans to segregate your architecture you're boned.

gweihir ( 88907 ) , Sunday May 27, 2018 @12:19PM ( #56683422 )
Re: So ( Score: 5 , Interesting)

Also really stupid. A competent attacker (and only those manage it into your network, right?) is not even slowed down by things like this.

bferrell ( 253291 ) , Sunday May 27, 2018 @12:20PM ( #56683430 ) Homepage Journal
Re: So ( Score: 4 , Interesting)

Except it DOESN'T secure anything, simply renders things a little more obscure... Since when is obscurity security?

fluffernutter ( 1411889 ) writes:
Re: ( Score: 3 )

Doing something to make things more difficult for a hacker is better than doing nothing to make things more difficult for a hacker. Unless you're lazy, as many of these things should be done as possible.

DamnOregonian ( 963763 ) , Sunday May 27, 2018 @04:37PM ( #56684878 )
Re:So ( Score: 5 , Insightful)

No.

Things like this don't slow down "hackers" with even a modicum of network knowledge inside of a functioning network. What they do slow down is your ability to troubleshoot network problems.

Breaking into a network is a slow process. Slow and precise. Trying to fix problems is a fast reactionary process. Who do you really think you're hurting? Yes another example of how ignorant opinions can become common sense.

mSparks43 ( 757109 ) writes:
Re: So ( Score: 2 )

Pretty much my reaction. like WTF? OTON, redhat flavors all still on glibc2 starting to become a regular p.i.t.a. so the chances of this actually becoming a thing to be concerned about seem very low.

Kinda like gdpr, same kind of groupthink that anyone actually cares or concerns themselves with policy these days.

ruir ( 2709173 ) writes:
Re: ( Score: 3 )

Disable all ICMP is not feasible as you will be disabling MTU negotiation and destination unreachable messages. You are essentially breaking the TCP/IP protocol. And if you want the protocol working OK, then people can do traceroute via HTTP messages or ICMP echo and reply.

Or they can do reverse traceroute at least until the border edge of your firewall via an external site.

DamnOregonian ( 963763 ) , Sunday May 27, 2018 @04:32PM ( #56684858 )
Re:So ( Score: 4 , Insightful)

You have no fucking idea what you're talking about. I run a multi-regional network with over 130 peers. Nobody "disables ICMP". IP breaks without it. Some folks, generally the dimmer of us, will disable echo responses or TTL expiration notices thinking it is somehow secure (and they are very fucking wrong) but nobody blocks all ICMP, except for very very dim witted humans, and only on endpoint nodes.

DamnOregonian ( 963763 ) writes:
Re: ( Score: 3 )

That's hilarious... I am *the guy* who runs the network. I am our senior network engineer. Every line in every router -- mine.

You have no idea what you're talking about, at any level. "disabled ICMP" - state statement alone requires such ignorance to make that I'm not sure why I'm even replying to ignorant ass.

DamnOregonian ( 963763 ) writes:
Re: ( Score: 3 )

Nonsense. I conceded that morons may actually go through the work to totally break their PMTUD, IP error signaling channels, and make their nodes "invisible"

I understand "networking" at a level I'm pretty sure you only have a foggy understanding of. I write applications that require layer-2 packet building all the way up to layer-4.

In short, he's a moron. I have reason to suspect you might be, too.

DamnOregonian ( 963763 ) writes:
Re: ( Score: 3 )

A CDS is MAC. Turning off ICMP toward people who aren't allowed to access your node/network is understandable. They can't get anything else though, why bother supporting the IP control channel? CDS does *not* say turn off ICMP globally. I deal with CDS, SSAE16 SOC 2, and PCI compliance daily. If your CDS solution only operates with a layer-4 ACL, it's a pretty simple model, or You're Doing It Wrong (TM)

nyet ( 19118 ) writes:
Re: ( Score: 3 )

> I'm not a network person

IOW, nothing you say about networking should be taken seriously.

kevmeister ( 979231 ) , Sunday May 27, 2018 @05:47PM ( #56685234 ) Homepage
Re:So ( Score: 4 , Insightful)

No, TCP/IP is not working fine. It's broken and is costing you performance and $$$. But it is not evident because TCP/IP is very good about dealing with broken networks, like yours.

The problem is that doing this requires things like packet fragmentation which greatly increases router CPU load and reduces the maximum PPS of your network as well s resulting in dropped packets requiring re-transmission and may also result in widow collapse fallowed with slow-start, though rapid recovery mitigates much of this, it's still not free.

It's another example of security by stupidity which seldom provides security, but always buys added cost.

Hylandr ( 813770 ) writes:
Re: ( Score: 3 )

As a server engineer I am experiencing this with our network team right now.

Do you have some reading that I might be able to further educate myself? I would like to be able to prove to the directors why disabling ICMP on the network may be the cause of our issues.

Zaelath ( 2588189 ) , Sunday May 27, 2018 @07:51PM ( #56685758 )
Re:So ( Score: 4 , Informative)

A brief read suggests this is a good resource: https://john.albin.net/essenti... [albin.net]

Bing Tsher E ( 943915 ) , Sunday May 27, 2018 @01:22PM ( #56683792 ) Journal
Re: Denying ICMP echo @ server/workstation level t ( Score: 5 , Insightful)

Linux has one of the few IP stacks that isn't derived from the BSD stack, which in the industry is considered the reference design. Instead for linux, a new stack with it's own bugs and peculiarities was cobbled up.

Reference designs are a good thing to promote interoperability. As far as TCP/IP is concerned, linux is the biggest and ugliest stepchild. A theme that fits well into this whole discussion topic, actually.

 


Top Visited
Switchboard
Latest
Past week
Past month

NEWS CONTENTS

Old News ;-)

[Sep 29, 2019] IPTABLES makes corporate security scans go away

Sep 29, 2019 | www.reddit.com

r/ShittySysadmin • Posted by u/TBoneJeeper 1 month ago IPTABLES makes corporate security scans go away

In a remote office location, corporate's network security scans can cause many false alarms and even take down services if they are tickled the wrong way. Dropping all traffic from the scanner's IP is a great time/resource-saver. No vulnerability reports, no follow-ups with corporate. No time for that. 12 comments 93% Upvoted What are your thoughts? Log in or Sign up log in sign up Sort by level 1 name_censored_ 9 points · 1 month ago

Seems a bit like a bandaid to me.

level 2 TBoneJeeper 3 points · 1 month ago

Good ideas, but sounds like a lot of work. Just dropping their packets had the desired effect and took 30 seconds. level 3 name_censored_ 5 points · 1 month ago

No-one ever said being lazy was supposed to be easy. level 2 spyingwind 2 points · 1 month ago

To be serious, closing of unused ports is good practice. Even better if used services can only be accessed from know sources. Such as the DB only allows access from the App server. A jump box, like a guacd server for remote access for things like RDP and SSH, would help reduce the threat surface. Or go further and setup Ansible/Chef/etc to allow only authorized changes. level 3 gortonsfiJr 2 points · 1 month ago

Except, seriously, in my experience the security teams demand that you make big security holes for them in your boxes, so that they can hammer away at them looking for security holes. level 4 asmiggs 1 point · 1 month ago

Security teams will always invoke the worst case scenario, 'what if your firewall is borked?', 'what if your jumpbox is hacked?' etc. You can usually give their scanner exclusive access to get past these things but surprise surprise the only worst case scenario I've faced is 'what if your security scanner goes rogue?'. level 5 gortonsfiJr 1 point · 1 month ago

What if you lose control of your AD domain and some rogue agent gets domain admin rights? Also, we're going to need domain admin rights.

...Is this a test? level 6 spyingwind 1 point · 1 month ago

What if an attacker was pretending o be a security company? No DA access! You can plug in anywhere, but if port security blocks your scanner, then I can't help. Also only 80 and 443 are allowed into our network. level 3 TBoneJeeper 1 point · 1 month ago

Agree. But in rare cases, the ports/services are still used (maybe rarely), yet have "vulnerabilities" that are difficult to address. Some of these scanners hammer services so hard, trying every CGI/PHP/java exploit known to man in rapid succession, and older hardware/services cannot keep up and get wedged. I remember every Tuesday night I would have to go restart services because this is when they were scanned. Either vendor support for this software version was no longer available, or would simply require too much time to open vendor support cases to report the issues, argue with 1st level support, escalate, work with engineering, test fixes, etc. level 1 rumplestripeskin 1 point · 1 month ago

Yes... and use Ansible to update iptables on each of your Linux VMs. level 1 rumplestripeskin 1 point · 1 month ago

I know somebody who actually did this. level 2 TBoneJeeper 2 points · 1 month ago

Maybe we worked together :-)

[Jul 26, 2019] The day the virtual machine manager died by Nathan Lager

"Dangerous" commands like dd should probably be always typed first in the editor and only when you verity that you did not make a blunder , executed...
A good decision was to go home and think the situation over, not to aggravate it with impulsive attempts to correct the situation, which typically only make it worse.
Lack of checking of the health of backups suggest that this guy is an arrogant sucker, despite his 20 years of sysadmin experience.
Notable quotes:
"... I started dd as root , over the top of an EXISTING DISK ON A RUNNING VM. What kind of idiot does that?! ..."
"... Since my VMs were still running, and I'd already done enough damage for one night, I stopped touching things and went home. ..."
Jul 26, 2019 | www.redhat.com

... ... ...

See, my RHEV manager was a VM running on a stand-alone Kernel-based Virtual Machine (KVM) host, separate from the cluster it manages. I had been running RHEV since version 3.0, before hosted engines were a thing, and I hadn't gone through the effort of migrating. I was already in the process of building a new set of clusters with a new manager, but this older manager was still controlling most of our production VMs. It had filled its disk again, and the underlying database had stopped itself to avoid corruption.

See, for whatever reason, we had never set up disk space monitoring on this system. It's not like it was an important box, right?

So, I logged into the KVM host that ran the VM, and started the well-known procedure of creating a new empty disk file, and then attaching it via virsh . The procedure goes something like this: Become root , use dd to write a stream of zeros to a new file, of the proper size, in the proper location, then use virsh to attach the new disk to the already running VM. Then, of course, log into the VM and do your disk expansion.

I logged in, ran sudo -i , and started my work. I ran cd /var/lib/libvirt/images , ran ls -l to find the existing disk images, and then started carefully crafting my dd command:

dd ... bs=1k count=40000000 if=/dev/zero ... of=./vmname-disk ...

Which was the next disk again? <Tab> of=vmname-disk2.img <Back arrow, Back arrow, Back arrow, Back arrow, Backspace> Don't want to dd over the existing disk, that'd be bad. Let's change that 2 to a 3 , and Enter . OH CRAP, I CHANGED THE 2 TO A 2 NOT A 3 ! <Ctrl+C><Ctrl+C><Ctrl+C><Ctrl+C><Ctrl+C><Ctrl+C>

I still get sick thinking about this. I'd done the stupidest thing I possibly could have done, I started dd as root , over the top of an EXISTING DISK ON A RUNNING VM. What kind of idiot does that?! (The kind that's at work late, trying to get this one little thing done before he heads off to see his friend. The kind that thinks he knows better, and thought he was careful enough to not make such a newbie mistake. Gah.)

So, how fast does dd start writing zeros? Faster than I can move my fingers from the Enter key to the Ctrl+C keys. I tried a number of things to recover the running disk from memory, but all I did was make things worse, I think. The system was still up, but still broken like it was before I touched it, so it was useless.

Since my VMs were still running, and I'd already done enough damage for one night, I stopped touching things and went home. The next day I owned up to the boss and co-workers pretty much the moment I walked in the door. We started taking an inventory of what we had, and what was lost. I had taken the precaution of setting up backups ages ago. So, we thought we had that to fall back on.

I opened a ticket with Red Hat support and filled them in on how dumb I'd been. I can only imagine the reaction of the support person when they read my ticket. I worked a help desk for years, I know how this usually goes. They probably gathered their closest coworkers to mourn for my loss, or get some entertainment out of the guy who'd been so foolish. (I say this in jest. Red Hat's support was awesome through this whole ordeal, and I'll tell you how soon. )

So, I figured the next thing I would need from my broken server, which was still running, was the backups I'd diligently been collecting. They were on the VM but on a separate virtual disk, so I figured they were safe. The disk I'd overwritten was the last disk I'd made to expand the volume the database was on, so that logical volume was toast, but I've always set up my servers such that the main mounts -- / , /var , /home , /tmp , and /root -- were all separate logical volumes.

In this case, /backup was an entirely separate virtual disk. So, I scp -r 'd the entire /backup mount to my laptop. It copied, and I felt a little sigh of relief. All of my production systems were still running, and I had my backup. My hope was that these factors would mean a relatively simple recovery: Build a new VM, install RHEV-M, and restore my backup. Simple right?

By now, my boss had involved the rest of the directors, and let them know that we were looking down the barrel of a possibly bad time. We started organizing a team meeting to discuss how we were going to get through this. I returned to my desk and looked through the backups I had copied from the broken server. All the files were there, but they were tiny. Like, a couple hundred kilobytes each, instead of the hundreds of megabytes or even gigabytes that they should have been.

Happy feeling, gone.

Turns out, my backups were running, but at some point after an RHEV upgrade, the database backup utility had changed. Remember how I said this system had existed since version 3.0? Well, 3.0 didn't have an engine-backup utility, so in my RHEV training, we'd learned how to make our own. Mine broke when the tools changed, and for who knows how long, it had been getting an incomplete backup -- just some files from /etc .

No database. Ohhhh ... Fudge. (I didn't say "Fudge.")

I updated my support case with the bad news and started wondering what it would take to break through one of these 4th-floor windows right next to my desk. (Ok, not really.)

At this point, we basically had three RHEV clusters with no manager. One of those was for development work, but the other two were all production. We started using these team meetings to discuss how to recover from this mess. I don't know what the rest of my team was thinking about me, but I can say that everyone was surprisingly supportive and un-accusatory. I mean, with one typo I'd thrown off the entire department. Projects were put on hold and workflows were disrupted, but at least we had time: We couldn't reboot machines, we couldn't change configurations, and couldn't get to VM consoles, but at least everything was still up and operating.

Red Hat support had escalated my SNAFU to an RHEV engineer, a guy I'd worked with in the past. I don't know if he remembered me, but I remembered him, and he came through yet again. About a week in, for some unknown reason (we never figured out why), our Windows VMs started dropping offline. They were still running as far as we could tell, but they dropped off the network, Just boom. Offline. In the course of a workday, we lost about a dozen windows systems. All of our RHEL machines were working fine, so it was just some Windows machines, and not even every Windows machine -- about a dozen of them.

Well great, how could this get worse? Oh right, add a ticking time bomb. Why were the Windows servers dropping off? Would they all eventually drop off? Would the RHEL systems eventually drop off? I made a panicked call back to support, emailed my account rep, and called in every favor I'd ever collected from contacts I had within Red Hat to get help as quickly as possible.

I ended up on a conference call with two support engineers, and we got to work. After about 30 minutes on the phone, we'd worked out the most insane recovery method. We had the newer RHEV manager I mentioned earlier, that was still up and running, and had two new clusters attached to it. Our recovery goal was to get all of our workloads moved from the broken clusters to these two new clusters.

Want to know how we ended up doing it? Well, as our Windows VMs were dropping like flies, the engineers and I came up with this plan. My clusters used a Fibre Channel Storage Area Network (SAN) as their storage domains. We took a machine that was not in use, but had a Fibre Channel host bus adapter (HBA) in it, and attached the logical unit numbers (LUNs) for both the old cluster's storage domains and the new cluster's storage domains to it. The plan there was to make a new VM on the new clusters, attach blank disks of the proper size to the new VM, and then use dd (the irony is not lost on me) to block-for-block copy the old broken VM disk over to the newly created empty VM disk.

I don't know if you've ever delved deeply into an RHEV storage domain, but under the covers it's all Logical Volume Manager (LVM). The problem is, the LV's aren't human-readable. They're just universally-unique identifiers (UUIDs) that the RHEV manager's database links from VM to disk. These VMs are running, but we don't have the database to reference. So how do you get this data?

virsh ...

Luckily, I managed KVM and Xen clusters long before RHEV was a thing that was viable. I was no stranger to libvirt 's virsh utility. With the proper authentication -- which the engineers gave to me -- I was able to virsh dumpxml on a source VM while it was running, get all the info I needed about its memory, disk, CPUs, and even MAC address, and then create an empty clone of it on the new clusters.

Once I felt everything was perfect, I would shut down the VM on the broken cluster with either virsh shutdown , or by logging into the VM and shutting it down. The catch here is that if I missed something and shut down that VM, there was no way I'd be able to power it back on. Once the data was no longer in memory, the config would be completely lost, since that information is all in the database -- and I'd hosed that. Once I had everything, I'd log into my migration host (the one that was connected to both storage domains) and use dd to copy, bit-for-bit, the source storage domain disk over to the destination storage domain disk. Talk about nerve-wracking, but it worked! We picked one of the broken windows VMs and followed this process, and within about half an hour we'd completed all of the steps and brought it back online.

We did hit one snag, though. See, we'd used snapshots here and there. RHEV snapshots are lvm snapshots. Consolidating them without the RHEV manager was a bit of a chore, and took even more leg work and research before we could dd the disks. I had to mimic the snapshot tree by creating symbolic links in the right places, and then start the dd process. I worked that one out late that evening after the engineers were off, probably enjoying time with their families. They asked me to write the process up in detail later. I suspect that it turned into some internal Red Hat documentation, never to be given to a customer because of the chance of royally hosing your storage domain.

Somehow, over the course of 3 months and probably a dozen scheduled maintenance windows, I managed to migrate every single VM (of about 100 VMs) from the old zombie clusters to the working clusters. This migration included our Zimbra collaboration system (10 VMs in itself), our file servers (another dozen VMs), our Enterprise Resource Planning (ERP) platform, and even Oracle databases.

We didn't lose a single VM and had no more unplanned outages. The Red Hat Enterprise Linux (RHEL) systems, and even some Windows systems, never fell to the mysterious drop-off that those dozen or so Windows servers did early on. During this ordeal, though, I had trouble sleeping. I was stressed out and felt so guilty for creating all this work for my co-workers, I even had trouble eating. No exaggeration, I lost 10lbs.

So, don't be like Nate. Monitor your important systems, check your backups, and for all that's holy, double-check your dd output file. That way, you won't have drama, and can truly enjoy Sysadmin Appreciation Day!

Nathan Lager is an experienced sysadmin, with 20 years in the industry. He runs his own blog at undrground.org, and hosts the Iron Sysadmin Podcast. More about me

[Jan 29, 2019] A new term PEBKAC

Jan 29, 2019 | thwack.solarwinds.com

dtreloar Jul 30, 2015 8:51 PM PEBKAC

P roblem

E xists

B etween

K eyboard

A nd

C hair

or the most common fault is the id ten t or ID10T

[Jan 29, 2019] Are you sure?

Jan 29, 2019 | thwack.solarwinds.com

RichardLetts

Jul 13, 2015 8:13 PM Dealing with my ISP:

Me: There is a problem with your head-end router, you need to get an engineer to troubleshoot it

Them: no the problem is with your cable modem and router, we can see it fine on our network

Me: That's interesting because I powered it off and disconnected it from the wall before we started this conversation.

Them: Are you sure?

Me: I'm pretty sure that the lack of blinky lights means it's got no power but if you think it's still working fine then I'd suggest the problem at your end of this phone conversation and not at my end.

[Jan 29, 2019] It helps if somebody checked if the equpment really has power, but often this step is skipped.

Notable quotes:
"... On closer inspection, noticed this power lead was only half in the socket... I connected this back to the original switch, grabbed the "I.T manager" and asked him to "just push the power lead"... his face? Looked like Casper the friendly ghost. ..."
Jan 29, 2019 | thwack.solarwinds.com

nantwiched Jul 13, 2015 11:18 AM

I've had a few horrors, heres a few...

Had to travel from Cheshire to Glasgow (4+hours) at 3am to get to a major high street store for 8am, an hour before opening. A switch had failed and taken out a whole floor of the store. So I prepped the new switch, using the same power lead from the failed switch as that was the only available lead / socket. No power. Initially thought the replacement switch was faulty and I would be in trouble for not testing this prior to attending site...

On closer inspection, noticed this power lead was only half in the socket... I connected this back to the original switch, grabbed the "I.T manager" and asked him to "just push the power lead"... his face? Looked like Casper the friendly ghost.

Problem solved at a massive expense to the company due to the out of hours charges. Surely that would be the first thing to check? Obviously not...

The same thing happened in Aberdeen, a 13 hour round trip to resolve a fault on a "failed router". The router looked dead at first glance, but after taking the side panel off the cabinet, I discovered it always helps if the router is actually plugged in...

Yet the customer clearly said everything is plugged in as it should be and it "must be faulty"... It does tend to appear faulty when not supplied with any power...

[Jan 29, 2019] It can be hot inside the rack

Jan 29, 2019 | thwack.solarwinds.com

jemertz Mar 28, 2016 12:16 PM

Shortly after I started my first remote server-monitoring job, I started receiving, one by one, traps for servers that had gone heartbeat missing/no-ping at a remote site. I looked up the site, and there were 16 total servers there, of which about 4 or 5 (and counting) were already down. Clearly not network issues. I remoted into one of the ones that was still up, and found in the Windows event viewer that it was beginning to overheat.

I contacted my front-line team and asked them to call the site to find out if the data center air conditioner had gone out, or if there was something blocking the servers' fans or something. He called, the client at the site checked and said the data center was fine, so I dispatched IBM (our remote hands) to go to the site and check out the servers. They got there and called in laughing.

There was construction in the data center, and the contractors, being thoughtful, had draped a painter's dropcloth over the server racks to keep off saw dust. Of COURSE this caused the servers to overheat. Somehow the client had failed to mention this.

...so after all this went down, the client had the gall to ask us to replace the servers "just in case" there was any damage, despite the fact that each of them had shut itself down in order to prevent thermal damage. We went ahead and replaced them anyway. (I'm sure they were rebuilt and sent to other clients, but installing these servers on site takes about 2-3 hours of IBM's time on site and 60-90 minutes of my remote team's time, not counting the rebuild before recycling.
Oh well. My employer paid me for my time, so no skin off my back.

[Jan 29, 2019] "Sure, I get out my laptop, plug in the network cable, get on the internet from home. I start the VPN client, take out this paper with the code on it, and type it in..." Yup. He wrote down the RSA token's code before he went home.

Jan 29, 2019 | thwack.solarwinds.com

jm_sysadmin Expert Jul 8, 2015 7:04 AM

I was just starting my IT career, and I was told a VIP user couldn't VPN in, and I was asked to help. Everything checked out with the computer, so I asked the user to try it in front of me. He took out his RSA token, knew what to do with it, and it worked.

I also knew this user had been complaining of this issue for some time, and I wasn't the first person to try to fix this. Something wasn't right.

I asked him to walk me through every step he took from when it failed the night before.

"Sure, I get out my laptop, plug in the network cable, get on the internet from home. I start the VPN client, take out this paper with the code on it, and type it in..." Yup. He wrote down the RSA token's code before he went home. See that little thing was expensive, and he didn't want to lose it. I explained that the number changes all time, and that he needed to have it with him. VPN issue resolved.

[Jan 29, 2019] How electricians can help to improve server uptime

Notable quotes:
"... "Oh my God, the server room is full of smoke!" Somehow they hooked up things wrong and fed 220v instead of 110v to all the circuits. Every single UPS was dead. Several of the server power supplies were fried. ..."
Jan 29, 2019 | thwack.solarwinds.com

wfordham Jul 13, 2015 1:09 PM

This happened back when we had an individual APC UPS for each server. Most of the servers were really just whitebox PCs in a rack mount case running a server OS.

The facilities department was doing some planned maintenance on the electrical panel in the server room over the weekend. They assured me that they were not going to touch any of the circuits for the server room, just for the rooms across the hallway. Well, they disconnected power to the entire panel. Then they called me to let me know what they did. I was able to remotely verify that everything was running on battery just fine. I let them know that they had about 20 minutes to restore power or I would need to start shutting down servers. They called me again and said,

"Oh my God, the server room is full of smoke!" Somehow they hooked up things wrong and fed 220v instead of 110v to all the circuits. Every single UPS was dead. Several of the server power supplies were fried.

And a few motherboards didn't make it either. It took me the rest of the weekend kludging things together to get the critical systems back online.

[Jan 28, 2019] Format of wrong particon initiated during RHEL install

Notable quotes:
"... Look at the screen, check out what it is doing, realize that the installer had grabbed the backend and he said yeah format all(we are not sure exactly how he did it). ..."
Jan 28, 2019 | www.reddit.com

kitched 5 points 6 points 7 points 3 years ago (2 children)

~10 years ago. 100GB drives on a node attached to an 8TB SAN. Cabling is all hooked up as we are adding this new node to manage the existing data on the SAN. A guy that is training up to help, we let him install RedHat and go through the GUI setup. Did not pay attention to him, and after a while wonder what is taking so long. Walk over to him and he is still staring at the install screen and says, "Hey guys, this format sure is taking a while".

Look at the screen, check out what it is doing, realize that the installer had grabbed the backend and he said yeah format all(we are not sure exactly how he did it).

Middle of the day, better kick off the tape restore for 8TB of data.

[Jan 28, 2019] Something about the meaning of the word space

Jul 13, 2015 | thwack.solarwinds.com

Jul 13, 2015 7:44 AM

Trying to walk a tech through some switch config.

me: type config space t

them: it doesn't work

me: <sigh> <spells out config> space the single letter t

them: it still doesn't work

--- try some other rudimentary things ---

me: uh, are you typing in the word 'space'?

them: you said to

[Jan 28, 2019] Those power cables ;-)

Jan 28, 2019 | opensource.com

John Fano on 31 Jul 2016

I was reaching down to power up the new UPS as my guy was stepping out from behind the rack and the whole rack went dark. His foot caught the power cord of the working UPS and pulled it just enough to break the contacts and since the battery was failed it couldn't provide power and shut off. It took about 30 minutes to bring everything back up..

Things went much better with the second UPS replacement. :-)

[Jan 10, 2019] When idiots are offloaded to security department, interesting things with network eventually happen

Highly recommended!
Security department often does more damage to the network then any sophisticated hacker can. Especially if they are populated with morons, as they usually are. One of the most blatant examples is below... Those idiots decided to disable Traceroute (which means ICMP) in order to increase security.
Notable quotes:
"... Traceroute is disabled on every network I work with to prevent intruders from determining the network structure. Real pain in the neck, but one of those things we face to secure systems. ..."
"... Also really stupid. A competent attacker (and only those manage it into your network, right?) is not even slowed down by things like this. ..."
"... Breaking into a network is a slow process. Slow and precise. Trying to fix problems is a fast reactionary process. Who do you really think you're hurting? Yes another example of how ignorant opinions can become common sense. ..."
"... Disable all ICMP is not feasible as you will be disabling MTU negotiation and destination unreachable messages. You are essentially breaking the TCP/IP protocol. And if you want the protocol working OK, then people can do traceroute via HTTP messages or ICMP echo and reply. ..."
"... You have no fucking idea what you're talking about. I run a multi-regional network with over 130 peers. Nobody "disables ICMP". IP breaks without it. Some folks, generally the dimmer of us, will disable echo responses or TTL expiration notices thinking it is somehow secure (and they are very fucking wrong) but nobody blocks all ICMP, except for very very dim witted humans, and only on endpoint nodes. ..."
"... You have no idea what you're talking about, at any level. "disabled ICMP" - state statement alone requires such ignorance to make that I'm not sure why I'm even replying to ignorant ass. ..."
"... In short, he's a moron. I have reason to suspect you might be, too. ..."
"... No, TCP/IP is not working fine. It's broken and is costing you performance and $$$. But it is not evident because TCP/IP is very good about dealing with broken networks, like yours. ..."
"... It's another example of security by stupidity which seldom provides security, but always buys added cost. ..."
"... A brief read suggests this is a good resource: https://john.albin.net/essenti... [albin.net] ..."
"... Linux has one of the few IP stacks that isn't derived from the BSD stack, which in the industry is considered the reference design. Instead for linux, a new stack with it's own bugs and peculiarities was cobbled up. ..."
"... Reference designs are a good thing to promote interoperability. As far as TCP/IP is concerned, linux is the biggest and ugliest stepchild. A theme that fits well into this whole discussion topic, actually. ..."
May 27, 2018 | linux.slashdot.org

jfdavis668 ( 1414919 ) , Sunday May 27, 2018 @11:09AM ( #56682996 )

Re:So ( Score: 5 , Interesting)

Traceroute is disabled on every network I work with to prevent intruders from determining the network structure. Real pain in the neck, but one of those things we face to secure systems.

Anonymous Coward writes:
Re: ( Score: 2 , Insightful)

What is the point? If an intruder is already there couldn't they just upload their own binary?

Hylandr ( 813770 ) , Sunday May 27, 2018 @05:57PM ( #56685274 )
Re: So ( Score: 5 , Interesting)

They can easily. And often time will compile their own tools, versions of Apache, etc..

At best it slows down incident response and resolution while doing nothing to prevent discovery of their networks. If you only use Vlans to segregate your architecture you're boned.

gweihir ( 88907 ) , Sunday May 27, 2018 @12:19PM ( #56683422 )
Re: So ( Score: 5 , Interesting)

Also really stupid. A competent attacker (and only those manage it into your network, right?) is not even slowed down by things like this.

bferrell ( 253291 ) , Sunday May 27, 2018 @12:20PM ( #56683430 ) Homepage Journal
Re: So ( Score: 4 , Interesting)

Except it DOESN'T secure anything, simply renders things a little more obscure... Since when is obscurity security?

fluffernutter ( 1411889 ) writes:
Re: ( Score: 3 )

Doing something to make things more difficult for a hacker is better than doing nothing to make things more difficult for a hacker. Unless you're lazy, as many of these things should be done as possible.

DamnOregonian ( 963763 ) , Sunday May 27, 2018 @04:37PM ( #56684878 )
Re:So ( Score: 5 , Insightful)

No.

Things like this don't slow down "hackers" with even a modicum of network knowledge inside of a functioning network. What they do slow down is your ability to troubleshoot network problems.

Breaking into a network is a slow process. Slow and precise. Trying to fix problems is a fast reactionary process. Who do you really think you're hurting? Yes another example of how ignorant opinions can become common sense.

mSparks43 ( 757109 ) writes:
Re: So ( Score: 2 )

Pretty much my reaction. like WTF? OTON, redhat flavors all still on glibc2 starting to become a regular p.i.t.a. so the chances of this actually becoming a thing to be concerned about seem very low.

Kinda like gdpr, same kind of groupthink that anyone actually cares or concerns themselves with policy these days.

ruir ( 2709173 ) writes:
Re: ( Score: 3 )

Disable all ICMP is not feasible as you will be disabling MTU negotiation and destination unreachable messages. You are essentially breaking the TCP/IP protocol. And if you want the protocol working OK, then people can do traceroute via HTTP messages or ICMP echo and reply.

Or they can do reverse traceroute at least until the border edge of your firewall via an external site.

DamnOregonian ( 963763 ) , Sunday May 27, 2018 @04:32PM ( #56684858 )
Re:So ( Score: 4 , Insightful)

You have no fucking idea what you're talking about. I run a multi-regional network with over 130 peers. Nobody "disables ICMP". IP breaks without it. Some folks, generally the dimmer of us, will disable echo responses or TTL expiration notices thinking it is somehow secure (and they are very fucking wrong) but nobody blocks all ICMP, except for very very dim witted humans, and only on endpoint nodes.

DamnOregonian ( 963763 ) writes:
Re: ( Score: 3 )

That's hilarious... I am *the guy* who runs the network. I am our senior network engineer. Every line in every router -- mine.

You have no idea what you're talking about, at any level. "disabled ICMP" - state statement alone requires such ignorance to make that I'm not sure why I'm even replying to ignorant ass.

DamnOregonian ( 963763 ) writes:
Re: ( Score: 3 )

Nonsense. I conceded that morons may actually go through the work to totally break their PMTUD, IP error signaling channels, and make their nodes "invisible"

I understand "networking" at a level I'm pretty sure you only have a foggy understanding of. I write applications that require layer-2 packet building all the way up to layer-4.

In short, he's a moron. I have reason to suspect you might be, too.

DamnOregonian ( 963763 ) writes:
Re: ( Score: 3 )

A CDS is MAC. Turning off ICMP toward people who aren't allowed to access your node/network is understandable. They can't get anything else though, why bother supporting the IP control channel? CDS does *not* say turn off ICMP globally. I deal with CDS, SSAE16 SOC 2, and PCI compliance daily. If your CDS solution only operates with a layer-4 ACL, it's a pretty simple model, or You're Doing It Wrong (TM)

nyet ( 19118 ) writes:
Re: ( Score: 3 )

> I'm not a network person

IOW, nothing you say about networking should be taken seriously.

kevmeister ( 979231 ) , Sunday May 27, 2018 @05:47PM ( #56685234 ) Homepage
Re:So ( Score: 4 , Insightful)

No, TCP/IP is not working fine. It's broken and is costing you performance and $$$. But it is not evident because TCP/IP is very good about dealing with broken networks, like yours.

The problem is that doing this requires things like packet fragmentation which greatly increases router CPU load and reduces the maximum PPS of your network as well s resulting in dropped packets requiring re-transmission and may also result in widow collapse fallowed with slow-start, though rapid recovery mitigates much of this, it's still not free.

It's another example of security by stupidity which seldom provides security, but always buys added cost.

Hylandr ( 813770 ) writes:
Re: ( Score: 3 )

As a server engineer I am experiencing this with our network team right now.

Do you have some reading that I might be able to further educate myself? I would like to be able to prove to the directors why disabling ICMP on the network may be the cause of our issues.

Zaelath ( 2588189 ) , Sunday May 27, 2018 @07:51PM ( #56685758 )
Re:So ( Score: 4 , Informative)

A brief read suggests this is a good resource: https://john.albin.net/essenti... [albin.net]

Bing Tsher E ( 943915 ) , Sunday May 27, 2018 @01:22PM ( #56683792 ) Journal
Re: Denying ICMP echo @ server/workstation level t ( Score: 5 , Insightful)

Linux has one of the few IP stacks that isn't derived from the BSD stack, which in the industry is considered the reference design. Instead for linux, a new stack with it's own bugs and peculiarities was cobbled up.

Reference designs are a good thing to promote interoperability. As far as TCP/IP is concerned, linux is the biggest and ugliest stepchild. A theme that fits well into this whole discussion topic, actually.

[Oct 05, 2018] When some filenames are etched in your brain you can type them several times repeating the same blunder again and again by Anatoly Ivasyuk

Notable quotes:
"... I was working on a line printer spooler, which lived in /etc. I wanted to remove it, and so issued the command "rm /etc/lpspl." There was only one problem. Out of habit, I typed "passwd" after "/etc/" and removed the password file. Oops. ..."
Oct 05, 2018 | cam.ac.uk

From Unix Admin. Horror Story Summary, version 1.0 by Anatoly Ivasyuk
From: [email protected] (Tim Smith)

Organization: University of Washington, Seattle

I was working on a line printer spooler, which lived in /etc. I wanted to remove it, and so issued the command "rm /etc/lpspl." There was only
one problem. Out of habit, I typed "passwd" after "/etc/" and removed the password file. Oops.

I called up the person who handled backups, and he restored the password file.

A couple of days later, I did it again! This time, after he restored it, he made a link, /etc/safe_from_tim.

About a week later, I overwrote /etc/passwd, rather than removing it. After he restored it again, he installed a daemon that kept a copy of /etc/passwd, on another file system, and automatically restored it if it appeared to have been damaged.

Fortunately, I finished my work on /etc/lpspl around this time, so we didn't have to see if I could find a way to wipe out a couple of filesystems...

--Tim Smith

[Oct 05, 2018] I learned a valuable lesson about pressing buttons without first fully understanding what they do.

Oct 05, 2018 | www.reddit.com

WorkOfOz (0 children)

This is actually one of my standard interview questions since I believe any sys admin that's worth a crap has made a mistake they'll never forget.

Here's mine, circa 2001. In response to a security audit, I had to track down which version of the Symantec Antivirus was running and what definition was installed on every machine in the company. I had been working through this for awhile and got a bit reckless.

There was a button in the console that read 'Virus Sweep'. Thinking it'd get the info from each machine and give me the details, I pressed it.. I was wrong..

Very Wrong. Instead it proceeded to initiate a virus scan on every machine including all of the servers.

Less than 5 minutes later, many of our older servers and most importantly our file servers froze. In the process, I took down a trade floor for about 45 minutes while we got things back up. I learned a valuable lesson about pressing buttons without first fully understanding what they do.

[Oct 05, 2018] A newbie turned production server off to replace a monitor

Oct 05, 2018 | www.reddit.com

just_call_in_sick 5 years ago (1 child)

A friend of the family was an IT guy and he gave me the usual high school unpaid intern job. My first day, he told me that a computer needed the monitor replaced. He gave me this 13" CRT and sent me on my way. I found the room (a wiring closet) with a tiny desk and a large desktop tower on it.

TURNED OFF THE COMPUTER and went about replacing the monitor. I think it took about 5 minutes for people start wondering why they can no longer use the file server and can't save their files they have been working on all day.

It turns out that you don't have to turn off computers to replace the monitor.

[Jul 20, 2017] These Guys Didn't Back Up Their Files, Now Look What Happened

Notable quotes:
"... Unfortunately, even today, people have not learned that lesson. Whether it's at work, at home, or talking with friends, I keep hearing stories of people losing hundreds to thousands of files, sometimes they lose data worth actual dollars in time and resources that were used to develop the information. ..."
"... "I lost all my files from my hard drive? help please? I did a project that took me 3 days and now i lost it, its powerpoint presentation, where can i look for it? its not there where i save it, thank you" ..."
"... Please someone help me I last week brought a Toshiba Satellite laptop running windows 7, to replace my blue screening Dell vista laptop. On plugged in my sumo external hard drive to copy over some much treasured photos and some of my (work – music/writing.) it said installing driver. it said completed I clicked on the hard drive and found a copy of my documents from the new laptop and nothing else. ..."
Jul 20, 2017 | www.makeuseof.com
Back in college, I used to work just about every day as a computer cluster consultant. I remember a month after getting promoted to a supervisor, I was in the process of training a new consultant in the library computer cluster. Suddenly, someone tapped me on the shoulder, and when I turned around I was confronted with a frantic graduate student – a 30-something year old man who I believe was Eastern European based on his accent – who was nearly in tears.

"Please need help – my document is all gone and disk stuck!" he said as he frantically pointed to his PC.

Now, right off the bat I could have told you three facts about the guy. One glance at the blue screen of the archaic DOS-based version of Wordperfect told me that – like most of the other graduate students at the time – he had not yet decided to upgrade to the newer, point-and-click style word processing software. For some reason, graduate students had become so accustomed to all of the keyboard hot-keys associated with typing in a DOS-like environment that they all refused to evolve into point-and-click users.

The second fact, gathered from a quick glance at his blank document screen and the sweat on his brow told me that he had not saved his document as he worked. The last fact, based on his thick accent, was that communicating the gravity of his situation wouldn't be easy. In fact, it was made even worse by his answer to my question when I asked him when he last saved.

"I wrote 30 pages."

Calculated out at about 600 words a page, that's 18000 words. Ouch.

Then he pointed at the disk drive. The floppy disk was stuck, and from the marks on the drive he had clearly tried to get it out with something like a paper clip. By the time I had carefully fished the torn and destroyed disk out of the drive, it was clear he'd never recover anything off of it. I asked him what was on it.

"My thesis."

I gulped. I asked him if he was serious. He was. I asked him if he'd made any backups. He hadn't.

Making Backups of Backups

If there is anything I learned during those early years of working with computers (and the people that use them), it was how critical it is to not only save important stuff, but also to save it in different places. I would back up floppy drives to those cool new zip drives as well as the local PC hard drive. Never, ever had a single copy of anything.

Unfortunately, even today, people have not learned that lesson. Whether it's at work, at home, or talking with friends, I keep hearing stories of people losing hundreds to thousands of files, sometimes they lose data worth actual dollars in time and resources that were used to develop the information.

To drive that lesson home, I wanted to share a collection of stories that I found around the Internet about some recent cases were people suffered that horrible fate – from thousands of files to entire drives worth of data completely lost. These are people where the only remaining option is to start running recovery software and praying, or in other cases paying thousands of dollars to a data recovery firm and hoping there's something to find.

Not Backing Up Projects

The first example comes from Yahoo Answers , where a user that only provided a "?" for a user name (out of embarrassment probably), posted:

"I lost all my files from my hard drive? help please? I did a project that took me 3 days and now i lost it, its powerpoint presentation, where can i look for it? its not there where i save it, thank you"

The folks answering immediately dove into suggesting that the person run recovery software, and one person suggested that the person run a search on the computer for *.ppt.

... ... ...

Doing Backups Wrong

Then, there's a scenario of actually trying to do a backup and doing it wrong, losing all of the files on the original drive. That was the case for the person who posted on Tech Support Forum , that after purchasing a brand new Toshiba Laptop and attempting to transfer old files from an external hard drive, inadvertently wiped the files on the hard drive.

Please someone help me I last week brought a Toshiba Satellite laptop running windows 7, to replace my blue screening Dell vista laptop. On plugged in my sumo external hard drive to copy over some much treasured photos and some of my (work – music/writing.) it said installing driver. it said completed I clicked on the hard drive and found a copy of my documents from the new laptop and nothing else.

While the description of the problem is a little broken, from the sound of it, the person thought they were backing up from one direction, while they were actually backing up in the other direction. At least in this case not all of the original files were deleted, but a majority were.

[Jul 08, 2010] OT SysAdmin Stories

Under the category of "learn from the mistake of others..."

About eight years ago, I was working on a program with tight deadlines. I'd worked through the night, only catching an hour or two of sleep in the office.

The next morning, one of the servers remounted it's file systems read-only. Being a small shop, I decided to just take the server down to run a quick fsck.ext2. In my sleepiness though, I typed 'mkfs.ext2'.

When people say that "root" is god, well, no one asks god "Are you sure?".

===

All nighters are bad news, mistakes are easily made at these times as we have all learnt the hard way ;)

*cough* erased the backups and spent the night re-backing up data so nothing actually got done *cough*

I do remember spending a few days putting together some systems check for my self and my colleague to use such as daily, weekly and monthly systems checks for all IT aspects (physical, virtual, power, redundancy, connectivity etc...) only to have something fail the next day (so it really paid off!) and then nothing has broken since?...Just goes to show you never know!

Also recently upgraded my personal Ubuntu server to a RAID 6 from a RAID 5 (about a week ago) and now it looks like one of the drives is dying, again, just in time!

--
Regards,
James.

http://www.jamesbensley.co.cc/

There are 10 kinds of people in the world; Those who understand Vigesimal, and 9 others...?

===

Does this go to show the value of preparedness? Or does it illustrate the power of luck? Or some intersection? I've often been lucky about when and how stuff breaks down. And I've known people with what looked like real computer jinxes. On the one hand, you never want to just trust your luck. On the other, if luck can be involved, could it be that the profession selects for those who have it?

Whit

===

I think Whit you have raised some deeper questions maybe about probability, sod's law, the uncertanty principle, karma, etc etc...Maybe a venn diagram covering luck and preparedness is in order, who knows, we/I am digressing....

I would like to point out that at home I'm pretty sure I'm jinxed; my ubuntu server has decided X ins't going to work any more, nor the sound (may be related) and the raid is dying, all on the same day?!?!?!

--
Regards,
James.

http://www.jamesbensley.co.cc/

There are 10 kinds of people in the world; Those who understand
Vigesimal, and 9 others...?

===

Way back in the stone age, I was a sys admin at my university, working the graveyard (i.e., backup) shift two days a week and an occasional Sunday. On Sundays, we did the full backup and restore, but we switched out the disk packs (I said this was a long time ago) so we never lost more than a week's worth of data at the time. Well, almost never....

My last Sunday there, I accidentally reinitialized all the disks after the backup but before I had switched them. Then, I realized what I did, switched them anyway, and reinitialized them again, then did a full restore.

Everything would have been fine if the file system hadn't crashed that Friday afternoon....

This was on a Xerox Sigma 7 (I'm dating myself).

UNIX horror story: 24 years ago, I was working on a development system (i.e., nothing critical on it) and my latest build didn't work the way I expected, so I erased it with an 'rm -rf *' - except that I was in the root directory at the time, not my build directory. By the time I realized what I had done, it was too far gone to recover, so I wound up reinstalling the whole system.

No harm done (I did things like that sometimes on purpose, when it was *my* machine involved), but I don't do 'rm -rf' of anything any more without double-checking where I am FIRST, even if the default "-v" is set.

(unsigned confession) ===

I had quite simmilar experience, but I typed `chown -R user:group' /
(instead of ./). Now I'm also checking it for few times and I learned to
use `.' instead of `./', :)

--
Dominik Zyla

[Jun 26, 2010] Non-sysadmin related but pretty amusing story about some wrong assumptions ;-)

A carpet layer had just finished installing carpet for a lady. He stepped out for a smoke, only to realize he'd lost his cigarettes. He went back in and in the middle of the room, under the carpet, was a bump. "No sense pulling up the entire floor for one pack of smokes," he said to himself. He got out his hammer and flattened the hump.

As he was cleaning up, the lady came in. "Here," she said, handing him his pack of cigarettes. "I found them in the hallway."

"Now," she said, "If only I could find my parakeet."

[Jun 21, 2010] Memorable Quotes from Alt.Sysadmin.Recovery

Recommended Links

Google matched content

Softpanorama Recommended

Top articles

[Jan 10, 2019] When idiots are offloaded to security department, interesting things with network eventually happen Published on May 27, 2018 | linux.slashdot.org

Sites



Etc

Society

Groupthink : Two Party System as Polyarchy : Corruption of Regulators : Bureaucracies : Understanding Micromanagers and Control Freaks : Toxic Managers :   Harvard Mafia : Diplomatic Communication : Surviving a Bad Performance Review : Insufficient Retirement Funds as Immanent Problem of Neoliberal Regime : PseudoScience : Who Rules America : Neoliberalism  : The Iron Law of Oligarchy : Libertarian Philosophy

Quotes

War and Peace : Skeptical Finance : John Kenneth Galbraith :Talleyrand : Oscar Wilde : Otto Von Bismarck : Keynes : George Carlin : Skeptics : Propaganda  : SE quotes : Language Design and Programming Quotes : Random IT-related quotesSomerset Maugham : Marcus Aurelius : Kurt Vonnegut : Eric Hoffer : Winston Churchill : Napoleon Bonaparte : Ambrose BierceBernard Shaw : Mark Twain Quotes

Bulletin:

Vol 25, No.12 (December, 2013) Rational Fools vs. Efficient Crooks The efficient markets hypothesis : Political Skeptic Bulletin, 2013 : Unemployment Bulletin, 2010 :  Vol 23, No.10 (October, 2011) An observation about corporate security departments : Slightly Skeptical Euromaydan Chronicles, June 2014 : Greenspan legacy bulletin, 2008 : Vol 25, No.10 (October, 2013) Cryptolocker Trojan (Win32/Crilock.A) : Vol 25, No.08 (August, 2013) Cloud providers as intelligence collection hubs : Financial Humor Bulletin, 2010 : Inequality Bulletin, 2009 : Financial Humor Bulletin, 2008 : Copyleft Problems Bulletin, 2004 : Financial Humor Bulletin, 2011 : Energy Bulletin, 2010 : Malware Protection Bulletin, 2010 : Vol 26, No.1 (January, 2013) Object-Oriented Cult : Political Skeptic Bulletin, 2011 : Vol 23, No.11 (November, 2011) Softpanorama classification of sysadmin horror stories : Vol 25, No.05 (May, 2013) Corporate bullshit as a communication method  : Vol 25, No.06 (June, 2013) A Note on the Relationship of Brooks Law and Conway Law

History:

Fifty glorious years (1950-2000): the triumph of the US computer engineering : Donald Knuth : TAoCP and its Influence of Computer Science : Richard Stallman : Linus Torvalds  : Larry Wall  : John K. Ousterhout : CTSS : Multix OS Unix History : Unix shell history : VI editor : History of pipes concept : Solaris : MS DOSProgramming Languages History : PL/1 : Simula 67 : C : History of GCC developmentScripting Languages : Perl history   : OS History : Mail : DNS : SSH : CPU Instruction Sets : SPARC systems 1987-2006 : Norton Commander : Norton Utilities : Norton Ghost : Frontpage history : Malware Defense History : GNU Screen : OSS early history

Classic books:

The Peter Principle : Parkinson Law : 1984 : The Mythical Man-MonthHow to Solve It by George Polya : The Art of Computer Programming : The Elements of Programming Style : The Unix Hater’s Handbook : The Jargon file : The True Believer : Programming Pearls : The Good Soldier Svejk : The Power Elite

Most popular humor pages:

Manifest of the Softpanorama IT Slacker Society : Ten Commandments of the IT Slackers Society : Computer Humor Collection : BSD Logo Story : The Cuckoo's Egg : IT Slang : C++ Humor : ARE YOU A BBS ADDICT? : The Perl Purity Test : Object oriented programmers of all nations : Financial Humor : Financial Humor Bulletin, 2008 : Financial Humor Bulletin, 2010 : The Most Comprehensive Collection of Editor-related Humor : Programming Language Humor : Goldman Sachs related humor : Greenspan humor : C Humor : Scripting Humor : Real Programmers Humor : Web Humor : GPL-related Humor : OFM Humor : Politically Incorrect Humor : IDS Humor : "Linux Sucks" Humor : Russian Musical Humor : Best Russian Programmer Humor : Microsoft plans to buy Catholic Church : Richard Stallman Related Humor : Admin Humor : Perl-related Humor : Linus Torvalds Related humor : PseudoScience Related Humor : Networking Humor : Shell Humor : Financial Humor Bulletin, 2011 : Financial Humor Bulletin, 2012 : Financial Humor Bulletin, 2013 : Java Humor : Software Engineering Humor : Sun Solaris Related Humor : Education Humor : IBM Humor : Assembler-related Humor : VIM Humor : Computer Viruses Humor : Bright tomorrow is rescheduled to a day after tomorrow : Classic Computer Humor

The Last but not Least Technology is dominated by two types of people: those who understand what they do not manage and those who manage what they do not understand ~Archibald Putt. Ph.D


Copyright © 1996-2021 by Softpanorama Society. www.softpanorama.org was initially created as a service to the (now defunct) UN Sustainable Development Networking Programme (SDNP) without any remuneration. This document is an industrial compilation designed and created exclusively for educational use and is distributed under the Softpanorama Content License. Original materials copyright belong to respective owners. Quotes are made for educational purposes only in compliance with the fair use doctrine.

FAIR USE NOTICE This site contains copyrighted material the use of which has not always been specifically authorized by the copyright owner. We are making such material available to advance understanding of computer science, IT technology, economic, scientific, and social issues. We believe this constitutes a 'fair use' of any such copyrighted material as provided by section 107 of the US Copyright Law according to which such material can be distributed without profit exclusively for research and educational purposes.

This is a Spartan WHYFF (We Help You For Free) site written by people for whom English is not a native language. Grammar and spelling errors should be expected. The site contain some broken links as it develops like a living tree...

You can use PayPal to to buy a cup of coffee for authors of this site

Disclaimer:

The statements, views and opinions presented on this web page are those of the author (or referenced source) and are not endorsed by, nor do they necessarily reflect, the opinions of the Softpanorama society. We do not warrant the correctness of the information provided or its fitness for any purpose. The site uses AdSense so you need to be aware of Google privacy policy. You you do not want to be tracked by Google please disable Javascript for this site. This site is perfectly usable without Javascript.

Last modified: February 07, 2020