|Home||Switchboard||Unix Administration||Red Hat||TCP/IP Networks||Neoliberalism||Toxic Managers|
May the source be with you, but remember the KISS principle ;-)
Bigger doesn't imply better. Bigger often is a sign of obesity, of lost control, of overcomplexity, of cancerous cells
|News||Sysadmin Horror Stories||Recommended Links||Creative uses of rm||Mistakes made because of the differences between various Unix/Linux flavors||Missing backup horror stories||Lack of testing complex, potentially destructive, commands before execution of production box||Pure stupidity|
|Locking yourself out||Premature or misguided optimization||Reboot Blunders||Performing the operation on a wrong server||Executing command in a wrong directory||Side effects of performing operations on home or application directories||Typos in the commands with disastrous consequences||Side effects of patching|
|Multiple sysadmin working on the same box||Side effects of patching of the customized server||Ownership changing blunders||Dot-star-errors and regular expressions blunders||Excessive zeal in improving security of the system||Unintended consequences of automatic system maintenance scripts||LVM mishaps||Abuse of privileges|
|Safe-rm||Workaholism and Burnout||Coping with the toxic stress in IT environment||The Unix Hater’s Handbook||Tips||Horror stories History||Humor||Etc|
If you try to distill the essence of horror stories, most of them are about inadequate backups. Everyone who has worked as system administrator in a large corporation for substantial period of time can tell that as a general observation, large organizations/corporations tend to opt for incredibly expensive, incredibly complex, incredibly overblown backup "solutions" sold to them by vendors rather than using the stock, well-tested, reliable tools that they already have. (e.g., Data Protector, Tivoli backup, or other expensive closed-source/proprietary/non-portable/slow/bulky software.
Home users have their own set of problems: According to a recent Carnegie-Mellon University report, hard drive failures affect up to 13 percent of all personal computer users each year. And yet surveys show almost half of users do not back up their data. Of cause now SSD are not that expensive, but they fail too, although they are more resistant to falling from the desk on the floor.
Having a good recent backup that can be restored is the key feature that distinguishes mere nuisance from full blown disaster. note that phzse " that can be resotred". This point is very difficult to understand by novice enterprise administrators. often the "missing backup" situation arise when backup is available but can't be used for restoration or restores only a part of filesystem, or is not current. There are some rules that help both prevent such situation and recover from it
Rephrasing Bernard Show we can say "Experience keeps the most expensive school, but most sysadmins are unable to learn anywhere else". Please remember that in enterprise environment you will almost never be rewarded for innovations and contributions but in many cases you will be severely punished for blunders. In other words typical enterprise IT is a risk averse environment and you better understand that sooner rather then later...
If you try to distill the essence of horror stories most of them are about inadequate backups. Having a good recent backup is the key feature that distinguishes mere nuisance from full blown disaster.
You should not be passing in accepting you fate. There should be couscous efforts to locate and test backup before engaging in some potentially dangerous manipulations with the OS.
|Test your backups to make sure they are readable before starting any
potentially dangerous manipulations with the OS.
Handle the format program (and anything else that writes directly to disk devices) like nitroglycerine.
If you've never done sysadmin work before, take a formal vendor training class even if this means paying your own money.
Testing your backups periodically should be a habit and it is better to be integrated into your monitoring system. Attempt at least to browse the backup and see if data are intact is a must. comparing it with the server state is even better. In any case that should be done. Skipping this means negligence on the part of system administrator.
Please remember that backup is the last change for you to restore the system if something went terribly wrong. That means that before any dangerous steps you need to locate and check the existence of backup.
In eneterprise environment making a private backup is also a good idea to that you have two or more recent copies of your OS and some user and data directories. It does not need to be complete. FIT falsh drives limit the total size to 128GB, but they are almost invisible after you insert them into USB port on the server and they provide improtant and cheap insurance for your OS, baseline and critical user and data files.
The felling of desperation one is experiencing after getting into this classic horror story are well reflected in the following parody on Yesterday
All those backups seemed a waste of pay.
Now my database has gone away.
Oh I believe in yesterday.
There's not half the files there used to be,
And there's a milestone hanging over me
The system crashed so suddenly.
I pushed something wrong
What it was I could not say.
Now all my data's gone
and I long for yesterday-ay-ay-ay.
The need for back-ups seemed so far away.
I knew my data was all here to stay,
Now I believe in yesterday.
Nov 01, 2018 | opensource.com
The ghost of the failed restore
In a well-known data center (whose name I do not want to remember), one cold October night we had a production outage in which thousands of web servers stopped responding due to downtime in the main database. The database administrator asked me, the rookie sysadmin, to recover the database's last full backup and restore it to bring the service back online.
But, at the end of the process, the database was still broken. I didn't worry, because there were other full backup files in stock. However, even after doing the process several times, the result didn't change.
With great fear, I asked the senior sysadmin what to do to fix this behavior.
"You remember when I showed you, a few days ago, how the full backup script was running? Something about how important it was to validate the backup?" responded the sysadmin.
"Of course! You told me that I had to stay a couple of extra hours to perform that task," I answered.
"Exactly! But you preferred to leave early without finishing that task," he said.
"Oh my! I thought it was optional!" I exclaimed.
"It was, it was "
Moral of the story: Even with the best solution that promises to make the most thorough backups, the ghost of the failed restoration can appear, darkening our job skills, if we don't make a habit of validating the backup every time.
Jul 20, 2017 | www.makeuseof.comBack in college, I used to work just about every day as a computer cluster consultant. I remember a month after getting promoted to a supervisor, I was in the process of training a new consultant in the library computer cluster. Suddenly, someone tapped me on the shoulder, and when I turned around I was confronted with a frantic graduate student – a 30-something year old man who I believe was Eastern European based on his accent – who was nearly in tears.
"Please need help – my document is all gone and disk stuck!" he said as he frantically pointed to his PC.
Now, right off the bat I could have told you three facts about the guy. One glance at the blue screen of the archaic DOS-based version of Wordperfect told me that – like most of the other graduate students at the time – he had not yet decided to upgrade to the newer, point-and-click style word processing software. For some reason, graduate students had become so accustomed to all of the keyboard hot-keys associated with typing in a DOS-like environment that they all refused to evolve into point-and-click users.
The second fact, gathered from a quick glance at his blank document screen and the sweat on his brow told me that he had not saved his document as he worked. The last fact, based on his thick accent, was that communicating the gravity of his situation wouldn't be easy. In fact, it was made even worse by his answer to my question when I asked him when he last saved.
"I wrote 30 pages."
Calculated out at about 600 words a page, that's 18000 words. Ouch.
Then he pointed at the disk drive. The floppy disk was stuck, and from the marks on the drive he had clearly tried to get it out with something like a paper clip. By the time I had carefully fished the torn and destroyed disk out of the drive, it was clear he'd never recover anything off of it. I asked him what was on it.
I gulped. I asked him if he was serious. He was. I asked him if he'd made any backups. He hadn't.Making Backups of Backups
If there is anything I learned during those early years of working with computers (and the people that use them), it was how critical it is to not only save important stuff, but also to save it in different places. I would back up floppy drives to those cool new zip drives as well as the local PC hard drive. Never, ever had a single copy of anything.
Unfortunately, even today, people have not learned that lesson. Whether it's at work, at home, or talking with friends, I keep hearing stories of people losing hundreds to thousands of files, sometimes they lose data worth actual dollars in time and resources that were used to develop the information.
To drive that lesson home, I wanted to share a collection of stories that I found around the Internet about some recent cases were people suffered that horrible fate – from thousands of files to entire drives worth of data completely lost. These are people where the only remaining option is to start running recovery software and praying, or in other cases paying thousands of dollars to a data recovery firm and hoping there's something to find.Not Backing Up Projects
The first example comes from Yahoo Answers , where a user that only provided a "?" for a user name (out of embarrassment probably), posted:
"I lost all my files from my hard drive? help please? I did a project that took me 3 days and now i lost it, its powerpoint presentation, where can i look for it? its not there where i save it, thank you"
The folks answering immediately dove into suggesting that the person run recovery software, and one person suggested that the person run a search on the computer for *.ppt.
... ... ...
Doing Backups Wrong
Then, there's a scenario of actually trying to do a backup and doing it wrong, losing all of the files on the original drive. That was the case for the person who posted on Tech Support Forum , that after purchasing a brand new Toshiba Laptop and attempting to transfer old files from an external hard drive, inadvertently wiped the files on the hard drive.
Please someone help me I last week brought a Toshiba Satellite laptop running windows 7, to replace my blue screening Dell vista laptop. On plugged in my sumo external hard drive to copy over some much treasured photos and some of my (work – music/writing.) it said installing driver. it said completed I clicked on the hard drive and found a copy of my documents from the new laptop and nothing else.
While the description of the problem is a little broken, from the sound of it, the person thought they were backing up from one direction, while they were actually backing up in the other direction. At least in this case not all of the original files were deleted, but a majority were.
If you are not careful you can wipe out your C disk performing a restore of the Windows C partition image to a USB drive, as selection of bootable recovery image somehow redirects recovery to disk C. The warning sign is when Acronis True Image wants to reboot computer to proceed.
If you are brave enough to go past this point, then despite the fact that you explicitly made your target different from bootable drive you need to face unpleasant consequences -- your C partition is now gone.
You can imagine your surprise with the results. I once did that. Thanks God there was no critical data on this wiped C drive. I already migrate it to a new PC. My first reaction was to throw this garbage program where it belongs. But the problem is that other similar programs are not much better and now I am trained not to trust Acronis and probably can do better in future. Another factor is that if you don't use Acronis True Image often you forget about it capabilities (in this case the write decision would be to use cloning of the disk operating, not restoration from the image but the problem was that the disk and image were slightly different and I want the content of the image not the content of the disk.
Still right way would be to do first clone of the disk and then perform restoration of the image to this drive. As I don't use complex operations with Acronis often, I forgot about that and was punished. And believe me you jaw really drops in such cases when you see the results...
Another time, our AIX/370 cluster managed to trash the /etc/passwd file. All 4 machines in the cluster lost their copies within milliseconds. In the next few minutes, I discovered that (a) the nightly script that stashed an archive copy hadn't run the night before and (b) that our backups were pure zorkumblattum as well. (The joys of running very beta-test software).
I finally got saved when I realized the cluster had *5* machines in it - a lone PS/2 had crashed the night before, and failed to reboot. So it had a propogated copy of /etc/passwd as of the previous night.
Go to that PS/2, unplug it's Ethernet.. reboot it. Copy /etc/passwd to floppy, carry to a working (?) PS/2 in the cluster, tar it off, let it propogate to other cluster sites. Go back, hook up the
crashed PS/2s ethernet.. All done.
Only time in my career that having beta-test software crash a machine saved me from bugs in beta-test software. ;)
Once I was in the position of upgrading a Gould PN/9080. I was a good sysadmin, took a backup before I started, since the README said that they had changed the I-node format slightly. I do the upgrade, and it goes with unprecidented (for Gould) smoothness. mkfs all the user partitions, start restoring files. Blam.
I/O error on the tape. All 12 tapes. Both Sets of backups.
However, 'dd' could read the tape just fine.
36 straight hours later, I finally track it down to a bad chip on the tape controller board - the chip was involved in the buffer/convert from a 32-bit backplane to a 8-bit I/O cable. Every 4 bytes, the 5th bit would reverse sense. 20 mins later, I had a program written, and 'dd 3 my_twiddle 3 restore -f -' running.
Moral: Always *verify* the backups - the tape drive didn't report a write error, because what it *received* and what went on the tape were the same....
I'm sure I have other sagas, but those are some of the more memorable ones I've had...
Computer Systems Engineer
From: rca@Ingres.COM (Bob Arnold)
Organization: Ask Computer Systems Inc., Ingres Division, Alameda CA 94501
Many moons ago, in my first sysadmin job, learning via "on-the-job training", I was in charge of a UNIX box who's user disk developed a bad block. (Maybe you can see it already ...)
The "format" man page seemed to indicate that it could repair bad blocks. (Can you see it now?) I read the man page very carefully. Nowhere did it indicate any kind of destructive behavior.
I was brave and bold, not to mention boneheaded, and formatted the user disk.
The good news:
1) The bad block was gone.
2) I was about to learn a lot real fast :-)
The bad news:
1) The user data was gone too.
2) The users weren't happy, to say the least.
Having recently made a full backup of the disk, I knew I was in for a miserable all day restore. Why all day? It took 8 hours to dump that disk to 40 floppies. And I had incrementals (levels 1, 2, 3, 4, and 5, which were another sign of my novice state) to layer on top of the full.
Only it got worse. The floppy drive had intermittent problems reading some of the floppies. So I had to go back and retry to get the files which were missed on the first attempt.
This was also a port of Version 7 UNIX (like I said, this was many moons ago). It had a program called "restor", primordial ancestor of BSD's "restore". If you used the "x" option to extract selected files (the ones missed on earlier attempts), "restor" would use the *inode number* as the name of the extracted files. You had to move the extracted files to their correct locations yourself (the man page said to write a shellscript to do this :-(). I didn't know much about shell scripts at the time, but I learned a lot more that week.
Yes, it took me a full week, including the weekend, maybe 120 hours or more, to get what I could (probably 95% of the data) off the backups.
And there were a few ownership and permissions problems to be cleaned up after that.
Once burned twice shy. This is the only truly catastrophic mistake I've ever made as a sysadmin, I'm glad to be able to say.
I kept a copy of my memo to the users after I had done what I could. Reading it over now is sobering indeed. I also kept my extensive notes on the restore process - thank goodness I've never had to use them since.
1) The "man" pages don't tell you everything you need to know.
2) Don't do backups to floppies.
3) Test your backups to make sure they are readable.
4) Handle the format program (and anything else that writes directly to disk devices) like nitroglycerine.
5) Strenuously avoid systems with inadequate backup and restore programs wherever possible (thank goodness for "restore" with an "e"!).
6) If you've never done sysadmin work before, take a formal training class.
Well, I haven't thought about that one in a while! I can laugh about it now ....
From: rca@Ingres.COM (Bob Arnold)
Organization: Ask Computer Systems Inc., Ingres Division, Alameda CA 94501
In article <1992Oct12.233524.13463@pony.Ingres.COM> I wrote:
>I was brave and bold, not to mention boneheaded, and formatted the user disk.
> U rest of story deleted ... Bob ~
> 1) The "man" pages don't tell you everything you need to know.
> 2) Don't do backups to floppies.
> 3) Test your backups to make sure they are readable.
> 4) Handle the format program (and anything else that writes directly
> to disk devices) like nitroglycerine.
> 5) Strenuously avoid systems with inadequate backup and restore
> programs wherever possible (thank goodness for "restore" with
> an "e"!).
> 6) If you've never done sysadmin work before, take a formal
> training class.
Just thought of a few more related morals (managers pay attention now):
7) You get what you pay for.
8) There's no substitute for experience.
9) It's a lot less painful to learn from someone else's experience than your own (that's what this thread is about, I guess :-) )
Part of the story I should tell here. My employer had been looking for a way to cut costs. I was 15% cheaper than their previous sysadmin so they let him go and hired me. It wasn't as nasty as it sounds, since they kept him on as a consultant at 4 hours a week and he ended up with a better job too (so did I). Everyone benefited in the end. I leaned heavily on his consulting, which was great. He was older and wiser, and probably had his own horror stories to tell. After this one, so did I!
Google matched content
Groupthink : Two Party System as Polyarchy : Corruption of Regulators : Bureaucracies : Understanding Micromanagers and Control Freaks : Toxic Managers : Harvard Mafia : Diplomatic Communication : Surviving a Bad Performance Review : Insufficient Retirement Funds as Immanent Problem of Neoliberal Regime : PseudoScience : Who Rules America : Neoliberalism : The Iron Law of Oligarchy : Libertarian Philosophy
War and Peace : Skeptical Finance : John Kenneth Galbraith :Talleyrand : Oscar Wilde : Otto Von Bismarck : Keynes : George Carlin : Skeptics : Propaganda : SE quotes : Language Design and Programming Quotes : Random IT-related quotes : Somerset Maugham : Marcus Aurelius : Kurt Vonnegut : Eric Hoffer : Winston Churchill : Napoleon Bonaparte : Ambrose Bierce : Bernard Shaw : Mark Twain Quotes
Vol 25, No.12 (December, 2013) Rational Fools vs. Efficient Crooks The efficient markets hypothesis : Political Skeptic Bulletin, 2013 : Unemployment Bulletin, 2010 : Vol 23, No.10 (October, 2011) An observation about corporate security departments : Slightly Skeptical Euromaydan Chronicles, June 2014 : Greenspan legacy bulletin, 2008 : Vol 25, No.10 (October, 2013) Cryptolocker Trojan (Win32/Crilock.A) : Vol 25, No.08 (August, 2013) Cloud providers as intelligence collection hubs : Financial Humor Bulletin, 2010 : Inequality Bulletin, 2009 : Financial Humor Bulletin, 2008 : Copyleft Problems Bulletin, 2004 : Financial Humor Bulletin, 2011 : Energy Bulletin, 2010 : Malware Protection Bulletin, 2010 : Vol 26, No.1 (January, 2013) Object-Oriented Cult : Political Skeptic Bulletin, 2011 : Vol 23, No.11 (November, 2011) Softpanorama classification of sysadmin horror stories : Vol 25, No.05 (May, 2013) Corporate bullshit as a communication method : Vol 25, No.06 (June, 2013) A Note on the Relationship of Brooks Law and Conway Law
Fifty glorious years (1950-2000): the triumph of the US computer engineering : Donald Knuth : TAoCP and its Influence of Computer Science : Richard Stallman : Linus Torvalds : Larry Wall : John K. Ousterhout : CTSS : Multix OS Unix History : Unix shell history : VI editor : History of pipes concept : Solaris : MS DOS : Programming Languages History : PL/1 : Simula 67 : C : History of GCC development : Scripting Languages : Perl history : OS History : Mail : DNS : SSH : CPU Instruction Sets : SPARC systems 1987-2006 : Norton Commander : Norton Utilities : Norton Ghost : Frontpage history : Malware Defense History : GNU Screen : OSS early history
The Peter Principle : Parkinson Law : 1984 : The Mythical Man-Month : How to Solve It by George Polya : The Art of Computer Programming : The Elements of Programming Style : The Unix Hater’s Handbook : The Jargon file : The True Believer : Programming Pearls : The Good Soldier Svejk : The Power Elite
Most popular humor pages:
Manifest of the Softpanorama IT Slacker Society : Ten Commandments of the IT Slackers Society : Computer Humor Collection : BSD Logo Story : The Cuckoo's Egg : IT Slang : C++ Humor : ARE YOU A BBS ADDICT? : The Perl Purity Test : Object oriented programmers of all nations : Financial Humor : Financial Humor Bulletin, 2008 : Financial Humor Bulletin, 2010 : The Most Comprehensive Collection of Editor-related Humor : Programming Language Humor : Goldman Sachs related humor : Greenspan humor : C Humor : Scripting Humor : Real Programmers Humor : Web Humor : GPL-related Humor : OFM Humor : Politically Incorrect Humor : IDS Humor : "Linux Sucks" Humor : Russian Musical Humor : Best Russian Programmer Humor : Microsoft plans to buy Catholic Church : Richard Stallman Related Humor : Admin Humor : Perl-related Humor : Linus Torvalds Related humor : PseudoScience Related Humor : Networking Humor : Shell Humor : Financial Humor Bulletin, 2011 : Financial Humor Bulletin, 2012 : Financial Humor Bulletin, 2013 : Java Humor : Software Engineering Humor : Sun Solaris Related Humor : Education Humor : IBM Humor : Assembler-related Humor : VIM Humor : Computer Viruses Humor : Bright tomorrow is rescheduled to a day after tomorrow : Classic Computer Humor
The Last but not Least Technology is dominated by two types of people: those who understand what they do not manage and those who manage what they do not understand ~Archibald Putt. Ph.D
Copyright © 1996-2018 by Dr. Nikolai Bezroukov. www.softpanorama.org was initially created as a service to the (now defunct) UN Sustainable Development Networking Programme (SDNP) in the author free time and without any remuneration. This document is an industrial compilation designed and created exclusively for educational use and is distributed under the Softpanorama Content License. Original materials copyright belong to respective owners. Quotes are made for educational purposes only in compliance with the fair use doctrine.
FAIR USE NOTICE This site contains copyrighted material the use of which has not always been specifically authorized by the copyright owner. We are making such material available to advance understanding of computer science, IT technology, economic, scientific, and social issues. We believe this constitutes a 'fair use' of any such copyrighted material as provided by section 107 of the US Copyright Law according to which such material can be distributed without profit exclusively for research and educational purposes.
This is a Spartan WHYFF (We Help You For Free) site written by people for whom English is not a native language. Grammar and spelling errors should be expected. The site contain some broken links as it develops like a living tree...
|You can use PayPal to make a contribution, supporting development of this site and speed up access. In case softpanorama.org is down you can use the at softpanorama.info|
The statements, views and opinions presented on this web page are those of the author (or referenced source) and are not endorsed by, nor do they necessarily reflect, the opinions of the author present and former employers, SDNP or any other organization the author may be associated with. We do not warrant the correctness of the information provided or its fitness for any purpose.
Last modified: July 01, 2018