|Home||Switchboard||Unix Administration||Red Hat||TCP/IP Networks||Neoliberalism||Toxic Managers|
May the source be with you, but remember the KISS principle ;-)
Skepticism and critical thinking is not panacea, but can help to understand the world better
|News||Sysadmin Horror Stories||Recommended Links||Creative uses of rm||Mistakes made because of the differences between various Unix/Linux flavors||Missing backup horror stories||Lack of testing complex, potentially destructive, commands before execution of production box||Pure stupidity|
|Locking yourself out||Premature or misguided optimization||Reboot Blunders||Performing the operation on a wrong server||Executing command in a wrong directory||Side effects of performing operations on home or application directories||Typos in the commands with disastrous consequences||Side effects of patching|
|Multiple sysadmin working on the same box||Side effects of patching of the customized server||Ownership changing blunders||Dot-star-errors and regular expressions blunders||Excessive zeal in improving security of the system||Unintended consequences of automatic system maintenance scripts||LVM mishaps||Abuse of privileges|
|Safe-rm||Workagolism and Burnout||Coping with the toxic stress in IT environment||The Unix Haters Handbook||Tips||Horror stories History||Humor||Etc|
If you try to distill the essence of horror stories, most of them are about inadequate, corrupt or completely missing backups.
Everyone who has worked as system administrator in a large corporation for substantial period of time can tell that as a general observation, large organizations/corporations tend to opt for incredibly expensive, incredibly complex, incredibly overblown backup "solutions" sold to them by one of the major vendors (IBM, HP, etc) vendors rather than using the stock, well-tested, reliable tools that they already have. (e.g., they are using Data Protector, Tivoli backup, or other expensive closed-source/proprietary/non-portable/slow/bulky software. Those are fine for application data backup/restore, but might have problem with bare metal restore.
Home users have their own set of problems: according to a recent Carnegie-Mellon University report, hard drive failures affect up to 13% of all personal computer users each year. Now with SSD drive being dominant in laptop this problem (especially accidental laptop drop) became less common. And yet surveys show almost half of users do not back up their data. SSD now are not that expensive and are "drop-resistant", but they fail too. and please note that for most home users FIT drive which can be permanently inserted in USB port represents an adequate backup solution. But they do not implement even this minimal backup solution, preferring to rely of pure lack. Then the disaster strikes and that realize how stupid they are, but it too late.
SSD USB drives are also resistant when failing from the desk to the floor. Although enclosure is often badly damaged drive continues to work OK.
Having a good recent backup that can be restored is the key feature that distinguishes mere nuisance from full blown disaster and professional system administrator from an amateur or wannabe.
|Having a good recent backup that can be restored is the key feature that distinguishes mere nuisance from full blown disaster and professional system administrator from an amateur or wannabe.|
Note that phase " that can be restored". This point is very difficult to understand by novice enterprise administrators. Often the "missing backup" situation arise when backup is available but can't be used for restoration or restores only a part of filesystem, or is not current.
Here there are two basic rules to follow:
Implementing a backup system without regular testing of restore process is suicidal.
Rephrasing Benjamin Franklin we can say "Experience keeps the most expensive school, but most sysadmins are unable to learn anywhere else". Please remember that in enterprise environment you will almost never be rewarded for innovations and contributions but in many cases you will be severely punished for blunders. In other words typical enterprise IT is a risk averse environment and you better understand that sooner rather then later...
If you try to distill the essence of horror stories most of them are about inadequate backups. Having a good recent backup is the key feature that distinguishes mere nuisance from full blown disaster.
You should not be passive in accepting you fate. There should be couscous efforts to locate and test backup before engaging in some potentially dangerous manipulations with the OS.
|Test your backups to make sure they are readable before starting any
potentially dangerous manipulations with the OS.
Handle the format program (and anything else that writes directly to disk devices) like nitroglycerine.
If you've never done sysadmin work before, take a formal vendor training class even if this means paying your own money.
Testing your backups periodically should be a habit and it is better to be integrated into your monitoring system. Attempt at least to browse the backup and see if data are intact is a must. comparing it with the server state is even better. In any case that should be done. Skipping this means negligence on the part of system administrator.
Please remember that backup is the last change for you to restore the system if something went terribly wrong. That means that before any dangerous steps you need to locate and check the existence of backup.
There are some rules that help both prevent such situation and recover from it
In enterprise environment making a private backup is also a good idea to that you have two or more recent copies of your OS and some user and data directories. It does not need to be complete. FIT flash drives limit the total size to 128GB, but they are almost invisible after you insert them into USB port on the server and they provide important and cheap insurance for your OS, baseline and critical user and data files.
The felling of desperation one is experiencing after getting into this classic horror story are well reflected in the following parody on Yesterday
RAID 5 drives can recover from the loss of a single disk only, no matter how many disks are configured. So if nobody monitors the health of RAID array, sooner of later it will fail. Usually it takes several months from one harddrive failure to another. So if the first drive failed in say, April, the second might fail only in September. And RAID5 array with two failed drives is usually unrecoverable without especial equipment. Even if you send the last failed drive to OnTrack or similar company and they recover this drive (which will cost you from $800 to several thousand dollars), there is no guarantee that RAID5 array will accept this drive and recover all the information.
As DRAC/ILO is usually not monitored closely it is very important that lights on remote servers be monitored on a regular basis (drive patrol). That means that unless you have ssh script to extract relevant information from DRAC, you are flying blind
System log does not provide any information about health of the drives. Or more correctly it does provide this information, but only when it is too late.
So the quality of DRAC monitoring became the crucial factor. Even simple script that detects failed drive is a huge improvement. For certain types of system, where integristy of data is extremely important you can shutdown the server when a failed drive is detected. Some user jobs will be lost, and there will be complains. But it is much better situration then when user data are lost.
The other issue is how resilient is your disk configuration. If RAID partition is mapped via disk manager this is another level of complexity that prevents data recovery and increase chances of data loss.
All those backups seemed a waste of pay.
Now my database has gone away.
Oh I believe in yesterday.
There's not half the files there used to be,
And there's a milestone hanging over me
The system crashed so suddenly.
I pushed something wrong
What it was I could not say.
Now all my data's gone
and I long for yesterday-ay-ay-ay.
The need for back-ups seemed so far away.
I knew my data was all here to stay,
Now I believe in yesterday.
May 20, 2018 | www.cyberciti.biz
I had only one backup copy of my QT project and I just wanted to get a directory called functions. I end up deleting entire backup (note -c switch instead of -x):
tar -zcvf project.tar.gz functions
I had no backup. Similarly I end up running rsync command and deleted all new files by overwriting files from backup set (now I have switched to rsnapshot )
rsync -av -delete /dest /src
Again, I had no backup.
... ... ...All men make mistakes, but only wise men learn from their mistakes -- Winston Churchill .
From all those mistakes I have learn that:
- You must keep a good set of backups. Test your backups regularly too.
- The clear choice for preserving all data of UNIX file systems is dump, which is only tool that guaranties recovery under all conditions. (see Torture-testing Backup and Archive Programs paper).
- Never use rsync with single backup directory. Create a snapshots using rsync or rsnapshots .
- Use CVS/git to store configuration files.
- Wait and read command line twice before hitting the dam [Enter] key.
- Use your well tested perl / shell scripts and open source configuration management software such as puppet, Ansible, Cfengine or Chef to configure all servers. This also applies to day today jobs such as creating the users and more.
Mistakes are the inevitable, so have you made any mistakes that have caused some sort of downtime? Please add them into the comments section below.
This happen with a small "kiosk type" cluster that served multiple research groups. In order to get account all users were required to sign a disclaimer that they will backup the data themselves. But of course almost nobody did.
Due to outsourcing "disk patrol" on this cluster (actually one Dell enclose with 16 blades and a rach server headnode with the large disk array on top of it was not performed and 2 disks in RAID5 array attached to the headnode went bad. This was 40TB filesystem with around 20TB of research data. All disappeared at one moment. LV consisted of two OV -- one smaller on the headnode itself (15 1TB drives) and the other PV was larger storage attachment with 12 4 TB drives. Only the second part of LV failed so so a partial recovery was possible. Dell H840 RAID controller also has in interesting feature -- it allows not to reinitialize a new volume after replacing failed drives. That also helped to limit the damage.
After frantic efforts parts of the filesystem were on the second PV rescued and LV restored, but those data that were on the failed disk remains missing. While 4TB disks were used in the disk array the damage generally canbe muich larger dire to striping. The commandfind . -type f -print0 | file -0 | grep -P ' data$"
Allowed to find most of corrupted files (verified later for each important data type separate (gz, pdf, html, etc) And there was several thousand of them. Corrupted GZ files proved are typically irrecoverable but corrupted text files and tar archives can be partially recovered if only a few blocks are missing in a large file.
Of course, disk space to perform backup of most user data after that was quickly found and backups institutes, but it was too late.
What is interesting that sysadmin in charge make a lot of efforts to preserve the operating system image and restore the OS in case of failu, but never gave any thought about the value of user data despite the fact that user data were much more valuable and he has plenty of space to backup at least 50$ of them (if not more).
As the result he spend several night recovering the data from failed LVM.
Jul 20, 2017 | www.linuxjournal.com
At an unnamed location it happened thus... The customer had been using a home built 'tar' -based backup system for a long time. They were informed enough to have even tested and verified that recovery would work also.
Everything had been working fine, and they even had to do a recovery which went fine. Well, one day something evil happened to a disk and they had to replace the unit and do a full recovery.
Things looked fine until someone noticed that a directory with critically important and sensitive data was missing. Turned out that some manager had decided to 'secure' the directory by doing 'chmod 000 dir' to protect the data from inquisitive eyes when the data was not being used.
Of course, tar complained about the situation and returned with non-null status, but since the backup procedure had seemed to work fine, no one thought it necessary to view the logs...
Jul 20, 2017 | www.linuxjournal.com
Anonymous on Fri, 11/08/2002 - 03:00.Anonymous on Sun, 11/10/2002 - 03:00.
The Subject, not the content, really brings back memories.
Imagine this, your tasked with complete control over the network in a multi-million dollar company. You've had some experience in the real world of network maintaince, but mostly you've learned from breaking things at home.
Time comes to implement (yes this was a startup company), a backup routine. You carefully consider the best way to do it and decide copying data to a holding disk before the tape run would be perfect in the situation, faster restore if the holding disk is still alive.
So off you go configuring all your servers for ssh pass through, and create the rsync scripts. Then before the trial run you think it would be a good idea to create a local backup of all the websites.
You logon to the web server, create a temp directory and start testing your newly advance rsync skills. After a couple of goes, you think your ready for the real thing, but you decide to run the test one more time.
Everything seems fine so you delete the temp directory. You pause for a second and your month drops open wider than it has ever opened before, and a feeling of terror overcomes you. You want to hide in a hole and hope you didn't see what you saw.
I RECURSIVELY DELETED ALL THE LIVE CORPORATE WEBSITES ON FRIDAY AFTERNOON AT 4PM!
This is why it's ALWAYS A GOOD IDEA to use Midnight Commander or something similar to delete directories!!
...Root for (5) years and never trashed a filesystem yet (knockwoody)...
Anonymous on Fri, 11/08/2002 - 03:00.
rsync with ssh as the transport mechanism works very well with my nightly LAN backups. I've found this page to be very helpful: http://www.mikerubel.org/computers/rsync_snapshots/
Nov 01, 2018 | opensource.com
In a well-known data center (whose name I do not want to remember), one cold October night we had a production outage in which thousands of web servers stopped responding due to downtime in the main database. The database administrator asked me, the rookie sysadmin, to recover the database's last full backup and restore it to bring the service back online.
But, at the end of the process, the database was still broken. I didn't worry, because there were other full backup files in stock. However, even after doing the process several times, the result didn't change.
With great fear, I asked the senior sysadmin what to do to fix this behavior.
"You remember when I showed you, a few days ago, how the full backup script was running? Something about how important it was to validate the backup?" responded the sysadmin.
"Of course! You told me that I had to stay a couple of extra hours to perform that task," I answered. "Exactly! But you preferred to leave early without finishing that task," he said. "Oh my! I thought it was optional!" I exclaimed.
"It was, it was "
Moral of the story: Even with the best solution that promises to make the most thorough backups, the ghost of the failed restoration can appear, darkening our job skills, if we don't make a habit of validating the backup every time.
Nov 08, 2002 | www.linuxjournal.com
Anonymous on Fri, 11/08/2002
Why don't you just buy an extra hard disk and have a copy of your important data there. With today's prices it doesn't cost anything.
Anonymous on Fri, 11/08/2002 - 03:00. A lot of people seams to have this idea, and in many situations it should work fine.
However, there is the human factor. Sometimes simple things go wrong (as simple as copying a file), and it takes a while before anybody notices that the contents of this file is not what is expected. This means you have to have many "generations" of backup of the file in order to be able to restore it, and in order to not put all the "eggs in the same basket" each of the file backups should be on a physical device.
Also, backing up to another disk in the same computer will probably not save you when lighting strikes, as the backup disk is just as likely to be fried as the main disk.
In real life, the backup strategy and hardware/software choices to support it is (as most other things) a balancing act. The important thing is that you have a strategy, and that you test it regularly to make sure it works as intended (as the main point is in the article). Also, realizing that achieving 100% backup security is impossible might save a lot of time in setting up the strategy.
(I.e. you have to say that this strategy has certain specified limits, like not being able to restore a file to its intermediate state sometime during a workday, only to the state it had when it was last backed up, which should be a maximum of xxx hours ago and so on...)
Nov 08, 2002 | www.linuxjournal.com
Anonymous on Fri, 11/08/2002 - 03:00.
Its here .. Unbeliveable..
[I had intended to leave the discussion of "rm -r *" behind after the compendium I sent earlier, but I couldn't resist this one.
I also received a response from rutgers!seismo!hadron!jsdy (Joseph S. D. Yao) that described building a list of "dangerous" commands into a shell and dropping into a query when a glob turns up. They built it in so it couldn't be removed, like an alias. Anyway, on to the story! RWH.] I didn't see the message that opened up the discussion on rm, but thought you might like to read this sorry tale about the perils of rm....
(It was posted to net.unix some time ago, but I think our postnews didn't send it as far as it should have!)
Have you ever left your terminal logged in, only to find when you came back to it that a (supposed) friend had typed "rm -rf ~/*" and was hovering over the keyboard with threats along the lines of "lend me a fiver 'til Thursday, or I hit return"? Undoubtedly the person in question would not have had the nerve to inflict such a trauma upon you, and was doing it in jest. So you've probably never experienced the worst of such disasters....
It was a quiet Wednesday afternoon. Wednesday, 1st October, 15:15 BST, to be precise, when Peter, an office-mate of mine, leaned away from his terminal and said to me, "Mario, I'm having a little trouble sending mail." Knowing that msg was capable of confusing even the most capable of people, I sauntered over to his terminal to see what was wrong. A strange error message of the form (I forget the exact details) "cannot access /foo/bar for userid 147" had been issued by msg.
My first thought was "Who's userid 147?; the sender of the message, the destination, or what?" So I leant over to another terminal, already logged in, and typedgrep 147 /etc/passwd
only to receive the response/etc/passwd: No such file or directory.
Instantly, I guessed that something was amiss. This was confirmed when in response tols /etc
I gotls: not found.
I suggested to Peter that it would be a good idea not to try anything for a while, and went off to find our system manager. When I arrived at his office, his door was ajar, and within ten seconds I realised what the problem was. James, our manager, was sat down, head in hands, hands between knees, as one whose world has just come to an end. Our newly-appointed system programmer, Neil, was beside him, gazing listlessly at the screen of his terminal. And at the top of the screen I spied the following lines:# cd # rm -rf *
Oh, *****, I thought. That would just about explain it.
I can't remember what happened in the succeeding minutes; my memory is just a blur. I do remember trying ls (again), ps, who and maybe a few other commands beside, all to no avail. The next thing I remember was being at my terminal again (a multi-window graphics terminal), and typingcd / echo *
I owe a debt of thanks to David Korn for making echo a built-in of his shell; needless to say, /bin, together with /bin/echo, had been deleted. What transpired in the next few minutes was that /dev, /etc and /lib had also gone in their entirety; fortunately Neil had interrupted rm while it was somewhere down below /news, and /tmp, /usr and /users were all untouched.
Meanwhile James had made for our tape cupboard and had retrieved what claimed to be a dump tape of the root filesystem, taken four weeks earlier. The pressing question was, "How do we recover the contents of the tape?". Not only had we lost /etc/restore, but all of the device entries for the tape deck had vanished. And where does mknod live?
You guessed it, /etc.
How about recovery across Ethernet of any of this from another VAX? Well, /bin/tar had gone, and thoughtfully the Berkeley people had put rcp in /bin in the 4.3 distribution. What's more, none of the Ether stuff wanted to know without /etc/hosts at least. We found a version of cpio in /usr/local, but that was unlikely to do us any good without a tape deck.
Alternatively, we could get the boot tape out and rebuild the root filesystem, but neither James nor Neil had done that before, and we weren't sure that the first thing to happen would be that the whole disk would be re-formatted, losing all our user files. (We take dumps of the user files every Thursday; by Murphy's Law this had to happen on a Wednesday).
Another solution might be to borrow a disk from another VAX, boot off that, and tidy up later, but that would have entailed calling the DEC engineer out, at the very least. We had a number of users in the final throes of writing up PhD theses and the loss of a maybe a weeks' work (not to mention the machine down time) was unthinkable.
So, what to do? The next idea was to write a program to make a device descriptor for the tape deck, but we all know where cc, as and ld live. Or maybe make skeletal entries for /etc/passwd, /etc/hosts and so on, so that /usr/bin/ftp would work. By sheer luck, I had a gnuemacs still running in one of my windows, which we could use to create passwd, etc., but the first step was to create a directory to put them in.
Of course /bin/mkdir had gone, and so had /bin/mv, so we couldn't rename /tmp to /etc. However, this looked like a reasonable line of attack.
By now we had been joined by Alasdair, our resident UNIX guru, and as luck would have it, someone who knows VAX assembler. So our plan became this: write a program in assembler which would either rename /tmp to /etc, or make /etc, assemble it on another VAX, uuencode it, type in the uuencoded file using my gnu, uudecode it (some bright spark had thought to put uudecode in /usr/bin), run it, and hey presto, it would all be plain sailing from there. By yet another miracle of good fortune, the terminal from which the damage had been done was still su'd to root (su is in /bin, remember?), so at least we stood a chance of all this working.
Off we set on our merry way, and within only an hour we had managed to concoct the dozen or so lines of assembler to create /etc. The stripped binary was only 76 bytes long, so we converted it to hex (slightly more readable than the output of uuencode), and typed it in using my editor. If any of you ever have the same problem, here's the hex for future reference:070100002c000000000000000000000000000000000000000000000000000000 0000dd8fff010000dd8f27000000fb02ef07000000fb01ef070000000000bc8f 8800040000bc012f65746300
I had a handy program around (doesn't everybody?) for converting ASCII hex to binary, and the output of /usr/bin/sum tallied with our original binary. But hang on---how do you set execute permission without /bin/chmod? A few seconds thought (which as usual, lasted a couple of minutes) suggested that we write the binary on top of an already existing binary, owned by me...problem solved.
So along we trotted to the terminal with the root login, carefully remembered to set the umask to 0 (so that I could create files in it using my gnu), and ran the binary. So now we had a /etc, writable by all.
From there it was but a few easy steps to creating passwd, hosts, services, protocols, (etc), and then ftp was willing to play ball. Then we recovered the contents of /bin across the ether (it's amazing how much you come to miss ls after just a few, short hours), and selected files from /etc. The key file was /etc/rrestore, with which we recovered /dev from the dump tape, and the rest is history.
Now, you're asking yourself (as I am), what's the moral of this story? Well, for one thing, you must always remember the immortal words, DON'T PANIC. Our initial reaction was to reboot the machine and try everything as single user, but it's unlikely it would have come up without /etc/init and /bin/sh. Rational thought saved us from this one.
The next thing to remember is that UNIX tools really can be put to unusual purposes. Even without my gnuemacs, we could have survived by using, say, /usr/bin/grep as a substitute for /bin/cat. And the final thing is, it's amazing how much of the system you can delete without it falling apart completely. Apart from the fact that nobody could login (/bin/login?), and most of the useful commands had gone, everything else seemed normal. Of course, some things can't stand life without say /etc/termcap, or /dev/kmem, or /etc/utmp, but by and large it all hangs together.
I shall leave you with this question: if you were placed in the same situation, and had the presence of mind that always comes with hindsight, could you have got out of it in a simpler or easier way?
Answers on a postage stamp to:
Dept. of Computer Science ARPA: email@example.com
The University USENET: mcvax!ukc!man.cs.ux!miw
Manchester M13 9PL JANET: firstname.lastname@example.org
U.K. 061-273 7121 x 5699
Jul 20, 2017 | www.makeuseof.comBack in college, I used to work just about every day as a computer cluster consultant. I remember a month after getting promoted to a supervisor, I was in the process of training a new consultant in the library computer cluster. Suddenly, someone tapped me on the shoulder, and when I turned around I was confronted with a frantic graduate student a 30-something year old man who I believe was Eastern European based on his accent who was nearly in tears.
"Please need help my document is all gone and disk stuck!" he said as he frantically pointed to his PC.
Now, right off the bat I could have told you three facts about the guy. One glance at the blue screen of the archaic DOS-based version of Wordperfect told me that like most of the other graduate students at the time he had not yet decided to upgrade to the newer, point-and-click style word processing software. For some reason, graduate students had become so accustomed to all of the keyboard hot-keys associated with typing in a DOS-like environment that they all refused to evolve into point-and-click users.
The second fact, gathered from a quick glance at his blank document screen and the sweat on his brow told me that he had not saved his document as he worked. The last fact, based on his thick accent, was that communicating the gravity of his situation wouldn't be easy. In fact, it was made even worse by his answer to my question when I asked him when he last saved.
"I wrote 30 pages."
Calculated out at about 600 words a page, that's 18000 words. Ouch.
Then he pointed at the disk drive. The floppy disk was stuck, and from the marks on the drive he had clearly tried to get it out with something like a paper clip. By the time I had carefully fished the torn and destroyed disk out of the drive, it was clear he'd never recover anything off of it. I asked him what was on it.
I gulped. I asked him if he was serious. He was. I asked him if he'd made any backups. He hadn't.Making Backups of Backups
If there is anything I learned during those early years of working with computers (and the people that use them), it was how critical it is to not only save important stuff, but also to save it in different places. I would back up floppy drives to those cool new zip drives as well as the local PC hard drive. Never, ever had a single copy of anything.
Unfortunately, even today, people have not learned that lesson. Whether it's at work, at home, or talking with friends, I keep hearing stories of people losing hundreds to thousands of files, sometimes they lose data worth actual dollars in time and resources that were used to develop the information.
To drive that lesson home, I wanted to share a collection of stories that I found around the Internet about some recent cases were people suffered that horrible fate from thousands of files to entire drives worth of data completely lost. These are people where the only remaining option is to start running recovery software and praying, or in other cases paying thousands of dollars to a data recovery firm and hoping there's something to find.Not Backing Up Projects
The first example comes from Yahoo Answers , where a user that only provided a "?" for a user name (out of embarrassment probably), posted:
"I lost all my files from my hard drive? help please? I did a project that took me 3 days and now i lost it, its powerpoint presentation, where can i look for it? its not there where i save it, thank you"
The folks answering immediately dove into suggesting that the person run recovery software, and one person suggested that the person run a search on the computer for *.ppt.
... ... ...
Doing Backups Wrong
Then, there's a scenario of actually trying to do a backup and doing it wrong, losing all of the files on the original drive. That was the case for the person who posted on Tech Support Forum , that after purchasing a brand new Toshiba Laptop and attempting to transfer old files from an external hard drive, inadvertently wiped the files on the hard drive.
Please someone help me I last week brought a Toshiba Satellite laptop running windows 7, to replace my blue screening Dell vista laptop. On plugged in my sumo external hard drive to copy over some much treasured photos and some of my (work music/writing.) it said installing driver. it said completed I clicked on the hard drive and found a copy of my documents from the new laptop and nothing else.
While the description of the problem is a little broken, from the sound of it, the person thought they were backing up from one direction, while they were actually backing up in the other direction. At least in this case not all of the original files were deleted, but a majority were.
May 01, 2018 | Techdirt
Here's a random story, found via Kottke , highlighting how Pixar came very close to losing a very large portion of Toy Story 2 , because someone did an rm * (non geek: "remove all" command). And that's when they realized that their backups hadn't been working for a month. Then, the technical director of the film noted that, because she wanted to see her family and kids, she had been making copies of the entire film and transferring it to her home computer. After a careful trip from the Pixar offices to her home and back, they discovered that, indeed, most of the film was saved:
Now, mostly, this is just an amusing little anecdote, but two things struck me:
How in the world do they not have more "official" backups of something as major as Toy Story 2 . In the clip they admit that it was potentially 20 to 30 man-years of work that may have been lost. It makes no sense to me that this would include a single backup system. I wonder if the copy, made by technical director Galyn Susman, was outside of corporate policy. You would have to imagine that at a place like Pixar, there were significant concerns about things "getting out," and so the policy likely wouldn't have looked all that kindly on copies being used on home computers.The Mythbusters folks wonder if this story was a little over-dramatized , and others have wondered how the technical director would have "multiple terabytes of source material" on her home computer back in 1999. That resulted in an explanation from someone who was there that what was deleted was actually the database containing the master copies of the characters, sets, animation, etc. rather than the movie itself. Of course, once again, that makes you wonder how it is that no one else had a simple backup. You'd think such a thing would be backed up in dozens of places around the globe for safe keeping...
Hans B PUFAL ( profile ), 18 May 2012 @ 5:53amReminds me of .... Some decades ago I was called to a customer site, a bank, to diagnose a computer problem. On my arrival early in the morning I noted a certain panic in the air. On querying my hosts I was told that there had been an "issue" the previous night and that they were trying, unsuccessfully, to recover data from backup tapes. The process was failing and panic ensued.Anonymous Coward , 18 May 2012 @ 6:00am
Though this was not the problem I had been called on to investigate, I asked some probing questions, made a short phone call, and provided the answer, much to the customer's relief.
What I found was that for months if not years the customer had been performing backups of indexed sequential files, that is data files with associated index files, without once verifying that the backed-up data could be recovered. On the first occasion of a problem requiring such a recovery they discovered that they just did not work.
The answer? Simply recreate the index files from the data. For efficiency reasons (this was a LONG time ago) the index files referenced the data files by physical disk addresses. When the backup tapes were restored the data was of course no longer at the original place on the disk and the index files were useless. A simple procedure to recreate the index files solved the problem.
Clearly whoever had designed that system had never tested a recovery, nor read the documentation which clearly stated the issue and its simple solution.
So here is a case of making backups, but then finding them flawed when needed.Re: Reminds me of .... That's why, in the IT world, you ALWAYS do a "dry run" when you want to deploy something, and you monitor the heck out of critical systems.Rich Kulawiec , 18 May 2012 @ 6:30amTwo notes on backupssaulgoode ( profile ), 18 May 2012 @ 6:38am
1. Everyone who has worked in computing for any period of time has their own backup horror story. I'll spare you mine, but note that as a general observation, large organizations/corporations tend to opt for incredibly expensive, incredibly complex, incredibly overblown backup "solutions" sold to them by vendors rather than using the stock, well-tested, reliable tools that they already have. (e.g., "why should we use dump, which is open-source/reliable/portable/tested/proven/efficient/etc., when we could drop $40K on closed-source/proprietary/non-portable/slow/bulky software from a vendor?"
Okay, okay, one comment: in over 30 years of working in the field, the second-worst product I have ever had the misfortune to deal with is Legato (now EMC) NetWorker.
2. Hollywood has a massive backup and archiving problem. How do we know? Because they keep telling us about it. There are a series of self-promoting commercials that they run in theaters before movies, in which they talk about all of the old films that are slowly decaying in their canisters in vast warehouses, and how terrible this is, and how badly they need charitable contributions from the public to save these treasures of cinema before they erode into dust, etc.
Let's skip the irony of Hollywood begging for money while they're paying professional liar Chris Dodd millions and get to the technical point: the easiest and cheapest way to preserve all of these would be to back them up to the Internet. Yes, there's a one-time expense of cleaning up the analog versions and then digitizing them at high resolution, but once that's done, all the copies are free. There's no need for a data center or elaborate IT infrastructure: put 'em on BitTorrent and let the world do the work. Or give copies to the Internet Archive. Whatever -- the point is that once we get past the analog issues, the only reason that this is a problem is that theymade it a problem by refusing to surrender control.Re: Two notes on backups "Real Men don't make backups. They upload it via ftp and let the world mirror it." - Linus TorvaldsAnonymous Coward , 18 May 2012 @ 7:02amWhat I suspect is that she was copying the rendered footage. If the footage was rendered at a resolution and rate fitting to DVD spec, that'd put the raw footage at around 3GB to 4GB for a full 90min, which just might fit on the 10GB HDD that were available back then on a laptop computer (remember how small OSes were back then).aldestrawk ( profile ), 18 May 2012 @ 8:34am
Even losing just the rendered raw footage (or even processed footage), would be a massive setback. It takes a long time across a lot of very powerful computers to render film quality footage. If it was processed footage then it's even more valuable as that takes a lot of man hours of post fx to make raw footage presentable to a consumer audience.a retelling by Oren Jacob Oren Jacob, the Pixar director featured in the animation, has made a comment on the Quora post that explains things in much more detail. The narration and animation was telling a story, as in storytelling. Despite the 99% true caption at the end, a lot of details were left out which misrepresented what had happened. Still, it was a fun tale for anyone who had dealt with backup problems. Oren Jacob's retelling in the comment makes it much more realistic and believable.Mason Wheeler , 18 May 2012 @ 10:01am
The terabytes level of data came from whoever posted the video on Quora. The video itself never mentions the actual amount of data lost or the total amount the raw files represent. Oren says, vaguely, that it was much less than a terabyte. There were backups! The last one was from two days previous to the delete event. The backup was flawed in that it produced files that when tested, by rendering, exhibited errors.
They ended up patching a two-month old backup together with the home computer version (two weeks old). This was labor intensive as some 30k files had to be individually checked.
The moral of the story.
- Firstly, always test a restore at some point when implementing a backup system.
- Secondly, don't panic! Panic can lead to further problems. They could well have introduced corruption in files by abruptly unplugging the computer.
- Thirdly, don't panic! Despite, somehow, deleting a large set of files these can be recovered apart from a backup system.
Deleting files, under Linux as well as just about any OS, only involves deleting the directory entries. There is software which can recover those files as long as further use of the computer system doesn't end up overwriting what is now free space.Re: a retelling by Oren Jacobaldestrawk ( profile ), 18 May 2012 @ 10:38amPanic can lead to further problems. They could well have introduced corruption in files by abruptly unplugging the computer.
What's worse? Corrupting some files or deleting all files?Re: Re: a retelling by Oren JacobDanny ( profile ), 18 May 2012 @ 10:49am
In this case they were not dealing with unknown malware that was steadily erasing the system as they watched. There was, apparently, a delete event at a single point in time that had repercussions that made things disappear while people worked on the movie.
I'll bet things disappeared when whatever editing was being done required a file to be refreshed.
A refresh operation would make the related object disappear when the underlying file was no longer available.
Apart from the set of files that had already been deleted, more files could have been corrupted when the computer was unplugged.
Having said that, this occurred in 1999 when they were probably using the Ext2 filesystem under Linux. These days most everyone uses a filesystem that includes journaling which protects against corruption that may occur when a computer loses power. Ext3 is a journaling filesystem and was introduced in 2001.
In 1998 I had to rebuild my entire home computer system. A power glitch introduced corruption in a Windows 95 system file and use of a Norton recovery tool rendered the entire disk into a handful of unusable files. It took me ten hours to rebuild the OS and re-install all the added hardware, software, and copy personal files from backup floppies. The next day I went out and bought a UPS. Nowadays, sometimes the UPS for one of my computers will fail during one of the three dozen power outages a year I get here. I no longer have problems with that because of journaling.I've gotta story like this too Ive posted in athe past on Techdirt that I used to work for Ticketmaster. The is an interesting TM story that I don't think ever made it into the public, so I will do it now.aldestrawk ( profile ), 18 May 2012 @ 11:30am
Back in the 1980s each TM city was on an independent computer system (PDP unibus systems with RM05 or CDC9766 disk drives. The drives were fixed removable boxes about the size of a washing machine, the removable disk platters about the size of the proverbial breadbox. Each platter held 256mb formatted.
Each city had itts own operations policies, but generally, the systems ran with mirrored drives, the database was backed up every night, archival copies were made monthly. In Chicago, where I worked, we did not have offsite backup in the 1980s. The Bay Area had the most interesting system for offsite backup.
The Bay Area BASS operation, bought by TM in the mid 1980s, had a deal with a taxi driver. They would make their nightly backup copies in house, and make an extra copy on a spare disk platter. Tis cabbie would come by the office about 2am each morning, and they'd put the spare disk platter in his trunk, swapping it for the previous day's copy that had been his truck for 24 hours. So, for the cost of about two platters ($700 at the time) and whatever cash they'd pay the cabbie, they had a mobile offsite copy of their database circulating the Bay Area at all times.
When the World Series earthquake hit in October 1988, the TM office in downtown Oakland was badly damaged. The only copy of the database that survived was the copy in the taxi cab.
That incident led TM corporate to establish much more sophisticated and redundant data redundancy policies.Re: I've gotta story like this too I like that story. Not that it matters anymore, but taxi cab storage was probably a bad idea. The disks were undoubtedly the "Winchester" type and when powered down the head would be parked on a "landing strip". Still, subjecting these drives to jolts from a taxi riding over bumps in the road could damage the head or cause it to be misaligned. You would have known though it that actually turned out to be a problem. Also, I wouldn't trust a taxi driver with the company database. Although, that is probably due to an unreasonable bias towards cab drivers. I won't mention the numerous arguments with them (not in the U.S.) over fares and the one physical fight with a driver who nearly ran me down while I was walking.Huw Davies , 19 May 2012 @ 1:20amRe: Re: I've gotta story like this too RM05s are removable pack drives. The heads stay in the washing machine size unit - all you remove are the platters.That One Guy ( profile ), 18 May 2012 @ 5:00pmWhat I want to know is this... She copied bits of a movie to her home system... how hard did they have to pull in the leashes to keep Disney's lawyers from suing her to infinity and beyond after she admitted she'd done so(never mind the fact that he doing so saved them apparently years of work...)?Lance , 3 May 2014 @ 8:53am
Evidently, the film data only took up 10 GB in those days. Nowhere near TB...
Nov 07, 2002 | Linux Journal
The dangers of not testing your backup procedures and some common pitfalls to avoid.
Backups. We all know the importance of making a backup of our most important systems. Unfortunately, some of us also know that realizing the importance of performing backups often is a lesson learned the hard way. Everyone has their scary backup stories. Here are mine. Scary Story #1
Like a lot of people, my professional career started out in technical support. In my case, I was part of a help-desk team for a large professional practice. Among other things, we were responsible for performing PC LAN backups for a number of systems used by other departments. For one especially important system, we acquired fancy new tape-backup equipment and a large collection of tapes. A procedure was put in place, and before-you-go-home-at-night backups became a standard. Some months later, a crash brought down the system, and all the data was lost. Shortly thereafter, a call came in for the latest backup tape. It was located and dispatched, and a recovery was attempted. The recovery failed, however, as the tape was blank . A call came in for the next-to-last backup tape. Nervously, it was located and dispatched, and a recovery was attempted. It also failed because this tape also was blank. Amid long silences and pink-slip glares, panic started to set in as the tape from three nights prior was called up. This attempt resulted in a lot of shouting.
All the tapes were then checked, and they were all blank. To add insult to injury, the problem wasn't only that the tapes were blank--they weren't even formatted! The fancy new backup equipment wasn't smart enough to realize the tapes were not formatted, so it allowed them to be used. Note: writing good data to an unformatted tape is never a good idea.
Now, don't get me wrong, the backup procedures themselves were good. The problem was that no one had ever tested the whole process--no one had ever attempted a recovery. Was it no small wonder then that each recovery failed?
For backups to work, you need to do two things: (1) define and implement a good procedure and (2) test that it works.
To this day, I can't fathom how my boss (who had overall responsibility for the backup procedures) managed not to get fired over this incident. And what happened there has always stayed with me.
A Good Solution
When it comes to doing backups on Linux systems, a number of standard tools can help avoid the problems discussed above. Marcel Gagné's excellent book (see Resources) contains a simple yet useful script that not only performs the backup but verifies that things went well. Then, after each backup, the script sends an e-mail to root detailing what occurred.
I'll run through the guts of a modified version of Marcel's script here, to show you how easy this process actually is. This bash script starts by defining the location of a log and an error file. Two mv commands then copy the previous log and error files to allow for the examination of the next-to-last backup (if required):#! /bin/bash backup_log=/usr/local/.Backups/backup.log backup_err=/usr/local/.Backups/backup.err mv $backup_log $backup_log.old mv $backup_err $backup_err.old
With the log and error files ready, a few echo commands append messages (note the use of >>) to each of the files. The messages include the current date and time (which is accessed using the back-ticked date command). The cd command then changes to the location of the directory to be backed up. In this example, that directory is /mnt/data, but it could be any location:echo "Starting backup of /mnt/data: `date`." >> $backup_log echo "Errors reported for backup/verify: `date`." >> $backup_err cd /mnt/data
The backup then starts, using the tried and true tar command. The -cvf options request the creation of a new archive (c), verbose mode (v) and the name of the file/device to backup to (f). In this example, we backup to /dev/st0, the location of an attached SCSI tape drive:tar -cvf /dev/st0 . 2>>$backup_err
Any errors produced by this command are sent to STDERR (standard error). The above command exploits this behaviour by appending anything sent to STDERR to the error file as well (using the 2>> directive).
When the backup completes, the script then rewinds the tape using the mt command, before listing the files on the tape with another tar command (the -t option lists the files in the named archive). This is a simple way of verifying the contents of the tape. As before, we append any errors reported during this tar command to the error file. Additionally, informational messages are added to the log file at appropriate times:mt -f /dev/st0 rewind echo "Verifying this backup: `date`" >>$backup_log tar -tvf /dev/st0 2>>$backup_err echo "Backup complete: `date`" >>$backup_log
To conclude the script, we concatenate the error file to the log file (with cat ), then e-mail the log file to root (where the -s option to the mail command allows the specification of an appropriate subject line):cat $backup_err >> $backup_log mail -s "Backup status report for /mnt/data" root < $backup_log
And there you have it, Marcel's deceptively simple solution to performing a verified backup and e-mailing the results to an interested party. If only we'd had something similar all those years ago.
... ... ...
If you are not careful you can wipe out your C disk performing a restore of the Windows C partition image to a USB drive, as selection of bootable recovery image somehow redirects recovery to disk C. The warning sign is when Acronis True Image wants to reboot computer to proceed.
If you are brave enough to go past this point, then despite the fact that you explicitly made your target different from bootable drive you need to face unpleasant consequences -- your C partition is now gone.
You can imagine your surprise with the results. I once did that. Thanks God there was no critical data on this wiped C drive. I already migrate it to a new PC. My first reaction was to throw this garbage program where it belongs. But the problem is that other similar programs are not much better and now I am trained not to trust Acronis and probably can do better in future. Another factor is that if you don't use Acronis True Image often you forget about it capabilities (in this case the write decision would be to use cloning of the disk operating, not restoration from the image but the problem was that the disk and image were slightly different and I want the content of the image not the content of the disk.
Still right way would be to do first clone of the disk and then perform restoration of the image to this drive. As I don't use complex operations with Acronis often, I forgot about that and was punished. And believe me you jaw really drops in such cases when you see the results...
Another time, our AIX/370 cluster managed to trash the /etc/passwd file. All 4 machines in the cluster lost their copies within milliseconds. In the next few minutes, I discovered that (a) the nightly script that stashed an archive copy hadn't run the night before and (b) that our backups were pure zorkumblattum as well. (The joys of running very beta-test software).
I finally got saved when I realized the cluster had *5* machines in it - a lone PS/2 had crashed the night before, and failed to reboot. So it had a propogated copy of /etc/passwd as of the previous night.
Go to that PS/2, unplug it's Ethernet.. reboot it. Copy /etc/passwd to floppy, carry to a working (?) PS/2 in the cluster, tar it off, let it propagate to other cluster sites. Go back, hook up the crashed PS/2s ethernet.. All done.
Only time in my career that having beta-test software crash a machine saved me from bugs in beta-test software. ;)
Once I was in the position of upgrading a Gould PN/9080. I was a good sysadmin, took a backup before I started, since the README said that they had changed the I-node format slightly. I do the upgrade, and it goes with unprecidented (for Gould) smoothness. mkfs all the user partitions, start restoring files. Blam.
I/O error on the tape. All 12 tapes. Both Sets of backups.
However, 'dd' could read the tape just fine.
36 straight hours later, I finally track it down to a bad chip on the tape controller board - the chip was involved in the buffer/convert from a 32-bit backplane to a 8-bit I/O cable. Every 4 bytes, the 5th bit would reverse sense. 20 mins later, I had a program written, and 'dd 3 my_twiddle 3 restore -f -' running.
Moral: Always *verify* the backups - the tape drive didn't report a write error, because what it *received* and what went on the tape were the same....
I'm sure I have other sagas, but those are some of the more memorable ones I've had...
Computer Systems Engineer
From: rca@Ingres.COM (Bob Arnold)
Organization: Ask Computer Systems Inc., Ingres Division, Alameda CA 94501
Many moons ago, in my first sysadmin job, learning via "on-the-job training", I was in charge of a UNIX box who's user disk developed a bad block. (Maybe you can see it already ...)
The "format" man page seemed to indicate that it could repair bad blocks. (Can you see it now?) I read the man page very carefully. Nowhere did it indicate any kind of destructive behavior.
I was brave and bold, not to mention boneheaded, and formatted the user disk.
The good news:
1) The bad block was gone.
2) I was about to learn a lot real fast :-)
The bad news:
1) The user data was gone too.
2) The users weren't happy, to say the least.
Having recently made a full backup of the disk, I knew I was in for a miserable all day restore. Why all day? It took 8 hours to dump that disk to 40 floppies. And I had incrementals (levels 1, 2, 3, 4, and 5, which were another sign of my novice state) to layer on top of the full.
Only it got worse. The floppy drive had intermittent problems reading some of the floppies. So I had to go back and retry to get the files which were missed on the first attempt.
This was also a port of Version 7 UNIX (like I said, this was many moons ago). It had a program called "restor", primordial ancestor of BSD's "restore". If you used the "x" option to extract selected files (the ones missed on earlier attempts), "restor" would use the *inode number* as the name of the extracted files. You had to move the extracted files to their correct locations yourself (the man page said to write a shellscript to do this :-(). I didn't know much about shell scripts at the time, but I learned a lot more that week.
Yes, it took me a full week, including the weekend, maybe 120 hours or more, to get what I could (probably 95% of the data) off the backups.
And there were a few ownership and permissions problems to be cleaned up after that.
Once burned twice shy. This is the only truly catastrophic mistake I've ever made as a sysadmin, I'm glad to be able to say.
I kept a copy of my memo to the users after I had done what I could. Reading it over now is sobering indeed. I also kept my extensive notes on the restore process - thank goodness I've never had to use them since.
1) The "man" pages don't tell you everything you need to know.
2) Don't do backups to floppies.
3) Test your backups to make sure they are readable.
4) Handle the format program (and anything else that writes directly to disk devices) like nitroglycerine.
5) Strenuously avoid systems with inadequate backup and restore programs wherever possible (thank goodness for "restore" with an "e"!).
6) If you've never done sysadmin work before, take a formal training class.
Well, I haven't thought about that one in a while! I can laugh about it now ....
From: rca@Ingres.COM (Bob Arnold)
Organization: Ask Computer Systems Inc., Ingres Division, Alameda CA 94501
In article <1992Oct12.233524.13463@pony.Ingres.COM> I wrote:
>I was brave and bold, not to mention boneheaded, and formatted the user disk.
> U rest of story deleted ... Bob ~
> 1) The "man" pages don't tell you everything you need to know.
> 2) Don't do backups to floppies.
> 3) Test your backups to make sure they are readable.
> 4) Handle the format program (and anything else that writes directly
> to disk devices) like nitroglycerine.
> 5) Strenuously avoid systems with inadequate backup and restore
> programs wherever possible (thank goodness for "restore" with
> an "e"!).
> 6) If you've never done sysadmin work before, take a formal
> training class.
Just thought of a few more related morals (managers pay attention now):
7) You get what you pay for.
8) There's no substitute for experience.
9) It's a lot less painful to learn from someone else's experience than your own (that's what this thread is about, I guess :-) )
Part of the story I should tell here. My employer had been looking for a way to cut costs. I was 15% cheaper than their previous sysadmin so they let him go and hired me. It wasn't as nasty as it sounds, since they kept him on as a consultant at 4 hours a week and he ended up with a better job too (so did I). Everyone benefited in the end. I leaned heavily on his consulting, which was great. He was older and wiser, and probably had his own horror stories to tell. After this one, so did I!
Google matched content
The Last but not Least Technology is dominated by two types of people: those who understand what they do not manage and those who manage what they do not understand ~Archibald Putt. Ph.D
Copyright © 1996-2018 by Dr. Nikolai Bezroukov. www.softpanorama.org was initially created as a service to the (now defunct) UN Sustainable Development Networking Programme (SDNP) in the author free time and without any remuneration. This document is an industrial compilation designed and created exclusively for educational use and is distributed under the Softpanorama Content License. Original materials copyright belong to respective owners. Quotes are made for educational purposes only in compliance with the fair use doctrine.
FAIR USE NOTICE This site contains copyrighted material the use of which has not always been specifically authorized by the copyright owner. We are making such material available to advance understanding of computer science, IT technology, economic, scientific, and social issues. We believe this constitutes a 'fair use' of any such copyrighted material as provided by section 107 of the US Copyright Law according to which such material can be distributed without profit exclusively for research and educational purposes.
This is a Spartan WHYFF (We Help You For Free) site written by people for whom English is not a native language. Grammar and spelling errors should be expected. The site contain some broken links as it develops like a living tree...
|You can use PayPal to make a contribution, supporting development of this site and speed up access. In case softpanorama.org is down you can use the at softpanorama.info|
The statements, views and opinions presented on this web page are those of the author (or referenced source) and are not endorsed by, nor do they necessarily reflect, the opinions of the author present and former employers, SDNP or any other organization the author may be associated with. We do not warrant the correctness of the information provided or its fitness for any purpose.
Last modified: November 08, 2019