Softpanorama

Home Switchboard Unix Administration Red Hat TCP/IP Networks Neoliberalism Toxic Managers
May the source be with you, but remember the KISS principle ;-)
Bigger doesn't imply better. Bigger often is a sign of obesity, of lost control, of overcomplexity, of cancerous cells

Softpanorama classification of sysadmin horror stories

A modest attempt to share experience and non-obvious mistakes in order to keep others from making them

10-15 min spend on re-reading this page once a month can help to avoid some of the situations described below. Spectacular blunder is often too valuable to be forgotten as it tends to repeat itself ;-).

Version 1.7 (July 2, 2019)

News Enterprise Unix System Administration Recommended Links Creative uses of rm Mistakes made because of the differences between various Unix/Linux flavors Missing backup horror stories Lack of testing of complex, potentially destructive, commands before execution of production box Pure stupidity
Locking yourself out Premature or misguided optimization Reboot Blunders  Performing the operation on a wrong server Executing command in a wrong directory Side effects of performing operations on home or application directories Typos in the commands with disastrous consequences Side effects of patching
Multiple sysadmin working on the same box Side effects of patching of the customized server Ownership changing blunders Dot-star-errors and regular expressions blunders Excessive zeal in improving security of the system Unintended consequences of automatic system maintenance scripts LVM mishaps Abuse of privileges
Safe-rm Workaholism and Burnout Coping with the toxic stress in IT environment The Unix Hater’s Handbook Tips Horror stories History Humor Etc


Introduction

"More systems have been wiped out by admins
than any hacker could do in a lifetime"

Rick Furniss

“Experience fails to teach where there is no desire to learn.”
"Everything happens to everybody sooner or later if there is time enough."

George Bernard Shaw

“Experience is the most expensive teacher, but a fool will learn from no other.”

Benjamin Franklin

Unix system administration is an interesting and complex craft. It's good, if your work demands use of your technical skills, creativity and judgment. If it doesn't, then you're in absurd world of Dilbertized cubicle farms and bureaucratic stupidity. And unfortunately that happens too.

There is a lot of deep elegance in Unix, and a talented sysadmin, like any talented craftsman, is able to expose this hidden beauty by the masterful manipulation of complex objects using classic Unix and command line. Which often amazes observers, who have Windows background. In Unix administration you need to improvise on the job to get things done, create your own tools and master command line environment; if you want to be on advanced level you can't go "by manual", you need to improvise.  Unfortunately some of such improvisations produce unexpected side effects ;-)

In a way not only  execution of complex sequences of commands is a part of this craftsmanship. Blunders and folklore about them are also the legitimate part of the craft. It's human to err after all. And if you are working as root, such an error can easily wipe a vital part of the system. In you are unlucky this is a production system. If you are especially unlucky there is no backup. It is presence of absence of recent backup that often distinguish the horror story from minor nuisance. That's why many veteran sysadmins create personal backup before doing something complex and/or risky.  At least you should backup /etc on you first login as root this day. That can be done from /root/.bash_profile.

Sh*t happens, but there is a system in any madness ;-). That's why it is important to try to classify typical sysadmin mistakes. people learn from experience and that's why each sysadmin should maintain his own lab journal. Regardless of the reason, every mistake should be documented as it constitutes as an important lesson pointing to the class of similar errors possible. As saying goes "never waist a good crisis". For example in many cases when a simple mistake cause serious problems, we observe absence of backups and absence of baseline, often both. 

The most common blunder in Unix/Linux is probably wiping out useful data by wrong rm command. This danger can't be avoided as the power of rm is necessary, but in addition to having an up-to-date backup, there are several steps that you can do to mitigate the damage:

  1. Block operation for system level 2 directories such  as /etc or /usr (this simple measure actually is now implemented in GNU rm used in Linux). Block operation on any directory that is in your favorite directories list if you have any (requires a script)
  2. As many such blunders occur when you operate on backup copy of the  directory and automatically type the name of the directory with slash in front of it  (rm -r /etc instead of rm -r etc )  if it because you are conditioned to type it this was you can first rename the directory to something else. A script can check if there is another directory with this name in existence and warn you too.  You can also simply move directory to /Trash folder which has, say 90 day expiration period for files in it
  3. You can write a script like rmm or saferm  which among other useful preventive checks introduces a delay between hitting enter and  starting of the operation and list the file or at lest number of them and first five that will be affected.

So avoiding the dangers of rm blunders is an important part of the art of Unix system administration. We have a separate page for this: Creative uses of rm

Another common and disastrous blunder for Unix sysadmin who often jungle many dozens of servers is performing a destructive operation on the wrong server. They vary from reboot of the production server instead of quality of testing, removal of file on the original box instead of backup (while this file does not exist on the backup), and many others. See Performing the operation on a wrong server

Learning from your own mistakes as well as mistakes of others is an important part of learning the craft. that's why it is important periodically reread such pages: they can prevent some horrible blunders. This page provides such a generic information.

It is important to reread it periodically as existence about recent blunder, usually fades quickly and in year most are ready to repeat them again ;-). Also Unix sysadmin who does not understand or don't remember about danger of rm -rf .* is a bad sysadmin. In addition keeping a personal journal of your SNAFU ( a typical SNAFU is like traffic incident is a confluence of several mistakes/simultaneous maneuvers /misunderstanding, etc; also like in the army incompetent bosses often play prominent role in such incidents)

Periodically browsing this personal log is really important as each of this incidents can easily be as they politically correctly say it "career limiting move". While bad incidents stimulate learning and stimulates personal growth of a system administrator, there are less painful way to grow your knowledge. Including knowledge of bad incidents. That's why this set of pages was created. Reading it and other similar pages is a must for any aspiring sysadmin.

There are several fundamental reasons for blunders syadmins commit:

  1.  Excessive zeal. As Talleyrand advised to young diplomats: "First and foremost try to avoid excessive zeal."  That very wise recommendation is fully applicable to sysadmin activities especially important one. often doing nothing NOW is the most optimal cause of action. It let you to have time to think about the saturation and understand it better. "Waiting until the next morning" is often not a bad advice.  Another trivial corollary of this maxim is you should never to start anything important before vacation ("to finish  everything before vacations" -- the road to hell is really paved with good intentions ;-), unless you really want your vacation to be spoiled ;-). Excessive zeal is  probably the source of most horrible blunders.  Doing something "quick and fast" to help the company, or your manager, or your colleagues, often can turn into unmitigated disaster. So resist urges to violate established procedures even you are pressed. That first of all is related to the "rule no 1: create backup before starting any activity that can screw up OS or important  components.

  2. Listening to user without checking the gory details. Users often do not understand what they want so blindly following their instruction is a sure recipe for a disaster. So, do users know what they want? No, no, and no. Three times no. For example, if a user wrote you a email requesting that a newer version R language interpreter should be installed on your servers ASAP, because previous version is too old (which is true), without checking you might miss the real meaning of his message, which quite different from the requested action (and that means that following his request leads to a rather big SNAFU):

    1. The user is an is a typical luser (idiot/novice/incompetent) who know neither Linux not R well and tried to install some R package (or a group of packages). When installation failed he/she just jumped to the conclusion that the problem is with the R  interpreter version, because he heard that there is a newer one.

    2. The user inhered some code, which he does not understand and it does not run under the currently installed interpreter.  In his infinite wisdom the user decided that the problem is not in him/her but in the R interpreter.

    3. Combination of (a) and (b)

    4. Some other reason with the incompetence as the root case

    If in this case you jump to action and update the interpreter you now can face several more serious problems:

    1. You might need to restore everything from backup (and that means for example on all 16 ot more servers that just updated spending part of your weekend ;-),  but for some parts that update overwrote you do not have a reliable backup.

    2. The problem that the user faces became much worse, and now you are in the loop to help him/her, because it is you who made it worse.

    3. Other users start experiencing serious problem with their scripts.

    Another more humiliating story of the same type Opensource.com

    The accidental spammer (An anonymous story)

    It's a pretty common story that new sys admins have to tell: They set up an email server and don't restrict access as a relay, and months later they discover they've been sending millions of spam email across the world. That's not what happened to me.

    I set up a Postfix and Dovecot email server, it was running fine, it had all the right permissions and all the right restrictions. It worked brilliantly for years. Then one morning, I was given a file of a few hundred email addresses. I was told it was an art organization list, and there was an urgent announcement that must be made to the list as soon as possible. So, I got right on it. I set up an email list, I wrote a quick sed command to pull out the addresses from the file, and I imported all the addresses. Then, I activated everything.

    Within ten minutes, my server nearly falls over. It turns out I had been asked to set up a mailing list for people we'd never met, never contacted before, and who had no idea they were being added to a mailing list. I had unknowingly set up a way for for us to spam hundreds of people at arts organizations and universities. Our address got blacklisted by a few places, and it took a week for the angry emails to stop. Lesson: Ask for more information, especially if someone is asking you to import hundreds of addresses.

  3. False sense of security. False sense of security invite performing dangerous actions without proper preparation. For example, the fact that you use command like for decade of more does not actually shield you form committing horrible blunders, if you are not careful. Especially if your prefer, as many sysadmin do, to work as root. Verifying command  by typing them on the editor first and only that running them is a good practice specially, if work with a host hundreds miles away.  It is so easy one day absolutely automatically type something like

    rm * 171206.log

    instead   of

    rm *171206.log

    Our brains sometimes tend to play jokes on us.

    Another aspect of the  same is that complexity of environment and hidden interactions between components are ignored and you just into action without investigating possible consequences of the move.  for example even trivial operation like fixing that way year is represented (year2000 problem) proved to be a mess. Similarly even a simple upgrade of the version of the complier of interpreter done at the request of one  user, can disrupt the work of others. This is typical both with hardware and software operations.  For example, sometimes sysadmin shut himself out of remote box by performing an network reconfiguration operation that does not take into account what type of network connection he/she is using.
     

  4. Absence of backup. This another typical case of "false sense of security" but it deserves be a separate point. One thing that distinguishes a professional Unix sysadmin from an amateur is attitude to backup and the level of knowledge of backup technologies. Professional sysadmin knows all too well that often the difference between major SNAFU and a nuisance is availability of the up-to-date backup of data. Here is pretty telling poem (From unknown source (well originally Paul McCartney :-) on the subject :

    Yesterday,
    All those backups seemed a waste of pay.
    Now my database has gone away.
    Oh I believe in yesterday.

  5. Loss of situational awareness. The latter is the ability to identify, process, and comprehend the critical elements of information about what is happening; be alert to any clues which might indicate that you lost situational awareness. Often happens with you are tied or exhausted, or sleepy. Loss of situational awareness often is connected with lack of preparedness, when the person already forgot important details about particular procedure or subsystem due to very infrequent problems with them, but failed to RTFM  and jusmp into action.
    Loss of situational awareness typically happens when you tied, exhausted or under pressure. It is often is connected with lack of preparedness, when the person already forgot important details about particular procedure or subsystem due to very infrequent problems with them, but failed to RTFM  and jump into action.

    In this sense while long troubleshooting session can be beneficial as only this way you get get a "mental picture" (like traffic controller) of what is happening, extremely long troubleshooting sessions (nighters) are counterproductive (and even dangerous) just because of this factor: in such conditions you can accidentally destroy with one stoke part of the OS and create much bigger problem then the problem you were dealing with.

    Avoiding any complex or potentially destructive operation when tired is a prudent advice, but due to specific of sysadmin work with its unpredictable load peaks it is very difficult to follow.

    Here are a couple of tips:

  6. "Misperception of the level of risks": often the most damage is caused not the failure itself but subsequent hasty and badly thought out "recovery" actions. People tend to react to disaster on an emotional basis, with feelings overweighting the logic and rush to actions trying to save the situation, while making it worse. Humans are “wired” biologically to fear first and think second. So after experiencing first, often relatively minor problem sysadmin often overreact and commits a huge blunder, trying to correct this error without full understanding of the situation. At this time minor problem became real SNAFU. The key in facing any serious problem is to give yourself some "cool down" period. Just a couple of minutes of thinking about the problem can save you from making a misguided action that makes the situation tremendously worse, sometime irreparable. In any case creating a backup is not a step you can skip. This is a step that differentiates between amateur and professional.
     
  7. Modern hardware is way too complex and sometimes dealing with components of modern server became minefield (especially if this happens rarely and previous lessons and knowledge are long forgotten) can also lead to disasters too. For example HO P410 RAID controller has interesting property to "forget" its configuration in certain circumstances if you remove the drive that is not used by controller while the server is up. In this case on reboot you get something like
    <4>cciss 0000:05:00.0: cciss: Trying to put board into performant mode
    <4>cciss 0000:05:00.0: Placing controller into performant mode
    <6> cciss/c0d0: unknown partition table    

    Here your jaw drops, especially if you realize that you have no recent backup.
     

  8. Reckless actions.  An “Ohnosecond” is defined as the period of time between when you hit the Enter key and you realize what you just did. Desire to "cut corners" often is connected with being tiered, personal problems, excessive hubris, bravado, being over caffeinated, etc. It is very similar to reckless driving.  For example, people often forget that ".*" matches “.” and ".." and do rm command or other "destructive" command this without prior testing what set of files is effected on production server. There is a difference between a test server and production server production server in a sense that any action on production server should be verified prior to execution. The similarity with traffic incidents here is that like reckless driver reckless sysadmin is aware of the risk and consciously disregard it.
    State laws usually define reckless driving as “driving with a willful or a wanton disregard for the safety of persons or property,” or in similar terms. Courts consider alcohol and drug use as a factor in deciding whether the driver’s actions were reckless.

Raising situation awareness by doing self-safety training

"Those Who Forget History Are Doomed to Repeat It"
Multiple authors

"Those who cannot remember the past are condemned to repeat it."

George Santayana

Having even primitive recoding of your blunders in a form of, say, HTML page, Word document  or  special logbook is a good way to increase situation awareness. People usually are unable to learn on blunders committed by others. They prefer to make their own... And even in this case after a year of two the lesson is typically completely forgotten.

Rereading of description of your own blunder typically provides strong emotional reaction and reinforces understanding of dangers related to this blunders. This type of "emotional memory" is very important in helping to avoid similar blunder in the future. That means that periodic reviews of descriptions of your own blunders is really necessary part of sysadmin arsenal. Re-reading those description should be periodic (for example once a quarter) self-safety training, much like safety training in large corporations.

I can attest that those 10-15 min spend on re-reading and enhancing this material once a month can help to avoid some of the situations described below. Spectacular blunder is often too valuable to be forgotten as they tends to repeat itself ;-). And people tend to commit them again and again. If you read some of the stories form late 90th they often sound as if they were writtem yesterday.

10-15 min spend on re-reading and enhancing this material one a month can help to avoid some of the situations described below. Spectacular blunder is often too valuable to be forgotten as it tends to repeat itself ;-).

Reading about somebody else blunder does not fully convey the gravity of the situation in which you can find yourself by repeating it. But it can serve as weaker substitute for log of own blunders. For example, the understanding that dealing with files and directories staring with dot in Unix requires extreme caution probably can be acquired only by committing one (just one) such blunder.

Dealing with RAID controllers is another areas that requires extreme caution, a good planning and availability of verified backup. Sometimes even routine firmware update turns into unmitigated disaster. This is also an area where the difference between minor nuisance and major disaster is presence of the most recent backup.

Some typical cases of loss of situational awareness

Here is constructed by myself list of typical cases of the loss of situational awareness

In this page we will present "Softpanorama classification of sysadmin horror stories". It is not the first and hopefully not the last one. The author is indebted to Anatoly Ivasyuk who created original " The Unofficial Unix Administration Horror Story Summary. This list exists several versions:

There is a very true saying that experience keeps the most expensive school, but fools are unable to learn in any other ;-). In order to be able to learn from others we need to classify classic sysadmin horror stories. One such classification created by the author is presented below.

There is a very true saying that experience keeps the most expensive school but fools are unable to learn in any other ;-). In order to be able to learn from others we need to classify classic sysadmin horror stories. One such classification created by the author is presented below.

The initial source, the base of which this page had grown is an old, but still quite relevant list of horror stories created by Anatoly Ivasyuk; There are two versions available:

Here is the author attempt to reorganize the categories and enhance the material by add more modern stories.

The issues connected with ego and hubris

  hubris: Overbearing pride or presumption; arrogance:
"There is no safety in unlimited technological hubris”

( McGeorge Bundy)

All the world's a stage,
And all the men and women merely players;
They have their exits and their entrances,
And one man in his time plays many parts,

Shakespeare, As You Like It Act 2, scene 7, 139–143

I think there's a lot of naivete and hubris within our mix of personalities.

- Ian Williams

 

Hubris (/ˈhjuːbrɪs/, also hybris, from ancient Greek ὕβρις) describes a personality quality of extreme or foolish pride or dangerous over-confidence.[1] In its ancient Greek context, it typically describes behavior that defies the norms of behavior or challenges the gods, and which in turn brings about the downfall, or nemesis, of the perpetrator of hubris (Hubris - Wikipedia)

Larry Wall once said that "The three chief virtues of a programmer are: Laziness, Impatience and Hubris.". I assume that this was a joke. This is not true for programmers. But for system administrators those three qualities are mortal sins.  Especially the last two. Just hubris alone will never let you to be a good system administrator. That's what distinguish system administrators from artists.

We're all victims of our own hubris at times. Success usually breeds a degree of hubris.  But some people are more affected then others. The problems start when people are shy to ask more experienced colleagues for advice of information, because they are afraid to demonstrated that they do not know something, which other assume they know. Sometimes this is the reason that lead to disasters.

 If the senior, more experienced  sysadmin looks at you like you’re an idiot, ask him why. It's better to be thought an idiot for asking than proven to be an idiot by not asking!

Softpanorama classification of sysadmin blunders

Backup = ( Full + Removable tapes (or media) + Offline + Offsite + Tested )

Vivek Gite

  1. Creative uses of rm with unintended consequences   This is an intrinsic, unavoidable danger in Linux. Like using a sharp blade or chainsaw.  Blunders happen very infrequently, but even a single one can be devastating and if happen on production server can cost your job.  That means that the level of knowledge of intricacies of rm command directly correlated with the level of qualification of the  Linux sysadmin.  Please read recommendation in Creative uses of rm with unintended consequences. they were created as generalization of unfortunate episodes (usually called SNAFU) of many sysadmins including myself.
  2. Missing backup. Please remember that backup is the last change for you to restore the system if something went terribly wrong. That means that before any dangerous steps you need to locate and check the existence of backup. Making another backup is also a good idea to that you have two or more recent copies. Attempt at least to brose the backup and see if data are intact is a must.
  3. Missing baseline and losing initial confirmation in the stream of changes. This is the most typical mistake in network troubleshooting and optimization is losing your initial configuration. This also might mean lack of preparation and lack of situational awareness. You need to take several steps to prevent this blunder from occurring and most important of them are baselines and backups.
  4. Locking yourself out
  5. Performing operation on a wrong computer. The naming schemes used by large corporations usually do not have enough distance between them to avoid such blunders. also if you work onmultiple terminal and do not distibush them with color, you can easily make such  a blunder. For example, you can type XYZ300 instead of XYZ200 and login to the wrong box. If you are in a hurry and do not check the name, you proceed with operation intended for different box. Another common situation is when you have several terminal windows open and in a hurry start working on a wrong server. That's why it's important that shell prompt shows the name of the host (but it is not enough; in case of terminal the color of the background is also important; probably more important).  Often, if you both have  a production server and a quality server for some application is wise never have two terminals opened simultaneously if you are doing  some tricky and potentially disastrous (if done on the wrong box) staff . Reopening it is not a big deal but can save you from some very unpleasant situations.
  6. Forgetting in which directory you are and executing command in a wrong directory. This is common mistake if work under severe time pressure or are very tired.
  7. Regular expressions related blunders. Novice sysadmins usually do not realize that '.*' also matches '..' often with disastrous consequences if commands like chmod, chown, rm are used recursively or in find command.
  8. Find filesystem traversal errors and other errors related to find. This is very common class of errors and it is covered in a separate page Typical Errors In Using Find
  9. Side effects of performing operations on home or application directories due to links to system directories. This is a pretty common mistake and I had committed it myself several time with various, but always unpleasant consequences.
  10. Misunderstanding of syntax of important command and/or not testing complex command before execution of production box. Such errors are often made under time pressure. One such case is using recursive rm, chown, chmod or find commands. Each of them deserves category of its own.
  11. Ownership changing blunders Those are common when using chown with find so you need to test the command first.
  12. Excessive zeal in improving security of the system ;-). A lot of current security recommendation are either stupid or counterproductive. In the hands of overly enthusiastic and semi-competent administrator it becomes a weapon that no hacker can ever match. I think more systems were destroyed by idiotic security measures that by hackers.
  13. Mistakes done under time pressure. Some of them were discussed above, but generally time pressure serves as a powerful catalyst for the most devastating mistakes.
  14. Patching horrors
  15. Unintended consequences of automatic system maintenance scripts
  16. Side effects/unintended consequences of multiple sysadmin working on the same box
  17. Premature or misguided optimization and/or cleanup of the system. Changing settings without full understanding consequences of such changes. Misguided attempts to get rid of unwanted file or directories (cleaning the system).
  18. Mistakes made because of the differences between various Unix/Linux flavors For example in Solaris run level 5 means reboot while in Linux run level 5 is running system with networking and X11.
  19. Stupid or preventable mistakes including those when dealing with complex server hardware.

Some personal experience

Cleaning NFS mounted home directory to save space

To speed up installation of the sever I mounted my home directory from another server. Then forgot about it and it remained mounted. CentOS 6.9 was installed on server. Later researcher asked to reinstall on it RHEL as one of his application were supported only on RHEL and I started with backing up all critical directories "just in case". Thinking that I already have a copy of my home directory elsewhere I decided to shrink space on /home filesystem and not realizing that it was NFS mounted deleted it.

Reboot of wrong server

Such commands as reboot or mkinitrd can be pretty devastating when applied to wrong server. That mishap happens with a lot of administrators including myself, so it is prudent to take special measures to make it less probable.

This situation often is made more probable due to not fault-tolerant name scheme employed in many corporations where names of the servers differ by one symbol. For example, scheme serv01, serv02 serv03 and so on is a pretty dangerous name scheme as server names are different by only single digit and thus errors like working on the wrong server are much more probable.

The typical case of the loss of situational awareness is performing some critical operation on the wrong server. If you use Windows desktop to connect to Unix servers use MSVDM to create multiple desktop and change background for each to make the typing command in a wrong terminal window less likely

Even more complex scheme like Bsn01dls9, Nyc02sns10 were first three letter encode the location, then numeric suffix and then vendor of the hardware and OS installed are prone to such errors. My impression that unless first letters differ, there is a substantial chance of working on wrong server. Using favorite sport teams names is a better strategy and those "formal" name can be used as aliases.

Inadequate backup

If you try to distill the essence of horror stories most of them were upgraded from errors to horror stories due to inadequate backups.

Having a good recent backup is the key feature that distinguishes mere nuisance from full blown disaster. This point is very difficult to understand by novice enterprise administrators. Rephrasing Bernard Show we can say "Experience keeps the most expensive school, but most sysadmins are unable to learn anywhere else". Please remember that in enterprise environment you will almost never be rewarded for innovations and contributions but in many cases you will be severely punished for blunders. In other words typical enterprise IT is a risk averse environment and you better understand that sooner rather then later...

If you try to distill the essence of horror stories most of them are about inadequate backups. Having a good recent backup is the key feature that distinguishes mere nuisance from full blown disaster.

Rush and absence of planning are probably the second most important reason. In many cases sysadmin is stressed and that impair judgment.

Forgetting to chroot affected subtree

Another typical reason is abuse of privileges. If you have access to root that does not mean that you need to perform all operations as root. For example such simple operations' as

cd /home/joeuser
chown -R joeuser:joeuser .* 

performed as root cause substantial problems and time lost in recovery of ownership of system file. Computers are really fast now and of modern server such an operation can take a second or two :-(.

Even with user privileges there will be some damage: it will affect all world writable files and directories.

This is the case where chroot can provide tremendous help:

cd /home/joeuser 
chroot /home/joeuser
chown -R joeuser:joeuser .* 

Abuse of root privileges

Another typical reason is abuse of root privileges. Using sudo or RBAC (on Solaris) you can avoid some unpleasant surprises. Another good practice if to use screen with one screen for root operations and another for operations that can be performed under your on ID or under privileges of wheel group (or other group to which all sysadmins belong).

Many Unix sysadmin horror stories are related to unintended consequences, unanticipated side effects of a particular Unix commands such as find and rm performed with root privileges. Unix is a complex OS and many intricate details (like behavior of commands like rm -r .* or chown -R a:a .*) can easily be forgotten from one encounter to another, especially if sysadmin works with several flavors of Unix or Unix and Windows servers.

For example recursive deletion of files either via rm -r or via find -exec rm {} \; has a lot of pitfalls that can destroy the server pretty nicely in less then a minute, if run without testing.

Some of those pitfalls can be viewed as a deficiency of rm implementation (it should automatically block * deletion of system directories like /, /etc/ and so on unless -f flag is specified, but Unix lacks system attribute for files although in some case sticky bit on directories (like /tmp) can help).

That means that it is wise to use wrappers for rm. There are several more or less usable approach to writing such a wrapper:

Another important source of blunders is time pressure. Trying to do something quickly cutting corners (such as creating verified a backup) often lead to substantial downtime. Hurry slowly is one of the saying that are very true for sysadmin. But unfortunately very difficult to follow. In any case always backup /etc/directory on your login (this should be done from profiel or bashrc script.

In any case always backup /etc/directory on your login (this should be done from profile or bashrc script.

Sometimes your emotional state contribute to the problems: you didn't have much sleep or your mind was distracted by your personal life problems. In such days it is important to slow down and be extra cautious. Doing nothing is such cases is much better that creating another SNAFU.

Typos are another common source of serious, some time disastrous errors. One rule that should be followed (but as the memory of the last incident fades, this rule like any safety rules, usually is forgotten :-): if you are working as root and perform dangerous operations never type the directory path, always copy it from the history, if possible or list it via ls command and copy from the screen.

If you are working as root and perform dangerous operations never type the directory path, especially complex path. Always try to copy it from the history, if possible or list it via ls command and then copy it from the screen.

I once automatically typed /etc instead of etc trying to delete directory to free space on a backup directory on a production server (/etc probably in engraved in sysadmin head as it is typed so often and can be substituted for etc subconsciously). I realized that it was mistake and cancelled the command, but it was a fast server and one third of /etc was gone. The rest of the day was spoiled... Actually not completely: I learned quite a bit about the behavior of AIX in this situation and the structure of AIX /etc directory this day so each such disaster is actually a great learning experience, almost like one day or even one week training course ;-). But it's much less nerve wracking to get this knowledge from a regular course...

Another interesting thing is having backup was not enough is this case -- backup software sometimes can stop working and the server has an illution of the backup not the actual backup. That happens with HP Data Protector, which is too complex software to operate reliably. The same can be true for ssh and rsync based backup -- somehting in the configuration changes and that went unnoticed until too late. And this was a remote server is a datacenter across the country. I restored the directory on the other non-production server (overwriting its /etc directory in this second box with the help of operations, tell me about cascading errors and Murphy law :-). Then netcat helped to transfer the tar file.

If you are working as root and perform dangerous operations never type a directory path of the command, copy it from the screen. If you can copy command from history instead of typing, just do it !

In such cases network services with authentication stop working and the only way to transfer files is using CD/DVD, USB drive or netcat. That's why it is useful to have netcat on servers: netcat is the last resort file transfer program when services with authentication like ftp or scp stop working. It is especially useful to have it if the datacenter is remote.

netcat is the last resort file transfer program when services with authentication like ftp or scp stop working. It is especially useful to have it, if the datacenter is remote.

What other authors are saying

Linux Server Hacks, Volume Two Tips & Tools for Connecting, Monitoring, and Troubleshooting William von Hagen, Brian K. Jones

Avoid Common Junior Mistakes

Get over the junior admin hump and land in guru territory.

No matter how "senior" you become, and no matter how omnipotent you feel in your current role, you will eventually make mistakes. Some of them may be quite large. Some will wipe entire weekends right off the calendar. However, the key to success in administering servers is to mitigate risk, have an exit plan, and try to make sure that the damage caused by potential mistakes is limited. Here are some common mistakes to avoid on your road to senior-level guru status.

Don't Take the root Name in Vain

Try really hard to forget about root. Here's a quick comparison of the usage of root by a seasoned vet versus by a junior administrator.

Solid, experienced administrators will occasionally forget that they need to be root to perform some function. Of course they know they need to be root as soon as they see their terminal filling with errors, but running su -root occasionally slips their mind. No big deal. They switch to root, they run the command, and they exit the root shell. If they need to run only a single command, such as a make install, they probably just run it like this:

	$ su -c 'make install'

This will prompt you for the root password and, if the password is correct, will run the command and dump you back to your lowly user shell.

A junior-level admin, on the other hand, is likely to have five terminals open on the same box, all logged in as root. Junior admins don't consider keeping a terminal that isn't logged in as root open on a production machine, because "you need root to do anything anyway." This is horribly bad form, and it can lead to some really horrid results. Don't become root if you don't have to be root!

Building software is a good example. After you download a source package, unzip it in a place you have access to as a user. Then, as a normal user, run your ./configure and make commands. If you're installing the package to your ~/bin directory, you can run make install as yourself. You only need root access if the program will be installed into directories to which only root has write access, such as /usr/local.

My mind was blown one day when I was introduced to an entirely new meaning of "taking the root name in vain." It doesn't just apply to running commands as root unnecessarily. It also applies to becoming root specifically to grant unprivileged access to things that should only be accessible by root!

I was logged into a client's machine (as a normal user, of course), poking around because the user had reported seeing some odd log messages. One of my favorite commands for tracking down issues like this is ls -lahrt/etc, which does a long listing of everything in the directory, reverse sorted by modification time. In this case, the last thing listed (and hence, the last thing modified) was /etc/shadow. Not too odd if someone had added a user to the local machine recently, but it so happened that this company used NIS+, and the permissions had been changed on the file!

I called the number they'd told me to call if I found anything, and a junior administrator admitted that he had done that himself because he was writing a script that needed to access that file. Ugh.

Don't Get Too Comfortable

Junior admins tend to get really into customizing their environments. They like to show off all the cool things they've recently learned, so they have custom window manager setups, custom logging setups, custom email configurations, custom tunneling scripts to do work from their home machines, and, of course, custom shells and shell initializations.

That last one can cause a bit of headache. If you have a million aliases set up on your local machine and some other set of machines that mount your home directory (thereby making your shell initialization accessible), things will probably work out for that set of machines. More likely, however, is that you're working in a mixed environment with Linux and some other Unix variant. Furthermore, the powers that be may have standard aliases and system-wide shell profiles that were there long before you were.

At the very least, if you modify the shell you have to test that everything you're doing works as expected on all the platforms you administer. Better is just to keep a relatively bare-bones administrative shell. Sure, set the proper environment variables, create three or four aliases, and certainly customize the command prompt if you like, but don't fly off into the wild blue yonder sourcing all kinds of bash completion commands, printing the system load to your terminal window, and using shell functions to create your shell prompt. Why not?

Well, because you can't assume that the same version of your shell is running everywhere, or that the shell was built with the same options across multiple versions of multiple platforms! Furthermore, you might not always be logging in from your desktop. Ever see what happens if you mistakenly set up your initialization file to print stuff to your terminal's titlebar without checking where you're coming from? The first time you log in from a dumb terminal, you'll realize it wasn't the best of ideas. Your prompt can wind up being longer than the screen!

Just as versions and build options for your shell can vary across machines, so too can "standard" commands-drastically! Running chown -R has wildly different effects on Solaris than it does on Linux machines, for example. Solaris will follow symbolic links and keep on truckin', happily skipping about your directory hierarchy and recursively changing ownership of files in places you forgot existed. This doesn't happen under Linux. To get Linux to behave the same way, you need to use the -H flag explicitly. There are lots of commands that exhibit different behavior on different operating systems, so be on your toes!

Also, test your shell scripts across platforms to make sure that the commands you call from within the scripts act as expected in any environments they may wind up in.

Don't Perform Production Commands "Off the Cuff"

Many environments have strict rules about how software gets installed, how new machines are built and pushed into production, and so on. However, there are also thousands of sites that don't enforce any such rules, which quite frankly can be a bit scary.

Not having the funds to come up with a proper testing and development environment is one thing. Having a blatant disregard for the availability of production services is quite another. When performing software installations, configuration changes, mass data migrations, and the like, do yourself a huge favor (actually, a couple of favors):

Script the procedure!
Script it and include checks to make sure that everything in the script runs without making any assumptions. Check to make sure each step has succeeded before moving on.
Script a backout procedure.
If you've moved all the data, changed the configuration, added a user for an application to run as, and installed the application, and something blows up, you really will not want to spend another 40 minutes cleaning things up so that you can get things back to normal. In addition, if things blow up in production, you could panic, causing you to misjudge, mistype, and possibly make things worse. Script it!

The process of scripting these procedures also forces you to think about the consequences of what you're doing, which can have surprising results. I once got a quarter of the way through a script before realizing that there was an unmet dependency that nobody had considered. This realization saved us a lot of time and some cleanup as well.

Ask Questions

The best tip any administrator can give is to be conscious of your own ignorance. Don't assume you know every conceivable side effect

Dr. Nikolai Bezroukov


Top Visited
Switchboard
Latest
Past week
Past month

NEWS CONTENTS

Old News ;-)

"Those Who Forget History Are Doomed to Repeat It"

Multiple authors

"Those who cannot remember the past are condemned to repeat it."

George Santayana

An "Ohnosecond" is defined as the period of time between when you hit enter and you realize what you just did.

[Nov 01, 2018] 3 scary sysadmin stories

Notable quotes:
"... "It was, it was " ..."
Nov 01, 2018 | opensource.com

The ghost of the failed restore

In a well-known data center (whose name I do not want to remember), one cold October night we had a production outage in which thousands of web servers stopped responding due to downtime in the main database. The database administrator asked me, the rookie sysadmin, to recover the database's last full backup and restore it to bring the service back online.

But, at the end of the process, the database was still broken. I didn't worry, because there were other full backup files in stock. However, even after doing the process several times, the result didn't change.

With great fear, I asked the senior sysadmin what to do to fix this behavior.

"You remember when I showed you, a few days ago, how the full backup script was running? Something about how important it was to validate the backup?" responded the sysadmin.

"Of course! You told me that I had to stay a couple of extra hours to perform that task," I answered.

"Exactly! But you preferred to leave early without finishing that task," he said.

"Oh my! I thought it was optional!" I exclaimed.

"It was, it was "

Moral of the story: Even with the best solution that promises to make the most thorough backups, the ghost of the failed restoration can appear, darkening our job skills, if we don't make a habit of validating the backup every time.

[Oct 22, 2018] linux - If I rm -rf a symlink will the data the link points to get erased, to

Notable quotes:
"... Put it in another words, those symlink-files will be deleted. The files they "point"/"link" to will not be touch. ..."
Oct 22, 2018 | unix.stackexchange.com

user4951 ,Jan 25, 2013 at 2:40

This is the contents of the /home3 directory on my system:
./   backup/    hearsttr@  lost+found/  randomvi@  sexsmovi@
../  freemark@  investgr@  nudenude@    romanced@  wallpape@

I want to clean this up but I am worried because of the symlinks, which point to another drive.

If I say rm -rf /home3 will it delete the other drive?

John Sui

rm -rf /home3 will delete all files and directory within home3 and home3 itself, which include symlink files, but will not "follow"(de-reference) those symlink.

Put it in another words, those symlink-files will be deleted. The files they "point"/"link" to will not be touch.

[Oct 22, 2018] Does rm -rf follow symbolic links?

Jan 25, 2012 | superuser.com
I have a directory like this:
$ ls -l
total 899166
drwxr-xr-x 12 me scicomp       324 Jan 24 13:47 data
-rw-r--r--  1 me scicomp     84188 Jan 24 13:47 lod-thin-1.000000-0.010000-0.030000.rda
drwxr-xr-x  2 me scicomp       808 Jan 24 13:47 log
lrwxrwxrwx  1 me scicomp        17 Jan 25 09:41 msg -> /home/me/msg

And I want to remove it using rm -r .

However I'm scared rm -r will follow the symlink and delete everything in that directory (which is very bad).

I can't find anything about this in the man pages. What would be the exact behavior of running rm -rf from a directory above this one?

LordDoskias Jan 25 '12 at 16:43, Jan 25, 2012 at 16:43

How hard it is to create a dummy dir with a symlink pointing to a dummy file and execute the scenario? Then you will know for sure how it works! –

hakre ,Feb 4, 2015 at 13:09

X-Ref: If I rm -rf a symlink will the data the link points to get erased, too? ; Deleting a folder that contains symlinkshakre Feb 4 '15 at 13:09

Susam Pal ,Jan 25, 2012 at 16:47

Example 1: Deleting a directory containing a soft link to another directory.
susam@nifty:~/so$ mkdir foo bar
susam@nifty:~/so$ touch bar/a.txt
susam@nifty:~/so$ ln -s /home/susam/so/bar/ foo/baz
susam@nifty:~/so$ tree
.
├── bar
│   └── a.txt
└── foo
    └── baz -> /home/susam/so/bar/

3 directories, 1 file
susam@nifty:~/so$ rm -r foo
susam@nifty:~/so$ tree
.
└── bar
    └── a.txt

1 directory, 1 file
susam@nifty:~/so$

So, we see that the target of the soft-link survives.

Example 2: Deleting a soft link to a directory

susam@nifty:~/so$ ln -s /home/susam/so/bar baz
susam@nifty:~/so$ tree
.
├── bar
│   └── a.txt
└── baz -> /home/susam/so/bar

2 directories, 1 file
susam@nifty:~/so$ rm -r baz
susam@nifty:~/so$ tree
.
└── bar
    └── a.txt

1 directory, 1 file
susam@nifty:~/so$

Only, the soft link is deleted. The target of the soft-link survives.

Example 3: Attempting to delete the target of a soft-link

susam@nifty:~/so$ ln -s /home/susam/so/bar baz
susam@nifty:~/so$ tree
.
├── bar
│   └── a.txt
└── baz -> /home/susam/so/bar

2 directories, 1 file
susam@nifty:~/so$ rm -r baz/
rm: cannot remove 'baz/': Not a directory
susam@nifty:~/so$ tree
.
├── bar
└── baz -> /home/susam/so/bar

2 directories, 0 files

The file in the target of the symbolic link does not survive.

The above experiments were done on a Debian GNU/Linux 9.0 (stretch) system.

Wyrmwood ,Oct 30, 2014 at 20:36

rm -rf baz/* will remove the contents – Wyrmwood Oct 30 '14 at 20:36

Buttle Butkus ,Jan 12, 2016 at 0:35

Yes, if you do rm -rf [symlink], then the contents of the original directory will be obliterated! Be very careful. – Buttle Butkus Jan 12 '16 at 0:35

frnknstn ,Sep 11, 2017 at 10:22

Your example 3 is incorrect! On each system I have tried, the file a.txt will be removed in that scenario. – frnknstn Sep 11 '17 at 10:22

Susam Pal ,Sep 11, 2017 at 15:20

@frnknstn You are right. I see the same behaviour you mention on my latest Debian system. I don't remember on which version of Debian I performed the earlier experiments. In my earlier experiments on an older version of Debian, either a.txt must have survived in the third example or I must have made an error in my experiment. I have updated the answer with the current behaviour I observe on Debian 9 and this behaviour is consistent with what you mention. – Susam Pal Sep 11 '17 at 15:20

Ken Simon ,Jan 25, 2012 at 16:43

Your /home/me/msg directory will be safe if you rm -rf the directory from which you ran ls. Only the symlink itself will be removed, not the directory it points to.

The only thing I would be cautious of, would be if you called something like "rm -rf msg/" (with the trailing slash.) Do not do that because it will remove the directory that msg points to, rather than the msg symlink itself.

> ,Jan 25, 2012 at 16:54

"The only thing I would be cautious of, would be if you called something like "rm -rf msg/" (with the trailing slash.) Do not do that because it will remove the directory that msg points to, rather than the msg symlink itself." - I don't find this to be true. See the third example in my response below. – Susam Pal Jan 25 '12 at 16:54

Andrew Crabb ,Nov 26, 2013 at 21:52

I get the same result as @Susam ('rm -r symlink/' does not delete the target of symlink), which I am pleased about as it would be a very easy mistake to make. – Andrew Crabb Nov 26 '13 at 21:52

,

rm should remove files and directories. If the file is symbolic link, link is removed, not the target. It will not interpret a symbolic link. For example what should be the behavior when deleting 'broken links'- rm exits with 0 not with non-zero to indicate failure

[Oct 14, 2018] When idiots are offloaded to security department interesting things with network eventually happnen

Oct 14, 2018 | linux.slashdot.org

jfdavis668 ( 1414919 ) , Sunday May 27, 2018 @11:09AM ( #56682996 )

Re:So ( Score: 5 , Interesting)

Traceroute is disabled on every network I work with to prevent intruders from determining the network structure. Real pain in the neck, but one of those things we face to secure systems.

Anonymous Coward writes:
Re: ( Score: 2 , Insightful)

What is the point? If an intruder is already there couldn't they just upload their own binary?

Hylandr ( 813770 ) , Sunday May 27, 2018 @05:57PM ( #56685274 )
Re:So ( Score: 5 , Interesting)

They can easily. And often time will compile their own tools, versions of Apache, etc..

At best it slows down incident response and resolution while doing nothing to prevent discovery of their networks. If you only use Vlans to segregate your architecture you're boned.

gweihir ( 88907 ) , Sunday May 27, 2018 @12:19PM ( #56683422 )
Re:So ( Score: 5 , Interesting)

Also really stupid. A competent attacker (and only those manage it into your network, right?) is not even slowed down by things like this.

bferrell ( 253291 ) , Sunday May 27, 2018 @12:20PM ( #56683430 ) Homepage Journal
Re:So ( Score: 4 , Interesting)

except it DOESN'T secure anything, simply renders things a little more obscure... Since when is obscurity security?

fluffernutter ( 1411889 ) writes:
Re: ( Score: 3 )

Doing something to make things more difficult for a hacker is better than doing nothing to make things more difficult for a hacker. Unless you're lazy, as many of these things should be done as possible.

DamnOregonian ( 963763 ) , Sunday May 27, 2018 @04:37PM ( #56684878 )
Re:So ( Score: 5 , Insightful)

No.
Things like this don't slow down "hackers" with even a modicum of network knowledge inside of a functioning network.
What they do slow down is your ability to troubleshoot network problems.
Breaking into a network is a slow process. Slow and precise. Trying to fix problems is a fast reactionary process. Who do you really think you're hurting?
Yes another example of how ignorant opinions can become common sense.

mSparks43 ( 757109 ) writes:
Re: So ( Score: 2 )

pretty much my reaction. like wtf? otoh, redhat flavours all still on glibc2 starting to become a regular p.i.t.a. so the chances of this actually becoming a thing to be concerned about seem very low.

kinda like gdpr, same kind of group think that anyone actually cares or concerns themselves with policy these days.

ruir ( 2709173 ) writes:
Re: ( Score: 3 )

disable all ICMP is not feasible as you will be disabling MTU negotiation and destination unreachable messages. You are essentially breaking the TCP/IP protocol. And if you want the protocol working OK, then people can do traceroute via HTTP messages or ICMP echo and reply.
Or they can do reverse traceroute at least until the border edge of your firewall via an external site.

DamnOregonian ( 963763 ) , Sunday May 27, 2018 @04:32PM ( #56684858 )
Re:So ( Score: 4 , Insightful)

You have no fucking idea what you're talking about. I run a multi-regional network with over 130 peers. Nobody "disables ICMP". IP breaks without it.
Some folks, generally the dimmer of us, will disable echo responses or TTL expiration notices thinking it is somehow secure (and they are very fucking wrong) but nobody blocks all ICMP, except for very very dim witted humans, and only on endpoint nodes.

DamnOregonian ( 963763 ) writes:
Re: ( Score: 3 )

That's hilarious...
I am *the guy* who runs the network. I am our senior network engineer. Every line in every router- mine.
You have no idea what you're talking about, at any level. "disabled ICMP"- state statement alone requires such ignorance to make that I'm not sure why I'm even replying to ignorant ass.

DamnOregonian ( 963763 ) writes:
Re: ( Score: 3 )

Nonsense. I conceded that morons may actually go through the work to totally break their PMTUD, IP error signaling channels, and make their nodes "invisible"

I understand "networking" at a level I'm pretty sure you only have a foggy understanding of.
I write applications that require layer-2 packet building all the way up to layer-4.

In short, he's a moron. I have reason to suspect you might be, too.

DamnOregonian ( 963763 ) writes:
Re: ( Score: 3 )

A CDS is MAC. Turning off ICMP toward people who aren't allowed to access your node/network is understandable. They can't get anything else though, why bother supporting the IP control channel? CDS does *not* say turn off ICMP globally. I deal with CDS, SSAE16 SOC 2, and PCI compliance daily. If your CDS solution only operates with a layer-4 ACL, it's a pretty simple model, or You're Doing It Wrong (TM)

nyet ( 19118 ) writes:
Re: ( Score: 3 )

> I'm not a network person

IOW, nothing you say about networking should be taken seriously.

kevmeister ( 979231 ) , Sunday May 27, 2018 @05:47PM ( #56685234 ) Homepage
Re:So ( Score: 4 , Insightful)

No, TCP/IP is not working fine. It's broken and is costing you performance and $$$. But it is not evident because TCP/IP is very good about dealing with broken networks, like yours.

Th problem is that doing this requires things like packet fragmentation which greatly increases router CPU load and reduces the maximum PPS of your network as well s resulting in dropped packets requiring re-transmission and may also result in widow collapse fallowed with slow-start, though rapid recovery mitigates much of this, it's still not free.

It's another example of security by stupidity which seldom provides security, but always buys added cost.

Hylandr ( 813770 ) writes:
Re: ( Score: 3 )

As a server engineer I am experiencing this with our network team right now.

Do you have some reading that I might be able to further educate myself? I would like to be able to prove to the directors why disabling ICMP on the network may be the cause of our issues.

Zaelath ( 2588189 ) , Sunday May 27, 2018 @07:51PM ( #56685758 )
Re:So ( Score: 4 , Informative)

A brief read suggests this is a good resource: https://john.albin.net/essenti... [albin.net]

Bing Tsher E ( 943915 ) , Sunday May 27, 2018 @01:22PM ( #56683792 ) Journal
Re: Denying ICMP echo @ server/workstation level t ( Score: 5 , Insightful)

Linux has one of the few IP stacks that isn't derived from the BSD stack, which in the industry is considered the reference design. Instead for linux, a new stack with it's own bugs and peculiarities was cobbled up.

Reference designs are a good thing to promote interoperability. As far as TCP/IP is concerned, linux is the biggest and ugliest stepchild. A theme that fits well into this whole discussion topic, actually.

[Oct 05, 2018] Unix Admin. Horror Story Summary, version 1.0 by Anatoly Ivasyuk

Oct 05, 2018 | cam.ac.uk

From: mfraioli@grebyn.com (Marc Fraioli)
Organization: Grebyn Timesharing

Well, here's a good one for you:

I was happily churning along developing something on a Sun workstation, and was getting a number of annoying permission denieds from trying to
write into a directory heirarchy that I didn't own. Getting tired of that, I decided to set the permissions on that subtree to 777 while Iwas working, so I wouldn't have to worry about it.

Someone had recently told me that rather than using plain "su", it was good to use "su -", but the implications had not yet sunk in. (You can probably see where this is going already, but I'll go to the bitter end.)

Anyway, I cd'd to where I wanted to be, the top of my subtree, and did su -. Then I did chmod -R 777. I then started to wonder why it was taking so damn long when there were only about 45 files in 20 directories under where I (thought) I was. Well, needless to say, su - simulates a real login, and had put me into root's home directory, /, so I was proceeding to set file permissions for the whole system to wide open.

I aborted it before it finished, realizing that something was wrong, but this took quite a while to straighten out.

Marc Fraioli

[Oct 05, 2018] One wrong find command can create one weak frantic recovery efforts

This is a classic SNAFU known and described for more then 30 years. It is still repeated in various forms thousand time by different system administrators. You can get permission of file installed via RPM back rather quickly and without problems. For all other files you need a backup or educated guess.
Ahh, the hazards of working with sysadmins who are not ready to be sysadmins in the first place
Oct 05, 2018 | cam.ac.uk

From: jerry@incc.com (Jerry Rocteur)
Organization: InCC.com Perwez Belgium

Horror story,

I sent one of my support guys to do an Oracle update in Madrid.

As instructed he created a new user called esf and changed the files
in /u/appl to owner esf, however in doing so he *must* have cocked up
his find command, the command was:

find /u/appl -user appl -exec chown esf {} \;

He rang me up to tell me there was a problem, I logged in via x25 and
about 75% of files on system belonged to owner esf.

VERY little worked on system.

What a mess, it took me a while and I came up with a brain wave to
fix it but it really screwed up the system.

Moral: be *very* careful of find execs, get the syntax right!!!!

[Oct 05, 2018] When some filenames are eched in braun you can type them several times repeating the same blunder again and again by Anatoly Ivasyuk

Notable quotes:
"... I was working on a line printer spooler, which lived in /etc. I wanted to remove it, and so issued the command "rm /etc/lpspl." There was only one problem. Out of habit, I typed "passwd" after "/etc/" and removed the password file. Oops. ..."
Oct 05, 2018 | cam.ac.uk

From Unix Admin. Horror Story Summary, version 1.0 by Anatoly Ivasyuk
From: tzs@stein.u.washington.edu (Tim Smith)

Organization: University of Washington, Seattle

I was working on a line printer spooler, which lived in /etc. I wanted to remove it, and so issued the command "rm /etc/lpspl." There was only
one problem. Out of habit, I typed "passwd" after "/etc/" and removed the password file. Oops.

I called up the person who handled backups, and he restored the password file.

A couple of days later, I did it again! This time, after he restored it, he made a link, /etc/safe_from_tim.

About a week later, I overwrote /etc/passwd, rather than removing it. After he restored it again, he installed a daemon that kept a copy of /etc/passwd, on another file system, and automatically restored it if it appeared to have been damaged.

Fortunately, I finished my work on /etc/lpspl around this time, so we didn't have to see if I could find a way to wipe out a couple of filesystems...

--Tim Smith

[Oct 05, 2018] Due to a configuration change I wasn't privy to, the software I was responsible for rebooted all the 911 operators servers at once

Oct 05, 2018 | www.reddit.com

ardwin 5 years ago (9 children)

Due to a configuration change I wasn't privy to, the software I was responsible for rebooted all the 911 operators servers at once.
cobra10101010 5 years ago (1 child)
Oh God..that is scary in true sense..hope everything was okay
ardwin 5 years ago (0 children)
I quickly learned that the 911 operators, are trained to do their jobs without any kind of computer support. It made me feel better.
reebzor 5 years ago (1 child)
I did this too!

edit: except I was the one that deployed the software that rebooted the machines

vocatus 5 years ago (0 children)
Hey, maybe you should go apologize to ardwin. I bet he was pissed.

[Oct 05, 2018] sudo yum -y remove krb5 (this removes coreutils)

Oct 05, 2018 | www.reddit.com

DrGirlfriend Systems Architect 2 points 3 points 4 points 5 years ago (5 children)

2960G 2 points 3 points 4 points 5 years ago (1 child)
+1 for the "yum -y". Had the 'pleasure' of fixing a box one of my colleagues did "yum -y remove openssl". Through utter magic managed to recover it without reinstalling :-)
chriscowley DevOps 0 points 1 point 2 points 5 years ago (0 children)
Do I explain. I would probably curled the RPMs of the repo into cpio and put them into place manually (been there)
vocatus NSA/DOD/USAR/USAP/AEXP [ S ] 0 points 1 point 2 points 5 years ago (1 child)
That last one gave me the shivers.

[Oct 05, 2018] Trying to preserve connection after networking change while working on the core switch remotely backfired, as sysadmin forgot to cancel scheduled reload comment after testing change

Notable quotes:
"... "All monitoring for customer is showing down except the edge firewalls". ..."
"... as soon as they said it I knew I forgot to cancel the reload. ..."
"... That was a fun day.... What's worse is I was following a change plan, I just missed the "reload cancel". Stupid, stupid, stupid, stupid. ..."
Oct 05, 2018 | www.reddit.com

Making some network changes in a core switch, use 'reload in 5' as I wasn't 100% certain the changes wouldn't kill my remote connection.

Changes go in, everything stays up, no apparent issues. Save changes, log out.

"All monitoring for customer is showing down except the edge firewalls".

... as soon as they said it I knew I forgot to cancel the reload.

0xD6 5 years ago

This one hit pretty close to home having spent the last month at a small Service Provider with some serious redundancy issues. We're working through them one by one, but there is one outage in particular that was caused by the same situation... Only the scope was pretty "large".

Performed change, was distracted by phone call. Had an SMS notifying me of problems with a legacy border that I had just performed my changes on. See my PuTTY terminal and my blood starts to run cold. "Reload requested by 0xd6".

...Fuck I'm thinking, but everything should be back soon, not much I can do now.

However, not only did our primary transit terminate on this legacy device, our old non-HSRP L3 gateways and BGP nail down routes for one of our /20s and a /24... So, because of my forgotten reload I withdrew the majority of our network from all peers and the internet at large.

That was a fun day.... What's worse is I was following a change plan, I just missed the "reload cancel". Stupid, stupid, stupid, stupid.

[Oct 05, 2018] I learned a valuable lesson about pressing buttons without first fully understanding what they do.

Oct 05, 2018 | www.reddit.com

WorkOfOz (0 children)

This is actually one of my standard interview questions since I believe any sys admin that's worth a crap has made a mistake they'll never forget.

Here's mine, circa 2001. In response to a security audit, I had to track down which version of the Symantec Antivirus was running and what definition was installed on every machine in the company. I had been working through this for awhile and got a bit reckless.

There was a button in the console that read 'Virus Sweep'. Thinking it'd get the info from each machine and give me the details, I pressed it.. I was wrong..

Very Wrong. Instead it proceeded to initiate a virus scan on every machine including all of the servers.

Less than 5 minutes later, many of our older servers and most importantly our file servers froze. In the process, I took down a trade floor for about 45 minutes while we got things back up. I learned a valuable lesson about pressing buttons without first fully understanding what they do.

[Oct 05, 2018] A newbie turned production server off to replace a monitor

Oct 05, 2018 | www.reddit.com

just_call_in_sick 5 years ago (1 child)

A friend of the family was an IT guy and he gave me the usual high school unpaid intern job. My first day, he told me that a computer needed the monitor replaced. He gave me this 13" CRT and sent me on my way. I found the room (a wiring closet) with a tiny desk and a large desktop tower on it.

TURNED OFF THE COMPUTER and went about replacing the monitor. I think it took about 5 minutes for people start wondering why they can no longer use the file server and can't save their files they have been working on all day.

It turns out that you don't have to turn off computers to replace the monitor.

[Oct 05, 2018] Sometimes one extra space makes a big differenece

Oct 05, 2018 | cam.ac.uk

From: rheiger@renext.open.ch (Richard H. E. Eiger)
Organization: Olivetti (Schweiz) AG, Branch Office Berne

In article <1992Oct9.100444.27928@u.washington.edu> tzs@stein.u.washington.edu
(Tim Smith) writes:
> I was working on a line printer spooler, which lived in /etc. I wanted
> to remove it, and so issued the command "rm /etc/lpspl." There was only
> one problem. Out of habit, I typed "passwd" after "/etc/" and removed
> the password file. Oops.
>
[deleted to save space[
>
> --Tim Smith

Here's another story. Just imagine having the sendmail.cf file in /etc. Now, I
was working on the sendmail stuff and had come up with lots of sendmail.cf.xxx
which I wanted to get rid of so I typed "rm -f sendmail.cf. *". At first I was
surprised about how much time it took to remove some 10 files or so. Hitting
the interrupt key, when I finally saw what had happened was way to late,
though.

Fortune has it that I'm a very lazy person. That's why I never bothered to just
back up directories with data that changes often. Therefore I managed to
restore /etc successfully before rebooting... :-) Happy end, after all. Of
course I had lost the only well working version of my sendmail.cf...

Richard

[Oct 05, 2018] Deletion of files purpose of which you do not understand sometimes backfire by Anatoly Ivasyuk

Oct 05, 2018 | cam.ac.uk

Unix Admin. Horror Story Summary, version 1.0 by Anatoly Ivasyuk

From: philip@haas.berkeley.edu (Philip Enteles)
Organization: Haas School of Business, Berkeley

As a new system administrator of a Unix machine with limited space I thought I was doing myself a favor by keeping things neat and clean. One
day as I was 'cleaning up' I removed a file called 'bzero'.

Strange things started to happen like vi didn't work then the compliants started coming in. Mail didn't work. The compilers didn't work. About this time the REAL system administrator poked his head in and asked what I had done.

Further examination showed that bzero is the zeroed memory without which the OS had no operating space so anything using temporary memory was non-functional.

The repair? Well things are tough to do when most of the utilities don't work. Eventually the REAL system administrator took the system to single user and rebuilt the system including full restores from a tape system. The Moral is don't be to anal about things you don't understand.

Take the time learn what those strange files are before removing them and screwing yourself.

Philip Enteles

[Oct 05, 2018] Danger of hidden symlinks

Oct 05, 2018 | cam.ac.uk

From: cjc@ulysses.att.com (Chris Calabrese)
Organization: AT&T Bell Labs, Murray Hill, NJ, USA

In article <7515@blue.cis.pitt.edu.UUCP> broadley@neurocog.lrdc.pitt.edu writes:
>On a old decstation 3100

I was deleting last semesters users to try to dig up some disk space, I also deleted some test users at the same time.

One user took longer then usual, so I hit control-c and tried ls. "ls: command not found"

Turns out that the test user had / as the home directory and the remove user script in Ultrix just happily blew away the whole disk.

>U...~

[Oct 05, 2018] Hidden symlinks and recursive deletion of the directories

Notable quotes:
"... Fucking asshole ex-sysadmin taught me a good lesson about checking for symlink bombs. ..."
Oct 05, 2018 | www.reddit.com

mavantix Jack of All Trades, Master of Some; 5 years ago (4 children)

I was cleaning up old temp folders of junk on Windows 2003 server, and C:\temp was full of shit. Most of it junk. Rooted deep in the junk, some asshole admin had apparently symlink'd sysvol to a folder in there. Deleting wiped sysvol.

There where no usable backups, well, there where but the ArcServe was screwed by lack of maintenance.

Spent days rebuilding policies.

Fucking asshole ex-sysadmin taught me a good lesson about checking for symlink bombs.

...and no I didn't tell this story to teach any of your little princesses to do the same when you leave your company.

[Oct 05, 2018] Automaticallt putting slash in from of directory named like system (named like bin,etc,usr, var) which is etched in sysadmin memory

This is why you should never type rm command on command line. Type it in editor first.
Oct 05, 2018 | www.reddit.com

aultl Senior DevOps Engineer

rm -rf /var

I was trying to delete /var/named/var

nekoeth0 Linux Admin, 5 years ago
Haha, that happened to me too. I had to use a live distro, chroot, copy, what not. It was fun!

[Oct 05, 2018] I corrupted a 400TB data warehouse.

Oct 05, 2018 | www.reddit.com

I corrupted a 400TB data warehouse.

Took 6 days to restore from tape.

mcowger VCDX | DevOps Guy 8 points 9 points 10 points 5 years ago (0 children)

Meh - happened a long time ago.

Had a big Solaris box (E6900) running Oracle 10 for the DW. Was going to add some new LUNs to the box and also change some of the fiber pathing to go through a new set of faster switches. Had the MDS changes prebuilt, confirmed in with another admin, through change control, etc.

Did fabric A, which went through fine, and then did fabric B without pausing or checking that the new paths came up on side A before I knocked over side B (in violation of my own approved plan). For the briefest of instants, there were no paths to the devices and Oracle was configured in full async write mode :(. Instant corruption of the tables that were active. Tried to do use archivelogs to bring it back, but no dice (and this is before Flashbacks, etc). So we were hosed.

Had to have my DBA babysit the RMAN restore for the entire weekend :(. 1GBe links to backup infrastructure.

RCA resulted in MANY MANY changes to the design of that system, and me just barely keeping my job.

invisibo DevOps 2 points 3 points 4 points 5 years ago (0 children)
You just made me say "holy shit! Out loud. You win.
FooHentai 2 points 3 points 4 points 5 years ago (0 children)
Ouch.

I dropped a 500Gb RAID set. There were 2 identical servers in the rack right next to each other. Both OpenFiler, both unlabeled. Didn't know about the other one and was told to 'wipe the OpenFiler'. Got a call half an hour later from a team wondering where all their test VMs had gone.

vocatus NSA/DOD/USAR/USAP/AEXP [ S ] 1 point 2 points 3 points 5 years ago (0 children)
I have to hear the story.

[Oct 02, 2018] Rookie almost wipes customer's entire inventory unbeknownst to sysadmin

Notable quotes:
"... At that moment, everything from / and onward began deleting forcefully and Reginald described his subsequent actions as being akin to "flying flat like a dart in the air, arms stretched out, pointy finger fully extended" towards the power switch on the mini computer. ..."
Oct 02, 2018 | theregister.co.uk

I was going to type rm -rf /*.old* – which would have forcibly removed all /dgux.old stuff, including any sub-directories I may have created with that name," he said.

But – as regular readers will no doubt have guessed – he didn't.

"I fat fingered and typed rm -rf /* – and then I accidentally hit enter instead of the "." key."

At that moment, everything from / and onward began deleting forcefully and Reginald described his subsequent actions as being akin to "flying flat like a dart in the air, arms stretched out, pointy finger fully extended" towards the power switch on the mini computer.

"Everything got quiet."

Reginald tried to boot up the system, but it wouldn't. So instead he booted up off a tape drive to run the mini Unix installer and mounted the boot "/" file system as if he were upgrading – and then checked out the damage.

"Everything down to /dev was deleted, but I was so relieved I hadn't deleted the customer's database and only system files."

Reginald did what all the best accident-prone people do – kept the cock-up to himself, hoped no one would notice and started covering his tracks, by recreating all the system files.

Over the next three hours, he "painstakingly recreated the entire device tree by hand", at which point he could boot the machine properly – "and even the application worked out".

Jubilant at having managed the task, Reginald tried to keep a lid on the heart that was no doubt in his throat by this point and closed off his work, said goodbye to the sysadmin and went home to calm down. Luckily no one was any the wiser.

"If the admins read this message, this would be the first time they hear about it," he said.

"At the time they didn't come in to check what I was doing, and the system was inaccessible to the users due to planned maintenance anyway."

Did you feel the urge to confess to errors no one else at your work knew about? Do you know someone who kept something under their hat for years? Spill the beans to Who, Me? by emailing us here . ® Re: If rm -rf /* doesn't delete anything valuable

Eh? As I read it, Reginald kicked off the rm -rf /*, then hit the power switch before it deleted too much. The tape rescue revealed that "everything down to /dev" had been deleted, ie. everything in / beginnind a,b,c and some d. On a modern system that might include /boot and /bin, but evidently was not a total disaster on Reg's server.


Anonymous Coward

title="Inappropriate post? Report it to our moderators" type="submit" value="Report abuse"> I remember discovering the hard way that when you delete an email account in Thunderbird and it asks if you want to delete all the files associated with it it actually means do you want to delete the entire directory tree below where the account is stored .... so, as I discovered, saying "yes" when the reason you are deleting the account is because you'd just created it in the wrong place in the the directory tree is not a good idea - instead of just deleting the new account I nuked all the data associated with all our family email accounts!

big_D Monday 1st October 2018 10:05 GMT bpfh
div Re: .cobol

"Delete is right above Rename in the bloody menu"

Probably designed by the same person who designed the crontab app then, with the command line options -e to edit and -r to remove immediately without confirmation. Misstype at your peril...

I found this out - to my peril - about 3 seconds before I realised that it was a good idea for a server's crontab to include a daily executed crontab -l > /foo/bar/crontab-backup.txt ...

Jason Bloomberg
div Re: .cobol

I went to delete the original files, but I only got as far as "del *.COB" befiore hitting return.

I managed a similar thing but more deliberately; belatedly finding "DEL FOOBAR.???" included files with no extensions when it didn't on a previous version (Win3.1?).

That wasn't the disaster it could have been but I've had my share of all-nighters making it look like I hadn't accidentally scrubbed a system clean.

Down not across
div Re: .cobol

Probably designed by the same person who designed the crontab app then, with the command line options -e to edit and -r to remove immediately without confirmation. Misstype at your peril...

Using crontab -e is asking for trouble even without mistypes. I've see too many corrupted or truncated crontabs after someone has edited them with crontab -e. crontab -l > crontab.txt;vi crontab.txt;crontab crontab.txt is much better way.

You mean not everyone has crontab entry that backs up crontab at least daily?

MrBanana
div Re: .cobol

"WAH! I copied the .COBOL back to .COB and started over again. As I knew what I wanted to do this time, it only took about a day to re-do what I had deleted."

When this has happened to me, I end up with better code than I had before. Re-doing the work gives you a better perspective. Even if functionally no different it will be cleaner, well commented, and laid out more consistently. I sometimes now do it deliberately (although just saving the first new version, not deleting it) to clean up the code.

big_D
div Re: .cobol

I totally agree, the resultant code was better than what I had previously written, because some of the mistakes and assumptions I'd made the first time round and worked around didn't make it into the new code.

Woza
div Reminds me of the classic

https://www.ee.ryerson.ca/~elf/hack/recovery.html

Anonymous South African Coward
div Re: Reminds me of the classic

https://www.ee.ryerson.ca/~elf/hack/recovery.html

Was about to post the same. It is a legendary classic by now.

Chairman of the Bored
div One simple trick...

...depending on your shell and its configuration a zero size file in each directory you care about called '-i' will force the rampaging recursive rm, mv, or whatever back into interactive mode. By and large it won't defend you against mistakes in a script, but its definitely saved me from myself when running an interactive shell.

It's proven useful enough to earn its own cronjob that runs once a week and features a 'find -type d' and touch '-i' combo on systems I like.

Glad the OP's mad dive for the power switch saved him, I wasn't so speedy once. Total bustification. Hence this one simple trick...

Now if I could ever fdisk the right f$cking disk, I'd be set!

PickledAardvark
div Re: One simple trick...

"Can't you enter a command to abort the wipe?"

Maybe. But you still have to work out what got deleted.

On the first Unix system I used, an admin configured the rm command with a system alias so that rm required a confirmation. Annoying after a while but handy when learning.

When you are reconfiguring a system, delete/rm is not the only option. Move/mv protects you from your errors. If the OS has no move/mv, then copy, verify before delete.

Doctor Syntax
div Re: One simple trick...

"Move/mv protects you from your errors."

Not entirely. I had a similar experience with mv. I was left with a running shell so could cd through the remains of the file system end list files with echo * but not repair it..

Although we had the CDs (SCO) to reboot the system required a specific driver which wasn't included on the CDs and hadn't been provided by the vendor. It took most of a day before they emailed the correct driver to put on a floppy before I could reboot. After that it only took a few minutes to put everything back in place.

Chairman of the Bored
div Re: One simple trick...

@Chris Evans,

Yes there are a number of things you can do. Just like Windows a quick ctrl-C will abort a rm operation taking place in an interactive shell. Destroying the window in which the interactive shell running rm is running will work, too (alt-f4 in most window managers or 'x' out of the window)

If you know the process id of the rm process you can 'kill $pid' or do a 'killall -KILL rm'

Couple of problems:

(1) law of maximum perversity says that the most important bits will be destroyed first in any accident sequence

(2) by the time you realize the mistake there is no time to kill rm before law 1 is satisfied

The OP's mad dive for the power button is probably the very best move... provided you are right there at the console. And provided the big red switch is actually connected to anything

Colin Bull 1
div cp can also be dangerous

After several years working in a DOS environment I got a job as project Manager / Sys admin on a Unix based customer site for a six month stint. On my second day I wanted to use a test system to learn the software more, so decided to copy the live order files to the test system.

Unfortunately I forgot the trailing full stop as it was not needed in DOS - so the live order index file over wrote the live data file. And the company only took orders for next day delivery so it wiped all current orders.

Luckily it printed a sales acknowledgement every time an order was placed so I escaped death and learned never miss the second parameter with cp command.

Anonymous Coward

title="Inappropriate post? Report it to our moderators" type="submit" value="Report abuse"> i'd written a script to deploy the latest changes to the live environment. worked great. except one day i'd entered a typo and it was now deploying the same files to the remote directory, over and again.

it did that for 2 whole years with around 7 code releases. not a single person realised the production system was running the same code after each release with no change in functionality. all the customer cared about was 'was the site up?'

not a single person realised. not the developers. not the support staff. not me. not the testers. not the customer. just made you think... wtf had we been doing for 2 years??? Yet Another Anonymous coward

div Look on the bright side, any bugs your team had introduced in those 2 years had been blocked by your intrinsically secure script
Prst. V.Jeltz
div not a single person realised. not the developers. not the support staff. not me. not the testers. not the customer. just made you think... wtf had we been doing for 2 years???

That is Classic! not surprised about the AC!

Bet some of the beancounters were less than impressed , probly on customer side :)

Anonymous Coward

title="Inappropriate post? Report it to our moderators" type="submit" value="Report abuse"> Re: ...then there's backup stories...

Many years ago (pre internet times) a client phones at 5:30 Friday afternoon. It was the IT guy wanting to run through the steps involved in recovering from a backup. Their US headquarters had a hard disk fail on their accounting system. He was talking the Financial Controller through a recovery and while he knew his stuff he just wanted to double check everything.

8pm the same night the phone rang again - how soon could I fly to the states? Only one of the backup tapes was good. The financial controller had put the sole remaining good backup tape in the drive, then popped out to get a bite to eat at 7pm because it was going to be a late night. At 7:30pm the scheduled backup process copied the corrupted database over the only remaining backup.

Saturday was spent on the phone trying to talk them through everything I could think of.

Sunday afternoon I was sitting in a private jet winging it's way to their US HQ. Three days of very hard work later we'd managed to recreate the accounting database from pieces of corrupted databases and log files. Another private jet ride home - this time the pilot was kind enough to tell me there was a cooler full of beer behind my seat. Olivier2553

div Re: Welcome to the club!

"Lesson learned: NEVER decide to "clean up some old files" at 4:30 on a Friday afternoon. You WILL look for shortcuts and it WILL bite you on the ass."

Do not do anything of some significance on Friday. At all. Any major change, big operation, etc. must be made by Thursday at the latest, so in case of cock-up, you have the Friday (plus days week-end) to repair it.

JQW
div I once wiped a large portion of a hard drive after using find with exec rm -rf {} - due to not taking into account the fact that some directories on the system had spaces in them.
Will Godfrey
div Defensive typing

I've long been in the habit of entering dangerous commands partially in reverse, so in the case of theO/Ps one I've have done:

' -rf /*.old* '

then gone back top the start of the line and entered the ' rm ' bit.

sisk
div A couple months ago on my home computer (which has several Linux distros installed and which all share a common /home because I apparently like to make life difficult for myself - and yes, that's as close to a logical reason I have for having multiple distros installed on one machine) I was going to get rid of one of the extraneous Linux installs and use the space to expand the root partition for one of the other distros. I realized I'd typed /dev/sdc2 instead of /dev/sdc3 at the same time that I verified that, yes, I wanted to delete the partition. And sdc2 is where the above mentioned shared /home lives. Doh.

Fortunately I have a good file server and a cron job running rsync every night, so I didn't actually lose any data, but I think my heart stopped for a few seconds before I realized that.

Kevin Fairhurst
div Came in to work one Monday to find that the Unix system was borked... on investigation it appeared that a large number of files & folders had gone missing, probably by someone doing an incorrect rm.

Our systems were shared with our US office who supported the UK outside of our core hours (we were in from 7am to ensure trading was ready for 8am, they were available to field staff until 10pm UK time) so we suspected it was one of our US counterparts who had done it, but had no way to prove it.

Rather than try and fix anything, they'd gone through and deleted all logs and history entries so we could never find the evidence we needed!

Restoring the system from a recent backup brought everything back online again, as one would expect!

DavidRa
div Sure they did, but the universe invented better idiots

Of course. However, the incompletely-experienced often choose to force bypass that configuration. For example, a lot of systems aliased rm to "rm -i" by default, which would force interactive confirmations. People would then say "UGH, I hate having to do this" and add their own customisations to their shells/profiles etc:

unalias rm

alias rm=rm -f

Lo and behold, now no silly confirmations, regardless of stupidity/typos/etc.

[Jul 30, 2018] Sudo related horror story

Jul 30, 2018 | www.sott.net

A new sysadmin decided to scratch his etch in sudoers file and in the standard definition of additional sysadmins via wheel group

## Allows people in group wheel to run all commands
# %wheel        ALL=(ALL)       ALL
he replaced ALL with localhost
## Allows people in group wheel to run all commands
# %wheel        localhost=(ALL)       ALL
then without testing he distributed this file to all servers in the datacenter. Sysadmin who worked after him discovered that sudo su - command no longer works and they can't get root using their tried and true method ;-)

[Apr 22, 2018] Unix-Linux Horror Stories Unix Horror Stories The good thing about Unix, is when it screws up, it does so very quickly

Notable quotes:
"... And then I realized I had thrashed the server. Completely. ..."
"... There must be a way to fix this , I thought. HP-UX has a package installer like any modern Linux/Unix distribution, that is swinstall . That utility has a repair command, swrepair . ..."
"... you probably don't want that user owning /bin/nologin. ..."
Aug 04, 2011 | unixhorrorstories.blogspot.com

Unix Horror Stories: The good thing about Unix, is when it screws up, it does so very quickly The project to deploy a new, multi-million-dollar commercial system on two big, brand-new HP-UX servers at a brewing company that shall not be named, had been running on time and within budgets for several months. Just a few steps remained, among them, the migration of users from the old servers to the new ones.

The task was going to be simple: just copy the home directories of each user from the old server to the new ones, and a simple script to change the owner so as to make sure that each home directory was owned by the correct user. The script went something like this:

#!/bin/bash

cat /etc/passwd|while read line
      do
         USER=$(echo $line|cut -d: -f1)
         HOME=$(echo $line|cut -d: -f6)
         chown -R $USER $HOME
      done

[NOTE: the script does not filter out system ids from userids and that's a grave mistake. also it was run before it was tested ; -) -- NNB]

As you see, this script is pretty simple: obtain the user and the home directory from the password file, and then execute the chown command recursively on the home directory. I copied the files, executed the script, and thought, great, just 10 minutes and all is done.

That's when the calls started.

It turns out that while I was executing those seemingly harmless commands, the server was under acceptance test. You see, we were just one week away from going live and the final touches were everything that was required. So the users in the brewing company started testing if everything they needed was working like in the old servers. And suddenly, the users noticed that their system was malfunctioning and started making furious phone calls to my boss and then my boss started to call me.

And then I realized I had thrashed the server. Completely. My console was still open and I could see that the processes started failing, one by one, reporting very strange messages to the console, that didn't look any good. I started to panic. My workmate Ayelen and I (who just copied my script and executed it in the mirror server) realized only too late that the home directory of the root user was / -the root filesystem- so we changed the owner of every single file in the filesystem to root!!! That's what I love about Unix: when it screws up, it does so very quickly, and thoroughly.

There must be a way to fix this , I thought. HP-UX has a package installer like any modern Linux/Unix distribution, that is swinstall . That utility has a repair command, swrepair . So the following command put the system files back in order, needing a few permission changes on the application directories that weren't installed with the package manager:

swrepair -F

But the story doesn't end here. The next week, we were going live, and I knew that the migration of the users would be for real this time, not just a test. My boss and I were going to the brewing company, and he receives a phone call. Then he turns to me and asks me, "What was the command that you used last week?". I told him and I noticed that he was dictating it very carefully. When we arrived, we saw why: before the final deployment, a Unix administrator from the company did the same mistake I did, but this time, people from the whole country were connecting to the system, and he received phone calls from a lot of angry users. Luckily, the mistake could be fixed, and we all, young and old, went back to reading the HP-UX manual. Those things can come handy sometimes!

Morale of this story: before doing something on the users directories, take the time to see which is the User ID of actual users - which start usually in 500 but it's configuration-dependent - because system users's IDs are lower than that.

Send in your Unix horror story, and it will be featured here in the blog!

Greetings,
Agustin

Colin McD, 16 de marzo de 2017, 15:02

This script is so dangerous. You are giving home directories to say the apache user and you probably don't want that user owning /bin/nologin.

[Apr 22, 2018] Unix Horror story script question Unix Linux Forums Shell Programming and Scripting

Apr 22, 2018 | www.unix.com

scottsiddharth Registered User

Unix Horror story script question


This text and script is borrowed from the "Unix Horror Stories" document.

It states as follows

"""""Management told us to email a security notice to every user on the our system (at that time, around 3000 users). A certain novice administrator on our system wanted to do it, so I instructed them to extract a list of users from /etc/passwd, write a simple shell loop to do the job, and throw it in the background.
Here's what they wrote (bourne shell)...

for USER in `cat user.list`;
do mail $USER <message.text &
done

Have you ever seen a load average of over 300 ??? """" END

My question is this- What is wrong with the script above? Why did it find a place in the Horror stories? It worked well when I tried it.

Maybe he intended to throw the whole script in the background and not just the Mail part. But even so it works just as well... So?

Thunderbolt

RE:Unix Horror story script question
I think, it does well deserve to be placed Horror stories.

Consider the given server for with or without SMTP service role, this script tries to process 3000 mail commands in parallel to send the text to it's 3000 repective receipents.

Have you ever tried with valid 3000 e-mail IDs, you can feel the heat of CPU (sar 1 100)

P.S.: I did not tested it but theoretically affirmed.

Best Regards.

Thunderbolt, View Public Profile 3 11-24-2008 - Original Discussion by scottsiddharth

Quote:

Originally Posted by scottsiddharth

Thank you for the reply. But isn't that exactly what the real admin asked the novice admin to do.

Is there a better script or solution ?

Well, Let me try to make it sequential to reduce the CPU load, but it will take no. of users*SLP_INT(default=1) seconds to execute....

#Interval between concurrent mail commands excution in seconds, minimum 1 second.

      SLP_INT=1
      for USER in `cat user.list`;
      do; 
         mail $USER <message.text; [ -z "${SLP_INT}" ] && sleep 1 || sleep ${SLP_INT}" ;
      done

[Apr 22, 2018] THE classic Unix horror story programming

Looks like not much changed since 1986. I amazed how little changed with Unix over the years. RM remains a danger although zsh and -I option on gnu rm are improvement. . I think every sysadmin wiped out important data with rm at least once in his career. So more work on this problem is needed.
Notable quotes:
"... Because we are creatures of habit. If you ALWAYS have to type 'yes' for every single deletion, it will become habitual, and you will start doing it without conscious thought. ..."
"... Amazing what kind of damage you can recover from given enough motivation. ..."
"... " in "rm -rf ~/ ..."
Apr 22, 2008 | www.reddit.com

probablycorey 10 years ago (35 children)

A little trick I use to ensure I never delete the root or home dir... Put a file called -i in / and ~

If you ever call rm -rf *, -i (the request confirmation option) will be the first path expanded. So your command becomes...

rm -rf -i

Catastrophe Averted!

mshade 10 years ago (0 children)
That's a pretty good trick! Unfortunately it doesn't work if you specify the path of course, but will keep you from doing it with a PWD of ~ or /.

Thanks!

aythun 10 years ago (2 children)
Or just use zsh. It's awesome in every possible way.
brian@xenon:~/tmp/test% rm -rf *
zsh: sure you want to delete all the files in /home/brian/tmp/test [yn]?
rex5249 10 years ago (1 child)
I keep an daily clone of my laptop and I usually do some backups in the middle of the day, so if I lose a disk it isn't a big deal other than the time wasted copying files.
MyrddinE 10 years ago (1 child)
Because we are creatures of habit. If you ALWAYS have to type 'yes' for every single deletion, it will become habitual, and you will start doing it without conscious thought.

Warnings must only pop up when there is actual danger, or you will become acclimated to, and cease to see, the warning.

This is exactly the problem with Windows Vista, and why so many people harbor such ill-will towards its 'security' system.

zakk 10 years ago (3 children)
and if I want to delete that file?!? ;-)
alanpost 10 years ago (0 children)
I use the same trick, so either of:

$ rm -- -i

or

$ rm ./-i

will work.

emag 10 years ago (0 children)
rm /-i ~/-i
nasorenga 10 years ago * (2 children)
The part that made me the most nostalgic was his email address: mcvax!ukc!man.cs.ux!miw

Gee whiz, those were the days... (Edit: typo)

floweryleatherboy 10 years ago (6 children)
One of my engineering managers wiped out an important server with rm -rf. Later it turned out he had a giant stock of kiddy porn on company servers.
monstermunch 10 years ago (16 children)
Whenever I use rm -rf, I always make sure to type the full path name in (never just use *) and put the -rf at the end, not after the rm. This means you don't have to worry about hitting "enter" in the middle of typing the path name (it won't delete the directory because the -rf is at the end) and you don't have to worry as much about data deletion from accidentally copy/pasting the command somewhere with middle click or if you redo the command while looking in your bash history.

Hmm, couldn't you alias "rm -rf" to mv the directory/files to a temp directory to be on the safe side?

branston 10 years ago (8 children)
Aliasing 'rm' is fairly common practice in some circles. It can have its own set of problems however (filling up partitions, running out of inodes...)
amnezia 10 years ago (5 children)
you could alias it with a script that prevents rm -rf * being run in certain directories.
jemminger 10 years ago (4 children)
you could also alias it to 'ls' :)
derefr 10 years ago * (1 child)
One could write a daemon that lets the oldest files in that directory be "garbage collected" when those conditions are approaching. I think this is, in a roundabout way, how Windows' shadow copy works.
branston 10 years ago (0 children)
Could do. Think we might be walking into the over-complexity trap however. The only time I've ever had an rm related disaster was when accidentally repeating an rm that was already in my command buffer. I looked at trying to exclude patterns from the command history but csh doesn't seem to support doing that so I gave up.

A decent solution just occurred to me when the underlying file system supports snapshots (UFS2 for example). Just snap the fs on which the to-be-deleted items are on prior to the delete. That needs barely any IO to do and you can set the snapshots to expire after 10 minutes.

Hmm... Might look at implementing that..

mbm 10 years ago (0 children)
Most of the original UNIX tools took the arguments in strict order, requiring that the options came first; you can even see this on some modern *BSD systems.
shadowsurge 10 years ago (1 child)
I just always format the command with ls first just to make sure everything is in working order. Then my neurosis kicks in and I do it again... and a couple more times just to make sure nothing bad happens.
Jonathan_the_Nerd 10 years ago (0 children)
If you're unsure about your wildcards, you can use echo to see exactly how the shell will expand your arguments.
splidge 10 years ago (0 children)
A better trick IMO is to use ls on the directory first.. then when you are sure that's what you meant type rm -rf !$ to delete it.
earthboundkid 10 years ago * (0 children)
Ever since I got burned by letting my pinky slip on the enter key years ago, I've been typing echo path first, then going back and adding the rm after the fact.
zerokey 10 years ago * (2 children)
Great story. Halfway through reading, I had a major wtf moment. I wasn't surprised by the use of a VAX, as my old department just retired their last VAX a year ago. The whole time, I'm thinking, "hello..mount the tape hardware on another system and, worst case scenario, boot from a live cd!"

Then I got to, "The next idea was to write a program to make a device descriptor for the tape deck" and looked back at the title and realized that it was from 1986 and realized, "oh..oh yeah...that's pretty fucked."

iluvatar 10 years ago (0 children)

Great story

Yeah, but really, he had way too much of a working system to qualify for true geek godhood. That title belongs to Al Viro . Even though I've read it several times, I'm still in awe every time I see that story...

cdesignproponentsist 10 years ago (0 children)
FreeBSD has backup statically-linked copies of essential system recovery tools in /rescue, just in case you toast /bin, /sbin, /lib, ld-elf.so.1, etc.

It won't protect against a rm -rf / though (and is not intended to), although you could chflags -R schg /rescue to make them immune to rm -rf.

clytle374 10 years ago * (9 children)
It happens, I tried a few months back to rm -rf bin to delete a directory and did a rm -rf /bin instead.

First thought: That took a long time.

Second thought: What do you mean ls not found.

I was amazed that the desktop survived for nearly an hour before crashing.

earthboundkid 10 years ago (8 children)
This really is a situation where GUIs are better than CLIs. There's nothing like the visual confirmation of seeing what you're obliterating to set your heart into the pit of your stomach.
jib 10 years ago (0 children)
If you're using a GUI, you probably already have that. If you're using a command line, use mv instead of rm.

In general, if you want the computer to do something, tell it what you want it to do, rather than telling it to do something you don't want and then complaining when it does what you say.

earthboundkid 10 years ago (3 children)
Yes, but trash cans aren't manly enough for vi and emacs users to take seriously. If it made sense and kept you from shooting yourself in the foot, it wouldn't be in the Unix tradition.
earthboundkid 10 years ago (1 child)
  1. Are you so low on disk space that it's important for your trash can to be empty at all times?
  2. Why should we humans have to adapt our directory names to route around the busted-ass-ness of our tools? The tools should be made to work with capital letters and spaces. Or better, use a GUI for deleting so that you don't have to worry about OMG, I forgot to put a slash in front of my space!

Seriously, I use the command line multiple times every day, but there are some tasks for which it is just not well suited compared to a GUI, and (bizarrely considering it's one thing the CLI is most used for) one of them is moving around and deleting files.

easytiger 10 years ago (0 children)
Thats a very simple bash/ksh/python/etc script.
  1. script a move op to a hidden dir on the /current/ partition.
  2. alias this to rm
  3. wrap rm as an alias to delete the contents of the hidden folder with confirmation
mattucf 10 years ago (3 children)
I'd like to think that most systems these days don't have / set as root's home directory, but I've seen a few that do. :/
dsfox 10 years ago (0 children)
This is a good approach in 1986. Today I would just pop in a bootable CDROM.
fjhqjv 10 years ago * (5 children)
That's why I always keep stringent file permissions and never act as the root user.

I'd have to try to rm -rf, get a permission denied error, then retype sudo rm -rf and then type in my password to ever have a mistake like that happen.

But I'm not a systems administrator, so maybe it's not the same thing.

toast_and_oj 10 years ago (2 children)
I aliased "rm -rf" to "omnomnom" and got myself into the habit of using that. I can barely type "omnomnom" when I really want to, let alone when I'm not really paying attention. It's saved one of my projects once already.
shen 10 years ago (0 children)
I've aliased "rm -rf" to "rmrf". Maybe I'm just a sucker for punishment.

I haven't been bit by it yet, the defining word being yet.

robreim 10 years ago (0 children)
I would have thought tab completion would have made omnomnom potentially easier to type than rm -rf (since the -rf part needs to be typed explicitly)
immure 10 years ago (0 children)
It's not.
lespea 10 years ago (0 children)
before I ever do something like that I make sure I don't have permissions so I get an error, then I press up, home, and type sudo <space> <enter> and it works as expected :)
kirun 10 years ago (0 children)
And I was pleased the other day how easy it was to fix the system after I accidentally removed kdm, konqueror and kdesktop... but these guys are hardcore.
austin_k 10 years ago (0 children)
I actually started to feel sick reading that. I've been in a IT disaster before where we almost lost a huge database. Ugh.. I still have nightmares.
umilmi81 10 years ago (4 children)
Task number 1 with a UNIX system. Alias rm to rm -i. Call the explicit path when you want to avoid the -i (ie: /bin/rm -f). Nobody is too cool to skip this basic protection.
flinchn 10 years ago (0 children)
i did an application install at an LE agency last fall - stupid me mv ./etc ./etcbk <> mv /etc /etcbk

ahh that damned period

DrunkenAsshole 10 years ago (0 children)
Were the "*"s really needed for a story that has plagued, at one point or another, all OS users?
xelfer 10 years ago (0 children)
Is the home directory for root / for some unix systems? i thought 'cd' then 'rm -rf *' would have deleted whatever's in his home directory (or whatever $HOME points to)
srparish 10 years ago (0 children)
Couldn't he just have used the editor to create the etc files he wanted, and used cpio as root to copy that over as an /etc?

sRp

stox 10 years ago (1 child)
Been there, done that. Have the soiled underwear to prove it. Amazing what kind of damage you can recover from given enough motivation.
sheepskin 10 years ago * (0 children)
I had a customer do this, he killed it about the same time. I told him he was screwed and I'd charge him a bunch of money to take down his server, rebuild it from a working one and put it back up. But the customer happened to have a root ftp session up, and was able to upload what he needed to bring the system back. by the time he was done I rebooted it to make sure it was cool and it booted all the way back up.

Of course I've also had a lot of customer that have done it, and they where screwed, and I got to charge them a bunch of money.

jemminger 10 years ago (0 children)
pfft. that's why lusers don't get root access.
supersan 10 years ago (2 children)
i had the same thing happened to me once.. my c:\ drive was running ntfs and i accidently deleted the "ntldr" system file in the c:\ root (because the name didn't figure much).. then later, i couldn't even boot in the safe mode! and my bootable disk didn't recognize the c:\ drive because it was ntfs!! so sadly, i had to reinstall everything :( wasted a whole day over it..
b100dian 10 years ago (0 children)
Yes, but that's a single file. I suppose anyone can write hex into mbr to copy ntldr from a samba share!
bobcat 10 years ago (0 children)
http://en.wikipedia.org/wiki/Emergency_Repair_Disk
boredzo 10 years ago (0 children)
Neither one is the original source. The original source is Usenet, and I can't find it with Google Groups. So either of these webpages is as good as the other.
docgnome 10 years ago (0 children)
In 1986? On a VAX?
MarlonBain 10 years ago (0 children)

This classic article from Mario Wolczko first appeared on Usenet in 1986 .

amoore 10 years ago (0 children)
I got sidetracked trying to figure out why the fictional antagonist would type the extra "/ " in "rm -rf ~/ ".
Zombine 10 years ago (2 children)

...it's amazing how much of the system you can delete without it falling apart completely. Apart from the fact that nobody could login (/bin/login?), and most of the useful commands had gone, everything else seemed normal.

Yeah. So apart from the fact that no one could get any work done or really do anything, things were working great!

I think a more rational reaction would be "Why on Earth is this big, important system on which many people rely designed in such a way that a simple easy-to-make human error can screw it up so comprehensively?" or perhaps "Why on Earth don't we have a proper backup system?"

daniels220 10 years ago (1 child)
The problem wasn't the backup system, it was the restore system, which relied on the machine having a "copy" command. Perfectly reasonable assumption that happened not to be true.
Zombine 10 years ago * (0 children)
Neither backup nor restoration serves any purpose in isolation. Most people would group those operations together under the heading "backup;" certainly you win only a semantic victory by doing otherwise. Their fail-safe data-protection system, call it what you will, turned out not to work, and had to be re-engineered on-the-fly.

I generally figure that the assumptions I make that turn out to be entirely wrong were not "perfectly reasonable" assumptions in the first place. Call me a traditionalist.

[Apr 22, 2018] rm and Its Dangers (Unix Power Tools, 3rd Edition)

Apr 22, 2018 | docstore.mik.ua
14.3. rm and Its Dangers

Under Unix, you use the rm command to delete files. The command is simple enough; you just type rm followed by a list of files. If anything, rm is too simple. It's easy to delete more than you want, and once something is gone, it's permanently gone. There are a few hacks that make rm somewhat safer, and we'll get to those momentarily. But first, here's a quick look at some of the dangers.

To understand why it's impossible to reclaim deleted files, you need to know a bit about how the Unix filesystem works. The system contains a "free list," which is a list of disk blocks that aren't used. When you delete a file, its directory entry (which gives it its name) is removed. If there are no more links ( Section 10.3 ) to the file (i.e., if the file only had one name), its inode ( Section 14.2 ) is added to the list of free inodes, and its datablocks are added to the free list.

Well, why can't you get the file back from the free list? After all, there are DOS utilities that can reclaim deleted files by doing something similar. Remember, though, Unix is a multitasking operating system. Even if you think your system is a single-user system, there are a lot of things going on "behind your back": daemons are writing to log files, handling network connections, processing electronic mail, and so on. You could theoretically reclaim a file if you could "freeze" the filesystem the instant your file was deleted -- but that's not possible. With Unix, everything is always active. By the time you realize you made a mistake, your file's data blocks may well have been reused for something else.

When you're deleting files, it's important to use wildcards carefully. Simple typing errors can have disastrous consequences. Let's say you want to delete all your object ( .o ) files. You want to type:

% rm *.o

But because of a nervous twitch, you add an extra space and type:

% rm * .o

It looks right, and you might not even notice the error. But before you know it, all the files in the current directory will be gone, irretrievably.

If you don't think this can happen to you, here's something that actually did happen to me. At one point, when I was a relatively new Unix user, I was working on my company's business plan. The executives thought, so as to be "secure," that they'd set a business plan's permissions so you had to be root ( Section 1.18 ) to modify it. (A mistake in its own right, but that's another story.) I was using a terminal I wasn't familiar with and accidentally created a bunch of files with four control characters at the beginning of their name. To get rid of these, I typed (as root ):

# rm ????*

This command took a long time to execute. When about two-thirds of the directory was gone, I realized (with horror) what was happening: I was deleting all files with four or more characters in the filename.

The story got worse. They hadn't made a backup in about five months. (By the way, this article should give you plenty of reasons for making regular backups ( Section 38.3 ).) By the time I had restored the files I had deleted (a several-hour process in itself; this was on an ancient version of Unix with a horrible backup utility) and checked (by hand) all the files against our printed copy of the business plan, I had resolved to be very careful with my rm commands.

[Some shells have safeguards that work against Mike's first disastrous example -- but not the second one. Automatic safeguards like these can become a crutch, though . . . when you use another shell temporarily and don't have them, or when you type an expression like Mike's very destructive second example. I agree with his simple advice: check your rm commands carefully! -- JP ]

-- ML

[Apr 22, 2018] How to prevent a mistaken rm -rf for specific folders?

Notable quotes:
"... There's nothing more on a traditional Linux, but you can set Apparmor/SELinux/ rules that prevent rm from accessing certain directories. ..."
"... Probably your best bet with it would be to alias rm -ri into something memorable like kill_it_with_fire . This way whenever you feel like removing something, go ahead and kill it with fire. ..."
Jan 20, 2013 | unix.stackexchange.com

I think pretty much people here mistakenly 'rm -rf'ed the wrong directory, and hopefully it did not cause a huge damage.. Is there any way to prevent users from doing a similar unix horror story?? Someone mentioned (in the comments section of the previous link) that

... I am pretty sure now every unix course or company using unix sets rm -fr to disable accounts of people trying to run it or stop them from running it ...

Is there any implementation of that in any current Unix or Linux distro? And what is the common practice to prevent that error even from a sysadmin (with root access)?

It seems that there was some protection for the root directory (/) in Solaris (since 2005) and GNU (since 2006). Is there anyway to implement the same protection way to some other folders as well??

To give it more clarity, I was not asking about general advice about rm usage (and I've updated the title to indicate that more), I want something more like the root folder protection: in order to rm -rf / you have to pass a specific parameter: rm -rf --no-preserve-root /.. Is there similar implementations for customized set of directories? Or can I specify files in addition to / to be protected by the preserve-root option?


amyassin, Jan 20, 2013 at 17:26

I think pretty much people here mistakenly ' rm -rf 'ed the wrong directory, and hopefully it did not cause a huge damage.. Is there any way to prevent users from doing a similar unix horror story ?? Someone mentioned (in the comments section of the previous link ) that

... I am pretty sure now every unix course or company using unix sets rm -fr to disable accounts of people trying to run it or stop them from running it ...

Is there any implementation of that in any current Unix or Linux distro? And what is the common practice to prevent that error even from a sysadmin (with root access)?

It seems that there was some protection for the root directory ( / ) in Solaris (since 2005) and GNU (since 2006). Is there anyway to implement the same protection way to some other folders as well??

To give it more clarity, I was not asking about general advice about rm usage (and I've updated the title to indicate that more), I want something more like the root folder protection: in order to rm -rf / you have to pass a specific parameter: rm -rf --no-preserve-root / .. Is there similar implementations for customized set of directories? Or can I specify files in addition to / to be protected by the preserve-root option?

mattdm, Jan 20, 2013 at 17:33

1) Change management 2) Backups. – mattdm Jan 20 '13 at 17:33

Keith, Jan 20, 2013 at 17:40

probably the only way would be to replace the rm command with one that doesn't have that feature. – Keith Jan 20 '13 at 17:40

sr_, Jan 20, 2013 at 18:28

safe-rm maybe – sr_ Jan 20 '13 at 18:28

Bananguin, Jan 20, 2013 at 21:07

most distros do `alias rm='rm -i' which makes rm ask you if you are sure.

Besides that: know what you are doing. only become root if necessary. for any user with root privileges security of any kind must be implemented in and by the user. hire somebody if you can't do it yourself.over time any countermeasure becomes equivalaent to the alias line above if you cant wrap your own head around the problem. – Bananguin Jan 20 '13 at 21:07

midnightsteel, Jan 22, 2013 at 14:21

@amyassin using rm -rf can be a resume generating event. Check and triple check before executing it – midnightsteel Jan 22 '13 at 14:21

Gilles, Jan 22, 2013 at 0:18

To avoid a mistaken rm -rf, do not type rm -rf .

If you need to delete a directory tree, I recommend the following workflow:

Never call rm -rf with an argument other than DELETE . Doing the deletion in several stages gives you an opportunity to verify that you aren't deleting the wrong thing, either because of a typo (as in rm -rf /foo /bar instead of rm -rf /foo/bar ) or because of a braino (oops, no, I meant to delete foo.old and keep foo.new ).

If your problem is that you can't trust others not to type rm -rf, consider removing their admin privileges. There's a lot more that can go wrong than rm .


Always make backups .

Periodically verify that your backups are working and up-to-date.

Keep everything that can't be easily downloaded from somewhere under version control.


With a basic unix system, if you really want to make some directories undeletable by rm, replace (or better shadow) rm by a custom script that rejects certain arguments. Or by hg rm .

Some unix variants offer more possibilities.

amyassin, Jan 22, 2013 at 9:41

Yeah backing up is the most amazing solution, but I was thinking of something like the --no-preserve-root option, for other important folder.. And that apparently does not exist even as a practice... – amyassin Jan 22 '13 at 9:41

Gilles, Jan 22, 2013 at 20:32

@amyassin I'm afraid there's nothing more (at least not on Linux). rm -rf already means "delete this, yes I'm sure I know what I'm doing". If you want more, replace rm by a script that refuses to delete certain directories. – Gilles Jan 22 '13 at 20:32

Gilles, Jan 22, 2013 at 22:17

@amyassin Actually, I take this back. There's nothing more on a traditional Linux, but you can set Apparmor/SELinux/ rules that prevent rm from accessing certain directories. Also, since your question isn't only about Linux, I should have mentioned OSX, which has something a bit like what you want. – Gilles Jan 22 '13 at 22:17

qbi, Jan 22, 2013 at 21:29

If you are using rm * and the zsh, you can set the option rmstarwait :
setopt rmstarwait

Now the shell warns when you're using the * :

> zsh -f
> setopt rmstarwait
> touch a b c
> rm *
zsh: sure you want to delete all the files in /home/unixuser [yn]? _

When you reject it ( n ), nothing happens. Otherwise all files will be deleted.

Drake Clarris, Jan 22, 2013 at 14:11

EDIT as suggested by comment:

You can change the attribute of to immutable the file or directory and then it cannot be deleted even by root until the attribute is removed.

chattr +i /some/important/file

This also means that the file cannot be written to or changed in anyway, even by root . Another attribute apparently available that I haven't used myself is the append attribute ( chattr +a /some/important/file . Then the file can only be opened in append mode, meaning no deletion as well, but you can add to it (say a log file). This means you won't be able to edit it in vim for example, but you can do echo 'this adds a line' >> /some/important/file . Using > instead of >> will fail.

These attributes can be unset using a minus sign, i.e. chattr -i file

Otherwise, if this is not suitable, one thing I practice is to always ls /some/dir first, and then instead of retyping the command, press up arrow CTL-A, then delete the ls and type in my rm -rf if I need it. Not perfect, but by looking at the results of ls, you know before hand if it is what you wanted.

NlightNFotis, Jan 22, 2013 at 8:27

One possible choice is to stop using rm -rf and start using rm -ri . The extra i parameter there is to make sure that it asks if you are sure you want to delete the file.

Probably your best bet with it would be to alias rm -ri into something memorable like kill_it_with_fire . This way whenever you feel like removing something, go ahead and kill it with fire.

amyassin, Jan 22, 2013 at 14:24

I like the name, but isn't f is the exact opposite of i option?? I tried it and worked though... – amyassin Jan 22 '13 at 14:24

NlightNFotis, Jan 22, 2013 at 16:09

@amyassin Yes it is. For some strange kind of fashion, I thought I only had r in there. Just fixed it. – NlightNFotis Jan 22 '13 at 16:09

Silverrocker, Jan 22, 2013 at 14:46

To protect against an accidental rm -rf * in a directory, create a file called "-i" (you can do this with emacs or some other program) in that directory. The shell will try to interpret -i and will cause it to go into interactive mode.

For example: You have a directory called rmtest with the file named -i inside. If you try to rm everything inside the directory, rm will first get -i passed to it and will go into interactive mode. If you put such a file inside the directories you would like to have some protection on, it might help.

Note that this is ineffective against rm -rf rmtest .

ValeriRangelov, Dec 21, 2014 at 3:03

If you understand C programming language, I think it is possible to rewrite the rm source code and make a little patch for kernel. I saw this on one server and it was impossible to delete some important directories and when you type 'rm -rf /direcotyr' it send email to sysadmin.

[Apr 21, 2018] Any alias of rm is a very stupid idea

Option -I is more modern and more useful then old option -i. It is highly recommended. And it make sense to to use alias with it contrary to what this author states (he probably does not understand that aliases do not wok for non-interactive sessions.).
The point the author make is that when you automatically expect rm to be aisles to rm -i you get into trouble on machines where this is not the case. And that's completely true.
But it does not solve the problem as respondents soon became automatic. stated. Writing your own wrapper is a better deal. One such wrapper -- safe-rm already exists and while not perfect is useful
Notable quotes:
"... A co-worker had such an alias. Imagine the disaster when, visiting a customer site, he did "rm *" in the customer's work directory and all he got was the prompt for the next command after rm had done what it was told to do. ..."
"... It you want a safety net, do "alias del='rm -I –preserve_root'", ..."
Feb 14, 2017 | www.cyberciti.biz
Art Protin June 12, 2012, 9:53 pm

Any alias of rm is a very stupid idea (except maybe alias rm=echo fool).

A co-worker had such an alias. Imagine the disaster when, visiting a customer site, he did "rm *" in the customer's work directory and all he got was the prompt for the next command after rm had done what it was told to do.

It you want a safety net, do "alias del='rm -I –preserve_root'",

Drew Hammond March 26, 2014, 7:41 pm
^ This x10000.

I've made the same mistake before and its horrible.

[Mar 28, 2018] Sysadmin wiped two servers, left the country to escape the shame by Simon Sharwood

Mar 26, 2018 | theregister.co.uk
"This revolutionary product allowed you to basically 'mirror' two file servers," Graham told The Register . "It was clever stuff back then with a high speed 100mb FDDI link doing the mirroring and the 10Mb LAN doing business as usual."

Graham was called upon to install said software at a British insurance company, which involved a 300km trip on Britain's famously brilliant motorways with a pair of servers in the back of a company car.

Maybe that drive was why Graham made a mistake after the first part of the job: getting the servers set up and talking.

"Sadly the software didn't make identifying the location of each disk easy," Graham told us. "And – ummm - I mirrored it the wrong way."

"The net result was two empty but beautifully-mirrored servers."

Oops.

Graham tried to find someone to blame, but as he was the only one on the job that wouldn't work.

His next instinct was to run, but as the site had a stack of Quarter Inch Cartridge backup tapes, he quickly learned that "incremental back-ups are the work of the devil."

Happily, all was well in the end.

[Dec 07, 2017] First Rule of Usability Don't Listen to Users

Notable quotes:
"... So, do users know what they want? No, no, and no. Three times no. ..."
Dec 07, 2017 | www.nngroup.com

But ultimately, the way to get user data boils down to the basic rules of usability

... ... ...

So, do users know what they want? No, no, and no. Three times no.

Finally, you must consider how and when to solicit feedback. Although it might be tempting to simply post a survey online, you're unlikely to get reliable input (if you get any at all). Users who see the survey and fill it out before they've used the site will offer irrelevant answers. Users who see the survey after they've used the site will most likely leave without answering the questions. One question that does work well in a website survey is "Why are you visiting our site today?" This question goes to users' motivation and they can answer it as soon as they arrive.

[Dec 07, 2017] The rogue DHCP server

Notable quotes:
"... from Don Watkins ..."
Dec 07, 2017 | opensource.com

from Don Watkins

I am a liberal arts person who wound up being a technology director. With the exception of 15 credit hours earned on my way to a Cisco Certified Network Associate credential, all of the rest of my learning came on the job. I believe that learning what not to do from real experiences is often the best teacher. However, those experiences can frequently come at the expense of emotional pain. Prior to my Cisco experience, I had very little experience with TCP/IP networking and the kinds of havoc I could create albeit innocently due to my lack of understanding of the nuances of routing and DHCP.

At the time our school network was an active directory domain with DHCP and DNS provided by a Windows 2000 server. All of our staff access to the email, Internet, and network shares were served this way. I had been researching the use of the K12 Linux Terminal Server ( K12LTSP ) project and had built a Fedora Core box with a single network card in it. I wanted to see how well my new project worked so without talking to my network support specialists I connected it to our main LAN segment. In a very short period of time our help desk phones were ringing with principals, teachers, and other staff who could no longer access their email, printers, shared directories, and more. I had no idea that the Windows clients would see another DHCP server on our network which was my test computer and pick up an IP address and DNS information from it.

I had unwittingly created a "rogue" DHCP server and was oblivious to the havoc that it would create. I shared with the support specialist what had happened and I can still see him making a bee-line for that rogue computer, disconnecting it from the network. All of our client computers had to be rebooted along with many of our switches which resulted in a lot of confusion and lost time due to my ignorance. That's when I learned that it is best to test new products on their own subnet.

[Jul 20, 2017] These Guys Didnt Back Up Their Files, Now Look What Happened

Notable quotes:
"... Unfortunately, even today, people have not learned that lesson. Whether it's at work, at home, or talking with friends, I keep hearing stories of people losing hundreds to thousands of files, sometimes they lose data worth actual dollars in time and resources that were used to develop the information. ..."
"... "I lost all my files from my hard drive? help please? I did a project that took me 3 days and now i lost it, its powerpoint presentation, where can i look for it? its not there where i save it, thank you" ..."
"... Please someone help me I last week brought a Toshiba Satellite laptop running windows 7, to replace my blue screening Dell vista laptop. On plugged in my sumo external hard drive to copy over some much treasured photos and some of my (work – music/writing.) it said installing driver. it said completed I clicked on the hard drive and found a copy of my documents from the new laptop and nothing else. ..."
Jul 20, 2017 | www.makeuseof.com
Back in college, I used to work just about every day as a computer cluster consultant. I remember a month after getting promoted to a supervisor, I was in the process of training a new consultant in the library computer cluster. Suddenly, someone tapped me on the shoulder, and when I turned around I was confronted with a frantic graduate student – a 30-something year old man who I believe was Eastern European based on his accent – who was nearly in tears.

"Please need help – my document is all gone and disk stuck!" he said as he frantically pointed to his PC.

Now, right off the bat I could have told you three facts about the guy. One glance at the blue screen of the archaic DOS-based version of Wordperfect told me that – like most of the other graduate students at the time – he had not yet decided to upgrade to the newer, point-and-click style word processing software. For some reason, graduate students had become so accustomed to all of the keyboard hot-keys associated with typing in a DOS-like environment that they all refused to evolve into point-and-click users.

The second fact, gathered from a quick glance at his blank document screen and the sweat on his brow told me that he had not saved his document as he worked. The last fact, based on his thick accent, was that communicating the gravity of his situation wouldn't be easy. In fact, it was made even worse by his answer to my question when I asked him when he last saved.

"I wrote 30 pages."

Calculated out at about 600 words a page, that's 18000 words. Ouch.

Then he pointed at the disk drive. The floppy disk was stuck, and from the marks on the drive he had clearly tried to get it out with something like a paper clip. By the time I had carefully fished the torn and destroyed disk out of the drive, it was clear he'd never recover anything off of it. I asked him what was on it.

"My thesis."

I gulped. I asked him if he was serious. He was. I asked him if he'd made any backups. He hadn't.

Making Backups of Backups

If there is anything I learned during those early years of working with computers (and the people that use them), it was how critical it is to not only save important stuff, but also to save it in different places. I would back up floppy drives to those cool new zip drives as well as the local PC hard drive. Never, ever had a single copy of anything.

Unfortunately, even today, people have not learned that lesson. Whether it's at work, at home, or talking with friends, I keep hearing stories of people losing hundreds to thousands of files, sometimes they lose data worth actual dollars in time and resources that were used to develop the information.

To drive that lesson home, I wanted to share a collection of stories that I found around the Internet about some recent cases were people suffered that horrible fate – from thousands of files to entire drives worth of data completely lost. These are people where the only remaining option is to start running recovery software and praying, or in other cases paying thousands of dollars to a data recovery firm and hoping there's something to find.

Not Backing Up Projects

The first example comes from Yahoo Answers , where a user that only provided a "?" for a user name (out of embarrassment probably), posted:

"I lost all my files from my hard drive? help please? I did a project that took me 3 days and now i lost it, its powerpoint presentation, where can i look for it? its not there where i save it, thank you"

The folks answering immediately dove into suggesting that the person run recovery software, and one person suggested that the person run a search on the computer for *.ppt.

... ... ...

Doing Backups Wrong

Then, there's a scenario of actually trying to do a backup and doing it wrong, losing all of the files on the original drive. That was the case for the person who posted on Tech Support Forum , that after purchasing a brand new Toshiba Laptop and attempting to transfer old files from an external hard drive, inadvertently wiped the files on the hard drive.

Please someone help me I last week brought a Toshiba Satellite laptop running windows 7, to replace my blue screening Dell vista laptop. On plugged in my sumo external hard drive to copy over some much treasured photos and some of my (work – music/writing.) it said installing driver. it said completed I clicked on the hard drive and found a copy of my documents from the new laptop and nothing else.

While the description of the problem is a little broken, from the sound of it, the person thought they were backing up from one direction, while they were actually backing up in the other direction. At least in this case not all of the original files were deleted, but a majority were.

[May 07, 2017] centos - Do not play those dangerous games with resing of partitions unless absolutly nessesary

Copying to additional drive (can be USB), repartitioning and then copying everything back is a safer bet
www.softpanorama.org

In theory, you could reduce the size of sda1, increase the size of the extended partition, shift the contents of the extended partition down, then increase the size of the PV on the extended partition and you'd have the extra room. However, the number of possible things that can go wrong there is just astronomical, so I'd recommend either buying a second hard drive (and possibly transferring everything onto it in a more sensible layout, then repartitioning your current drive better) or just making some bind mounts of various bits and pieces out of /home into / to free up a bit more space.

--womble

[May 05, 2017] As Unix does not have a rename command usage of mv for renaming can lead to SNAFU

www.softpanorama.org

If destination does not exist it behaves as rename command but if destination exists and is directory it move it one level up

For example, if you have directories /home and home2 and want to move all subdirectories from /home2 to /home and the directory /home is empty you can't use

mv home2 home

if you forget to remove the directory /home, mv silently will create /home/home2 directory and you have a problem if this is user home directories.

[May 05, 2017] The key problem with cp utility is that it does not preserve timestamp of the file.

Expected behaviour of copy command by windows users is that it preserves attributes. But this in not true for Unix cp command.
Using -r option without -p option destroys all timestamps.
www.vanityfair.com

-p -- Preserve the characteristics of the source_file. Copy the contents, modification times, and permission modes of the source_file to the destination files.

You might wish to create an alias

alias cp='cp -p'

as I can't imagine case where regular Unix behaviour is desirable.

[Feb 14, 2017] My 10 UNIX Command Line Mistakes

Feb 14, 2017 | www.cyberciti.biz
Destroyed named.conf

I wanted to append a new zone to /var/named/chroot/etc/named.conf file., but end up running:
./mkzone example.com > /var/named/chroot/etc/named.conf

Destroyed Working Backups with Tar and Rsync (personal backups)

I had only one backup copy of my QT project and I just wanted to get a directory called functions. I end up deleting entire backup (note -c switch instead of -x):
cd /mnt/bacupusbharddisk
tar -zcvf project.tar.gz functions

I had no backup. Similarly I end up running rsync command and deleted all new files by overwriting files from backup set (now I’ve switched to rsnapshot )
rsync -av -delete /dest /src
Again, I had no backup.

[Feb 12, 2017] Vendor support vs local support

Jonathan.White Jul 9, 2015 10:14 AM (in response to nickzourdos)

We had a client that said their IBM application was running slow because of the "network". (The mysterious place that packets vanish into like a black hole...lol) I explained to them that the application spans two data centers in separate states across several different pieces of equipment. They said they didn't feel like going through the process of opening another ticket with IBM since IBM would require them to gather a bunch of logs and do a lot of investigation work on their side. Instead they decided to punt it over to the networking team by opening a ticket/incident that read something along the lines that their application was slow due to network related issues.

To help get things moving along I setup a weekly call to get a status on where we were with the troubleshooting process. The first thing I would do was a role call. I would ask who was on the line and then very specifically ask if IBM was on the call. Every time they informed us that IBM wasn't on the call and hadn't been engaged. We were at a standstill and the calls would end very quickly after role call because IBM was the missing piece. We needed someone with enough knowledge of the application to tell us what exactly was slow so we could track it down across the network. Based on the clients initial thought process with punting over to networking you can imagine how well they knew their application.

Needless to say after a few weeks of role call they asked me to cancel the meetings since they contacted IBM and tweaked a few application settings that corrected the problems. The issue was resolved on our end by a simple role call which was strategically done to get this problem routed to the proper group despite the client's laziness....

[Feb 12, 2017] Stupidity of the manager effect

nickzourdos Jul 9, 2015 10:13 AM

So the Exchange server had a bit of a hiccup one day, back when I was on the help desk. There was an hour window where one of the databases got behind and the queues had to catch up. This caused ~200 users to have slow or unresponsive Outlook clients. I got an angry call from someone in accounting after about 20 minutes of downtime, and she proceeded to assume the roll of tech support manager:

Her: So is the email down?

Me: Yes, we've notified our system administrators and they have already fixed the issue. We're waiting for the server to return to normal, but for now it's playing catchup.

Her: So how are you going to prevent the help desk from getting swamped with calls? Don't you think it would be a good idea to help deflect the calls you're getting?

Me: We're actually not that swamped. The outage only applies to 205 users in the company that are on that specific database.

Her: Ok but what are you going to do about it? What about those 205 people who are having problems? Shouldn't you notify them? How hard is it to send a mass email letting them know that the server is down?

Me: I... don't think they would get the email if the email server is down for them.

Her: Well I'm going to send a mass email to the accounting department, I suggest you do the same for the rest of the company.

[Feb 12, 2017] Just the push of the button in the opened datacenter

atreides71 Jul 15, 2015 6:49 PM
My first job was in a Hewlett Packard reseller company, the small datacenter was plain sight from the lobby so our sales executives could talk visitors about the infrastructure we were using to run the company systems ( ERP, email, BI, etc ), and they had the bad habit to let people in so they could see the different solutions very close.

Someday one of those executives must had left the door open, it was summer holiday time so we had a visit of a reseller and he was accompanied by his child son, who quickly found that the door was opened, came into the datacenter and pushed a single button, the on/off button of the Progress Database Server that kept the ERP information. He did and he left the datacenter without being noticed.

In just a couple of minutes we had a lot of calls from all the branch offices asking about the ERP service; it took us 1 or 2 hours to find the failure, check the raid status, the database integrity and put in on line again, we had a meeting looking for the root cause of the outage until someone had the idea to check the video of the security cameras, then we found the real responsible for the fail of the system.

Since then, the Datacenter remained closed.

[Feb 11, 2017] Being way too lazy is not always beneficial

When a customer gets a replacement disk for their SAN and doesn't replace it for a week saying "I just couldn't bring myself to care about the SAN this week." and then another disk goes bad the next day.
mleon Jul 15, 2015 9:13 AM

When a customer gets a replacement disk for their SAN and doesn't replace it for a week saying "I just couldn't bring myself to care about the SAN this week." and then another disk goes bad the next day.

[Feb 10, 2017] An inventive idea of reusing the socket into which the switch was plugged

jimtech18 Jul 31, 2015 1:34 PM

No hazard pay:

Replacing a failing switch in a high pressure test lab (one with signs that warn of the danger of pinhole leaks being able to KILL you). Up near the top of the stupid-tall 20' step ladder when the lab tech holding the ladder tells me about the guy who fell off this same ladder last year and broke his hip (now he tells me). (Who puts a switch in the ceiling supports anyway? Apparently there used to be a wall there that the switch was mounted to. Construction guys removed the wall so the switch and wires just got moved up and mounted at the ceiling! duh what else would you do? long before my time) Back to the challenge at hand. As I'm messing around with the switch the Hydrogen alarm mounted near the ceiling starts wailing, and the guy on the floor says "That's not good!" and leaves the room, remember him, he was steadying the ladder that someone fell off of last year that I am still at the top of. He soon returns and holds the ladder as I climb all the way down, seems like twice as far as when I climbed up.

By this time the Hydrogen alarm has stopped and both techs say that there is nothing to worry about and that I should finish the switch replacement so they can get back to work. Of course, as a SYSADMIN, I go back up the crappy 20' step ladder and finish swapping out the failed switch with a POE powered one, problem resolved. I take the failed switch back to my office and it works fine, what? How could that be? Turns out the extension cord that the switch (mounted at the ceiling) was plugged into had been unplugged by the first tech who was HELPING me because he needed an outlet. Once I pointed out the cause of the whole issue he said "Oh Yeah, that's where that cord goes, Oh Well, it's fixed now and I get to keep using the outlet, thanks".

[Feb 08, 2017] A side effect of simultaneous changes on many boxes can be networking storm when boxes start communing all at once

Deltona Jul 31, 2015 4:50 AM

I was supposed to do some routine redundancy tests at a remote site in another country. After implementing and testing everything successfully, I enabled EnergyWise on a couple hundred switches in one go. The broadcast storm that followed brought everything in the DC down to a halt.

It took me two hours to figure out why this happened and i missed my flight back home. A couple months later, a dozen plus firmwares were released to address this issue.

[Feb 07, 2017] Troubleshooting method for networking problems: work up the OSI model - layer 1 - check the cabling

Troubleshooting method - work up the OSI model - layer 1 - check the cabling. After checking the cabling, check the cabling again. Before you're ready to escalate, ask for help, check the cabling again.

pseudocyber Jul 28, 2015 9:20 AM (in response to adcast)

Troubleshooting method - work up the OSI model - layer 1 - check the cabling. After checking the cabling, check the cabling again. Before you're ready to escalate, ask for help, check the cabling again.

[Feb 06, 2017] The way to keep senior management informed

rbrickler Jul 17, 2015 11:52 AM

I was working for Network Operations in a company several years back. It was a small company and we had a VP that was not tech savvy. We were having an issue one day, and he came running into the Network Operations Center asking what was going on.

One of our coworkers looked at him and said, relax, it is no big deal, we have everything under control. He asked what was the problem.

Our coworker said, "the flux capacitor stopped working, but we got it restarted." The VP said OK, turned around and left the room to go report to the execs about our Flux Capacitor issue....

[Feb 05, 2017] Cutting yourself from the networked server by putting down and then up eth0 interface

jemertz Mar 30, 2016 10:26 AM
When working in a remote lab, on a Linux server which you're connecting to through eth0:

use: ifdown eth0; ifup eth0

not:

ifdown eth0 
ifup eth0

Doing it on one line means it comes back up right after it goes down. Doing it on two lines means you lose connection before you can type the second line. I figured this out the hard way, and haven't made the same mistake a second time.

[Feb 04, 2017] How do I fix mess created by accidentally untarred files in the current dir, aka tar bomb

In such cases the UID of the file is often different from uid of "legitimate" files in polluted directories and you probably can use this fact for quick elimination of the tar bomb, But the idea of using the list of files from the tar bomb to eliminate offending files also works if you observe some precautions -- some directories that were created can have the same names as existing directories. Never do rm in -exec or via xargs without testing.
Notable quotes:
"... You don't want to just rm -r everything that tar tf tells you, since it might include directories that were not empty before unpacking! ..."
"... Another nice trick by @glennjackman, which preserves the order of files, starting from the deepest ones. Again, remove echo when done. ..."
"... One other thing: you may need to use the tar option --numeric-owner if the user names and/or group names in the tar listing make the names start in an unpredictable column. ..."
"... That kind of (antisocial) archive is called a tar bomb because of what it does. Once one of these "explodes" on you, the solutions in the other answers are way better than what I would have suggested. ..."
"... The easiest (laziest) way to do that is to always unpack a tar archive into an empty directory. ..."
"... The t option also comes in handy if you want to inspect the contents of an archive just to see if it has something you're looking for in it. If it does, you can, optionally, just extract the file(s) you want. ..."
Feb 04, 2017 | superuser.com

linux - Undo tar file extraction mess - Super User

first try to issue

tar tf archive
tar will list the contents line by line.

This can be piped to xargs directly, but beware : do the deletion very carefully. You don't want to just rm -r everything that tar tf tells you, since it might include directories that were not empty before unpacking!

You could do

tar tf archive.tar | xargs -d'\n' rm -v
tar tf archive.tar | sort -r | xargs -d'\n' rmdir -v

to first remove all files that were in the archive, and then the directories that are left empty.

sort -r (glennjackman suggested tac instead of sort -r in the comments to the accepted answer, which also works since tar 's output is regular enough) is needed to delete the deepest directories first; otherwise a case where dir1 contains a single empty directory dir2 will leave dir1 after the rmdir pass, since it was not empty before dir2 was removed.

This will generate a lot of

rm: cannot remove `dir/': Is a directory


and

rmdir: failed to remove `dir/': Directory not empty
rmdir: failed to remove `file': Not a directory

Shut this up with 2>/dev/null if it annoys you, but I'd prefer to keep as much information on the process as possible.

And don't do it until you are sure that you match the right files. And perhaps try rm -i to confirm everything. And have backups, eat your breakfast, brush your teeth, etc.

===

List the contents of the tar file like so:

tar tzf myarchive.tar

Then, delete those file names by iterating over that list:

while IFS= read -r file; do echo "$file"; done < <(tar tzf myarchive.tar.gz)

This will still just list the files that would be deleted. Replace echo with rm if you're really sure these are the ones you want to remove. And maybe make a backup to be sure.

In a second pass, remove the directories that are left over:

while IFS= read -r file; do rmdir "$file"; done < <(tar tzf myarchive.tar.gz)

This prevents directories with from being deleted if they already existed before.

Another nice trick by @glennjackman, which preserves the order of files, starting from the deepest ones. Again, remove echo when done.

tar tvf myarchive.tar | tac | xargs -d'\n' echo rm

This could then be followed by the normal rmdir cleanup.


Here's a possibility that will take the extracted files and move them to a subdirectory, cleaning up your main folder.
    #!/usr/bin/perl -w  

    use strict  ;  
    use   Getopt  ::  Long  ;  

    my $clean_folder   =     "clean"  ;  
    my $DRY_RUN  ;  
    die   "Usage: $0 [--dry] [--clean=dir-name]\n"  
          if     (     !  GetOptions  (  "dry!"     =>   \$DRY_RUN  ,  
                           "clean=s"     =>   \$clean_folder  ));  

      # Protect the 'clean_folder' string from shell substitution  
    $clean_folder   =~   s  /  '/'  \\  ''  /  g  ;  

      # Process the "tar tv" listing and output a shell script.  
    print   "#!/bin/sh\n"     if     (     !  $DRY_RUN   );  
      while     (<>)  
      {  
        chomp  ;  

          # Strip out permissions string and the directory entry from the 'tar' list  
        my $perms   =   substr  (  $_  ,     0  ,     10  );  
        my $dirent   =   substr  (  $_  ,     48  );  

          # Drop entries that are in subdirectories  
        next   if     (   $dirent   =~   m  :/.:     );  

          # If we're in "dry run" mode, just list the permissions and the directory  
          # entries.  
          #  
          if     (   $DRY_RUN   )  
          {  
            print   "$perms|$dirent\n"  ;  
            next  ;  
          }  

          # Emit the shell code to clean up the folder  
        $dirent   =~   s  /  '/'  \\  ''  /  g  ;  
        print   "mv -i '$dirent' '$clean_folder'/.\n"  ;  
      } 

Save this to the file fix-tar.pl and then execute it like this:

 $ tar tvf myarchive  .  tar   |   perl fix  -  tar  .  pl   --  dry 

This will confirm that your tar list is like mine. You should get output like:

  -  rw  -  rw  -  r  --|  batch
  -  rw  -  rw  -  r  --|  book  -  report  .  png
  -  rwx  ------|  CaseReports  .  png
  -  rw  -  rw  -  r  --|  caseTree  .  png
  -  rw  -  rw  -  r  --|  tree  .  png
drwxrwxr  -  x  |  sample  / 

If that looks good, then run it again like this:

$ mkdir cleanup
$ tar tvf myarchive  .  tar   |   perl fix  -  tar  .  pl   --  clean  =  cleanup   >   fixup  .  sh 

The fixup.sh script will be the shell commands that will move the top-level files and directories into a "clean" folder (in this instance, the folder called cleanup). Have a peek through this script to confirm that it's all kosher. If it is, you can now clean up your mess with:

 $ sh fixup  .  sh 

I prefer this kind of cleanup because it doesn't destroy anything that isn't already destroyed by being overwritten by that initial tar xv.

Note: if that initial dry run output doesn't look right, you should be able to fiddle with the numbers in the two substr function calls until they look proper. The $perms variable is used only for the dry run so really only the $dirent substring needs to be proper.

One other thing: you may need to use the tar option --numeric-owner if the user names and/or group names in the tar listing make the names start in an unpredictable column.

One other thing: you may need to use the tar option --numeric-owner if the user names and/or group names in the tar listing make the names start in an unpredictable column.

===

That kind of (antisocial) archive is called a tar bomb because of what it does. Once one of these "explodes" on you, the solutions in the other answers are way better than what I would have suggested.

The best "solution", however, is to prevent the problem in the first place.

The easiest (laziest) way to do that is to always unpack a tar archive into an empty directory. If it includes a top level directory, then you just move that to the desired destination. If not, then just rename your working directory (the one that was empty) and move that to the desired location.

If you just want to get it right the first time, you can run tar -tvf archive-file.tar | less and it will list the contents of the archive so you can see how it is structured and then do what is necessary to extract it to the desired location to start with.

The t option also comes in handy if you want to inspect the contents of an archive just to see if it has something you're looking for in it. If it does, you can, optionally, just extract the file(s) you want.

[Feb 04, 2017] Restoring deleted /tmp folder

Jan 13, 2015 | cyberciti.biz

As my journey continues with Linux and Unix shell, I made a few mistakes. I accidentally deleted /tmp folder. To restore it all you have to do is:

mkdir /tmp
chmod 1777 /tmp
chown root:root /tmp
ls -ld /tmp
 
mkdir /tmp chmod 1777 /tmp chown root:root /tmp ls -ld /tmp 

[Feb 04, 2017] Use CDPATH to access frequent directories in bash - Mac OS X Hints

Feb 04, 2017 | hints.macworld.com
The variable CDPATH defines the search path for the directory containing directories. So it served much like "directories home". The dangers are in creating too complex CDPATH. Often a single directory works best. For example export CDPATH = /srv/www/public_html . Now, instead of typing cd /srv/www/public_html/CSS I can simply type: cd CSS
Use CDPATH to access frequent directories in bash
Mar 21, '05 10:01:00AM • Contributed by: jonbauman

I often find myself wanting to cd to the various directories beneath my home directory (i.e. ~/Library, ~/Music, etc.), but being lazy, I find it painful to have to type the ~/ if I'm not in my home directory already. Enter CDPATH , as desribed in man bash ):

The search path for the cd command. This is a colon-separated list of directories in which the shell looks for destination directories specified by the cd command. A sample value is ".:~:/usr".
Personally, I use the following command (either on the command line for use in just that session, or in .bash_profile for permanent use):
CDPATH=".:~:~/Library"

This way, no matter where I am in the directory tree, I can just cd dirname , and it will take me to the directory that is a subdirectory of any of the ones in the list. For example:
$ cd
$ cd Documents 
/Users/baumanj/Documents
$ cd Pictures
/Users/username/Pictures
$ cd Preferences
/Users/username/Library/Preferences
etc...
[ robg adds: No, this isn't some deeply buried treasure of OS X, but I'd never heard of the CDPATH variable, so I'm assuming it will be of interest to some other readers as well.]

cdable_vars is also nice
Authored by: clh on Mar 21, '05 08:16:26PM

Check out the bash command shopt -s cdable_vars

From the man bash page:

cdable_vars

If set, an argument to the cd builtin command that is not a directory is assumed to be the name of a variable whose value is the directory to change to.

With this set, if I give the following bash command:

export d="/Users/chap/Desktop"

I can then simply type

cd d

to change to my Desktop directory.

I put the shopt command and the various export commands in my .bashrc file.

[Aug 04, 2015] My 10 UNIX Command Line Mistakes by Vivek Gite

The thread of comments after the article is very educational. We reproduce only a small fraction.
June 21, 2009

Anyone who has never made a mistake has never tried anything new. -- Albert Einstein.

Here are a few mistakes that I made while working at UNIX prompt. Some mistakes caused me a good amount of downtime. Most of these mistakes are from my early days as a UNIX admin.

userdel Command

The file /etc/deluser.conf was configured to remove the home directory (it was done by previous sys admin and it was my first day at work) and mail spool of the user to be removed. I just wanted to remove the user account and I end up deleting everything (note -r was activated via deluser.conf):

userdel foo

... ... ...

Destroyed Working Backups with Tar and Rsync (personal backups)

I had only one backup copy of my QT project and I just wanted to get a directory called functions. I end up deleting entire backup (note -c switch instead of -x):

cd /mnt/bacupusbharddisk
tar -zcvf project.tar.gz functions

I had no backup. Similarly I end up running rsync command and deleted all new files by overwriting files from backup set (now I've switched to rsnapshot)

rsync -av -delete /dest /src
Again, I had no backup.
Deleted Apache DocumentRoot

I had sym links for my web server docroot (/home/httpd/http was symlinked to /www). I forgot about symlink issue. To save disk space, I ran rm -rf on http directory. Luckily, I had full working backup set.

... ... ...

Public Network Interface Shutdown

I wanted to shutdown VPN interface eth0, but ended up shutting down eth1 while I was logged in via SSH:

ifconfig eth1 down
Firewall Lockdown

I made changes to sshd_config and changed the ssh port number from 22 to 1022, but failed to update firewall rules. After a quick kernel upgrade, I had rebooted the box. I had to call remote data center tech to reset firewall settings. (now I use firewall reset script to avoid lockdowns).

Typing UNIX Commands on Wrong Box

I wanted to shutdown my local Fedora desktop system, but I issued halt on remote server (I was logged into remote box via SSH):

halt
service httpd stop
Wrong CNAME DNS Entry

Created a wrong DNS CNAME entry in example.com zone file. The end result - a few visitors went to /dev/null:

echo 'foo 86400 IN CNAME lb0.example.com' >> example.com && rndc reload
Failed To Update Postfix RBL Configuration

In 2006 ORDB went out of operation. But, I failed to update my Postfix RBL settings. One day ORDB was re-activated and it was returning every IP address queried as being on its blacklist. The end result was a disaster.

Conclusion

All men make mistakes, but only wise men learn from their mistakes -- Winston Churchill.

From all those mistakes I've learnt that:

  1. Backup = ( Full + Removable tapes (or media) + Offline + Offsite + Tested )
  2. The clear choice for preserving all data of UNIX file systems is dump, which is only tool that guaranties recovery under all conditions. (see Torture-testing Backup and Archive Programs paper).
  3. Never use rsync with single backup directory. Create a snapshots using rsync or rsnapshots.
  4. Use CVS to store configuration files.
  5. Wait and read command line again before hitting the dam [Enter] key.
  6. Use your well tested perl / shell scripts and open source configuration management software such as puppet, Cfengine or Chef to configure all servers. This also applies to day today jobs such as creating the users and so on.

Mistakes are the inevitable, so did you made any mistakes that have caused some sort of downtime? Please add them into the comments below.

Jon June 21, 2009, 2:42 am

My all time favorite mistake was a simple extra space:

cd /usr/lib
ls /tmp/foo/bar

I typed

rm -rf /tmp/foo/bar/ *

instead of

rm -rf /tmp/foo/bar/*

The system doesn't run very will without all of it's libraries……

Vinicius August 21, 2010, 5:42 pm

I Did something similar on a remote server
I was going to type 'chmod -R 755 ./' but i throw 'chmod -R 755 /' |:

Daniel December 30, 2013, 9:40 pm

I typed 'chmod -R 777′ , to allow all files to have rwx permissions from all users (RPi) .

Doesn't work that well without sudo!

robert wlaschin May 1, 2012, 9:57 pm

Hm… I was trying to format a USB flash

dd if=big_null_file of=/dev/sdb

unfortunately /dev/sdb was my local secondary drive, sdc was the usb … shucks.

I discovered this after I rebooted.

Jeff April 21, 2011, 10:46 pm

I did something similar on my first day as a junior admin. As root, I copied my buddy's dot files (.profile, etc.) from his home directory to mine because he had some cool customizations. He also had some scripts in a directory called .scripts/ that he wanted me to copy. I gave myself ownership of the dot files and the contents of the .scripts directory with this command:

cd ~jeff; chown -R jeff .*

It was only later that I realized that ".*" matched "." and "..", so my userid owned the entire machine… which happened to be our production Oracle database.

That was 15 years ago and we've both changed jobs a few times, but that friend reminds me of that mistake every time I see him.

Garry April 11, 2014, 8:02 pm

I once had a bunch of dot files I wanted to remove. So I did:

rm -r .*

This, of course, includes ".." – recursively.

I had taken over SysAdmin of a server. The server had a cron job that ran, as root, that cd'ed into a directory and did a find, removing any files older than 3 days. It was to clean up the log files of some program they had. They quit using the program. About a year later, someone removed the directory. The cron job ran. The cd into the log file directory didn't work, but the cron job kept going. It was still in / – removing any files older than 3 days old! I restored the filesystems and went home to get some sleep, thinking I would investigate root cause after I had some rest. As soon as my head hit the pillow, the phone rang. "It did it again". The cron job had run again.

Lastly, I once had an accidental copy & paste, which renamed (mv) /usr/lib. Did you know the "mv" command uses libraries in /usr/lib? I found that out the hard way when I discovered I could not move it back to its original pathname. Nor could I copy it (cp uses /usr/lib).

An "Ohnosecond" is defined as the period of time between when you hit enter and you realize what you just did.

Michael Shigorin April 12, 2014, 8:14 am

That's why set -e or #!/bin/sh -e (in this particular case I'd just tell find that_dir … though). --[The -e flag's long name is errexit, causing the script to immediately exit on the first error. -- NNB]

My .. incident has taught me to hit tab just in case to see what actually gets removed; BTW zsh is very helpful in that regard, it has some safety net means for the usual * ~ cases - but then again touching nothing with destructive tools when tired, especially as root, is a bitter but prudent decision.

Regarding /usr/lib: ALT Linux coreutils are built properly ;-) (although there are some leftovers as we've found when looking with some Gentoo guys at LVEE conference)

georgesdev June 21, 2009, 9:15 am

never type anything such as:

rm -rf /usr/tmp/whatever

maybe you are going to type enter by mistake before the end of the line. You would then for example erase all your disk starting on /.

if you want to use -rf option, add it at the end on the line:

rm /usr/tmp/whatever -rf

and even this way, read your line twice before adding -rf

Daniel Hoherd May 4, 2012, 4:58 pm

Another good test is to first do "echo rm -rf /dir/whatever/*" to see the expansion of the glob and what will be deleted. I especially do this when writing loops, then just pipe to bash when I know I've got it right.

Denis November 23, 2010, 9:27 am

I think it is a good practice to use parameter i whithin the -rf:

rm -rfi /usr/tmp/whatever

-i will ask you do you sure to delete all that stuff.

John February 25, 2011, 11:11 am

I worked with a guy who always used "rm -rf" to delete anything. And he always logged in as root. Another worker set the stage for him by creating a file called "~" in a visible location (that would be a filed entered as "\~", as not to expand to the user's home directory. User one then dealt with that file with "rm -rf ~". This was when the root home directory was / and not something like /root. You got it.

Cody March 22, 2011, 1:33 pm

(Note to mod: put this in wrong place initially; sorry about that. here is the correct place).

This reminds me of when I told a friend a way to auto-log out on login (many ways but this would be more obscure). He then told someone who was "annoying" him to try it on his shell. End result was this person was furious. Quite so. And although I don't find it so funny now (keyword not as – I still think it's amusing), I found it hilarious then (hey, was young and obnoxious as can be!).

The command, for what its worth :

echo "PS1=`kill -9 0`" >> ~/.bash_profile

Yes, that's setting the prompt to run the command : kill -9 0 upon sourcing of ~/.bash_profile which means kill that shell. Bad idea!

I don't even remember what inspired me to think of that command as this was years and years ago. However, it does bring up an important point :

Word of the wise : if you do not know what a command does, don't run it! Amazing how many fail that one…

Peter Odding January 7, 2012, 6:40 pm

I once read a nasty trick that's also fun in a very sadistic kind of way:

echo 'echo sleep 1 >> ~/.profile' >> /home/unhappy-user/.profile

The idea is that every time the user logs in it will take a second longer than the previous time… This stacks up quickly and gets reallllly annoying :-)

Daniel April 23, 2015, 10:53 am

What about echo "PS1=$PS1 ; `sleep 1`" >> ~/.bash_profile
I'm not sure if it works, but it's pretty cool.


3ToKoJ June 21, 2009, 9:26 am

public network interface shutdown … done

typing unix command on wrong box … done

Delete apache DocumentRoot … done

Firewall lockdone … done with a NAT rule redirecting the configuration interface of the firewall to another box, serial connection saved me

I can add, being trapped by aptitude keeping tracks of previously planned - but not executed - actions, like "remove slapd from the master directory server"

UnixEagle June 21, 2009, 11:03 am

Rebooted the wrong box
While adding alias to main network interface I ended up changing the main IP address, the system froze right away and I had to call for a reboot

Instead of appending text to Apache config file, I overwritten it's contents

Firewall lockdown while changing the ssh port

Wrongfully run a script contained recursive chmod and chown as root on / caused me a downtime of about 12 hours and a complete re-install

Some mistakes are really silly, and when they happen, you don't believe your self you did that, but every mistake, regardless of it's silliness, should be a learned lesson.
If you did a trivial mistake, you should not just overlook it, you have to think of the reasons that made you did it, like: you didn't have much sleep or your mind was confused about personal life or …..etc.

I like Einstein's quote, you really have to do mistakes to learn.

smaramba June 21, 2009, 11:31 am

Yyping unix command on wrong box and firewall lockdown are all time classics: been there, done that. But for me the absolute worst, on linux, was checking a mounted filesystem on a production server…

fsck /dev/sda2

The root filesystem was rendered unreadable. system down. Dead. Users really pissed off. fortunately there was a full backup and the machine rebooted within an hour.

Don May 10, 2011, 4:14 pm

I know this thread is a couple of years old but …

Using lpr from the command line, forgetting that I was logged in to a remote machine in another state. My print job contained sensitive information which was now on a printer several hundred miles away! Fortunately, a friend intercepted the message and emailed me while I was trying to figure out what was wrong with my printer :-)

od June 21, 2009, 12:50 pm

"Typing UNIX Commands on Wrong Box"

Yea, I did that one too. Wanted to shut down my own vm but I issued init 0 on a remote server which I accessed via ssh. And oh yes, it was the production server.

Adi June 21, 2009, 10:24 pm

tar -czvf /path/to/file file_archive.tgz

instead of

tar -czvf file_archive.tgz /path/to/file

I ended up destroying that file and had no backup as this command was intended to provide the first backup – it was on the DHCP Linux production server and the file wad dhcpd.conf!

The Unix Hater's Handbook

wayback.archive.org

"rm" Is Forever

The principles above combine into real-life horror stories. A series of

exchanges on the Usenet news group alt.folklore.computers illustrates

our case:

Date: Wed, 10 Jan 90

From: djones@megatest.uucp (Dave Jones)

Subject: rm *

Newsgroups: alt.folklore.computers2

Anybody else ever intend to type:

% rm *.o

And type this by accident:

% rm *>o

Now you've got one new empty file called "o", but plenty of room

for it!

Actually, you might not even get a file named "o" since the shell documentation

doesn't specify if the output file "o" gets created before or after the

wildcard expansion takes place. The shell may be a programming language,

but it isn't a very precise one.

Date: Wed, 10 Jan 90 15:51 CST

From: ram@attcan.uucp

Subject: Re: rm *

Newsgroups: alt.folklore.computers

I too have had a similar disaster using rm. Once I was removing a file

system from my disk which was something like /usr/foo/bin. I was in /

usr/foo and had removed several parts of the system by:

% rm -r ./etc

% rm -r ./adm

…and so on. But when it came time to do ./bin, I missed the period.

System didn't like that too much.

Unix wasn't designed to live after the mortal blow of losing its /bin directory.

An intelligent operating system would have given the user a chance to

recover (or at least confirm whether he really wanted to render the operating

system inoperable).

Unix aficionados accept occasional file deletion as normal. For example,

consider following excerpt from the comp.unix.questions FAQ:3

6) How do I "undelete" a file?

Someday, you are going to accidentally type something like:

% rm * .foo

and find you just deleted "*" instead of "*.foo". Consider it a

rite of passage.

Of course, any decent systems administrator should be doing

regular backups. Check with your sysadmin to see if a recent

backup copy of your file is available.

"A rite of passage"? In no other industry could a manufacturer take such a

cavalier attitude toward a faulty product. "But your honor, the exploding

gas tank was just a rite of passage." "Ladies and gentlemen of the jury, we

will prove that the damage caused by the failure of the safety catch on our

3comp.unix.questions is an international bulletin-board where users new to the

Unix Gulag ask questions of others who have been there so long that they don't

know of any other world. The FAQ is a list of Frequently Asked Questions garnered

Changing rm's Behavior Is Not an Option

After being bitten by rm a few times, the impulse rises to alias the rm command

so that it does an "rm -i" or, better yet, to replace the rm command

with a program that moves the files to be deleted to a special hidden directory,

such as ~/.deleted. These tricks lull innocent users into a false sense

of security.

Date: Mon, 16 Apr 90 18:46:33 199

From: Phil Agre <agre@gargoyle.uchicago.edu>

To: UNIX-HATERS

Subject: deletion

On our system, "rm" doesn't delete the file, rather it renames in some

obscure way the file so that something called "undelete" (not

"unrm") can get it back.

This has made me somewhat incautious about deleting files, since of

course I can always undelete them. Well, no I can't. The Delete File

command in Emacs doesn't work this way, nor does the D command

in Dired. This, of course, is because the undeletion protocol is not

part of the operating system's model of files but simply part of a

kludge someone put in a shell command that happens to be called

"rm."

As a result, I have to keep two separate concepts in my head, "deleting"

a file and "rm'ing" it, and remind myself of which of the two of

them I am actually performing when my head says to my hands

"delete it."

Some Unix experts follow Phil's argument to its logical absurdity and

maintain that it is better not to make commands like rm even a slight bit

friendly. They argue, though not quite in the terms we use, that trying to

make Unix friendlier, to give it basic amenities, will actually make it

worse. Unfortunately, they are right.

[Sep 04, 2014] Blunders with expansion of tar files, structure of which you do not understand

if you try to expand tar file in some production directory you accidentally can overwrite and change ownership of such directories and then spend a lot of type restored status quo. It is safer to expand such tar files in /tmp first and only after that seeing the results then decide whether to copy some directories of re-expand the tar file. Now in production directory.

[Sep 03, 2014] Doing operation in a wrong directory among several similar directories

Sometimes directories are very similar, for example numbered directoriess created by some application such as task0001, task0002, ... task0256. In this case you can well perform operation on a wrong directory. For example send to tech support a tar file with the directory that instead of test data contain production run.

[Oct 17, 2013] Crontab file - The UNIX and Linux Forums

The loss of crontab is a serious trouble. This is one of a typical sysadmin blunders (Crontab file - The UNIX and Linux Forums)

mradsus

Hi All,
I created a crontab entry in a cron.txt file accidentally entered

crontab cron.txt.

Now my previous crontab -l entries are not showing up, that means i removed the scheduling of the previous jobs by running this command "crontab cron.txt"

How do I revert back to previously schedule jobs.
Please help. this is urgent.,
Thanks.

In this case, if you do not have a backup, you only remedy is to try to extract cron commands from /var/log/messages.

[Jul 17, 2012] My 10 UNIX Command Line Mistakes

Destroyed Working Backups with Tar and Rsync (personal backups)

I had only one backup copy of my QT project and I just wanted to get a directory called functions. I end up deleting entire backup (note -c switch instead of -x):
cd /mnt/bacupusbharddisk
tar -zcvf project.tar.gz functions

I had no backup. Similarly I end up running rsync command and deleted all new files by overwriting files from backup set (now I've switched to rsnapshot)
rsync -av -delete /dest /src
Again, I had no backup.

Deleted Apache DocumentRoot

I had sym links for my web server docroot (/home/httpd/http was symlinked to /www). I forgot about symlink issue. To save disk space, I ran rm -rf on http directory. Luckily, I had full working backup set.

Public Network Interface Shutdown

I wanted to shutdown VPN interface eth0, but ended up shutting down eth1 while I was logged in via SSH:
ifconfig eth1 down

Firewall Lockdown

I made changes to sshd_config and changed the ssh port number from 22 to 1022, but failed to update firewall rules. After a quick kernel upgrade, I had rebooted the box. I had to call remote data center tech to reset firewall settings. (now I use firewall reset script to avoid lockdowns).

Typing UNIX Commands on Wrong Box

I wanted to shutdown my local Fedora desktop system, but I issued halt on remote server (I was logged into remote box via SSH):
halt
service httpd stop

Conclusion

All men make mistakes, but only wise men learn from their mistakes -- Winston Churchill.

From all those mistakes I've learnt that:

  1. Backup = ( Full + Removable tapes (or media) + Offline + Offsite + Tested )
  2. The clear choice for preserving all data of UNIX file systems is dump, which is only tool that guaranties recovery under all conditions. (see Torture-testing Backup and Archive Programs paper).
  3. Never use rsync with single backup directory. Create a snapshots using rsync or rsnapshots.
  4. Use CVS to store configuration files.
  5. Wait and read command line again before hitting the dam [Enter] key.
  6. Use your well tested perl / shell scripts and open source configuration management software such as puppet, Cfengine or Chef to configure all servers. This also applies to day today jobs such as creating the users and so on.

Mistakes are the inevitable, so did you made any mistakes that have caused some sort of downtime? Please add them into the comments below.

[May 17, 2012] Pixar's The Movie Vanishes, How Toy Story 2 Was Nearly Lost

In the 2010 animated short titled Studio Stories: The Movie Vanishes, we learn from Pixar's Oren Jacob and Galyn Susman how a big chunk of Toy Story 2 movie files were nearly lost due to the accidental use of a Linux rm command (and a poor backup system). This short was included on the Toy Story 2 DVD extras.

Pixar studio stories - The movie vanishes (full) - YouTube

[Mar 16, 2012] Using right command in a wrong place

From email to Editor of Softpanorama...

This happened with Open view. It has a command for agent reinstallation. opc-inst -r. The problem is that it needs to be run on the node not on the server and does not accept any arguments

In this case it was run on the server with predictable results. This was a production server of a large corporation so you can imagine the level of stress in putting down this fire...

[Oct 14, 2011] Nasty surprise with the command cd joeuser; chown -R joeuser:joeuser .*

This is classic case of side effect of dot .* along with -R flag which cause complete tree traversal in Unix. The key issue here is not panic. The recovery is possible even if you do not have the map of all files permissions (and you better do it on regular basis). The first step is to use
for p in $(rpm -qa); do rpm --setugids $p; done
The second is to copy remaining ownership info from some similar system. Especially important is to restore ownership in /dev directory.
Similar approach can be used for resoring permissions:
for p in $(rpm -qa); do rpm --setperms $p; done
Please note that the rpm --setperms command actaully resets setuid, setgid, and sticky bits. These must be set manually using some existing system as a baseline.

[Jul 22, 2011] Mailbag by Marcello Romani

Feb 02, 2011 | LG #186

Hi, I had a horror story similar to Ben's one, about two years ago. I backed up a PC and reinstalled the OS with the backup usb disk still attached. The OS I was reinstalling was a version of Windows (2000 or XP, I don't remember right now). When the partition creation screen appeared, the list items looked a bit different from what I was expecting, but as soon as I realized why, my fingers had already pressed the keys, deleting the existing partitions and creating a new ntfs one. Luckily, I stopped just before the "quick format" command... Searching the 'net for data recovery software, I came across TestDisk, which is target at partition table recovery. I was lucky enough to have wiped out only that portion of the usb disk, so in less than an hour I was able to regain access to the all of my data. Since then I always "safely remove" usb disks from the machine before doing anything potentially dangerous, and check "fdisk -l" at least three times before deciding that the arguments to "dd" are written correctly...

Marcello Romani TAG mailing list TAG@lists.linuxgazette.net http://lists.linuxgazette.net/listinfo.cgi/tag-linuxgazette.net

[Jul 03, 2011] Be careful with naming servers

Some application like Oracle products are sensitive to DNS names you use, especially hostname. They store them in multiple places and there is no easy way to change it in all those places after Oracle product is installed. They also accept only long hostname (i.e. box.location.firm.com) instead of short.

If you mess with your hostname and DBA installed Oracle product you usually need to reinstall the box.

Such errors can happen if you copy files form ne servers to another to speed up the installation and forgot to modify /etc/hosts file or modified it incorrectly.

[Jun 03, 2011] Sysadmin Tales of Terror by Carla Schroder

February 19, 2003 | Enterprise Networking Planet

Cover One's Behind With Glory

Now let's be honest, documentation is boring and no fun. I don't care; just do it. Keep a project diary. Record everything you find. You don't want to shoulder the blame for someone else's mistakes or malfeasance. It is unlikely you'll get into legal trouble, but the possibility always exists. Record progress and milestones as well. Those in management tend to have short memories and limited attention spans when it comes to technical matters, so put everything in writing and make a point of reviewing your progress periodically. No need to put on long, windy presentations -- take ten minutes once a week to hit the high points. Emphasize the good news; after all, as the ace sysadmin, it is your job to make things work. Any dork can make a mess; it takes a real star to deliver the goods.

Be sure to couch your progress in terms meaningful to the person(s) you're talking to. A non-technical manager doesn't want to hear how many scripts you rewrote or how many routers you re-programmed. She wants to hear "Group A's email works flawlessly now, and I fixed their database server so it doesn't crash anymore. No more downtime for Group A." That kind of talk is music to a manager's ears.

Managing Users

In every business there are certain key people who wield great influence. They can make or break you. Don't focus exclusively on management -- the people who really run the show are the secretaries and administrative assistants. They know more than anyone about how things work, what's really important, and who is really important. Consult them. Listen to them. Suck up to them. Trust me, this will pay off handsomely. Also worth cultivating are relationships with the cleaning and maintenance people -- they see things no one else even knows about.

When you're new on the job and still figuring things out, the last thing you need is to field endless phone calls from users with problems. Make them put it in writing -- email, yellow pad, elaborate trouble-ticket system, whatever suits you. This gives you useful information and time to do some triage.

Managing Remote Users

If you have remote offices under your care, the phone can save a lot of travel. There's almost always one computer-savvy person in every office; make this person your ally and helper. At very least, this person will be able to give you coherent, understandable explanations. At best, they will be your remote hands and eyes, and will save you much trouble.

Such a person may be a candidate for training and possibly transferring to IT. Some people are afraid of helping someone like this for fear of losing out to them in some way. The truth, though, is that you never lose by helping people, so don't let that idea scare you off from giving a boost to a worthy person.

Getting Help

We all know how to use Google, Usenet, and other online resources to get assistance. By all means, don't be too proud -- ask! And by all means, don't be stupide either -- use a fake name and don't mention the company you work for. There's absolutely no upside to making such information public; there are, however, many downsides to doing so, like inviting security breaches, giving away too much information, making your company look bad, and besmirching your own reputation.

As I said at the beginning, these are strategies that have served me well. Feel free to send me your own ideas; I especially love to hear about true-life horror stories that have happy endings.

Resources

Life in the Trenches: A Sysadmin Speaks
10 Tips for Getting Along with People at Work
Linux Administration Books

[Jun 20, 2010] IT Resource Center forums - greatest blunders

Bill McNAMARA

I've done this with people looking over my shoulder (while in single user):

echo "/dev/vg00/lvol6 /tmp vxfs delaylog 0 2" > /etc/fstab
reboot!!

Other good ones:
mv /dev/ /Dev
(try it - and don't ask why!!)

Later,
Bill

Christian Gebhardt

Hi
As a newby in UNIX I had an Oracle Testinstallion on a production system
productiv directory: /u01/...
test directory: /test/u01/...

deleting the test installation:
cd /test
rm /u01

OOPS ...

After several bdf commands I noticed that the wrong lvol shrinks and stops the delete command with Ctrl'C

The database still worked without the most binaries and libraries and after a restore from tape without stopping and starting the database all was ok.

I love oracle ;-)

Chris

harry d brown jr

Learning hpux? Naw, that's not it....maybe it was learning to spell aix?? sco?? osf?? Nope, none of those.

The biggest blunder:

One morning I came in at my usual time of 6am, and had an operator ask me what was wrong with one of our production servers (servicing 6 banks). Well nothing worked at the console (it was already logged in as root). Even a "cat *" produced nothing but another shell prompt. I stopped and restarted the machine and when it attempted to come back up it didn't have any OS to run. Major issue, but we got our backup tapes from that night and restored the machine back to normal. I was clueless (sort of like today)

The next morning, the same operator caught me again, and this time I was getting angry (imagine that). Same crap, different day. Nothing was on any disk. This of course was before we had raid availble (not that that would have helped). So we restored the system from that nights backups and by 8am the banks have their systems up.

So now I have to fix this issue, but where the hell to start? I knew that production batch processing was done by 9PM, and that the backups started right after that. The backups completed around 1am, which were good backups, because we never lost a single transaction. But around 6am the stuff hit the fan. So I had a time frame: 1am-6am, something was clobbering the system. I went though the crons, but nothing really stood out, so I had to really dive into them. This is the code (well almost) I found in the script:

cd /tmp/uniplex/trash/garbage
rm -rf *

As soon as I saw those two lines, I realized that I was the one that had caused the system to crap out every morning. See, I needed some disk space, and I was doing some house cleaning, and I deleted the sub-directory "garbage" from the /tmp/uniplex/trash" directory. Of course the script is run by root, which attempted to "CD" to a non-existent directory, which failed, and cron was still cd'd to "/", it then proceeded to "rm -rf *" my system!

live free or die
harry

Bill Hassell

I guess my blunder sets the record for "most clobbered machines" in one day:

I created an inventory script to be used in the Response Center to track all the systems across the United States (about 320 systems). These are all test and problem replication machines but necessary for the R/C engineers to replicate customer problems.

The script was written about 1992 to handle version 7.0 and higher. About 1995, I had a number of useful scripts that it seemed reasonable to drop these into all 300 machines as a part of the inventory process (so far, so good). Then about that time, 10.01 was released and I made a few changes to the script. One was to change the useful script location from /usr/local/bin to /usr/contrib/bin because of bad directory permissions. I considered 'fixing' the bad permissions but since these systems must represent the customer environment, I decided to move everything.

Enter the shell option -u. I did not use that option in my scripts and due to a spelling error, an environment variable was used in rm -r which was null, thus removing the entire /usr/local directory on 320 machines overnight.

Needless to say, I never write scripts without set -u at the top of the script.

John Poff

We were doing a disaster recovery drill. I was busy Igniting a V-class server for our database server. I had finally gotten the OS on it after about three hours and I was running a slick little script I had written to recreate all the volume groups and filesystems. My script takes a list of available PVs and does a 'pvcreate -f' on them. Well, we started our drill at midnight &#91;not our idea but we had little choice&#93;, so around about 3:30am I was trying to run this script. It was chugging along just fine, pvcreating disks, and then the system hung. Not completely, but pretty much dead. After trying to reboot it, I eventually figured out that when I went through the interactive Ignite, I hadn't paid close attention to which disk Ignite had selected to load the OS on, and it had chosen one of the disks in the EMC array instead of one of the local Jamaica disks. My slick script came along and had pvcreated the disk that had the OS on it. Oops. There went a few more hours of work.

The good news is that after that mess they decided that we would never start a DR drill at midnight!

JP

Dave Johnson

Here is my worst.
We us BC's on our XP512. We stop the application, resync the BC, split the BC, start the application, mount the BC on same server, start backup to tape from BC. Well I had to add a LUN to the primary and BC. I recreated the BC. I forgot to change the script that mounts the BC to include the new LUN. The error message vgimport when you do not include all the LUN's is just a warning and it makes the volume group available. The backups seemed to be working just fine.
Well 2 months go by. I did not have enough available disk space to test my backups. (That has been changed). Then I decided to be proactive about deleted old files. So I wrote a script:
cd /the/directory/I/want/to/thin/out
find . -mtime +30 -exec rm {} \;

Well that was scheduled on cron to run just before backups one night. The next morning I get the call the system is not responding. (I guessed later the cd command had failed and the find ran from /).
After a reboot I find lots of files are missing from /etc /var /usr /stand and so on. No problem, just rebuild from the make_recovery tape created 2 nights before then restore the rest from backup.
Well step 1 was fine, but the backup tape was bad. The database was incomplete. It took 3 days (that is 24 hours per day) to find the most recient tape with a valid database. Then we had to reload all the data. After the 3rd day I was able to turn over recovery to the developers. It took about a week to get the application back on-line.
I have sent a request to HP to have the vgimport command changed so a vgimport that does not specify all the LUN's will fail unless some new command line param is used. They have not yet provided this "enhancement" as of the last time I checked a couple of months ago. I now test for this condition and send mail to root as well as fail the BC mount if it does.

Life lesson: TEST YOUR BACKUPS!!

Dave Unverhau

This is probably not too uncommon...needed to shutdown a server for service (one of several lined up along the floor...no...not racked). Grabbed the keyboard sitting on that box and quickly typed the shutdown string (with a -hy 0, of course) and got ready to service the box.

...ALWAYS make sure the keyboard is sitting on the box to which it is connected!

Deepak Extross

We had this developer who claimed that when he runs his program, it complains about /usr/bin/ld. (This was because of a missing shared library, he later discovered) It was decided to backup /usr/bin/ld and replace it with 'ld' from another machine on which his program worked.
No sooner was ld moved, than all hell breaks loose.
Users get coredumps in response to even simple commands like "ls", pwd", "cd"... New users cannot telnet into the system and those who are logged in are frozen in their tracks.

Both the developer and admin are still working with us...

RAC

Well I was very very new to HP-UX. Wanted to set up PPP connection with a password borrowed from a friend so that I could browse the net.

Did not worry that the remote support modem can not dial out from remote support port.
Went through all documents available, created device files dozen times, but never worked. In anguish did rm -fr `ltr|tail -4|Awk '{print $9}'
(That to pacify myself that I know complex commands)

But alas, I was /sbin/rc3.d.

Thought this is not going to work and left that.
Other colleage not aware of this rebooted the system for Veritas netbackup problem.

Within next two hours HP engineer was on-site. Was called by colleague.

Was watching whole recovery process, repeatedly saying "I want to learn, I want to learn"

Then came to know that can not be done.

Dave Johnson

Hey Bill,

When I reinstalled the OS from the make_recovery tape it wiped out the scirpt I wrote and the item on the cron. There is no evidance of what happened or who was responsible. I did however go straight to my boss to confess and take the blame. That above all is probably the strongest reason next to being able to recover at least some of the data why I was not terminated for it.

Did I mention in the first post this happened Feb of 2002????

Simon Hargrave

1. On a live Sun Enterprise server, you turn the key one way for maintenance, and one way for off. I wanted to turn it to maintenance but wasn't sure which way to turn it. Guess which way I chose...

2. On an XP512 I accidentally business-copied a new LUN over the top of a live LUN, because I put the wrong LUN ID in!!! Luckily the live datas backup had finished a full 3 minutes earlier...phew!

3. I can't take credit for this one my ex-boss did it, but I had to include it. On Solaris he added a filesystem in the vfstab file, but put the wrong device in the raw-device field. Concequently all backups backed up the wrong device, so when the data got trashed and required restoring, it...um...didn't exist on tape! Luckily for him he'd left the company 2 months before and I was left to explain what a halfwit he was ;)

Dave Chamberlin

I have stepped in TAR on a couple occasions. I Moved a tar file from production box to development box but I had tarred it with an absolute path. When I untarred it - it overwrote the existing directory - destroying all the developers updates! I have also been burned by the fact that xvf and cvf are very close on the keyboard - so my command to extract a tar came out once as tar -cvf - which of course erased the tar file.
Only other bad blunder was doing an lvreduce on a mounted file system - thought I was recovering space without affecting the other files on the volume. Luckily - they were backup up...

Martin Johnson

One of my coworkers decided to set up a pseudo root (UID=0) account for himself. He used useradd to create the account and made / his home directory. He was unaware that useradd does a "chown -R" to the home directory. So he became the owner of all files on the system. This was a pop3 mail server system, and the mail services did not like the change.

My coworker left for the day, leaving me with angry VPs looking over my shoulder demanding to know when email services will be back.

Marty <the coworker is now known as "chown boy">

fg

Greatest GAFF: Taking the word of someone who I thought knew what they were doing and had taken the proper precautions to ensure a recovery method for a rebuild of filesystems,

to make a long story short, no backup, no make_recovery, and then rebuilt filesystems. Data lost and had to rebuild. Recovered most of the data except for previous 24hrs.

MORAL of the story: Always have backups and make_recovery tapes done.

Richard Darling

When I upgraded from 10.20 to 11.0 I finished the system installed, and then used cpio to copy my user applications. One of the vendors had originally had their app installed in /usr (before my time), and I copied the app up one directory and wiped out /usr. By the way, I didn???t back up the installation before the cpio copy. It was a Friday night and I wanted to get out...figured I could backup after getting the apps copied over...learnt an important lesson regarding backups that night...
RD

Belinda Dermody

writing a script to chmod -R to r/w for the world on a dir. Not doing a check to see if I was in the proper directory and all of a sudden my bin directory files were all 666. Lucky enough I had multiple windows and it hadn't gotten to the sbin directory yet. Had a few inquiries why certain commands wouldnt work before I got it all back correctly. From then on, I do $? and check the return status before I issue any remove or chmod commands.

Ian Kidd

I was going to vi a script that performs a cold-backup of an oracle database. Since we prefer not to be root all the time, we use sudo.

So I typed, "sudo", but then was interrupted by someone. I then typed the name of the script when that person left. Nothing appeared on the screen immediately, so I got a coffee.

When I came back, I saw " sudo {script}" and realized - 1 minute the DBAs started screaming that their database was down - that I started a cold backup in the middle of a production day.

Duncan Edmonstone

My worst two:

Installing a server in a major call centre of a US bank...

I built the OS as required by our apps team in the US, and following our build standards put the system into trusted mode.

They installed the app, and realised they'd forgotten to ask me to put the system into NIS (system could be used by any of the call centre reps in over 40 call centers - a total of 15,000 NIS based accounts!) It's the middle of the night in the UK, so the apps team get a US admin to set up the system as a NIS client. (yes it shouldn't work when the box is trusted, but it does!)

Next day, the apps team is complaining about some stuff not working - can I take the system out of trusted mode so we can discount that? Sure course I can - I run tsconvert and wait.... and wait.... and wait.... hmmm - this usually takes about 30 seconds - what gives?

Try to open another window to check whats happening - can't log in as root, the password that worked two minutes ago no longer works!

Next root file system full messages start to scroll up the screen!

It turns out that tsconvert is busy taking ALL the NIS accounts and putting them in the /etc/passwd file (yes all 15,000 of them) and guess what? There's a root account in NIS!

All I can say is thank god for good backups!

The other one was a typical junior admin mistake which comes from not understanding shell file name generation fully:

A user can't log in, I go take a look at his home directory and note the permissions on his .profile are incorrect. I also note that the other '.' files are incorrect, so I do this:

cd /home/user
chmod 400 .*

I call the user and tell him to try again - he says he still can't log in! Huh?

So I go back and carry on looking for the problem, but before I know it the phone is ringing off the hook! No-one can log in now!

And then it dawns on me

I type the following:

cd /home/user
echo .*

and that returns (of course)

. .. .cshrc .exrc .login .profile .sh_history

Oops I didn't just change the permissions on the users '.' files - I also changed the permissions on the users directory, and (crucially!) the users parent directory /home!

These days I always use echo to check my file name pattern matching logic when doing this kind of thing...

We live and learn


Duncan

Vincent Fleming

I have been way too fortunate not to have really blundered all that bad (I've mostly done development), but one I've seen was a real good one...

The "security auditor", who apparently knew absolutely nothing about UNIX, was reviewing our development system, and decided that /tmp having world read/write permissions was not a good thing for security - so, in the middle of the day, he chmod 744 /tmp ... suddenly, 200+ developers (including myself) on the machine (it was a *very* large machine back in 1990) were unable to save their editor sessions!

So, of course, I use the "wall" command to point our their error so they can fix it quickly and I can save my 2+ hours of edits:

$ wall
who's the moron who changed the permissions on /tmp????
.
$

The funny thing was that I was the one they escorted out of the building that day...

The hazards of being a contractor and publically humiliating an employee...

Jerry Jordak

This one wasn't my fault, but is still funny.

One time, we had to add disk space to one of our servers. My manager at time also was in charge of the EMC disk environment, so he allocated an extra disk to the server. I configured the disk into the OS, did a pvcreate on it, and proceeded to add it to the volume group, extend the filesystem, etc...

At about that same time, another one of our servers started going absolutely nuts. It turns out that he accidentally gave me a disk that was already allocated that other system. That drive had held the binaries for that server's application. Oops...

Tom Danzig

As root:

find / -u 201 -chown dansmith

Did this afeter changing a user ID to another number. Should have user "-user" and not -u (I had usermod on my mind). System gladly ignored the -u and started changing all files to user dansmith (/etc/passwd, /etc/hosts, etc). Needless to say, system was hosed.

Was able to recover fine from make_recovery tape. Fortunately this was also a test box and not production.

Oh well ... live and learn! Mistakes are only bad if you don't learn from them.

Mark Fenton

Back in '92 on a NIS network, meant to wipe out a particular user's directory, but was one level up from same when issued rm -r *. Took three hours to back up all home directories on network....

Last year, I discovered that new is not necessarily better. Updating Db software I blithly stopped the Db, copied new software in, and restarted. Users couldn't get any processing done that day -- seems that there was a conversion program that was *supposed* to run that didn't. But that wasn't the blunder -- the blunder is that the most recent backup had been two days previous, so all the previous day's processing was gone... (and that had been an overtime day, too!)

Keely Jackson

My greatest blunder:

The guy who set up the live database had done it as himself rather than aa a separate dba user. He left the company and his user id was re-allocated to somebody in HR. The guy in HR subsequently left as well.

One day I decided to tidy up the system and remvoe the this user. I did this via sam, selected the option to delete all the users files thinking that nobody who was in HR could possibly own any important files.

Unfortunately I was somewhat mistaken. Of course the guy in HR now owned all the database files. The first thing I knew was when the users started to complain that the database was no longer available. I got the db back from restore but everybody had lost half a days work.

Needless to say, I now do not delete old users files but re-allocate them to a special 'leavers' user and check them all before deleting anything.

A good HP blunder.

HP were moving the live server - a K420 - between sites and the remvoal men managed to drop it down a flight of stairs. It landed on one of them who then had to be taken to hospital. Fortunately he was only bruised while the machine had a huge dent in it. Anyway, it got moved to the other site and booted up straight away with no problems. That is what I call resiliant hardware. As a precaution disks etc were changed but it is still running quite happily today.

Cheers
Keely

Michael Steele

    When I was first starting out I worked for a Telecom as an 'Application Administrator' and I sat in a small room with a half a dozen other admins and together we took calls from users as their calls escalated up from tier I support. We were tier II in a three tier organization.

    A month earlier someone from tier I confused a production server with a test server and rebooted it in the middle of the day. These servers were remotely connected over a large distance so it can be confusing. Care is needed before rebooting.

    The tier I culprit took a great deal of abuse for this mistake and soon became a victim of several jokes. An outage had been caused in a high availability environment which meant management, interviews, reports; It went on and on and was pretty brutal.

    And I was just as brutal as anyone.

    Their entire organization soon became victimize by everyone from our organization. The abuse traveled right up the management tree and all participated.

    It was hilarious, for us.

    Until I did the same thing a month later.

    There is nothing more humbling then 2000 people all knowing who you are for the wrong reason and I have never longed for anonymity more.

    Now I alway do a 'uname' or 'hostname' before a reboot, even when I'm right in front of it.

Geoff Wild

Problem Exists Between Keyboard And Chair:

Just did this yesterday:

tar cvf - /sapmnt/XXX | tar xvf -

Meant to do:

tar cvf - /sapmnt/XXX | (cd /sapmnttest/XXX ;tar xvf -)

Needless to say, I corrupted most of the files in /sapmnt/XXX

Rgds....Geoff

Suhas

1. Imagine what would have happened when, on a Solaris box, while taking backup of ld.so.1, instead of doing "cp", "mv" was done !!! As most of you would be aware, ld.so.1 is the library file that is accesses by every system call. The next 1 hour was sheer chaos .. and worst hour ever experienced!!!!
Lesson Learnt: "Look before you leap !!!"

2. Was responsible for changing the date on the back-up master server by nearly a year . That night was a horrifying night of my life.
Lesson Learnt : "A typo-error can cost you any-thing between $0 to infinity."

Keep forumming !!!!
Suhas

[Jun 12, 2010] Sysadmin Blunder (3rd Response) - Toolbox for IT Groups

chrisz

Did one also.

I was in a directory off of root and performed the following command:

chown -R someuser.somegroup .*

I didn't think much of the command, just wanted to change the owner and
group for all files with a . in the front of them and subdirs. Went well
for the files in the current directory until it reached the .. file
(previous directory). All the files and subdir's off of root changed to the
owner and group specified. I was wondering why the command was taking so
long to complete. BTW, it changed the owner and group for all NFS files
too! That's when the real fun started.
Some days you're the windshield, other days you're the bug!

Dan Wright

It didn't really cause any significant damage, but about 10 years ago, I had
recently become an admin of a network of mostly NeXT machines which were new
to me and the default shell was c-shell, which I also wasn't very familiar
with.

I had dialed in from home on nite to play around and become more familiar
with how things worked on NeXTStep.

In an attempt to kill a job, I typed in "kill 1" instead of "kill %1" - and
it probably was actually a "kill -9 1" and of course I was root.

And of course, 1 was the init process. I immediately lost connection and
had to do a hard reboot on that machine the next day before that user got in
(for some reason, the machine with the modem wasn't in my office, it was in
someone elses).

Fortunately, that wasn't a critical machine outside of normal business
hours. No harm, no foul, eh?


If you like this kind of story, there are a bunch here:

http://www2.hunter.com/~skh/humor/admin-horror.html

User123731

I have in the past touched a file called "-i" in important directories. This
will cause rm to see the "-i" and make the rm interactive before it acts on
other files/dirs if you do not specify a particular directory.

User451715:

Ha!

That's an easy one.. My first position as a Junior Admin in HPUX working in First line support about eight years ago..

I was working on a server, moving some files around, and mistakenly moved all of the files in the /etc directory to a lower level direcory (about 10 sub-dirs down)..

I sat there at the console wide-eyed, my heart dropped, and I turned and looked out the window, and saw my job sailing out of it, since this was a server that was being prepared to be deployed and that a month's worth of work would have been wasted.

Luckily, a Senior Admin who later became my greatest mentor (Phil Gifford), took pity on my situation, and we sat there and recovered the /etc directory before anyone knew what had happened.. The key here was, he walked me through the necessary steps to recover files from an ignite tape, and voila!

Needless to say, I learned all about why seasoned UNIX admins protect root privilege as it it was the 'Family Jewels'.. <chuckle>

Mike E.-

Bryan Irvine :

My biggest blunder wasn't an an AIX box but applies to the thread. I
once made an access list for a cisco box, and forgot that there is an
implicit "deny all" rule at the end. So I made my nifty access list and
enabled it, tested the traffic to see if it was blocked and lo and
behold it seemed to be working. Great! I went on with my life and
figured I go read some news or something. uhhhh that didn't work. tried
email..that didn't work. tried traceroutes, they all died at the router
I was jsut working on...then the phones started ringing.

*click*

lightbulb went on in my head and I ran as fast as I could to the router
to reboot it (lucky I hadn't written the nvram)

The phone didn't stop ringing for 45 minutes even though the problem
only existed for about 4 minutes.

But then, what do ya expect when you kill internet traffic at 5
locations across 2 states?

the guys on the cisco list said that if you haven't done similar you are
lying about the 5 years experience on your resume ;-)


--Bryan

jxtmills:

I needed to clear a directory full of ".directories" and I issued:

rm -r .*

It seemed to find ".." awfully fast. That was a bad day, but, we restored most of the files in about an hour.

John T. Mills

bryanwun:

I thought I was in an application dir but instead was in /usr
and did chown -R to a low level user
on top of that I did not have a mksysb backup,
and the machine was in production.
It continued to function for the users ok
but most shell commands returned nothing
I had to find another machine with the same OS and Maint level
write a script to gather ownership permissions
then write another script to apply the permissions to the
damaged machine. this returned most functionality
but I still cant install new software with installp
it goes through the motions then nothing is changed.

alain:

hi every one

for me , i remember 2

1 - 'rm *;old' in / directory note ';' instead of '.'

2 - kill #pid of the informix process (oninit) and delete it (i dreamed)

jturner:

Variation on a theme ... the 'rm -r theme'

as a junior admin on AIX3.2.5, I had been left to my own devices to create
some housekeeping scripts - all my draft scripts being created and tested
in my home directory with a '.jt' suffix. After completing the scripts I
found that I had inadvertantly placed some copies in / with a .jt suffix.
Easy job then to issue a 'rm *.jt' in / and all would be well. Well it
would have been if I hadn't put a space between the * and the .jt. And the
worst thing of all, not being a touch typist and looking at the keys, I
glanced at the screen before hitting the enter key and realised with horror
what was going to happen and STILL my little finger continued to proceed
toward the enter key.

Talk about 'Gone in 60 seconds' - my life was at an end - over - finished
- perhaps I could park cars or pump gas for a living. Like other
correspondents a helpful senior admin was on hand to smile kindly and show
me how to restore from mksysb-fortunately taken daily on these production
systems. (Thanks Pete :))) )

To this day, rm -i is my first choice with multiple rm's just as a
test!!!!!!

Happy rm-ing :)

daguenet:

I know that one. Does any body remember when the rm man page had a
warning not do rm -rf / as root? How may systems were rebuild due that
blunder. Not that I have ever done something like that, nor will ever
admit to it:).

Aaron

cal.staples:

That is a no brainer!

First a little background. I cooked up a script called "killme" which
would ask for a search string then parse the process table and return a
list of all matches. If the list contained the processes you wanted to
kill then you would answer "Yes", not once, but twice just to be sure
you thought about it. This was very handy at times so I put it out on
all of our servers.

Some time had passed and I had not used it for a while when I had a need
to kill a group of processes. So I typed the command not realizing that
I had forgotten the scripts name. Of course I was on our biggest
production system at that time and everything stopped in it's tracks!

Unknown to me was that there is an AIX command called "killall" which is
what I typed.

From the MAN page: "The killall command cancels all processes that you
started, except those producing the killall process. This command
provides a convenient means of canceling all processes created by the
shell that you control. When started by a root user, the killall command
cancels all cancellable processes except those processes that started
it."

And it doesn't ask for confirmation or anything! Fortunately the
database didn't get corrupted and we were able to bring everything back
on line fairly quickly. Needless to say we changed the name of this
command so it couldn't be run so easily.

"Killall" is a great command for a programmer developing an application
which goes wild and he/she needs to kill the processes and retain
control of the system, but it is very dangerous in the real world!

Jeff Scott:

The silliest mistake? That had to be a permissions change on /bin. I got a
call from an Oracle DBA that the $ORACLE_HOME/bin no longer belonged to
oracle:dba. We never found out how that happened. I logged in to change the
permissions. I accidentally typed cd /oracle.... /bin (note the space
before /bin), then cheerfully entered the following command:

#chown -R oracle:dba ./*

The command did not climb up to root fortunately, but it really made a mess.
We ended up restoring /bin from a backup taken the previous evening.


Jeff Scott
Darwin Partners/EMC


Cal S.

tzhou:

crontab -r when I wanted to do crontab -e. See letters e and r are side by
side on the keyboard. I had 2 pages of crontab and had no backup on the
machine !

Jeff Scott:

I've seen the rm -fr effect before. There were no entries in any crontab.
Before switching to sudo, the company used a homegrown utility to grant
things root access. The server accepted ETLs from other databases, acting as
a data warehouse. This utility logged locally, instead of logging via
syslogd with local and remote logging. So, when the system was erased, there
really was no way to determine the actual cause. Most of the interactive
users shared a set of group accounts, instead of enforcing individual
accounts and using su - or sudo. The outage cost the company $8 million USD
due to lost labor that had to be repeated. Clearly, it was caused by a
cleanup script, but it is anyone's guess which one.

Technically, this was not a sysadmin blunder, but it underscores the need
for individual accounts and for remote logging of ALL uses of group
accounts, including those performed by scripts.

It also underscores the absolute requirement that all scripts have error
trapping mechanisms. In this case, the rm -fr was likely preceded by a cd to
a nonexistent directory. Since this was performed as root, the script cd'd
to / instead. The rm -fr then removed everything. The other possibility is
that it applied itself to another directory, but, again, root privileges
allowed it to climb up the directory tree to root.

Aneesh Mohan

Hi Siv,

The greatest blunder happened frm myside is created a lvol named /dev/vg00/lvol11 and did newfs on /dev/vg00/rlvol1 :)


The second greatest blunder from my side is corrupting the root filesystem by the below 2 steps :)


#lvchange -C n /dev/vg00/lvol03

#lvextend -L 100 /dev/vg00/lvol03

Cheers,
Aneesh

[Jun 09, 2010] Halloween - IT Admin Horror Stories

Zimbra Forums

Well ... I was working for a large multi-national running HP-UX systems and Oracle/SAP, and one day the clock struck twelve and the OS just started to disappear Down went SAP and Oracle like a sack of spuds!

Mayhem broke with the IT manager standing over my shoulder wanting to know what had happened ... I did not have a clue, and I could not even get onto the system as it was completely hosed! So the task of restoring the server began and after 30 minutes I had everything backup and running again. phewww

Until 1pm! The system disappearing again What the hell is going off, panic set it, this time I managed to keep a couple of sessions open to allow me to check the system.

And then it clicked .... I wonder .... Yep indeed, somebody had setup a cronjob AS ROOT, that attempted to 'cd' to a directory which then proceeded with a 'rm -rf *'

Though the ******* other admin did not verify that the directory existed before performing the remove! Well once we had restored the system again, the cronjob was removed, and we were all running fine again.

Morale of the story is to always protect root access and ensure you have adequate backups!!!

[Jun 06, 2010] NFS-export as a poor man backdoor

You can't log-in to the box if /etc/passwd or /etc/shadow are gone...

Ric Werme: Oct 10, 2007 18:05:52 -0700

Bill McGonigle once learned:

> rm lets you remove libc too.  DAMHINT.

I managed to salvage one system because I had NFS-exported / and
could gain write access from another system.

After that I often did the export before replacing humorless files
like libc.so and sometimes did the update with NFS.  It was a
struggle to remember to type the /mnt before the /etc/passwd so
I tried to cd to the target directory copy files in.

  -Ric Werme

[Jun 06, 2010] Security zeal ;-)

Good judgment comes from experience, experience comes from poor judgment. So do new jobs...Sometimes, even entirely new careers!

> On 10/9/07, John Abreau <[EMAIL PROTECTED]> wrote:
>> ... I looked in /bin for suspicious files, and that was the
>> first time I ever noticed the file [ . It looked suspicious, so
>> of course I deleted it. :-/

[Jun 05, 2010] Directory formerly known as /etc ;-)

Tom Buskey
Thu, 11 Oct 2007 06:18:27 -0700

On 10/10/07, Bill McGonigle <[EMAIL PROTECTED]> wrote:
>
>
> On Oct 9, 2007, at 17:31, Ben Scott wrote:
>
> >   Did you know 'rpm' will let you remove every package from the
> > system?
>
> rm lets you remove libc too.  DAMHINT.


I had a user call about a user supported system that was having issues.  We
explicitly do not support it and the users only use the root account.

He gave me the root account to login and I couldn't.  I went to his system &
looked around.  /etc was empty.  I told him he was fsked and he should ftp
any files he wanted to elsewhere & that he wouldn't be able to login again
or reboot.  In any event, we were not supporting it.

Sure enough, a help desk ticket came in for another admin, claiming that the
system got corrupted during bootup.  Why do users lie so often?  All it does
is obscure the problem...
Tom Buskey wrote:
> On 10/10/07, Bill McGonigle <[EMAIL PROTECTED]> wrote:
>   
>> On Oct 9, 2007, at 17:31, Ben Scott wrote:
>>
>>     
>>>   Did you know 'rpm' will let you remove every package from the
>>> system?
>>>       
>> rm lets you remove libc too.  DAMHINT.
>>     
>
>
> I had a user call about a user supported system that was having issues.  We
> explicitly do not support it and the users only use the root account.
>
> He gave me the root account to login and I couldn't.  I went to his system &
> looked around.  /etc was empty.  I told him he was fsked and he should ftp
> any files he wanted to elsewhere & that he wouldn't be able to login again
> or reboot.  In any event, we were not supporting it.
>
> Sure enough, a help desk ticket came in for another admin, claiming that the
> system got corrupted during bootup.  Why do users lie so often?  All it does
> is obscure the problem...
>   
Hmmm. Did you check lost+found? I've had similar symptoms only to
discover that there was indeed a bad sector that remapped all of /etc/
and some of /var and /usr. fsck didn't help much until I moved the drive
to another system and ran fsck there.

But you're right - if its not supported, then they'll have to go elsewhere to get this done.

BTW: My point is: the user may not have lied, but was just calling the shot as s/he saw them.

--Bruce

[May 26, 2010] Never ever play lose with /boot partition.

Here is the recent story connected with the upgrade of the OS (in this case Suse 10) to a new service pack (SP3)

After the upgrade sysadmin discovered that instead of /boot partition mounted there is none but there is a /boot directory directory on the boot partition populated by the update. This is so called "split kernel" situation when one (older) version of kernel boots and then it finds different (more recent) modules in /lib/modules and complains. There reason of this strange behavior of Suse update was convoluted and connected with LVM upgrade it contained after which LVM blocked mounting of /boot partition.

Easy, he thought. Let's boot from DVD, mount boot partition to say /boot2 and copy all files from the /boot directory to the boot partition.

And he did exactly that. To make things "clean" he first wiped the "old" boot partition and then copied the directory.

After rebooting the server he see GRUB prompt; it never goes to the menu. This is a production server and the time slot for the upgrade was 30 min. Investigation that involves now other sysadmins and that took three hours as server needed to be rebooted, backups retrieved to other server from the tape, etc, reveals that /boot directory did not contain a couple of critical files such as /boot/message and /boot/grub/menu.lst. Remember /boot partition was wiped clean.

BTWs /boot/message is an executable and grub stops execution of stpped /boot/grub/menu.lst.when it encounted instruction

gfxmenu (hd0,1)/message

Here is an actual /boot/grub/menu.lst.

# Modified by YaST2. Last modification on Thu May 13 13:43:35 EDT 2010
default 0
timeout 8
gfxmenu (hd0,1)/message
##YaST - activate

###Don't change this comment - YaST2 identifier: Original name: linux###
title SUSE Linux Enterprise Server 10 SP3
root (hd0,1)
kernel /vmlinuz-2.6.16.60-0.54.5-smp root=/dev/vg01/root vga=0x317 splash=silent showopts
initrd /initrd-2.6.16.60-0.54.5-smp

###Don't change this comment - YaST2 identifier: Original name: failsafe###
title Failsafe -- SUSE Linux Enterprise Server 10 SP3
root (hd0,1)
kernel /vmlinuz-2.6.16.60-0.54.5-smp root=/dev/vg01/root vga=0x317 showopts ide=nodma apm=off acpi=off noresume edd=off 3
initrd /initrd-2.6.16.60-0.54.5-smp

Luckily there was a backup done before this "fix". Four hours later server was bootable again.

Sysadmin Stories Moral of these stories

October 19, 2009 | UnixNewbie.org

From: jarocki@dvorak.amd.com (John Jarocki)
Organization: Advanced Micro Devices, Inc.; Austin, Texas

- Never hand out directions on "how to" do some sysadmin task until the directions have been tested thoroughly.

– Corollary: Just because it works one one flavor on *nix says nothing about the others. '-}

– Corollary: This goes for changes to rc.local (and other such "vital" scripties.

2

From: ericw@hobbes.amd.com (Eric Wedaa)
Organization: Advanced Micro Devices, Inc.

-NEVER use 'rm ', use rm -i ' instead.
-Do backups more often than you go to church.
-Read the backup media at least as often as you go to church.
-Set up your prompt to do a `pwd` everytime you cd.
-Always do a `cd .` before doing anything.
-DOCUMENT all your changes to the system (We use a text file
called /Changes)
-Don't nuke stuff you are not sure about.
-Do major changes to the system on Saturday morning so you will
have all weekend to fix it.
-Have a shadow watching you when you do anything major.
-Don't do systems work on a Friday afternoon. (or any other time
when you are tired and not paying attention.)

3

From: rca@Ingres.COM (Bob Arnold)
Organization: Ask Computer Systems Inc., Ingres Division, Alameda CA 94501

1) The "man" pages don't tell you everything you need to know.
2) Don't do backups to floppies.
3) Test your backups to make sure they are readable.
4) Handle the format program (and anything else that writes directly to disk devices) like nitroglycerine.
5) Strenuously avoid systems with inadequate backup and restore programs wherever possible (thank goodness for "restore" with an "e"!).
6) If you've never done sysadmin work before, take a formal training class.
7) You get what you pay for. There's no substitute for experience.
9) It's a lot less painful to learn from someone else's experience than your own (that's what this thread is about, I guess

4

From: jimh@pacdata.uucp (Jim Harkins)
Organization: Pacific Data Products

If you appoint someone to admin your machine you better be willing to train them. If they've never had a hard disk crash on them you might want to ensure they understand hardware does stuff like that.

5

From: dvsc-a@minster.york.ac.uk
Organization: Department of Computer Science, University of York, England

Beware anything recursive when logged in as root!

6

From: matthews@oberon.umd.edu (Mike Matthews)
Organization: /etc/organization

*NEVER* move something important. Copy, VERIFY, and THEN delete.

7

From: almquist@chopin.udel.edu (Squish)
Organization: Human Interface Technology Lab (on vacation)

When you are doing some BIG type the command reread what you've typed about 100 times to make sure its sunk in (:

8

From: Nick Sayer

If / is full, du /dev.

9

From: TRIEMER@EAGLE.WESLEYAN.EDU
Organization: Wesleyan College

Never ever assume that some prepackaged script that you are running does anything right.

Admin Stories UnixNewbie.org

This is a modified list from "The Unofficial Unix Administration Horror Story Summary" by Anatoly Ivasyuk.

" Creative uses of rm
" How not to free up space on your drive
" Dealing with /dev files
" Making backups
" Blaming it on the hardware
" Partitioning the drives
" Configuring the system
" Upgrading the system
" All about file permissions
" Machine dependencies
" Miscellaneous stories (a.k.a. 'oops')
" What we have learned

My 10 UNIX Command Line Mistakes by Vivek Gite

with 90 comments

Anyone who has never made a mistake has never tried anything new. -- Albert Einstein.

Here are a few mistakes that I made while working at UNIX prompt. Some mistakes caused me a good amount of downtime. Most of these mistakes are from my early days as a UNIX admin.

userdel Command

The file /etc/deluser.conf was configured to remove the home directory (it was done by previous sys admin and it was my first day at work) and mail spool of the user to be removed. I just wanted to remove the user account and I end up deleting everything (note -r was activated via deluser.conf):
userdel foo

Rebooted Solaris Box

On Linux killall command kill processes by name (killall httpd). On Solaris it kill all active processes. As root I killed all process, this was our main Oracle db box:
killall process-name

Destroyed named.conf

I wanted to append a new zone to /var/named/chroot/etc/named.conf file., but end up running:
./mkzone example.com > /var/named/chroot/etc/named.conf

Destroyed Working Backups with Tar and Rsync (personal backups)

I had only one backup copy of my QT project and I just wanted to get a directory called functions. I end up deleting entire backup (note -c switch instead of -x):

cd /mnt/bacupusbharddisk

tar -zcvf project.tar.gz functions

I had no backup. Similarly I end up running rsync command and deleted all new files by overwriting files from backup set (now I've switched to rsnapshot)
rsync -av -delete /dest /src
Again, I had no backup.

Deleted Apache DocumentRoot

I had sym links for my web server docroot (/home/httpd/http was symlinked to /www). I forgot about symlink issue. To save disk space, I ran rm -rf on http directory. Luckily, I had full working backup set.

Accidentally Changed Hostname and Triggered False Alarm

Accidentally changed the current hostname (I wanted to see current hostname settings) for one of our cluster node. Within minutes I received an alert message on both mobile and email.
hostname foo.example.com

Public Network Interface Shutdown

I wanted to shutdown VPN interface eth0, but ended up shutting down eth1 while I was logged in via SSH:
ifconfig eth1 down

Firewall Lockdown

I made changes to sshd_config and changed the ssh port number from 22 to 1022, but failed to update firewall rules. After a quick kernel upgrade, I had rebooted the box. I had to call remote data center tech to reset firewall settings. (now I use firewall reset script to avoid lockdowns).

Typing UNIX Commands on Wrong Box

I wanted to shutdown my local Fedora desktop system, but I issued halt on remote server (I was logged into remote box via SSH):
halt
service httpd stop

Wrong CNAME DNS Entry

Created a wrong DNS CNAME entry in example.com zone file. The end result - a few visitors went to /dev/null:
echo 'foo 86400 IN CNAME lb0.example.com' >> example.com && rndc reload

Conclusion

All men make mistakes, but only wise men learn from their mistakes -- Winston Churchill.

From all those mistakes I've learnt that:

  1. Backup = ( Full + Removable tapes (or media) + Offline + Offsite + Tested )
  2. The clear choice for preserving all data of UNIX file systems is dump, which is only tool that guaranties recovery under all conditions. (see Torture-testing Backup and Archive Programs paper).
  3. Never use rsync with single backup directory. Create a snapshots using rsync or rsnapshots.
  4. Use CVS to store configuration files.
  5. Wait and read command line again before hitting the dam [Enter] key.
  6. Use your well tested perl / shell scripts and open source configuration management software such as puppet, Cfengine or Chef to configure all servers. This also applies to day today jobs such as creating the users and so on.

Mistakes are the inevitable, so did you made any mistakes that have caused some sort of downtime? Please add them into the comments below.

Jon 06.21.09 at 2:42 am
My all time favorite mistake was a simple extra space:

cd /usr/lib
ls /tmp/foo/bar

I typed
rm -rf /tmp/foo/bar/ *
instead of
rm -rf /tmp/foo/bar/*
The system doesn't run very will without all of it's libraries……
georgesdev 06.21.09 at 9:15 am
never type anything such as:
rm -rf /usr/tmp/whatever
maybe you are going to type enter by mistake before the end of the line. You would then for example erase all your disk starting on /.

if you want to use -rf option, add it at the end on the line:
rm /usr/tmp/whatever -rf
and even this way, read your line twice before adding -rf

3ToKoJ 06.21.09 at 9:26 am
public network interface shutdown … done
typing unix command on wrong box … done
Delete apache DocumentRoot … done
Firewall lockdone … done with a NAT rule redirecting the configuration interface of the firewall to another box, serial connection saved me

I can add, being trapped by aptitude keeping tracks of previously planned - but not executed - actions, like "remove slapd from the master directory server"

UnixEagle 06.21.09 at 11:03 am

Rebooted the wrong box
While adding alias to main network interface I ended up changing the main IP address, the system froze right away and I had to call for a reboot
Instead of appending text to Apache config file, I overwritten it's contents
Firewall lockdown while changing the ssh port
Wrongfully run a script contained recursive chmod and chown as root on / caused me a downtime of about 12 hours and a complete re-install

Some mistakes are really silly, and when they happen, you don't believe your self you did that, but every mistake, regardless of it's silliness, should be a learned lesson.
If you did a trivial mistake, you should not just overlook it, you have to think of the reasons that made you did it, like: you didn't have much sleep or your mind was confused about personal life or …..etc.

I like Einstein's quote, you really have to do mistakes to learn.

Selected Comments
7 smaramba 06.21.09 at 11:31 am
typing unix command on wrong box and firewall lockdown are all time classics: been there, done that.
but for me the absolute worst, on linux, was checking a mounted filesystem on a production server…

fsck /dev/sda2

the root filesystem was rendered unreadable. system down. dead. users really pissed off.
fortunately there was a full backup and the machine rebooted within an hour.

8 od 06.21.09 at 12:50 pm
"Typing UNIX Commands on Wrong Box"

Yea, I did that one too. Wanted to shut down my own vm but I issued init 0 on a remote server which I accessed via ssh. And oh yes, it was the production server.

10 sims 06.22.09 at 2:23 am
Funny thing, I don't remember typing typing in the wrong console. I think that's because I usually have the hostname right there. Fortunately, I don't do the same things over and over again very much. Which means I don't remember command syntax for all but most used commands.

Locking myself out while configuring the firewall – done – more than once. It wasn't really a CLI mistake though. Just being a n00b.

georgesdev, good one. I usually:

ls -a /path/to/files
to double check the contents
then up arrowkey homekey hit del a few times and type rm. I always get nervous with rm sitting at the prompt. I'll have to remember that -rf at the end of the line.

I always make mistakes making links. I can never remember the syntax. :/

Here's to less CLI mistakes… (beer)

Grant D. Vallance 06.22.09 at 7:56 am
A couple of days ago I typed and executed (as root): rm -rf /* on my home development server. Not good. Thankfully, the server at the time had nothing important on it, which is why I had no backups …

I am still not sure *why* I did when I have read about all the warnings about using this command. (A dyslexic moment with the syntax?)

Ah well, a good lesson learned. At least it was not the disaster it could of been. I shall be *very* paranoid about this command in the future.

Joren 06.22.09 at 9:30 am
I wanted to remove the subfolder etc from the /usr/local/matlab/ directory. So I accidentally added the '/' symbol in a force of habit when going to the /etc folder and I typed from the /usr/local/matlab directory:

sudo rm /etc

instead of

sudo rm etc

Without the entire /etc folder the computer didn't work anymore (which was to be expected ofcourse) and I ended up reinstalling my computer.

Robsteranium 06.22.09 at 11:05 am
Aza Rashkin explains how habituation can lead to stupid errors – confirming "yes I'm sure/ overwrite file etc" automatically without realising it. Perhaps rm and the > command need an undo/ built-in backup…
Ramaswamy 06.22.09 at 10:47 am
Deleted the files
I used place some files in /tmp/rama and some conf files at /home//httpd/conf file
I used to swap between these two directories by "cd -"
Executed the command rm -fr ./*
supposed to remove the files at /tmp/rama/*, but ended up by removing the file at /home//httpd/conf/*, with out any backup
Yonitg 06.23.09 at 8:06 am
Great post !
I did my share of system mishaps,
killing servers in production, etc.
the most emberassing one was sending 70K users the wrong message.
or beter yet, telling the CEO we have a major crysis, gathering up many people to solve it, and finding that it is nothing at all while all the management is standing in my cube.
Solaris 06.23.09 at 8:37 pm
Firewall lock out: done.
Command on wrong server: done.

And the worst: update and upgrade while some important applications were running, of
course on a production server.. as someone mentioned the system doesn't run very well
without all of its original libraries :)

Peko 06.30.09 at 8:46 am
I invented a new one today.

Just assuming that a [-v] option stands for –verbose

Yep, most of the time. But not on a [pkill] command.
[pkill -v myprocess] will kill _any_ process you can kill - except those whose name contains "myprocess". Ooooops. :-!
(I just wanted pkill to display "verbose" information when killing processes)

Yes, I know. Pretty dumb thing. Lesson learned ?

I would suggest adding another critical rule to your list:
" Read The Fantastic Manual - First" ;-)

Jai Prakash 07.03.09 at 1:43 pm
Mistake 1:

My Friend tried to see last reboot time and mistakenly executed command "last | reboot" instead of "last | grep reboot"

It made a outage on Production DB server.

Mistake 2:

Another guy, wants to see the FQDN on solaris box and executed "hostname -f"
It changed the hostname name to "-f" and clients faced lot of connectivity issues due to this mistake.
[ hostname -f is used in Linux to see FQDN name but it solaris its usage is different ]

32 Name 07.04.09 at 5:20 pm
Worse thing i've done so far, It accidentally dropped a MySQL database containing 13k accounts for a gameserver :D

Luckily i had backups but took a little while to restore,

33 Vince Stevenson 07.06.09 at 6:23 pm
I was dragged into a meeting one day and forgot to secure my Solaris session. A colleague and former friend did this: alias ll='/usr/sbin/shutdown -g5 -i5 "Bye bye Vince"' He must have thought that I was logged into my personal host machine, not the company's cashcow server. What happens when it all goes wrong. Secure your session… Rgds Vince
Bjarne Rasmussen 07.07.09 at 7:56 pm
well, tried many times, the crontab fast typing failure…

crontab -r instead of -e
e for edit
r for remove..

now i always use -l for list before editing…

35 Ian 07.08.09 at 4:15 am
Made a script that automatically removes all files from a directory. Now, rather than making it logically (this was early on) I did it stupidly.

cd /tmp/files
rm ./*

Of course, eventually someone removed /tmp/files..

36 shlomi 07.12.09 at 9:21 am
Hi

On My RHEL 5 sever I create /tmp mount point to my storage and tmpwatch script that run under cron.daily removes files which have not been accessed 12 hours !!!

52 foo 09.25.09 at 9:41 pm
wanted to kill all the instances of a service on HP-UX (pkill like util not available)…

# ps -aef | grep -v foo | awk {print'$2′} | xargs kill -9

Typed "grep -v" instead of "grep -i" and u can guess what happened :(

LinAdmin 09.29.09 at 2:38 pm
Typing rm -Rf /var/* in the wrong box. Recovered in few minutes by doing scp root@healty_box:/var . – the ssh session on the just broken box was still open . This saved my life :-P
Deltaray 10.03.09 at 4:37 am
Like Peko above, I too once ran pkill with the -v option and ended up killing everything else. This was on a very important enterprise production machine and I reminded myself the hard lesson of making sure you check man pages before trying some new option.

I understand where pkill gets its -v functionality from (pgrep and thus from grep), but honestly I don't see what use of -v would be for pkill. When do you really need to say something like kill all processes except this one? Seems reckless. Maybe 1 in a million times you'd use it properly, but probably most of the time people just get burned by it. I wrote to the author of pkill about this but never heard anything back. Oh well.

Guntram 10.05.09 at 7:51 pm
This is why i never use pkill; always use something like "ps ….| grep …" and, when it's ok, type a " | awk '{print $2}' | xargs kill" behind it. But, as a normal user, something like "pkill -v bash" might make perfect sense if you're sitting at the console (so you can't just switch to a different window or something) and have a background program rapidly filling your screen.

Worst thing that ever happened to me:
Our oracle database runs some rdbms jobs at midnight to clean out very old rows from various tables, along the line of "delete from XXXX where last_access < sysdate-3650". One sunday i installed ntp to all machines, made a start script that does an ntpdate first, then runs ntpd. Tested it:
$ date 010100002030; /etc/init.d/ntpd start; date
Worked great, current time was ok.
$ date 010100002030; reboot
After the machine was back up i noticed i had forgotten the /etc/rc*.d symlinks. But i never thought of the database until a lot of people were very angry monday morning. Fortunately, there's an automated backup every saturday.

sqn 10.07.09 at 6:05 pm
tried to lockout a folder by removing it's attributes (chmod 000) as a beginner and wanted to impress myself, did:

# cd /folder
# chmod 000 .. -R
used two points instead of one, and of course the system used the upper folder witch is / for modifying attributes
ended up getting out of my home and go the the server to reset the permissions back to normal. I got lucky because i just did a dd to move the system from one HDD to another and I haven't deleted the old one yet :)
And of course the classical configuring the wrong box, firewall lockout :)

dev 10.15.09 at 10:15 am
while I was working on many ssh window:

rm -rf *

I intended to remove all files under a site, after changing the current working
directory, then replacing with the stable one

wrong window, wrong server, and I did it on production server xx((
just aware the mistakes 1.5 after typing [ENTER]
no backup. maybe luckily, the site was keep running smooth..

it seems that the deleted files were such images, or media contents
1-2 secs incidental removal in heavy machine gave me loss approx. 20 MB

58 LMatt 10.17.09 at 3:36 pm
In a hurry to get a db back up for a user, I had to parse through nearly a several terabyte .tar.gz for the correct SQL dumpfile. So, being the good sysadmin, I locate it within an hour, and in my hurry to get db up for the client who was on the phone the entire time:
mysql > dbdump.sql
Fortunately I didn't sit and wait all that long before checking to make sure that the database size was increasing, and the client was on hold when I realized my error.
mysql > dbdump.sql - SHOULD be -
mysql < dbdump.sql
I had just sent stdout of the mysql CLI interface to a file named dbdump.sql. I had to re-retrieve the damn sqldump file and start over!
BAH! FOILED AGAIN!
Mr Z 10.18.09 at 5:13 am
After 10+ years I've made a lot of mistakes. Early on I got myself in the habit of testing commands before using them. For instance:

ls ~usr/tar/foo/bar then rm -f ~usr/tar/foo/bar – make sure you know what you will delete

When working with SSH, always make sure what system you are on. Modifying system prompts generally eliminates all confusion there.

It's all just creating a habit of doing things safely… at least for me.

60 chris 10.22.09 at 11:15 pm
cd /var/opt/sysadmin/etc
rm -f /etc

note the /etc. It was supposed to be rm -rf etc

Jonix 10.23.09 at 11:18 am
The deadline were coming too close to comfort, I'd worked for too looong hours for months.

We were developing a website, and I was in charge of developing the CGI scripts which generated a lot of temporary files, so on pure routine i worked in "/var/www/web/" and entered "rm temp/*" which i misspelled at some point as "rm tmp/ *". I kind of wondered, in my overtired brain, what took so long for the delete to finish, it should only be 20 small files that is should delete.

The very next morning the paying client was to fly in and pay us a visit, and get a demonstration of the project.

P.S Thanks to Subversion and opened files in Emacs buffers I managed to get almost all files back, and I had rewritten the missing files before the morning.

Cougar 10.29.09 at 3:00 pm
rm * in one of my project directory (no backup). I planned to do rm *~ to delete backup files but used international keyboard where space was required after ~ (dead key for letters like õ)..
BattleHardened 10.30.09 at 1:33 am
Some of my more choice moments:

postsuper -d ALL (instead of -r ALL, adjacent keys – 80k spooled mails gone). No recovery possible – ramfs :/

Had a .pl script to delete mails in .Spam directories older than X days, didn't put in enough error checking, some helpdesk guy provisioned a domain with a leading space in it and script rm'd (rm -rf /mailstore/ domain.com/.Spam/*) the whole mailstore. (250k users – 500GB used) – Hooray for 1 day old backup

chown -R named:named /var/named when there was a proc filesystem under /var/named/proc. Every running process on system got chown.. /bin/bash, /usr/sbin/sshd and so on. Took hours of manual find's to fix.

.. and pretty much all the ones everyone else listed :)

You break it, you fix it.

Shantanu Oak 11.03.09 at 11:20 am
scp overwrites an existing file if exists on the destination server. I just used the following command and soon realised that it has replaced the "somefile" of that server!!
scp somefile root@192.168.0.1:/root/
thatguy 11.04.09 at 3:37 pm
Hmm, most of these mistakes I have done – but my personal favourite.

# cd /usr/local/bin
# ls -l -> that displayed some binaries that I didn't need / want.
# cd ..
# rm -Rf /bin
– Yeah, you guessed it – smoked the bin folder ! The system wasn't happy after that. This is what happens when you are root and do something without reading the command before hitting [enter] late at night. First and last time …

Gurudatt 11.06.09 at 12:05 am
chmod 777 /

never try this, if u do so even root will not be able to login

69 richard 11.09.09 at 6:59 pm
so in recovering a binary backup of a large mysql database, produced by copying and tarballing '/var/lib/mysql', I untarred it in tmp, and did the recovery without incident. (at 2am in the morning, when it went down). Feeling rather pleased with myself for suck a quick and successful recovery, I went to deltete the 'var' directory in '/tmp' . I wanted to type:
rm -rf var/

instead I typed :
rm -rf /var

unfortunatley I didnt spot it for a while, and not until after did I realize that my on-site backups were stored in /var/backups …
IT was a truly miserable few days that followed while I pieced together the box from SVN and various other sources …

Derek 11.12.09 at 10:26 pm
Heh,
These were great.
I have many above.. my first was
reboot
….Connection reset by peer. Unfortunately, I thought I was rebooting my desktop. Luckily, the performance test server I was on hadn't been running tests(normally they can take 24-72 hours to run)..

symlinks… ack! I was cleaning up space and thought weird.. I don't remember having a bunch of databases in this location.. rm -f * unfortunately, it was a symlink to my /db slice, that DID have my databases, friday afternoon fun.

I did a similar with being in the wrong directory… deleted all my mysql binaries.

This was also after we had acquired a company and the same happened on one of their servers months before.. we never realized that, and the server had an issue one dady… so we rebooted. Mysql had been running in memory for months, and upon reboot there was no more mysql. Took us a while to figure that out because no one had thought that the mysql binaries were GONE! Luckily I wasn't the one who had deleted the binaries, just got to witness the aftermath.

jason 11.18.09 at 4:19 pm
The best ones are when you f*ck up and take down the production server and are then asked to investigate what happened and report on it to management….
Mr Z 11.19.09 at 3:02 pm
@jason
That sort of situation leads to this tee-shirt
http://www.rfcafe.com/business/images/Engineer%27s%20Troubleshooting%20Flow%20Chart.jpg
M.S. Babaei 08.01.09 at 3:39 am
once upon a time mkfs is killing me on ext3 partition I want
instead of
mkfs.ext3 /dev/sda1
I did this
mkfs.ext3 /dev/sdb1

I never forget what I lost??

Simon B 08.07.09 at 2:47 pm
Whilst a colleague was away from their keyboard I entered :


rm -rf *

… but did not press enter on the last line (as a joke). I expected them to come back and see it as a joke and rofl….back space… The unthinkable happened, the screen went to sleep and they banged the enter key to wake it up a couple of times. We lost 3 days worth of business and some new clients. estimated cost $50,000+

ginzero 08.17.09 at 5:10 am
tar cvf /dev/sda1 blah blah…
47 Kevin 08.25.09 at 10:50 am
tar cvf my_dir/* dir.tar
and your write your archive in the first file of the directory …
48 ST 09.17.09 at 10:14 am
I've done the wrong server thing. SSH'd into the mailserver to archive some old messages and clear up space.
Mistake #1: I didn't logoff when I was done, but simply minimized the terminal and kept working
Mistake#2: At the end of the day I opened what I thought was a local terminal and typed:
/sbin/shutdown -h now
thinking I was bringing down my laptop. The angry phone calls started less than a minute later. Thankfully, I just had to run to the server room and press power.

I never thought about using CVS to backup config files. After doing some really dumb things to files in /etc (deleting, stupid edits, etc), I started creating a directory to hold original config files, and renaming those files things like httpd.conf.orig or httpd.conf.091709

As always, the best way to learn this operating system is to break it…however unintentionally.

49 Wolf Halton 09.21.09 at 3:16 pm
Attempting to update a Fedora box over the wire from Fedora8 to Fedora9
I updated the repositories to the Fedora9 repos, and ran
# yum -y upgrade
I have now tested this on a couple of boxes and without exception the upgrades failed with many loose older-version packages and dozens of missing dependencies, as well as some fun circular dependencies which cannot be resolved. By the time it is done, eth0 is disabled and a reboot will not get to the kernel-choice stage.

Oddly, this kind of update works great in Ubuntu.

50 Ruben 09.24.09 at 8:23 pm
while cleaning the backup hdd late the night, a '/' can change everything…

"rm -fr /home" instead of "rm -fr home/"

It was a sleepless night, but thanks to Carlo Wood and his ext3grep I rescued about 95% of data ;-)

51 foo 09.25.09 at 9:36 pm
# svn add foo
--> Added 5 extra files that were not to be commited, so I decided to undo the change,delete the files and add to svn again…..
# svn rm foo –force

and it deleted the foo directory from disk :(…lost all my code just before the dead line :(

52 foo 09.25.09 at 9:41 pm
wanted to kill all the instances of a service on HP-UX (pkill like util not available)…

# ps -aef | grep -v foo | awk {print'$2′} | xargs kill -9

Typed "grep -v" instead of "grep -i" and u can guess what happened :(

53 LinAdmin 09.29.09 at 2:38 pm
Typing rm -Rf /var/* in the wrong box. Recovered in few minutes by doing scp root@healty_box:/var . – the ssh session on the just broken box was still open . This saved my life :-P
54 Deltaray 10.03.09 at 4:37 am
Like Peko above, I too once ran pkill with the -v option and ended up killing everything else. This was on a very important enterprise production machine and I reminded myself the hard lesson of making sure you check man pages before trying some new option.

I understand where pkill gets its -v functionality from (pgrep and thus from grep), but honestly I don't see what use of -v would be for pkill. When do you really need to say something like kill all processes except this one? Seems reckless. Maybe 1 in a million times you'd use it properly, but probably most of the time people just get burned by it. I wrote to the author of pkill about this but never heard anything back. Oh well.

55 Guntram 10.05.09 at 7:51 pm
This is why i never use pkill; always use something like "ps ….| grep …" and, when it's ok, type a " | awk '{print $2}' | xargs kill" behind it. But, as a normal user, something like "pkill -v bash" might make perfect sense if you're sitting at the console (so you can't just switch to a different window or something) and have a background program rapidly filling your screen.

Worst thing that ever happened to me:
Our oracle database runs some rdbms jobs at midnight to clean out very old rows from various tables, along the line of "delete from XXXX where last_access < sysdate-3650". One sunday i installed ntp to all machines, made a start script that does an ntpdate first, then runs ntpd. Tested it:
$ date 010100002030; /etc/init.d/ntpd start; date
Worked great, current time was ok.
$ date 010100002030; reboot
After the machine was back up i noticed i had forgotten the /etc/rc*.d symlinks. But i never thought of the database until a lot of people were very angry monday morning. Fortunately, there's an automated backup every saturday.

56 sqn 10.07.09 at 6:05 pm
tried to lockout a folder by removing it's attributes (chmod 000) as a beginner and wanted to impress myself, did:

# cd /folder
# chmod 000 .. -R
used two points instead of one, and of course the system used the upper folder witch is / for modifying attributes
ended up getting out of my home and go the the server to reset the permissions back to normal. I got lucky because i just did a dd to move the system from one HDD to another and I haven't deleted the old one yet :)
And of course the classical configuring the wrong box, firewall lockout :)

57 dev 10.15.09 at 10:15 am
while I was working on many ssh window:

rm -rf *

I intended to remove all files under a site, after changing the current working
directory, then replacing with the stable one

wrong window, wrong server, and I did it on production server xx((
just aware the mistakes 1.5 after typing [ENTER]
no backup. maybe luckily, the site was keep running smooth..

it seems that the deleted files were such images, or media contents
1-2 secs incidental removal in heavy machine gave me loss approx. 20 MB

58 LMatt 10.17.09 at 3:36 pm
In a hurry to get a db back up for a user, I had to parse through nearly a several terabyte .tar.gz for the correct SQL dumpfile. So, being the good sysadmin, I locate it within an hour, and in my hurry to get db up for the client who was on the phone the entire time:
mysql > dbdump.sql
Fortunately I didn't sit and wait all that long before checking to make sure that the database size was increasing, and the client was on hold when I realized my error.
mysql > dbdump.sql - SHOULD be -
mysql < dbdump.sql
I had just sent stdout of the mysql CLI interface to a file named dbdump.sql. I had to re-retrieve the damn sqldump file and start over!
BAH! FOILED AGAIN!
59 Mr Z 10.18.09 at 5:13 am
After 10+ years I've made a lot of mistakes. Early on I got myself in the habit of testing commands before using them. For instance:

ls ~usr/tar/foo/bar then rm -f ~usr/tar/foo/bar – make sure you know what you will delete

When working with SSH, always make sure what system you are on. Modifying system prompts generally eliminates all confusion there.

It's all just creating a habit of doing things safely… at least for me.

60 chris 10.22.09 at 11:15 pm
cd /var/opt/sysadmin/etc
rm -f /etc

note the /etc. It was supposed to be rm -rf etc

61 Jonix 10.23.09 at 11:18 am
The deadline were coming too close to comfort, I'd worked for too looong hours for months.

We were developing a website, and I was in charge of developing the CGI scripts which generated a lot of temporary files, so on pure routine i worked in "/var/www/web/" and entered "rm temp/*" which i misspelled at some point as "rm tmp/ *". I kind of wondered, in my overtired brain, what took so long for the delete to finish, it should only be 20 small files that is should delete.

The very next morning the paying client was to fly in and pay us a visit, and get a demonstration of the project.

P.S Thanks to Subversion and opened files in Emacs buffers I managed to get almost all files back, and I had rewritten the missing files before the morning.

62 Cougar 10.29.09 at 3:00 pm
rm * in one of my project directory (no backup). I planned to do rm *~ to delete backup files but used international keyboard where space was required after ~ (dead key for letters like õ)..
63 BattleHardened 10.30.09 at 1:33 am
Some of my more choice moments:

postsuper -d ALL (instead of -r ALL, adjacent keys – 80k spooled mails gone). No recovery possible – ramfs :/

Had a .pl script to delete mails in .Spam directories older than X days, didn't put in enough error checking, some helpdesk guy provisioned a domain with a leading space in it and script rm'd (rm -rf /mailstore/ domain.com/.Spam/*) the whole mailstore. (250k users – 500GB used) – Hooray for 1 day old backup

chown -R named:named /var/named when there was a proc filesystem under /var/named/proc. Every running process on system got chown.. /bin/bash, /usr/sbin/sshd and so on. Took hours of manual find's to fix.

.. and pretty much all the ones everyone else listed :)

You break it, you fix it.

64 PowerPeeCee 11.02.09 at 1:01 am
As an Ubuntu user for a while, Y'all are giving me nightmares, I will make extra discs and keep them handy. Eek! I am sure that I will break it somehow rather spectacularly at some point.
65 mahelious 11.02.09 at 10:44 pm
second day on the job i rebooted apache on the live web server, forgetting to first check the cert password. i was finally able to find it in an obscure doc file after about 30 minutes. the resulting firestorm of angry clients would have made Nero proud. I was very, very surprised to find out I still had a job after that debacle.

lesson learned: keep your passwords secure, but handy

66 Shantanu Oak 11.03.09 at 11:20 am
scp overwrites an existing file if exists on the destination server. I just used the following command and soon realised that it has replaced the "somefile" of that server!!
scp somefile root@192.168.0.1:/root/
67 thatguy 11.04.09 at 3:37 pm
Hmm, most of these mistakes I have done – but my personal favourite.

# cd /usr/local/bin
# ls -l -> that displayed some binaries that I didn't need / want.
# cd ..
# rm -Rf /bin
– Yeah, you guessed it – smoked the bin folder ! The system wasn't happy after that. This is what happens when you are root and do something without reading the command before hitting [enter] late at night. First and last time …

68 Gurudatt 11.06.09 at 12:05 am
chmod 777 /

never try this, if u do so even root will not be able to login

69 richard 11.09.09 at 6:59 pm
so in recovering a binary backup of a large mysql database, produced by copying and tarballing '/var/lib/mysql', I untarred it in tmp, and did the recovery without incident. (at 2am in the morning, when it went down). Feeling rather pleased with myself for suck a quick and successful recovery, I went to deltete the 'var' directory in '/tmp' . I wanted to type:
rm -rf var/

instead I typed :
rm -rf /var

unfortunatley I didnt spot it for a while, and not until after did I realize that my on-site backups were stored in /var/backups …
IT was a truly miserable few days that followed while I pieced together the box from SVN and various other sources …

70 Henry 11.10.09 at 6:00 pm
Nice post and familiar with the classic mistakes.

My all time classic:
- rm -rf /foo/bar/ * [space between / and *]

Be carefull with clamscan's:
–detect-pua=yes –detect-structured=yes –remove=no –move=DIRECTORY

I chose to scan / instead of /home/user and I ended with a screwed apt, libs, and missing files from allover the place :D I luckily had –log=/home/user/scan.log and not console output, so I could restore the moved files one by one
next time I use –copy instead of move and never start with /

these 2 happened at home, while working I've learned a long time ago (SCO Unix times) to backup files before rm :D

71 Derek 11.12.09 at 10:26 pm
Heh,
These were great.
I have many above.. my first was
reboot
….Connection reset by peer. Unfortunately, I thought I was rebooting my desktop. Luckily, the performance test server I was on hadn't been running tests(normally they can take 24-72 hours to run)..

symlinks… ack! I was cleaning up space and thought weird.. I don't remember having a bunch of databases in this location.. rm -f * unfortunately, it was a symlink to my /db slice, that DID have my databases, friday afternoon fun.

I did a similar with being in the wrong directory… deleted all my mysql binaries.

This was also after we had acquired a company and the same happened on one of their servers months before.. we never realized that, and the server had an issue one dady… so we rebooted. Mysql had been running in memory for months, and upon reboot there was no more mysql. Took us a while to figure that out because no one had thought that the mysql binaries were GONE! Luckily I wasn't the one who had deleted the binaries, just got to witness the aftermath.

72 Ahmad Abubakr 11.13.09 at 2:23 pm
My favourite :)

sudo chmod 777 /
73 jason 11.18.09 at 4:19 pm
The best ones are when you f*ck up and take down the production server and are then asked to investigate what happened and report on it to management….
74 Mr Z 11.19.09 at 3:02 pm
@jason
That sort of situation leads to this tee-shirt
http://www.rfcafe.com/business/images/Engineer%27s%20Troubleshooting%20Flow%20Chart.jpg
75 John 11.20.09 at 2:29 am
Clearing up space used by no-longer-needed archive files:

# du -sh /home/myuser/oldserver/var
32G /home/myuser/oldserver/var
# cd /home/myuser/oldserver
# rm -rf /var

The box ran for 6 months after doing this, by the way, until I had to shut it down to upgrade the RAM…although of course all the mail, Web content, and cron jobs were gone. *sigh*

76 Erick Mendes 11.24.09 at 7:55 pm
Yesterday I've locked my self outside of a switch I was setting up. lol
I was setting up a VLAN on it and my PC was directly connected to it thru one of the ports I messed up.

Had to get thru serial to undo vlan config.

Oh, the funny thing is that some hours later my boss just made the same mistake lol

77 John Kennedy 11.25.09 at 2:09 pm
Remotely logged into a (Solaris) box at 3am. Made some changes that required a reboot. Being too lazy to even try and remember the difference between Solaris and Linux shutdown commands I decided to use init. I typed init 0…No one at work to hit the power switch for me so I had to make the 30 minute drive into work.
This one I chalked up to being a noob…I was on an XTerminal which was connected to a Solaris machine. I wanted to reboot the terminal due to display problems…Instead of just powering off the terminal I typed reboot on the commandline. I was logged in as root…
78 bram 11.27.09 at 8:45 pm
on a remote freebsd box:

[root@localhost ~]# pkg_delete bash

The next time i tried to log in, it kept on telling me access denied… hmmmm… ow sh#t

(since my default shell in /etc/passwd was still pointing to a non-existent /usr/local/bin/bash, i would never be able to log in)

79 Li Tai Fang 11.29.09 at 8:02 am
On a number of occasions, I typed "rm" when I wanted to type "mv," i.e., I wanted to rename a file, but instead I deleted it.
80 vmware 11.30.09 at 4:59 am
last | reboot
instead
last | grep reboot
81 ColtonCat 12.02.09 at 4:21 am
I have a habit of renaming config files I work on to the same file with a "~" at the end for a backup, so that I can roll back if I make a mistake, and then once all is well I just do a rm *~. Trouble happened to me when I accidentally typed rm * ~ and as Murphy would have it a production asterisk telephony server.
82 bye bye box 12.02.09 at 7:54 pm
Slicked the wrong box in a production data center at my old job.

In all fairness it was labeled wrong on the box and kvm ID.

Now I've learned to check hostname before decom'ing anything.

83 Murphy's Red 12.02.09 at 9:11 pm
Running out of diskspace while updating a kernel on FreeBSD.

Not fully inserting a memory module on my home machine which shortcircuited my motherboard.

On several occasions i had to use a rdesktop session to windows machine and use putty to connect to a machine (yep.. i know it sounds weird ;-) ) Anyway.. text copied in windows is stored differently than text copied in the shell. Why changing a root passwd on a box, (password copied using putty) i just control v-ed it and logged off. I had to go to the datacenter to boot into single user mode to acces the box again.

Using the same crappy setup, i copied some text in windows, accidently hit control-v in the putty screen of the box i was logged into as root, the first word was halt, the last character an enter.

Configuring nat on the wrong interface while connected through ssh

Adding a new interface on a machine, filled in the details of a home network in kudzu which changed the default gateway to 192.168.1.1 on the main interface. Only checking the output of ifconfig but not the traffic or gateway and dns settings.

fsck -y on filesystem without unmounting it

84 ehrichweiss 12.03.09 at 6:55 pm
I've definitely rebooted the wrong box, locked myself out with firewall rules, rm -rf'ed a huge portion of my system. I had my infant son bang on the keyboard for my SGI Indigo2 and somehow hit the right key combo to undo a couple of symlinks I had created for /usr(I had to delete them a couple of times in the process of creating them) AND cleared the terminal/history so I had no idea what was going on when I started getting errors. I had created the symlink a week prior so it took me a while to figure out what I had to do to get the system operational again.

My best and most recent FUBAR was when I was backing up my system(I have horrible, HORRIBLE luck with backups to the point I don't bother doing them any more for the most part); I was using mondorescue and backing the files up to an NTFS partition I had mounted under /mondo and had done a backup that wouldn't restore anything because of an apostrophe or single quote in one of the file names was backing up, so I had to remove the files causing the problem which wasn't really a biggie and did the backup, then formatted the drive as I had been planning………..only to discover that I hadn't remounted the NTFS partition under /mondo as I had thought and all 30+ GB of data was gone. I attempted recovery several times but it was just gone.

85 fly 12.04.09 at 3:55 pm
my personal favorite, a script somehow created few dozens file in /etc dit … all named ??somestrings so i promplty did rm -rf ??* … (at the point when i hit [enter] i remebered that ? is a wildchar … Too late :)) luckily that was my home box … but reinstall was imminent :)
86 bips 12.06.09 at 9:56 am
il m'est arrivé de farie :
crontab -r

au lieu de :
crontab -e

ce qui a eu pour effet de vider la liste crontab…

87 bips 12.06.09 at 9:59 am
also i've done

shutdown -n
(i thaught -n meant "now")

which had for consequence to reboot the server without networking…

88 Deltaray 12.06.09 at 4:51 pm
bips: What does shutdown -n do? Its not in the shutdown man page.
89 miss 12.14.09 at 8:42 am
crontab -e vs crontab -r is the best :)
90 marty 12.18.09 at 12:21 am
the extra space before a * is one I've done before only the root cause was tab completion.

#rm /some/directory/FilesToBeDele[TAB]*

Thinking there were multiple files that began with FilesToBeDele. Instead, there was only one and pressing tab put in the extra space. Luckily I was in my home dir, and there was a file with write only permission so rm paused to ask if I was sure. I ^C and wiped my brow. Of course the [TAB] is totally unneccesary in this instance, but my pinky is faster than my brain.

Copy Your Linux Install to a Different Partition or Drive

Jul 9, 2009
If you need to move your Linux installation to a different hard drive or partition (and keep it working) and your distro uses grub this tech tip is what you need.

To start, get a live CD and boot into it. I prefer Ubuntu for things like this. It has Gparted. Now follow the steps outlined below.

Copying

Configuration

Install Grub

That's it! You should now have a bootable working copy of your source drive on your destination drive! You can use this to move to a different drive, partition, or filesystem.

Related Stories:
Linux - Compare two directories(Feb 18, 2009)
Cloning Linux Systems With CloneZilla Server Edition (CloneZilla SE)(Jan 22, 2009)
Copying a Filesystem between Computers(Oct 28, 2008)
rsnapshot: rsync-Based Filesystem Snapshot(Aug 26, 2008)
K9Copy Helps Make DVD Backups Easy(Aug 23, 2008)

Hosing Your Root Account By S. Lee Henry

If you manage your own Unix system, you might be interested in hearing how easy it is to make your root account completely inaccessible -- and then how to fix the problem. I have landed in this situation twice in my career and, each time, ended up having to boot my Solaris box off a CD-ROM in order to gain control of it.

The first time I ran into this problem, someone else had made a typing mistake in the root user's shell in the /etc/passwd file. Instead of saying "/bin/sh", the field was made to say "/bin/sch", suggesting to me that the intent had been to switch to /bin/csh. Due to the typing mistake, however, not only could root not log in but no one could su to the root account. Instead, we got error messages like these:

    login: root
    Password:
    Login incorrect

    boson% su -
    Password:
    su: cannot run /bin/sch: No such file or directory

The second time, I rdist'ed a new set of /etc files to a new Solaris box I was setting up without realizing that the root shell on the source system had been set to /bin/tcsh. Because this offspring of the C shell is not available on most Unix boxes (and certainly isn't delivered with Solaris), I found myself facing the same situation that I had run into many years before.

I couldn't log in as root. I couldn't su to the root account. I couldn't use rcp (even from a trusted host) -- because it checks the shell. I could ftp a copy of tcsh, but could not make it executable. I couldnt boot the system in single user mode (it also looked for a valid shell). The only option at my disposal was to boot the system from a CD ROM. Once I had done this, I had two choices: 1) I could mount my root partition on /a, cd to /a/etc, replace the shell in the /etc/passwd file, unmount /a, and then reboot. 2) I could mount my root partition on /a, cd to /a/bin, chmod 755 the copy of tcsh that I had previously ftped there, unmount /a, and then reboot.

I fixed root's entry in the /etc/passwd file and made my new tcsh file executable to prevent any possible recurrence of the problem. To avoid these problems, I usually don't allow the root shell to be set to anything other than /bin/sh (or /bin/csh if I'm pressured into it). The Bourne shell (or bash) is generally the best shell for root because it's on every system and the system start/stop scripts (in the /etc/rc?.d or /etc/rc.d/rc?.d directories) are almost exclusively written in sh syntax. Hence, should one of these files fail to include the #!/bin/sh designator, they will still run properly.

Surprised by how easily and completely I had made my system unusable, I was left running around the office looking for the secret stash of Solaris CD-ROMs to repair the damage. By the way, changing the file on the rdist source host and rdist'ing the files a second time would not have worked because even rdist requires the root account on the system be working properly. The rdist tool is based on rcp.

Recommended Links

Unix Admin. Horror Story Summary, version 1.0 compiled by: Anatoly Ivasyuk (anatoly@nick.csh.rit.edu)
The Unofficial Unix Administration Horror Story Summary, version 1.1

Stupid user tricks Eleven IT horror stories InfoWorld

Any horror stories about fired sysadmins sysadmin

developerWorks Linux Technical library view

More 2 Cent Tips

More 2 Cent Tips

More 2 Cent Tips

More 2 Cent Tips

More 2 Cent Tips

More 2 Cent Tips

More 2 Cent Tips

More 2 Cent Tips

More 2 Cent Tips

More 2 Cent Tips

More 2 Cent Tips

More 2 Cent Tips

More 2 Cent Tips

More 2 Cent Tips

More 2 Cent Tips

More 2 Cent Tips

More 2 Cent Tips

More 2 Cent Tips

More 2 Cent Tips

More 2 Cent Tips

More 2 Cent Tips

More 2 Cent Tips

Two Cent BASH Shell Script Tips

Lots More 2 Cent Tips...

Some great 2¢ Tips...



Etc

Society

Groupthink : Two Party System as Polyarchy : Corruption of Regulators : Bureaucracies : Understanding Micromanagers and Control Freaks : Toxic Managers :   Harvard Mafia : Diplomatic Communication : Surviving a Bad Performance Review : Insufficient Retirement Funds as Immanent Problem of Neoliberal Regime : PseudoScience : Who Rules America : Neoliberalism  : The Iron Law of Oligarchy : Libertarian Philosophy

Quotes

War and Peace : Skeptical Finance : John Kenneth Galbraith :Talleyrand : Oscar Wilde : Otto Von Bismarck : Keynes : George Carlin : Skeptics : Propaganda  : SE quotes : Language Design and Programming Quotes : Random IT-related quotesSomerset Maugham : Marcus Aurelius : Kurt Vonnegut : Eric Hoffer : Winston Churchill : Napoleon Bonaparte : Ambrose BierceBernard Shaw : Mark Twain Quotes

Bulletin:

Vol 25, No.12 (December, 2013) Rational Fools vs. Efficient Crooks The efficient markets hypothesis : Political Skeptic Bulletin, 2013 : Unemployment Bulletin, 2010 :  Vol 23, No.10 (October, 2011) An observation about corporate security departments : Slightly Skeptical Euromaydan Chronicles, June 2014 : Greenspan legacy bulletin, 2008 : Vol 25, No.10 (October, 2013) Cryptolocker Trojan (Win32/Crilock.A) : Vol 25, No.08 (August, 2013) Cloud providers as intelligence collection hubs : Financial Humor Bulletin, 2010 : Inequality Bulletin, 2009 : Financial Humor Bulletin, 2008 : Copyleft Problems Bulletin, 2004 : Financial Humor Bulletin, 2011 : Energy Bulletin, 2010 : Malware Protection Bulletin, 2010 : Vol 26, No.1 (January, 2013) Object-Oriented Cult : Political Skeptic Bulletin, 2011 : Vol 23, No.11 (November, 2011) Softpanorama classification of sysadmin horror stories : Vol 25, No.05 (May, 2013) Corporate bullshit as a communication method  : Vol 25, No.06 (June, 2013) A Note on the Relationship of Brooks Law and Conway Law

History:

Fifty glorious years (1950-2000): the triumph of the US computer engineering : Donald Knuth : TAoCP and its Influence of Computer Science : Richard Stallman : Linus Torvalds  : Larry Wall  : John K. Ousterhout : CTSS : Multix OS Unix History : Unix shell history : VI editor : History of pipes concept : Solaris : MS DOSProgramming Languages History : PL/1 : Simula 67 : C : History of GCC developmentScripting Languages : Perl history   : OS History : Mail : DNS : SSH : CPU Instruction Sets : SPARC systems 1987-2006 : Norton Commander : Norton Utilities : Norton Ghost : Frontpage history : Malware Defense History : GNU Screen : OSS early history

Classic books:

The Peter Principle : Parkinson Law : 1984 : The Mythical Man-MonthHow to Solve It by George Polya : The Art of Computer Programming : The Elements of Programming Style : The Unix Hater’s Handbook : The Jargon file : The True Believer : Programming Pearls : The Good Soldier Svejk : The Power Elite

Most popular humor pages:

Manifest of the Softpanorama IT Slacker Society : Ten Commandments of the IT Slackers Society : Computer Humor Collection : BSD Logo Story : The Cuckoo's Egg : IT Slang : C++ Humor : ARE YOU A BBS ADDICT? : The Perl Purity Test : Object oriented programmers of all nations : Financial Humor : Financial Humor Bulletin, 2008 : Financial Humor Bulletin, 2010 : The Most Comprehensive Collection of Editor-related Humor : Programming Language Humor : Goldman Sachs related humor : Greenspan humor : C Humor : Scripting Humor : Real Programmers Humor : Web Humor : GPL-related Humor : OFM Humor : Politically Incorrect Humor : IDS Humor : "Linux Sucks" Humor : Russian Musical Humor : Best Russian Programmer Humor : Microsoft plans to buy Catholic Church : Richard Stallman Related Humor : Admin Humor : Perl-related Humor : Linus Torvalds Related humor : PseudoScience Related Humor : Networking Humor : Shell Humor : Financial Humor Bulletin, 2011 : Financial Humor Bulletin, 2012 : Financial Humor Bulletin, 2013 : Java Humor : Software Engineering Humor : Sun Solaris Related Humor : Education Humor : IBM Humor : Assembler-related Humor : VIM Humor : Computer Viruses Humor : Bright tomorrow is rescheduled to a day after tomorrow : Classic Computer Humor

The Last but not Least Technology is dominated by two types of people: those who understand what they do not manage and those who manage what they do not understand ~Archibald Putt. Ph.D


Copyright © 1996-2018 by Dr. Nikolai Bezroukov. www.softpanorama.org was initially created as a service to the (now defunct) UN Sustainable Development Networking Programme (SDNP) in the author free time and without any remuneration. This document is an industrial compilation designed and created exclusively for educational use and is distributed under the Softpanorama Content License. Original materials copyright belong to respective owners. Quotes are made for educational purposes only in compliance with the fair use doctrine.

FAIR USE NOTICE This site contains copyrighted material the use of which has not always been specifically authorized by the copyright owner. We are making such material available to advance understanding of computer science, IT technology, economic, scientific, and social issues. We believe this constitutes a 'fair use' of any such copyrighted material as provided by section 107 of the US Copyright Law according to which such material can be distributed without profit exclusively for research and educational purposes.

This is a Spartan WHYFF (We Help You For Free) site written by people for whom English is not a native language. Grammar and spelling errors should be expected. The site contain some broken links as it develops like a living tree...

You can use PayPal to make a contribution, supporting development of this site and speed up access. In case softpanorama.org is down you can use the at softpanorama.info

Disclaimer:

The statements, views and opinions presented on this web page are those of the author (or referenced source) and are not endorsed by, nor do they necessarily reflect, the opinions of the author present and former employers, SDNP or any other organization the author may be associated with. We do not warrant the correctness of the information provided or its fitness for any purpose.

The site uses AdSense so you need to be aware of Google privacy policy. You you do not want to be tracked by Google please disable Javascript for this site. This site is perfectly usable without Javascript.

Last modified: November 02, 2018