Softpanorama May the source be with you, but remember the KISS principle ;-)	Home	Switchboard	Unix Administration	Red Hat	TCP/IP Networks	Neoliberalism	Toxic Managers
	(slightly skeptical) Educational society promoting "Back to basics" movement against IT overcomplexity and bastardization of classic Unix

Accidental Shutdowns/Reboot Blunders

News	Sysadmin Horror Stories	Recommended Links	Performing the operation on a wrong server	Abrupt loss of power horror stories	Simple Unix Backup Tools	Unix mv command
Missing backup	Rush/absence of testing	Creative uses of rm	Pure stupidity	Multiple sysadmin working on the same box		Locking yourself out
Safe-rm	Typical Errors In Using Find	Tips	Unix History	Humor		Etc

Introduction
What are constructive ideas in preventing rebooting of the wrong server or accidental rebooting
Molly guard script

Introduction

Accidental reboot is a serious blunder, especially if you rebooted production box in the middle of the day. Not all applications survive such operation gracefully, especially if this happens in the middle of the write operation.

This is one of CLM (career limiting moves) for any administrator. Typically his happens with remote sessions when the administrator opens a shell to one box, then open one to another box. Then became distracted, after which returns to the wrong terminal window and send the wrong command to the wrong machine. Nobody is perfect.

The rule before issuing any reboot command should be classic "Stop. Think. Click".

But more constructive is prevent such things in interactive session by redefining the reboot and other dangerous command to aliases, which provide a warning with the name of the box and pause.

Stop. Think. Click

Such script are usually called molly-guard. We will discuss them below

NOTE:

In RHEL7 /sbin is symlink to /usr/bin
In RHEL 7 reboot and shutdown are both symlinks to systemctl

What are constructive ideas in preventing rebooting of the wrong server or accidental rebooting

There are several such ideas.

Write a wrapper that introduce the delay between submission and execution of the command and asks for server name. You can simply rename then into something else. For example if the server is test1 you can rename the reboot command into reboot_test1. After that you can alias reboot to
```
alias reboot='/root/bin/safereboot'
```
Where safereboot is a simple script instead of the original command. Something like the following simple script:

#!/bin/bash
#: safe_reboot: verify the name of the server before reboot
#: Nikolai Bezroukov, 2013-2017. Released under Artistic license
#: Version 3.5 (October, 2017) 
#:
#: Invocation:
#: mkuser.sh [short name of the server]
#
#

   DEBUG=0;
   SHUTDOWN=`which shutdown`
   (( DEBUG > 0 )) && SHUTDOWN='echo shutdown'
   
   HOST=`hostname -s`

   if [[ -f /root/noreboot ]] ; then
      echo Reboot of this server is currently prohibited. Please remove the file /root/noreboot  to continue...
      exit 255
   fi

function verify_name
{
  if [[ "$HOSTNAME" = "$HOST" ]] ; then
     echo "The current server is $HOST (full name $HOSTNAME) Please enter the short name if this is the server you intend to reboot: "
   else
      echo "The current server is $HOST Please enter this name, if this is the server you intend to reboot: "
   fi
   read answer
   if [[ "$answer" != "$HOST" ]]; then
      echo Wrong answer $ANSWER Reboot cancelled...
      exit 255
   fi
   return 0
}

function rebootme
{      
     users=`who | grep -v root | wc -1`;
     if (( users == 0 )) ; then 
       $SHUTDOWN -r now
     else 
       echo "ATTENTION: There are $users on this system"  
       who
       echo You have 1 min to change your mind... Use shutdown -c to cancel
       $SHUTDOWN -r +1
     fi    
}

if [[ "$1" = "$HOST" ]] ; then
    rebootme
fi
verify_name
(( $? == 0 )) && rebootme
exit 0

This can be done in 5 min and is definitely better then no protection at all

Use some files or variables in environment (like SSH_CONNECTION (which provides IP address from which user connected to the server) to check and prevent reboot of important infrastructure servers. For example file /root/noreboot is used for this purpose in the wrapper above
Original programs can be moved to the subdirectory outside the PATH variable like /sbin/Caution ( In this case scripts should use init or systemdctl). That does not work for RHEL6 as the reboot-related commands in /usr/bin are aliases to consolehelper to control access to these commands through PAM
Use extended attributes to protect vital directories. for system directories this is a little bit cumbersome as it affects patching: you need to remove extended attributes before patching and reinstate them after.

Molly guard script

The name is interesting and has historical significance. Originally used of the plexiglass covers improvised for the BRS on an IBM 4341 mainframe after a programmer's toddler daughter (named Molly) tripped it twice in one day. Later generalized to covers over stop/reset switches on disk drives and networking equipment.

Ubuntu and Debian has a shell script called molly-guard which guards against accidental shutdowns-reboots (port exists for RHEL 5 and RHEL 6, but not RHEL7), It utilizes an interesting trick. If you put /usr/sbin before /sbin (for those systems that have those two directories, RHEL7 does not need to apply ;-) you can invoke your own script instead of system executable if command like reboot are submitted.

This script was written in 2008 by Martin F. Krafft and did not changed much since then.

NAME

       molly-guard - guard against accidental shutdowns/reboots

SYNOPSIS

       shutdown [-hV] [--molly-guard-do-nothing] [-- script_options]

       halt [-hV] [--molly-guard-do-nothing] [-- script_options]

       reboot [-hV] [--molly-guard-do-nothing] [-- script_options]

       poweroff [-hV] [--molly-guard-do-nothing] [-- script_options]

DESCRIPTION

       molly-guard attempts to prevent you from accidentally shutting down or
       rebooting machines. It does this by injecting a couple of checks before
       the existing commands: halt, reboot, shutdown, and poweroff.

           This happens via scripts with the same names in /usr/sbin, so it only works
                                                 if you have /usr/sbin before /sbin in your PATH!

       Before molly-guard invokes the real command, all scripts in
       /etc/molly-guard/run.d/ have to run and exit successfully; else, it
       aborts the command.  run-parts(1) is used to process the directory.

       molly-guard passes any script_options to the scripts, and also
       populates the environment with the following variables:

       ·   MOLLYGUARD_CMD - the actual command invoked by the user.

       ·   MOLLYGUARD_DO_NOTHING - set to 1 if this is a demo-run.

       ·   MOLLYGUARD_SETTINGS - the path to a shell script snippet which
           scripts can source to obtain settings.

       molly-guard prints the contents of /etc/molly-guard/messages.d/COMMAND
       or /etc/molly-guard/messages.d/default to the console, if either
       exists. This is due to /etc/molly-guard/run.d/10-print-message.

GUARDING SSH SESSIONS

       molly-guard was primarily designed to shield SSH connections. This
       functionality (which should arguably be provided by the openssh-server
       package) is implemented in /etc/molly-guard/run.d/30-query-hostname.

       This script first tests whether the command is being executed from a
       tty which has been created by sshd. It also checks whether the variable
       SSH_CONNECTION is defined. If any of these tests are successful, test
       script queries the user for the machine´s hostname, which should be
       sufficient to prevent the user from doing something by accident.

       You can pass the --pretend-ssh script option to molly-guard to pretend
       that those tests succeeds. Alternatively, setting ALWAYS_QUERY_HOSTNAME
       in /etc/molly-guard/rc causes the script to always query.

       The following situations are still UNGUARDED. If you can think of ways
       to protect against those, please let me know!

       ·   running sudo within screen or screen within sudo; sudo eats the
           SSH_CONNECTION variable, and screen creates a new pty.

       ·   executing those command in a remote terminal window, that is a
           XTerm started on a remote machine but displaying on the local X
           server.

       You have been warned. You can use the --molly-guard-do-nothing switch
       to prevent anything from happening, e.g.  halt
       --molly-guard-do-nothing.

OPTIONS

       --molly-guard-do-nothing
           Cause molly-guard to print the command which would be executed,
           after processing all scripts, instead of executing it.

       -h, --help
           Display usage information.

       -V, --version
           Display version information.

LEGALESE

       molly-guard is copyright by martin f. krafft. Andrew Ruthven came up
       with the idea of using the scripts directory and submitted a patch,
       which I modified a bit.

       This manual page was written by martin f. krafft <[email protected]>.

       Permission is granted to copy, distribute and/or modify this document
       under the terms of the Artistic License 2.0

COPYRIGHT

       Copyright © 2008 martin f. krafft

The Script itself is pretty short. It is just 126 lines of bash code.

#!/bin/sh
#
# shutdown -- wrapper script to guard against accidental shutdowns
#
# Copyright © martin f. krafft 
# Released under the terms of the Artistic Licence 2.0
#
set -eu

ME=molly-guard
VERSION=0.4

SCRIPTSDIR="@cfgdir@/run.d"

CMD="${0##*/}"
EXEC="@REALPATH@/$CMD"

case "$CMD" in
  halt|reboot|shutdown|poweroff|coldreboot|pm-hibernate|pm-suspend|pm-suspend-hybrid)
    if [ ! -f $EXEC ]; then
      echo "E: not a regular file: $EXEC" >&2
      exit 4
    fi
    if [ ! -x $EXEC ]; then
      echo "E: not an executable: $EXEC" >&2
      exit 3
    fi
    ;;
  *)
    echo "E: unsupported command: $CMD" >&2
    exit 1
    ;;
esac

usage()
{
  cat <<-_eousage
	Usage: $ME [options] [-- script options]
	       (shielding $EXEC)
	
	molly-guard's primary goal is to guard against accidental
	shutdowns/reboots. $ME will run all scripts in $SCRIPTSDIR and only
	invokes $EXEC if all scripts exited successfully.

	Specifying --molly-guard-do-nothing as argument to the command will
	make $ME echo the command it would execute rather than actually
	executing it.

	Options following the double hyphen will be passed unchanged to the
	scripts.

	Please see molly-guard(8) for more information.

	The actual command's help output follows:

	_eousage
}

CMDARGS=
SCRIPTARGS=
END_OF_ARGS=0
DO_NOTHING=0
for arg in "$@"; do
  case "$arg" in
    (*-molly-guard-do-nothing) DO_NOTHING=1;;
    (*-help)
      usage 2>&1
      eval $EXEC --help 2>&1
      exit 0
      ;;
    --) END_OF_ARGS=1;;
    *\"*)
      echo 'E: cannot use double-quotes (") in arguments' >&2
      exit 1
      ;;
    *)
      if [ $END_OF_ARGS -eq 0 ]; then
        CMDARGS="${CMDARGS:+$CMDARGS }\"$arg\""
      else
        SCRIPTARGS="${SCRIPTARGS:+$SCRIPTARGS }--arg \"$arg\""
      fi
      ;;
  esac
done

do_real_cmd()
{
  if [ $DO_NOTHING -eq 1 ]; then
    echo "$ME: would run: $EXEC $CMDARGS"
    exit 0
  else
    eval exec $EXEC "$CMDARGS"
  fi
}

if [ $DO_NOTHING -eq 1 ]; then
  echo "I: demo mode; $ME will not do anything due to --molly-guard-do-nothing." >&2
fi

if [ -n "${MOLLYGUARD_CMD:-}" ]; then
  do_real_cmd
fi

MOLLYGUARD_CMD=$CMD; export MOLLYGUARD_CMD
MOLLYGUARD_DO_NOTHING=$DO_NOTHING; export MOLLYGUARD_DO_NOTHING
MOLLYGUARD_SETTINGS="@cfgdir@/rc"; export MOLLYGUARD_SETTINGS

# pass through certain commands
case "$CMD $CMDARGS" in
  (*shutdown\ *-c*|*halt\ *-w*|*halt\ *-f*|*reboot\ *-f*)
    # allow canceling shutdowns, only write wtmp and force immediate halt
    echo "I: executing $CMD $CMDARGS regardless of check results." >&2
    do_real_cmd
    ;;
esac

for script in $(run-parts --test $SCRIPTSDIR); do
  ret=0
  eval $script $SCRIPTARGS || ret=$?
  if [ $ret -ne 0 ]; then
    echo "W: aborting $CMD due to ${script##*/} exiting with code $ret." >&2
    exit $ret
  fi
done

do_real_cmd

iT use only one important "rules file" which definitely can be greatly simplified to jsut asking the server name in all cases

Analysis of process tree can perform by pstree from package psmisc.x86_64 which is availed from standard repositories in RHEL or compiling ftp://ftp.thp.uni-duisburg.de/pub/source/pstree.tar.gz

#!/bin/sh
#
# 30-ask-hostname - request the user to type in the hostname of the local host
#
# Copyright © 2006-2009 martin f. krafft 
# Copyright © 2012 Ludovico Gardenghi 
# Copyright © 2014 Josh Triplett 
# Copyright © 2015 Francois Marier 
# Copyright © 2017 Simó Albert i Beltran 
# Released under the terms of the Artistic Licence 2.0
#
set -eu

ME=molly-guard

# Walk up the process tree until PID 1 is reached or a process with 'sshd' in
# its /proc//cmdline is met. Return success if such a process is found.
is_child_of_sshd_or_mosh_server() {
  pid=$$
  ppid=$PPID
  # Be a bit paranoid with the guard, should some horribly broken system
  # provide a strange process hierarchy. '[ $pid -ne 1 ]' should be enough for
  # sane systems.
  [ -z "$pid" ] || [ -z "$ppid" ] && return 2
  while [ $pid -gt 1 ] && [ $pid -ne $ppid ]; do
    if egrep -q 'sshd|mosh-server' /proc/$ppid/cmdline; then
      return 0
    fi
    pid=$ppid
    ppid=$(grep ^PPid: /proc/$pid/status | tr -dc 0-9)
  done
  return 1
}

[ -f "$MOLLYGUARD_SETTINGS" ] && . "$MOLLYGUARD_SETTINGS"

PRETEND_SSH=0
for arg in "$@"; do
  case "$arg" in
    (*-pretend-ssh) PRETEND_SSH=1;;
  esac
done

# require an interactive terminal connected to stdin
test -t 0 || exit 0

# we've been asked to always protect this host
case "${ALWAYS_QUERY_HOSTNAME:-0}" in
  0|false|False|no|No|off|Off)
    # only run if we are being called over SSH, that is if the current terminal
    # was created by sshd.
    command -v tty >/dev/null 2>&1 || exit 0
    PTS=$(tty)
    if ! pgrep -f "^sshd.+${PTS#/dev/}\>" >/dev/null \
      && [ -z "${SSH_CONNECTION:-}" ] \
      && ! is_child_of_sshd_or_mosh_server; then
        if [ $PRETEND_SSH -eq 1 ]; then
          echo "I: $ME: this is not an SSH session, but --pretend-ssh was given..." >&2
        else
          exit 0
        fi
    else
      echo "W: $ME: SSH session detected!" >&2
    fi
    ;;
  *)
    echo "I: $ME: $MOLLYGUARD_CMD is always molly-guarded on this system." >&2
    ;;
esac

case "${USE_FQDN:-0}" in
  0|false|False|no|No|off|Off)
    HOSTNAME="$(hostname --short)"
    ;;
  *)
    HOSTNAME="$(hostname --fqdn)"
    ;;
esac

sigh()
{
  echo "Good thing I asked; I won't $MOLLYGUARD_CMD $HOSTNAME ..." >&2
  exit 1
}

trap 'echo;sigh' 1 2 3 9 10 12 15

echo -n "Please type in hostname of the machine to $MOLLYGUARD_CMD: "
read HOSTNAME_USER || :

HOSTNAME="$(echo "$HOSTNAME" | tr '[:upper:]' '[:lower:]')"
HOSTNAME_USER="$(echo "$HOSTNAME_USER" | tr '[:upper:]' '[:lower:]')"

[ "$HOSTNAME_USER" = "$HOSTNAME" ] || sigh

trap - 1 2 3 9 10 12 15

exit 0

Top Visited <p>Your browser does not support iframes.</p>					Switchboard
					Latest
					Past week
					Past month

NEWS CONTENTS

20190301 : Molly-guard for CentOS 7 UoB Unix by dg12158 ( Sep 21, 2015 , bris.ac.uk )
20190301 : molly-guard protects machines from accidental shutdowns-reboots by ruchi ( Nov 28, 2009 , www.ubuntugeek.com )
20190301 : Confirm before executing shutdown-reboot command on linux by Ilija Matoski ( Oct 23, 2017 , matoski.com )
20190129 : hardware - Is post-sudden-power-loss filesystem corruption on an SSD drive's ext3 partition expected behavior ( Dec 04, 2012 , serverfault.com )
20190129* xfs corrupted after power failure ( Oct 15, 2013 , www.linuxquestions.org ) [Recommended]
20190129 : an HVAC tech that confused the BLACK button that got pushed to exit the room with the RED button clearly marked EMERGENCY POWER OFF. ( Jan 29, 2019 , thwack.solarwinds.com )
20190129 : HVAC units greatly help to increase reliability ( Jan 29, 2019 , thwack.solarwinds.com )
20190129 : In a former life, I had every server crash over the weekend when the facilities group took down the climate control and HVAC systems without warning ( Jan 29, 2019 , thwack.solarwinds.com )
20190129 : [SOLVED] Unable to mount root file system after a power failure ( Jan 29, 2019 , www.linuxquestions.org )
20190128* Testing backup system as the main source of power outatages ( Jan 28, 2019 , thwack.solarwinds.com ) [Recommended]
20190128 : False alarm: bas small inmashine room due to electrical light not a server ( Jan 28, 2019 , www.reddit.com )
20190128 : Loss of power problems: Machines are running, but every switch in the cabinet is dead. Some servers are dead. Panic sets in. ( Jan 28, 2019 , www.reddit.com )
20190128 : That's how I learned to always check with somebody else before rebooting a production server, no matter how minor it may seem ( Jan 28, 2019 , www.reddit.com )
20190114 : Safe rm stops you accidentally wiping the system! @ New Zealand Linux ( Jan 14, 2019 , www.nzlinux.com )
20181005 : Due to a configuration change I wasn't privy to, the software I was responsible for rebooted all the 911 operators servers at once ( Oct 05, 2018 , www.reddit.com )
20100616 : Prevent Accidental Shutdown/Reboot in Ubuntu ( Jan 15, 2010 , Linux Today )
20100614 : IT Resource Center forums - greatest blunders ( IT Resource Center forums - greatest blunders, Jun 14, 2010 )
20100609 : horror - University of Cambridge Computing Service - Unix Support ( horror - University of Cambridge Computing Service - Unix Support, Jun 9, 2010 )
20100609 : My 10 UNIX Command Line Mistakes by Vivek Gite ( Selected Comments )
20090521 : Accidental shutdown ( Accidental shutdown, May 21, 2009 )

Old News ;-)

[Mar 01, 2019] Molly-guard for CentOS 7 UoB Unix by dg12158

Sep 21, 2015 | bris.ac.uk

Since I was looking at this already and had a few things to investigate and fix in our systemd-using hosts, I checked how plausible it is to insert a molly-guard-like password prompt as part of the reboot/shutdown process on CentOS 7 (i.e. using systemd).

Problems encountered include:

Asking for a password from a service/unit in systemd -- Use systemd-ask-password and needs some agent setup to reply to this correctly?

The reboot command always walls a message to all logged in users before it even runs the new reboot-molly unit, as it expects a reboot to happen. The argument --no-wall stops this but that requires a change to the reboot command. Hence back to the original problem of replacing packaged files/symlinks with RPM

The reboot.target unit is a "systemd.special" unit, which means that it has some special behaviour and cannot be renamed. We can modify it, of course, by editing the reboot.target file.

How do we get a systemd unit to run first and block anything later from running until it is complete? (In fact to abort the reboot but just for this time rather than being set as permanently failed. Reboot failing is a bit of a strange situation for it to be in ) The dependencies appear to work but the reboot target is quite keen on running other items from the dependency list -- I'm more than likely doing something wrong here!

So for now this is shelved. It would be nice to have a solution though, so any hints from systemd experts are gratefully received!

(Note that CentOS 7 uses systemd 208, so new features in later versions which help won't be available to us) This entry was posted in Uncategorized by dg12158 . Bookmark the permalink .

[Mar 01, 2019] molly-guard protects machines from accidental shutdowns-reboots by ruchi

Nov 28, 2009 | www.ubuntugeek.com
molly-guard installs a shell script that overrides the existing shutdown/reboot/halt/poweroff commands and first runs a set of scripts, which all have to exit successfully, before molly-guard invokes the real command.
One of the scripts checks for existing SSH sessions. If any of the four commands are called interactively over an SSH session, the shell script prompts you to enter the name of the host you wish to shut down. This should adequately prevent you from accidental shutdowns and reboots.

This shell script passes through the commands to the respective binaries in /sbin and should thus not get in the way if called non-interactively, or locally.

The tool is basically a replacement for halt, reboot and shutdown to prevent such accidents.

Install molly-guard in ubuntu

sudo apt-get install molly-guard

or click on the following link

apt://molly-guard

Now that it's installed, try it out (on a non production box). Here you can see it save me from rebooting the box Ubuntu-test

Ubuntu-test:~$ sudo reboot
W: molly-guard: SSH session detected!
Please type in hostname of the machine to reboot: ruchi
Good thing I asked; I won't reboot Ubuntu-test ...
W: aborting reboot due to 30-query-hostname exiting with code 1.
Ubuntu-Test:~$

By default you're only protected on sessions that look like SSH sessions (have $SSH_CONNECTION set). If, like us, you use alot of virtual machines and RILOE cards, edit /etc/molly-guard/rc and uncomment ALWAYS_QUERY_HOSTNAME=true. Now you should be prompted for any interactive session.

[Mar 01, 2019] Confirm before executing shutdown-reboot command on linux by Ilija Matoski

Notable quotes:

"... rushing to leave and was still logged into a server so I wanted to shutdown my laptop, but what I didn't notice is that I was still connected to the remote server. ..."

Oct 23, 2017 | matoski.com
rushing to leave and was still logged into a server so I wanted to shutdown my laptop, but what I didn't notice is that I was still connected to the remote server. Luckily before pressing enter I noticed I'm not on my machine but on a remote server. So I was thinking there should be a very easy way to prevent it from happening again, to me or to anyone else.
So first thing we need to create a new bash script at /usr/local/bin/confirm with the contents bellow and with execution permissions
#!/usr/bin/env bash
echo "About to execute $1 command"
echo -n "Would you like to proceed y/n? "
read reply

if [ "$reply" = y -o "$reply" = Y ]
then
   $1 "${@:2}"
else
   echo "$1 ${@:2} cancelled"
fi
Now only thing left to do is to setup the aliases so they go through this command to confirm instead of directly calling the command.

So I create the following files

/etc/profile.d/confirm-shutdown.sh
alias shutdown="/usr/local/bin/confirm /sbin/shutdown"
/etc/profile.d/confirm-reboot.sh
alias reboot="/usr/local/bin/confirm /sbin/reboot"
Now when I actually try to do a shutdown/reboot it will prompt me like so.
ilijamt@x1 ~ $ reboot 
Before proceeding to perform /sbin/reboot, please ensure you have approval to perform this task
Would you like to proceed y/n? n
/sbin/reboot  cancelled

[Jan 29, 2019] hardware - Is post-sudden-power-loss filesystem corruption on an SSD drive's ext3 partition expected behavior

Dec 04, 2012 | serverfault.com

My company makes an embedded Debian Linux device that boots from an ext3 partition on an internal SSD drive. Because the device is an embedded "black box", it is usually shut down the rude way, by simply cutting power to the device via an external switch.

This is normally okay, as ext3's journalling keeps things in order, so other than the occasional loss of part of a log file, things keep chugging along fine.

However, we've recently seen a number of units where after a number of hard-power-cycles the ext3 partition starts to develop structural issues -- in particular, we run e2fsck on the ext3 partition and it finds a number of issues like those shown in the output listing at the bottom of this Question. Running e2fsck until it stops reporting errors (or reformatting the partition) clears the issues.

My question is... what are the implications of seeing problems like this on an ext3/SSD system that has been subjected to lots of sudden/unexpected shutdowns?

My feeling is that this might be a sign of a software or hardware problem in our system, since my understanding is that (barring a bug or hardware problem) ext3's journalling feature is supposed to prevent these sorts of filesystem-integrity errors. (Note: I understand that user-data is not journalled and so munged/missing/truncated user-files can happen; I'm specifically talking here about filesystem-metadata errors like those shown below)

My co-worker, on the other hand, says that this is known/expected behavior because SSD controllers sometimes re-order write commands and that can cause the ext3 journal to get confused. In particular, he believes that even given normally functioning hardware and bug-free software, the ext3 journal only makes filesystem corruption less likely, not impossible, so we should not be surprised to see problems like this from time to time.

Which of us is right?
Embedded-PC-failsafe:~# ls
Embedded-PC-failsafe:~# umount /mnt/unionfs
Embedded-PC-failsafe:~# e2fsck /dev/sda3
e2fsck 1.41.3 (12-Oct-2008)
embeddedrootwrite contains a file system with errors, check forced.
Pass 1: Checking inodes, blocks, and sizes
Pass 2: Checking directory structure
Invalid inode number for '.' in directory inode 46948.
Fix<y>? yes

Directory inode 46948, block 0, offset 12: directory corrupted
Salvage<y>? yes

Entry 'status_2012-11-26_14h13m41.csv' in /var/log/status_logs (46956) has deleted/unused inode 47075.  Clear<y>? yes
Entry 'status_2012-11-26_10h42m58.csv.gz' in /var/log/status_logs (46956) has deleted/unused inode 47076.  Clear<y>? yes
Entry 'status_2012-11-26_11h29m41.csv.gz' in /var/log/status_logs (46956) has deleted/unused inode 47080.  Clear<y>? yes
Entry 'status_2012-11-26_11h42m13.csv.gz' in /var/log/status_logs (46956) has deleted/unused inode 47081.  Clear<y>? yes
Entry 'status_2012-11-26_12h07m17.csv.gz' in /var/log/status_logs (46956) has deleted/unused inode 47083.  Clear<y>? yes
Entry 'status_2012-11-26_12h14m53.csv.gz' in /var/log/status_logs (46956) has deleted/unused inode 47085.  Clear<y>? yes
Entry 'status_2012-11-26_15h06m49.csv' in /var/log/status_logs (46956) has deleted/unused inode 47088.  Clear<y>? yes
Entry 'status_2012-11-20_14h50m09.csv' in /var/log/status_logs (46956) has deleted/unused inode 47073.  Clear<y>? yes
Entry 'status_2012-11-20_14h55m32.csv' in /var/log/status_logs (46956) has deleted/unused inode 47074.  Clear<y>? yes
Entry 'status_2012-11-26_11h04m36.csv.gz' in /var/log/status_logs (46956) has deleted/unused inode 47078.  Clear<y>? yes
Entry 'status_2012-11-26_11h54m45.csv.gz' in /var/log/status_logs (46956) has deleted/unused inode 47082.  Clear<y>? yes
Entry 'status_2012-11-26_12h12m20.csv.gz' in /var/log/status_logs (46956) has deleted/unused inode 47084.  Clear<y>? yes
Entry 'status_2012-11-26_12h33m52.csv.gz' in /var/log/status_logs (46956) has deleted/unused inode 47086.  Clear<y>? yes
Entry 'status_2012-11-26_10h51m59.csv.gz' in /var/log/status_logs (46956) has deleted/unused inode 47077.  Clear<y>? yes
Entry 'status_2012-11-26_11h17m09.csv.gz' in /var/log/status_logs (46956) has deleted/unused inode 47079.  Clear<y>? yes
Entry 'status_2012-11-26_12h54m11.csv.gz' in /var/log/status_logs (46956) has deleted/unused inode 47087.  Clear<y>? yes

Pass 3: Checking directory connectivity
'..' in /etc/network/run (46948) is <The NULL inode> (0), should be /etc/network (46953).
Fix<y>? yes

Couldn't fix parent of inode 46948: Couldn't find parent directory entry

Pass 4: Checking reference counts
Unattached inode 46945
Connect to /lost+found<y>? yes

Inode 46945 ref count is 2, should be 1.  Fix<y>? yes
Inode 46953 ref count is 5, should be 4.  Fix<y>? yes

Pass 5: Checking group summary information
Block bitmap differences:  -(208264--208266) -(210062--210068) -(211343--211491) -(213241--213250) -(213344--213393) -213397 -(213457--213463) -(213516--213521) -(213628--213655) -(213683--213688) -(213709--213728) -(215265--215300) -(215346--215365) -(221541--221551) -(221696--221704) -227517
Fix<y>? yes

Free blocks count wrong for group #6 (17247, counted=17611).
Fix<y>? yes

Free blocks count wrong (161691, counted=162055).
Fix<y>? yes

Inode bitmap differences:  +(47089--47090) +47093 +47095 +(47097--47099) +(47101--47104) -(47219--47220) -47222 -47224 -47228 -47231 -(47347--47348) -47350 -47352 -47356 -47359 -(47457--47488) -47985 -47996 -(47999--48000) -48017 -(48027--48028) -(48030--48032) -48049 -(48059--48060) -(48062--48064) -48081 -(48091--48092) -(48094--48096)
Fix<y>? yes

Free inodes count wrong for group #6 (7608, counted=7624).
Fix<y>? yes

Free inodes count wrong (61919, counted=61935).
Fix<y>? yes


embeddedrootwrite: ***** FILE SYSTEM WAS MODIFIED *****

embeddedrootwrite: ********** WARNING: Filesystem still has errors **********

embeddedrootwrite: 657/62592 files (24.4% non-contiguous), 87882/249937 blocks

Embedded-PC-failsafe:~# 
Embedded-PC-failsafe:~# e2fsck /dev/sda3
e2fsck 1.41.3 (12-Oct-2008)
embeddedrootwrite contains a file system with errors, check forced.
Pass 1: Checking inodes, blocks, and sizes
Pass 2: Checking directory structure
Directory entry for '.' in ... (46948) is big.
Split<y>? yes

Missing '..' in directory inode 46948.
Fix<y>? yes

Setting filetype for entry '..' in ... (46948) to 2.
Pass 3: Checking directory connectivity
'..' in /etc/network/run (46948) is <The NULL inode> (0), should be /etc/network (46953).
Fix<y>? yes

Pass 4: Checking reference counts
Inode 2 ref count is 12, should be 13.  Fix<y>? yes

Pass 5: Checking group summary information

embeddedrootwrite: ***** FILE SYSTEM WAS MODIFIED *****
embeddedrootwrite: 657/62592 files (24.4% non-contiguous), 87882/249937 blocks
Embedded-PC-failsafe:~# 
Embedded-PC-failsafe:~# e2fsck /dev/sda3
e2fsck 1.41.3 (12-Oct-2008)
embeddedrootwrite: clean, 657/62592 files, 87882/249937 blocks
filesystems hardware ssd ext3 share | improve this question edited Dec 5 '12 at 18:40 ewwhite 173k 75 364 712 asked Dec 4 '12 at 1:13 Jeremy Friesner Jeremy Friesner 611 1 8 25

Have you all thought of changing to ext4 or ZFS? – mdpc Dec 4 '12 at 2:14

I've thought about changing to ext4, at least... would that help address this issue? Would ZFS be better still? – Jeremy Friesner Dec 4 '12 at 2:17

1 Neither option would fix this. We still use devices with supercapacitors in ZFS, and battery or flash-protected cache is recommended for ext4 in server applications. – ewwhite Dec 4 '12 at 2:54

add a comment | 2 Answers 2 active oldest votes 10 You're both wrong (maybe?)... ext3 is coping the best it can with having its underlying storage removed so abruptly.
Your SSD probably has some type of onboard cache. You don't mention the make/model of SSD in use, but this sounds like a consumer-level SSD versus an enterprise or industrial-grade model .

Either way, the cache is used to help coalesce writes and prolong the life of the drive. If there are writes in-transit, the sudden loss of power is definitely the source of your corruption. True enterprise and industrial SSD's have supercapacitors that maintain power long enough to move data from cache to nonvolatile storage, much in the same way battery-backed and flash-backed RAID controller caches work .

If your drive doesn't have a supercap, the in-flight transactions are being lost, hence the filesystem corruption. ext3 is probably being told that everything is on stable storage, but that's just a function of the cache. share | improve this answer edited Apr 13 '17 at 12:14 Community ♦ 1 answered Dec 4 '12 at 1:24 ewwhite ewwhite 173k 75 364 712

Sorry to you and everyone who upvoted this, but you're just wrong. Handling the loss of in progress writes is exactly what the journal is for, and as long as the drive correctly reports whether it has a write cache and obeys commands to flush it, the journal guarantees that the metadata will not be inconsistent. You only need a supercap or battery backed raid cache so you can enable write cache without having to enable barriers, which sacrifices some performance to maintain data correctness. – psusi Dec 5 '12 at 19:12

@psusi The SSD in use probably has cache explicitly enabled or relies on an internal buffer regardless of the OS_level setting. The data in that cache is what a supercapacitor-enabled SSD would protect. – ewwhite Dec 5 '12 at 19:30

The data in the cache doesn't need protecting if you enable IO barriers. Most consumer type drives ship with write caching disabled by default and you have to enable it if you want it, exactly because it causes corruption issues if the OS is not careful. – psusi Dec 5 '12 at 19:35

@pusi Old now, but you mention this: as long as the drive correctly reports whether it has a write cache and obeys commands to flush it, the journal guarantees that the metadata will not be inconsistent. That's the thing: because of storage controllers that tend to assume older disks, SSDs will sometimes lie about whether data is flushed. You do need that supercap. – Joel Coel Aug 9 '15 at 22:01

add a comment | 2 You are right and your coworker is wrong. Barring something going wrong the journal makes sure you never have inconsistent fs metadata. You might check with hdparm to see if the drive's write cache is enabled. If it is, and you have not enabled IO barriers ( off by default on ext3, on by default in ext4 ), then that would be the cause of the problem.
The barriers are needed to force the drive write cache to flush at the correct time to maintain consistency, but some drives are badly behaved and either report that their write cache is disabled when it is not, or silently ignore the flush commands. This prevents the journal from doing its job. share | improve this answer answered Dec 5 '12 at 19:09 psusi psusi 2,617 11 9

-1 for reading-comprehension... – ewwhite Dec 5 '12 at 19:34

@ewwhite, maybe you should try reading, and actually writing a useful response instead of this childish insult. – psusi Dec 5 '12 at 19:36

+1 this answer probably could be improved, as any other answer in any QA. But at least provides some light and hints. @downvoters: improve the answer yourselves, or comment on possible flows, but downvoting this answer without proper justification is just disgusting! – Alberto Dec 6 '12 at 21:44

[Jan 29, 2019] xfs corrupted after power failure

Highly recommended!

Oct 15, 2013 | www.linuxquestions.org

katmai90210
hi guys,

i have a problem. yesterday there was a power outage at one of my datacenters, where i have a relatively large fileserver. 2 arrays, 1 x 14 tb and 1 x 18 tb both in raid6, with a 3ware card.

after the outage, the server came back online, the xfs partitions were mounted, and everything looked okay. i could access the data and everything seemed just fine.

today i woke up to lots of i/o errors, and when i rebooted the server, the partitions would not mount:

Oct 14 04:09:17 kp4 kernel:
Oct 14 04:09:17 kp4 kernel: XFS internal error XFS_WANT_CORRUPTED_RETURN a<ffffffff80056933>] pdflush+0x0/0x1fb
Oct 14 04:09:17 kp4 kernel: [<ffffffff80056a84>] pdflush+0x151/0x1fb
Oct 14 04:09:17 kp4 kernel: [<ffffffff800cd931>] wb_kupdate+0x0/0x16a
Oct 14 04:09:17 kp4 kernel: [<ffffffff80032c2b>] kthread+0xfe/0x132
Oct 14 04:09:17 kp4 kernel: [<ffffffff8005dfc1>] child_rip+0xa/0x11
Oct 14 04:09:17 kp4 kernel: [<ffffffff800a3ab7>] keventd_create_kthread+0x0/0xc4
Oct 14 04:09:17 kp4 kernel: [<ffffffff80032b2d>] kthread+0x0/0x132
Oct 14 04:09:17 kp4 kernel: [<ffffffff8005dfb7>] child_rip+0x0/0x11
Oct 14 04:09:17 kp4 kernel:
Oct 14 04:09:17 kp4 kernel: XFS internal error XFS_WANT_CORRUPTED_RETURN at line 279 of file fs/xfs/xfs_alloc.c. Caller 0xffffffff88342331
Oct 14 04:09:17 kp4 kernel:

got a bunch of these in dmesg.

The array is fine:

[root@kp4 ~]# tw_cli
//kp4> focus c6
s
//kp4/c6> how

Unit UnitType Status %RCmpl %V/I/M Stripe Size(GB) Cache AVrfy
------------------------------------------------------------------------------
u0 RAID-6 OK - - 256K 13969.8 RiW ON
u1 RAID-6 OK - - 256K 16763.7 RiW ON

VPort Status Unit Size Type Phy Encl-Slot Model
------------------------------------------------------------------------------
p0 OK u1 2.73 TB SATA 0 - Hitachi HDS723030AL
p1 OK u1 2.73 TB SATA 1 - Hitachi HDS723030AL
p2 OK u1 2.73 TB SATA 2 - Hitachi HDS723030AL
p3 OK u1 2.73 TB SATA 3 - Hitachi HDS723030AL
p4 OK u1 2.73 TB SATA 4 - Hitachi HDS723030AL
p5 OK u1 2.73 TB SATA 5 - Hitachi HDS723030AL
p6 OK u1 2.73 TB SATA 6 - Hitachi HDS723030AL
p7 OK u1 2.73 TB SATA 7 - Hitachi HDS723030AL
p8 OK u0 2.73 TB SATA 8 - Hitachi HDS723030AL
p9 OK u0 2.73 TB SATA 9 - Hitachi HDS723030AL
p10 OK u0 2.73 TB SATA 10 - Hitachi HDS723030AL
p11 OK u0 2.73 TB SATA 11 - Hitachi HDS723030AL
p12 OK u0 2.73 TB SATA 12 - Hitachi HDS723030AL
p13 OK u0 2.73 TB SATA 13 - Hitachi HDS723030AL
p14 OK u0 2.73 TB SATA 14 - Hitachi HDS723030AL

Name OnlineState BBUReady Status Volt Temp Hours LastCapTest
---------------------------------------------------------------------------
bbu On Yes OK OK OK 0 xx-xxx-xxxx

i googled for solutions and i think i jumped the horse by doing
xfs_repair -L /dev/sdc
it would not clean it with xfs_repair /dev/sdc, and everybody pretty much says the same thing.

this is what i was getting when trying to mount the array.

Filesystem Corruption of in-memory data detected. Shutting down filesystem xfs_check

Did i jump the gun by using the -L switch :/ ?
jefro

Here is the RH data on that.

https://docs.fedoraproject.org/en-US...xfsrepair.html

[Jan 29, 2019] an HVAC tech that confused the BLACK button that got pushed to exit the room with the RED button clearly marked EMERGENCY POWER OFF.

Jan 29, 2019 | thwack.solarwinds.com

George Sutherland Jul 8, 2015 9:58 AM ( in response to RandyBrown ) had similar thing happen with an HVAC tech that confused the BLACK button that got pushed to exit the room with the RED button clearly marked EMERGENCY POWER OFF. Clear plastic cover installed with in 24 hours.... after 3 hours of recovery!

PS... He told his boss that he did not do it.... the camera that focused on the door told a much different story. He was persona non grata at our site after that.

[Jan 29, 2019] HVAC units greatly help to increase reliability

Jan 29, 2019 | thwack.solarwinds.com

sleeper_777 Jul 15, 2015 1:07 PM

Worked at a bank. 6" raised floor. Liebert cooling units on floor with all network equipment. Two units developed a water drain issue over a weekend.

About an hour into Monday morning, devices, servers, routers, in a domino effect starting shorting out and shutting down or blowing up, literally.

Opened the floor tiles to find three inches of water.

We did not have water alarms on the floor at the time.

Shortly after the incident, we did.

But the mistake was very costly and multiple 24 hour shifts of IT people made it a week of pure h3ll.

[Jan 29, 2019] In a former life, I had every server crash over the weekend when the facilities group took down the climate control and HVAC systems without warning

Jan 29, 2019 | thwack.solarwinds.com

aaronleet Jul 13, 2015 8:45 AM

In a former life, I had every server crash over the weekend when the facilities group took down the climate control and HVAC systems without warning.

[Jan 29, 2019] [SOLVED] Unable to mount root file system after a power failure

Jan 29, 2019 | www.linuxquestions.org
07-01-2012, 12:56 PM # 1

damateem LQ Newbie
Registered: Dec 2010 Posts: 8
Rep: Unable to mount root file system after a power failure

[ Log in to get rid of this advertisement] We had a storm yesterday and the power dropped out, causing my Ubuntu server to shut off. Now, when booting, I get
[ 0.564310] Kernel panic - not syncing: VFS: Unable to mount root fs on unkown-block(0,0)

It looks like a file system corruption, but I'm having a hard time fixing the problem. I'm using Rescue Remix 12-04 to boot from USB and get access to the system.

Using

sudo fdisk -l

Shows the hard drive as

/dev/sda1: Linux
/dev/sda2: Extended
/dev/sda5: Linux LVM

Using

sudo lvdisplay

Shows LV Names as

/dev/server1/root
/dev/server1/swap_1

Using

sudo blkid

Shows types as

/dev/sda1: ext2
/dev/sda5: LVM2_member
/dev/mapper/server1-root: ext4
/dev/mapper/server1-swap_1: swap

I can mount sda1 and server1/root and all the files appear normal, although I'm not really sure what issues I should be looking for. On sda1, I see a grub folder and several other files. On root, I see the file system as it was before I started having trouble.

I've ran the following fsck commands and none of them report any errors

sudo fsck -f /dev/sda1
sudo fsck -f /dev/server1/root
sudo fsck.ext2 -f /dev/sda1
sudo fsck.ext4 -f /dev/server1/root

and I still get the same error when the system boots.

I've hit a brick wall.

What should I try next?

What can I look at to give me a better understanding of what the problem is?

Thanks,
David

damateem

View Public Profile

View LQ Blog

View Review Entries

View HCL Entries

Find More Posts by damateem

07-02-2012, 05:58 AM # 2

syg00 LQ Veteran
Registered: Aug 2003 Location: Australia Distribution: Lots ... Posts: 17,415
Rep: Might depend a bit on what messages we aren't seeing.
Normally I'd reckon that means that either the filesystem or disk controller support isn't available. But with something like Ubuntu you'd expect that to all be in place from the initrd. And that is on the /boot partition, and shouldn't be subject to update activity in a normal environment. Unless maybe you're real unlucky and an update was in flight.

Can you chroot into the server (disk) install and run from there successfully ?.

syg00

View Public Profile

View LQ Blog

View Review Entries

View HCL Entries

Find More Posts by syg00

07-02-2012, 06:08 PM # 3

damateem LQ Newbie
Registered: Dec 2010 Posts: 8
Original Poster
Rep: I had a very hard time getting the Grub menu to appear. There must be a very small window for detecting the shift key. Holding it down through the boot didn't work. Repeatedly hitting it at about twice per second didn't work. Increasing the rate to about 4 hits per second got me into it.
Once there, I was able to select an older kernel (2.6.32-39-server). The non-booting kernel was 2.6.32-40-server. 39 booted without any problems.

When I initially setup this system, I couldn't send email from it. It wasn't important to me at the time, so I planned to come back and fix it later. Last week (before the power drop), email suddenly started working on its own. I was surprised because I haven't specifically performed any updates. However, I seem to remember setting up automatic updates, so perhaps an auto update was done that introduced a problem, but it wasn't seen until the reboot that was forced by the power outage.

Next, I'm going to try updating to the latest kernel and see if it has the same problem.

Thanks,
David

damateem

View Public Profile

View LQ Blog

View Review Entries

View HCL Entries

Find More Posts by damateem
07-02-2012, 06:24 PM # 4
frieza Senior Member Contributing Member
Registered: Feb 2002 Location: harvard, il Distribution: Ubuntu 11.4,DD-WRT micro plus ssh,lfs-6.6,Fedora 15,Fedora 16 Posts: 3,233
Rep:
imho auto updates are dangerous, if you want my opinion, make sure auto updates are off, and only have the system tell you there are updates, that way you can chose not to install them during a power failure
as for a possible future solution for what you went through, unlike other keys, the shift key being held doesn't register as a stuck key to the best of my knowledge, so you can hold the shift key to get into grub, after that, edit the recovery line (the e key) to say at the end, init=/bin/bash then boot the system using the keys specified on the bottom of the screen, then once booted to a prompt, you would run
Code:
fsck -f {root partition}
(in this state, the root partition should be either not mounted or mounted read-only, so you can safely run an fsck on the drive)
note the -f seems to be an undocumented flag that does a more thorough scan than merely a standard run of fsck.

then reboot, and hopefully that fixes things

glad things seem to be working for the moment though.
frieza

View Public Profile

View LQ Blog

View Review Entries

View HCL Entries

Visit frieza's homepage!

Find More Posts by frieza

07-02-2012, 06:32 PM # 5

suicidaleggroll LQ Guru Contributing Member
Registered: Nov 2010 Location: Colorado Distribution: OpenSUSE, CentOS Posts: 5,573
Rep: Quote:

Originally Posted by damateem However, I seem to remember setting up automatic updates, so perhaps an auto update was done that introduced a problem, but it wasn't seen until the reboot that was forced by the power outage.

I think this is very likely. Delayed reboots after performing an update can make tracking down errors impossibly difficult. I had a system a while back that wouldn't boot, turns out it was caused by an update I had done 6 MONTHS earlier, and the system had simply never been restarted afterward.

suicidaleggroll

View Public Profile

View LQ Blog

View Review Entries

View HCL Entries

Find More Posts by suicidaleggroll

07-04-2012, 10:18 AM # 6

damateem LQ Newbie
Registered: Dec 2010 Posts: 8
Original Poster
Rep: I discovered the root cause of the problem. When I attempted the update, I found that the boot partition was full. So I suspect that caused issues for the auto update, but they went undetected until the reboot.
I next tried to purge old kernels using the instructions at

http://www.liberiangeek.net/2011/11/...neiric-ocelot/

but that failed because a previous install had not completed, but it couldn't complete because of the full partition. So had no choice but to manually rm the oldest kernel and it's associated files. With that done, the command

apt-get -f install

got far enough that I could then purge the unwanted kernels. Finally,

sudo apt-get update
sudo apt-get upgrade

brought everything up to date.

I will be deactivating the auto updates.

Thanks for all the help!

David

[Jan 28, 2019] Testing backup system as the main source of power outatages

Highly recommended!

Jan 28, 2019 | thwack.solarwinds.com

gcp Jul 8, 2015 10:33 PM

Many years ago I worked at an IBM Mainframe site. To make systems more robust they installed a UPS system for the mainframe with battery bank and a honkin' great diesel generator in the yard.

During the commissioning of the system, they decided to test the UPS cutover one afternoon - everything goes *dark* in seconds. Frantic running around to get power back on and MF restarted and databases recovered (afternoon, remember? during the work day...). Oh! The UPS batteries were not charged! Oops.

Over the next few weeks, they did two more 'tests' during the working day, with everything going *dark* in seconds for various reasons. Oops.

Then they decided - perhaps we should test this outside of office hours. (YAY!)

Still took a few more efforts to get everything working - diesel generator wouldn't start automatically, fixed that and forgot to fill up the diesel tank so cutover was fine until the fuel ran out.

Many, many lessons learned from this episode.

[Jan 28, 2019] False alarm: bas small inmashine room due to electrical light not a server

Jan 28, 2019 | www.reddit.com

radiomix Jack of All Trades 5 points 6 points 7 points 3 years ago (2 children)

I was in my main network facility, for a municipal fiber optic ring. Outside were two technicians replacing our backup air conditioning unit. I walk inside after talking with the two technicians, turn on the lights and begin walking around just visually checking things around the room. All of a sudden I started smelling that dreaded electric hot/burning smell. In this place I have my core switch, primary router, a handful of servers, some customer equipment and a couple of racks for my service provider. I start running around the place like a mad man sniffing all the equipment. I even called in the AC technicians to help me sniff.
After 15 minutes we could not narrow down where it was coming from. Finally I noticed that one of the florescent lights had not come on. I grabbed a ladder and opened it up.

The ballast had burned out on the light and it just so happen to be the light right in front of the AC vent blowing the smell all over the room.

The last time I had smelled that smell in that room a major piece of equipment went belly up and there was nothing I could do about it.

benjunmun 2 points 3 points 4 points 3 years ago (0 children)
The exact same thing has happened to me. Nothing quite as terrifying as the sudden smell of ozone as you're surrounded by critical computers and electrical gear.

[Jan 28, 2019] Loss of power problems: Machines are running, but every switch in the cabinet is dead. Some servers are dead. Panic sets in.

Jan 28, 2019 | www.reddit.com

eraser_6776 VP IT/Sec (a damn suit) 9 points 10 points 11 points 3 years ago (1 child)

May 22, 2004. There was a rather massive storm here that spurred one of the [biggest Tornaodes recorded in Nebraska]( www.tornadochaser.net/hallam.html ) and I was a sysadmin for a small company. It was a Saturday, aka beer day, and as all hell was breaking loose my friends and roomates' pagers and phones were all going off. "Ha ha!" I said, looking at a silent cellphone "sucks to be you!"
Next morning around 10 my phone rings, and I groggily answer it because it's the owner of the company. "You'd better come in here, none of the computers will turn on" he says. Slight panic, but I hadn't received any emails. So it must have been breakers, and I can get that fixed. No problem.

I get into the office and something strikes me. That eery sound of silence. Not a single machine is on.. why not? Still shaking off too much beer from the night before, I go into the server room and find out why I didn't get paged. Machines are running, but every switch in the cabinet is dead. Some servers are dead. Panic sets in.

I start walking around the office trying to turn on machines and.. dead. All of them. Every last desktop won't power on. That's when panic REALLY set in.

In the aftermath I found out two things - one, when the building was built, it was built with a steel roof and steel trusses. Two, when my predecessor had the network cabling wired he hired an idiot who didn't know fire code and ran the network cabling, conveniently, along the trusses into the ceiling. Thus, when lightning hit the building it had a perfect ground path to every workstation in the company. Some servers that weren't in the primary cabinet had been wired to a wall jack (which, in turn, went up into the ceiling then back down into the cabinet because you know, wire management!). Thankfully they were all "legacy" servers.

The only thing that saved the main servers was that Cisco 2924 XL-EN's are some badass mofo's that would die before they let that voltage pass through to the servers in the cabinet. At least that's what I told myself.

All in all, it ended up being one of the longest work weeks ever as I first had to source a bunch of switches, fast to get things like mail and the core network back up. Next up was feeding my buddies a bunch of beer and pizza after we raided every box store in town for spools of Cat 5 and threw wire along the floor.

Finally I found out that CDW can and would get you a whole lot of desktops delivered to your door with your software pre-installed in less than 24 hours if you have an open checkbook. Thanks to a great insurance policy, we did. Shipping and "handling" for those were more than the cost of the machines (again, this was back in 2004 and they were business desktops so you can imagine).

Still, for weeks after I had non-stop user complaints that generally involved "..I think this is related to the lightning ". I drank a lot that summer.

[Jan 28, 2019] That's how I learned to always check with somebody else before rebooting a production server, no matter how minor it may seem

Jan 28, 2019 | www.reddit.com

VexingRaven 1 point 2 points 3 points 3 years ago (1 child)

Not really a horror story but definitely one of my first "Oh shit" moments. I was the FNG helpdesk/sysadmin at a company of 150 people. I start getting calls that something (I think it was Outlook) wasn't working in Citrix, apparently something broken on one of the Citrix servers. I'm 100% positive it will be fixed with a reboot (I've seen this before on individual PCs), so I diligently start working to get people off that Citrix server (one of three) so I can reboot it.
I get it cleared out, hit Reboot... And almost immediately get a call from the call center manager saying every single person just got kicked off Citrix. Oh shit. But there was nobody on that server! Apparently that server also housed the Secure Gateway server which my senior hadn't bothered to tell me or simply didn't know (Set up by a consulting firm). Whoops. Thankfully the servers were pretty fast and people's sessions reconnected a few minutes later, no harm no foul. And on the plus side, it did indeed fix the problem.

And that's how I learned to always check with somebody else before rebooting a production server, no matter how minor it may seem.

[Jan 14, 2019] Safe rm stops you accidentally wiping the system! @ New Zealand Linux

Jan 14, 2019 | www.nzlinux.com

Francois Marier October 21, 2009 at 10:34 am
Another related tool, to prevent accidental reboots of servers this time, is molly-guard:

http://packages.debian.org/sid/molly-guard

It asks you to type the hostname of the machine you want to reboot as an extra confirmation step.

[Oct 05, 2018] Due to a configuration change I wasn't privy to, the software I was responsible for rebooted all the 911 operators servers at once

Oct 05, 2018 | www.reddit.com

ardwin 5 years ago (9 children)

Due to a configuration change I wasn't privy to, the software I was responsible for rebooted all the 911 operators servers at once.
cobra10101010 5 years ago (1 child)
Oh God..that is scary in true sense..hope everything was okay
ardwin 5 years ago (0 children)
I quickly learned that the 911 operators, are trained to do their jobs without any kind of computer support. It made me feel better.
reebzor 5 years ago (1 child)
I did this too!
edit: except I was the one that deployed the software that rebooted the machines

vocatus 5 years ago (0 children)
Hey, maybe you should go apologize to ardwin. I bet he was pissed.

[Jun 16, 2010] Prevent Accidental Shutdown/Reboot in Ubuntu

Jan 15, 2010 | Linux Today

blackhole

Re: Solution looking for a problem
> How exactly does someone "accidentally" issue a shutdown or reboot command? ... Failing that highly likely scenario, this is someone shopping around a solution for a problem that doesn't really exist. Give me a break.

I haven't checked out the actual package in question, but based on the fact that (according to the posted output) it notes the connection is via SSH and asks for a hostname, I would say the author of the article did not articulate well what the purpose of the package is. The purpose appears to be to avoid shutting down the *wrong* computer when connecting remotely.

I've never had that problem, but more than once I've shut down the local computer when I intended to shut down a remote computer. I think after the second time (after I stopped swearing!) I created aliases for halt and reboot that first query with a message like: Really halt {hostname} [yn]?

Marco

Re: Solution looking for a problem
Re: How exactly does someone "accidentally" issue a shutdown or reboot command?
I've done it while I was distracted, open a shell to one box, then open one to another box. Go to lunch. Forget which shell you are using and send the wrong command to the wrong machine. Nobody is perfect.

[Jun 14, 2010] IT Resource Center forums - greatest blunders

Michael Steele

When I was first starting out I worked for a Telecom as an 'Application Administrator' and I sat in a small room with a half a dozen other admins and together we took calls from users as their calls escalated up from tier I support. We were tier II in a three tier organization.
A month earlier someone from tier I confused a production server with a test server and rebooted it in the middle of the day. These servers were remotely connected over a large distance so it can be confusing. Care is needed before rebooting.
The tier I culprit took a great deal of abuse for this mistake and soon became a victim of several jokes. An outage had been caused in a high availability environment which meant management, interviews, reports; It went on and on and was pretty brutal.
And I was just as brutal as anyone.
Their entire organization soon became victimize by everyone from our organization. The abuse traveled right up the management tree and all participated.
It was hilarious, for us.
Until I did the same thing a month later.
There is nothing more humbling then 2000 people all knowing who you are for the wrong reason and I have never longed for anonymity more.
Now I alway do a 'uname' or 'hostname' before a reboot, even when I'm right in front of it.

[Jun 9, 2010] horror - University of Cambridge Computing Service - Unix Support

(3) At the same institution, we were running a system software that had a serious bug where if anyone had logged out ungracefully, the system wouldn't let any more users onto the system and users who were logged on couldn't execute any new commands. (The newest release of the software later on did fix this bug.) I had to reboot the machine to restore the system to a sane state. I did a wall <<EOF We need to shutdown blah blah... EOF and then shutdown. Well, I should've waited since at the precise moment, one of our users was doing a once-a-year massive conversion of our financial data (talk about bad luck). I had shutdown in the middle of a very long disk write and thus, data was lost. We did recover that data and life went on.

Moral: make damn sure that *no one* is doing anything on your system before you reboot, even if other users are vociferously clamoring for you to reboot.

My 10 UNIX Command Line Mistakes by Vivek Gite

with 90 comments
Anyone who has never made a mistake has never tried anything new. -- Albert Einstein.

Here are a few mistakes that I made while working at UNIX prompt. Some mistakes caused me a good amount of downtime. Most of these mistakes are from my early days as a UNIX admin.

... ... ...

Rebooted Solaris Box

On Linux killall command kill processes by name (killall httpd). On Solaris it kill all active processes. As root I killed all process, this was our main Oracle db box:
killall process-name

Selected Comments

UnixEagle

Rebooted the wrong box

While adding alias to main network interface I ended up changing the main IP address, the system froze right away and I had to call for a reboot

Instead of appending text to Apache config file, I overwritten it's contents

Firewall lockdown while changing the ssh port

Wrongfully run a script contained recursive chmod and chown as root on / caused me a downtime of about 12 hours and a complete re-install

Some mistakes are really silly, and when they happen, you don't believe your self you did that, but every mistake, regardless of it's silliness, should be a learned lesson.

If you did a trivial mistake, you should not just overlook it, you have to think of the reasons that made you did it, like: you didn't have much sleep or your mind was confused about personal life or …..etc.

I like Einstein's quote, you really have to do mistakes to learn.

[May 21, 2009] Accidental shutdown

Re: Accidental shutdown
by Todd A. Jacobson 2009-05-21T20:46:53+00:00.

On Thu, May 21, 2009 at 12:31:47AM +0100, Bhasker C V wrote:

> I can rename and shell wrap the binaries poweroff/shutdown/reboot but
> that would not be a clean method and I am sure there should be much
> better way than that.

Nope. You could disable the reboot command in your sudoers file, but that isn't going to prevent you from rebooting the wrong machine if you really make an effort.

You might also consider editing sudoers to change the sudo password prompt to include the hostname of the box you're on, so that you're less likely to issue commands to the wrong box if you're paying attention.

However, the real problem here is that you're assuming Linux should protect you from yourself. It won't; part of being a power user is not running privileged commands without exercising due care. With power comes responsibility!

As has been said before: "*nix is user friendly. It's just picky about who its friends are!"

--
"Oh, look: rocks!"
-- Doctor Who, "Destiny of the Daleks"

by Scott Giffordon 2009-05-21T20:51:53+00:00.

Bhasker C V writes:

> Is there a method to prevent accidental powerdown of a linux box ? or atleast alert ?

If you get in the habit of running "shutdown -r +1" instead of "reboot", it will warn users for 1 minute before shutting down the server. That should give you enough time to run "shutdown -c" to cancel the shutdown if you realize it's on the wrong machine.

Hope this helps,
-----Scott.

Etc

Society

Groupthink : Two Party System as Polyarchy : Corruption of Regulators : Bureaucracies : Understanding Micromanagers and Control Freaks : Toxic Managers : Harvard Mafia : Diplomatic Communication : Surviving a Bad Performance Review : Insufficient Retirement Funds as Immanent Problem of Neoliberal Regime : PseudoScience : Who Rules America : Neoliberalism : The Iron Law of Oligarchy : Libertarian Philosophy

Quotes

War and Peace : Skeptical Finance : John Kenneth Galbraith :Talleyrand : Oscar Wilde : Otto Von Bismarck : Keynes : George Carlin : Skeptics : Propaganda : SE quotes : Language Design and Programming Quotes : Random IT-related quotes : Somerset Maugham : Marcus Aurelius : Kurt Vonnegut : Eric Hoffer : Winston Churchill : Napoleon Bonaparte : Ambrose Bierce : Bernard Shaw : Mark Twain Quotes

Bulletin:

Vol 25, No.12 (December, 2013) Rational Fools vs. Efficient Crooks The efficient markets hypothesis : Political Skeptic Bulletin, 2013 : Unemployment Bulletin, 2010 : Vol 23, No.10 (October, 2011) An observation about corporate security departments : Slightly Skeptical Euromaydan Chronicles, June 2014 : Greenspan legacy bulletin, 2008 : Vol 25, No.10 (October, 2013) Cryptolocker Trojan (Win32/Crilock.A) : Vol 25, No.08 (August, 2013) Cloud providers as intelligence collection hubs : Financial Humor Bulletin, 2010 : Inequality Bulletin, 2009 : Financial Humor Bulletin, 2008 : Copyleft Problems Bulletin, 2004 : Financial Humor Bulletin, 2011 : Energy Bulletin, 2010 : Malware Protection Bulletin, 2010 : Vol 26, No.1 (January, 2013) Object-Oriented Cult : Political Skeptic Bulletin, 2011 : Vol 23, No.11 (November, 2011) Softpanorama classification of sysadmin horror stories : Vol 25, No.05 (May, 2013) Corporate bullshit as a communication method : Vol 25, No.06 (June, 2013) A Note on the Relationship of Brooks Law and Conway Law

History:

Fifty glorious years (1950-2000): the triumph of the US computer engineering : Donald Knuth : TAoCP and its Influence of Computer Science : Richard Stallman : Linus Torvalds : Larry Wall : John K. Ousterhout : CTSS : Multix OS Unix History : Unix shell history : VI editor : History of pipes concept : Solaris : MS DOS : Programming Languages History : PL/1 : Simula 67 : C : History of GCC development : Scripting Languages : Perl history : OS History : Mail : DNS : SSH : CPU Instruction Sets : SPARC systems 1987-2006 : Norton Commander : Norton Utilities : Norton Ghost : Frontpage history : Malware Defense History : GNU Screen : OSS early history

Classic books:

The Peter Principle : Parkinson Law : 1984 : The Mythical Man-Month : How to Solve It by George Polya : The Art of Computer Programming : The Elements of Programming Style : The Unix Hater’s Handbook : The Jargon file : The True Believer : Programming Pearls : The Good Soldier Svejk : The Power Elite

Most popular humor pages:

Manifest of the Softpanorama IT Slacker Society : Ten Commandments of the IT Slackers Society : Computer Humor Collection : BSD Logo Story : The Cuckoo's Egg : IT Slang : C++ Humor : ARE YOU A BBS ADDICT? : The Perl Purity Test : Object oriented programmers of all nations : Financial Humor : Financial Humor Bulletin, 2008 : Financial Humor Bulletin, 2010 : The Most Comprehensive Collection of Editor-related Humor : Programming Language Humor : Goldman Sachs related humor : Greenspan humor : C Humor : Scripting Humor : Real Programmers Humor : Web Humor : GPL-related Humor : OFM Humor : Politically Incorrect Humor : IDS Humor : "Linux Sucks" Humor : Russian Musical Humor : Best Russian Programmer Humor : Microsoft plans to buy Catholic Church : Richard Stallman Related Humor : Admin Humor : Perl-related Humor : Linus Torvalds Related humor : PseudoScience Related Humor : Networking Humor : Shell Humor : Financial Humor Bulletin, 2011 : Financial Humor Bulletin, 2012 : Financial Humor Bulletin, 2013 : Java Humor : Software Engineering Humor : Sun Solaris Related Humor : Education Humor : IBM Humor : Assembler-related Humor : VIM Humor : Computer Viruses Humor : Bright tomorrow is rescheduled to a day after tomorrow : Classic Computer Humor

The Last but not Least Technology is dominated by two types of people: those who understand what they do not manage and those who manage what they do not understand ~Archibald Putt. Ph.D

Copyright © 1996-2021 by Softpanorama Society. www.softpanorama.org was initially created as a service to the (now defunct) UN Sustainable Development Networking Programme (SDNP) without any remuneration. This document is an industrial compilation designed and created exclusively for educational use and is distributed under the Softpanorama Content License. Original materials copyright belong to respective owners. Quotes are made for educational purposes only in compliance with the fair use doctrine.

FAIR USE NOTICE This site contains copyrighted material the use of which has not always been specifically authorized by the copyright owner. We are making such material available to advance understanding of computer science, IT technology, economic, scientific, and social issues. We believe this constitutes a 'fair use' of any such copyrighted material as provided by section 107 of the US Copyright Law according to which such material can be distributed without profit exclusively for research and educational purposes.

This is a Spartan WHYFF (We Help You For Free) site written by people for whom English is not a native language. Grammar and spelling errors should be expected. The site contain some broken links as it develops like a living tree...

You can use PayPal to to buy a cup of coffee for authors of this site

Disclaimer:

The statements, views and opinions presented on this web page are those of the author (or referenced source) and are not endorsed by, nor do they necessarily reflect, the opinions of the Softpanorama society. We do not warrant the correctness of the information provided or its fitness for any purpose. The site uses AdSense so you need to be aware of Google privacy policy. You you do not want to be tracked by Google please disable Javascript for this site. This site is perfectly usable without Javascript.

Last modified: February 19, 2020

Accidental Shutdowns/Reboot Blunders

NAME

SYNOPSIS

DESCRIPTION

GUARDING SSH SESSIONS

OPTIONS

SEE ALSO

LEGALESE

COPYRIGHT

Old News ;-)

[Mar 01, 2019] Molly-guard for CentOS 7 UoB Unix by dg12158

Sep 21, 2015 | bris.ac.uk

[Mar 01, 2019] molly-guard protects machines from accidental shutdowns-reboots by ruchi

Nov 28, 2009 | www.ubuntugeek.com

[Mar 01, 2019] Confirm before executing shutdown-reboot command on linux by Ilija Matoski

Notable quotes:

"... rushing to leave and was still logged into a server so I wanted to shutdown my laptop, but what I didn't notice is that I was still connected to the remote server. ..."

Oct 23, 2017 | matoski.com

[Jan 29, 2019] hardware - Is post-sudden-power-loss filesystem corruption on an SSD drive's ext3 partition expected behavior

Dec 04, 2012 | serverfault.com

[Jan 29, 2019] xfs corrupted after power failure

Highly recommended!

Oct 15, 2013 | www.linuxquestions.org

[Jan 29, 2019] an HVAC tech that confused the BLACK button that got pushed to exit the room with the RED button clearly marked EMERGENCY POWER OFF.

Jan 29, 2019 | thwack.solarwinds.com

[Jan 29, 2019] HVAC units greatly help to increase reliability

Jan 29, 2019 | thwack.solarwinds.com

[Jan 29, 2019] In a former life, I had every server crash over the weekend when the facilities group took down the climate control and HVAC systems without warning

Jan 29, 2019 | thwack.solarwinds.com

[Jan 29, 2019] [SOLVED] Unable to mount root file system after a power failure

Jan 29, 2019 | www.linuxquestions.org

[Jan 28, 2019] Testing backup system as the main source of power outatages

Highly recommended!

Jan 28, 2019 | thwack.solarwinds.com

[Jan 28, 2019] False alarm: bas small inmashine room due to electrical light not a server

Jan 28, 2019 | www.reddit.com

[Jan 28, 2019] Loss of power problems: Machines are running, but every switch in the cabinet is dead. Some servers are dead. Panic sets in.

Jan 28, 2019 | www.reddit.com

[Jan 28, 2019] That's how I learned to always check with somebody else before rebooting a production server, no matter how minor it may seem

Jan 28, 2019 | www.reddit.com

[Jan 14, 2019] Safe rm stops you accidentally wiping the system! @ New Zealand Linux

Jan 14, 2019 | www.nzlinux.com

[Oct 05, 2018] Due to a configuration change I wasn't privy to, the software I was responsible for rebooted all the 911 operators servers at once

Oct 05, 2018 | www.reddit.com

[Jun 16, 2010] Prevent Accidental Shutdown/Reboot in Ubuntu

Jan 15, 2010 | Linux Today

[Jun 14, 2010] IT Resource Center forums - greatest blunders

[Jun 9, 2010] horror - University of Cambridge Computing Service - Unix Support

My 10 UNIX Command Line Mistakes by Vivek Gite

Selected Comments

[May 21, 2009] Accidental shutdown

Google matched content

Softpanorama Recommended

[Jan 29, 2019] xfs corrupted after power failure Published on Oct 15, 2013 | www.linuxquestions.org

[Jan 28, 2019] Testing backup system as the main source of power outatages Published on Jan 28, 2019 | thwack.solarwinds.com