Softpanorama

May the source be with you, but remember the KISS principle ;-)
Home Switchboard Unix Administration Red Hat TCP/IP Networks Neoliberalism Toxic Managers
(slightly skeptical) Educational society promoting "Back to basics" movement against IT overcomplexity and  bastardization of classic Unix

Abrupt loss of power horror stories

News Sysadmin Horror Stories Recommended Links Performing the operation on a wrong server Simple Unix Backup Tools Unix mv command
Missing backup Rush/absence of testing Creative uses of rm  Pure stupidity  Multiple sysadmin working on the same box  Locking yourself out
Safe-rm Typical Errors In Using Find Tips Unix History Humor Etc

The more complex setup you have the more your Linux boxes are prone to getting problems after power outage.  Abrupt loss of power is a very typical incident, even in large datacenters. Often it is connected with testing to backup power ;-).

If you server was heavily loaded at the moment of power outage chanced that it will survive the abrupt loss of power, then when the server was just circulating air.

Servers with  systemd are more brittle that servers with SysV init.  Servers with LVM partitions are more brittle that  servers without LVM.

Especially bad situation is when root fileystem in on LV, which stupidly enough is RHEL default. Here you essentially are putting  yourself into situation when recovery is more complex that it should be.  That's the worst blunder  a sysadmin can commit.

Journaling filesystem helps to minimize this problem based on the way it writes information. Most new filesystems - including Ext3 and Ext4 (the most common filesystems for Linux) are journaling filesystem and that helps top preserve the data. Whiche filesystem is more tolerant of power failure linux

Ext3 defaulted to barrier=0, and the distros I saw did not enable it by default. Without barrier=1 (see the mount(8) man page), ext3 does not place sufficient constraints on write cache reordering to make it safe against power loss.

ext4 does default to using write barriers.

So ext3 can avoid filesystem-level corruption on power-loss, but it does not do so by default. ext4 will do so by default.

Long ago

Inode table works in Linux/unix is the weakest link. Since in Ext3 and ext4 files are stored in an unordered list which is constantly being modified, there is no separation between important, static files like kernel binaries, and worthless files like temporary files. also the timestamp for the last access often is useless by it is still written even for system files, unless you specifically disable it.  What leads to the situation in which the inode entries to critical system files are still are re-written on each access to them.

When power goes out it tends to blow away a part of the inode table was being written. The system files themselves are fine and intact, but the inode table for the directory can be be damaged.  This situation is made worse by disk caching which has the effect of increasing the size of the damaged areas.

Usually this situation is recoverable but sometimes not.

SSD disks generally survive loss of power better then totaing disks, so critical files should be kept on SSD disks if possible.

Here is old but  still useful advcisort (File System Corruption after Power Outage or System Crash)

Although Linux is a stable operating system, should it happen to crash unexpectantly (perhaps due to a kernel bug, or perhaps due to a power outage), your file system(s) will not have been unmounted and therefore will be automatically checked for errors when Linux is restarted.

Most of the time, any file system problems are minor ones caused by file buffers not being written to the disk, such as deleted inodes still marked in use. In the majority of cases, the file system check will be able to detect and repair such anomolies automatically, and upon completion the Linux boot process will continue normally.

Should a file system problem be more severe (such problems tend to be caused by faulty hardware such as a bad hard drive or memory chip; something to keep in mind should file system corruption happen frequently), the file system check may not be able to repair the problem automatically. This is usually, but not always, the case when the root file system itself is corrupted. In this case, the Red Hat boot process will display an error message and drop you into a shell, allowing you to attempt file system repairs manually.

As the recovery shell unmounts all file systems, and then mounts the root file system "read-only", you will be able to perform full file system checks using the appropriate utilities. Likely you will be able to run e2fsck on the corrupted file system(s) which should hopefully resolve all the problems found.

After you have (hopefully) repaired any file system problems, simply exit the shell to have Linux reboot the system and attempt a subsequent restart.

Naturally, to be prepared for situations such as a non-recoverable file system problem, you should have one or more of the following things available to you:

 

 


Top Visited
Switchboard
Latest
Past week
Past month

NEWS CONTENTS

Old News ;-)

[Jan 29, 2019] hardware - Is post-sudden-power-loss filesystem corruption on an SSD drive's ext3 partition expected behavior

Dec 04, 2012 | serverfault.com

My company makes an embedded Debian Linux device that boots from an ext3 partition on an internal SSD drive. Because the device is an embedded "black box", it is usually shut down the rude way, by simply cutting power to the device via an external switch.

This is normally okay, as ext3's journalling keeps things in order, so other than the occasional loss of part of a log file, things keep chugging along fine.

However, we've recently seen a number of units where after a number of hard-power-cycles the ext3 partition starts to develop structural issues -- in particular, we run e2fsck on the ext3 partition and it finds a number of issues like those shown in the output listing at the bottom of this Question. Running e2fsck until it stops reporting errors (or reformatting the partition) clears the issues.

My question is... what are the implications of seeing problems like this on an ext3/SSD system that has been subjected to lots of sudden/unexpected shutdowns?

My feeling is that this might be a sign of a software or hardware problem in our system, since my understanding is that (barring a bug or hardware problem) ext3's journalling feature is supposed to prevent these sorts of filesystem-integrity errors. (Note: I understand that user-data is not journalled and so munged/missing/truncated user-files can happen; I'm specifically talking here about filesystem-metadata errors like those shown below)

My co-worker, on the other hand, says that this is known/expected behavior because SSD controllers sometimes re-order write commands and that can cause the ext3 journal to get confused. In particular, he believes that even given normally functioning hardware and bug-free software, the ext3 journal only makes filesystem corruption less likely, not impossible, so we should not be surprised to see problems like this from time to time.

Which of us is right?

Embedded-PC-failsafe:~# ls
Embedded-PC-failsafe:~# umount /mnt/unionfs
Embedded-PC-failsafe:~# e2fsck /dev/sda3
e2fsck 1.41.3 (12-Oct-2008)
embeddedrootwrite contains a file system with errors, check forced.
Pass 1: Checking inodes, blocks, and sizes
Pass 2: Checking directory structure
Invalid inode number for '.' in directory inode 46948.
Fix<y>? yes

Directory inode 46948, block 0, offset 12: directory corrupted
Salvage<y>? yes

Entry 'status_2012-11-26_14h13m41.csv' in /var/log/status_logs (46956) has deleted/unused inode 47075.  Clear<y>? yes
Entry 'status_2012-11-26_10h42m58.csv.gz' in /var/log/status_logs (46956) has deleted/unused inode 47076.  Clear<y>? yes
Entry 'status_2012-11-26_11h29m41.csv.gz' in /var/log/status_logs (46956) has deleted/unused inode 47080.  Clear<y>? yes
Entry 'status_2012-11-26_11h42m13.csv.gz' in /var/log/status_logs (46956) has deleted/unused inode 47081.  Clear<y>? yes
Entry 'status_2012-11-26_12h07m17.csv.gz' in /var/log/status_logs (46956) has deleted/unused inode 47083.  Clear<y>? yes
Entry 'status_2012-11-26_12h14m53.csv.gz' in /var/log/status_logs (46956) has deleted/unused inode 47085.  Clear<y>? yes
Entry 'status_2012-11-26_15h06m49.csv' in /var/log/status_logs (46956) has deleted/unused inode 47088.  Clear<y>? yes
Entry 'status_2012-11-20_14h50m09.csv' in /var/log/status_logs (46956) has deleted/unused inode 47073.  Clear<y>? yes
Entry 'status_2012-11-20_14h55m32.csv' in /var/log/status_logs (46956) has deleted/unused inode 47074.  Clear<y>? yes
Entry 'status_2012-11-26_11h04m36.csv.gz' in /var/log/status_logs (46956) has deleted/unused inode 47078.  Clear<y>? yes
Entry 'status_2012-11-26_11h54m45.csv.gz' in /var/log/status_logs (46956) has deleted/unused inode 47082.  Clear<y>? yes
Entry 'status_2012-11-26_12h12m20.csv.gz' in /var/log/status_logs (46956) has deleted/unused inode 47084.  Clear<y>? yes
Entry 'status_2012-11-26_12h33m52.csv.gz' in /var/log/status_logs (46956) has deleted/unused inode 47086.  Clear<y>? yes
Entry 'status_2012-11-26_10h51m59.csv.gz' in /var/log/status_logs (46956) has deleted/unused inode 47077.  Clear<y>? yes
Entry 'status_2012-11-26_11h17m09.csv.gz' in /var/log/status_logs (46956) has deleted/unused inode 47079.  Clear<y>? yes
Entry 'status_2012-11-26_12h54m11.csv.gz' in /var/log/status_logs (46956) has deleted/unused inode 47087.  Clear<y>? yes

Pass 3: Checking directory connectivity
'..' in /etc/network/run (46948) is <The NULL inode> (0), should be /etc/network (46953).
Fix<y>? yes

Couldn't fix parent of inode 46948: Couldn't find parent directory entry

Pass 4: Checking reference counts
Unattached inode 46945
Connect to /lost+found<y>? yes

Inode 46945 ref count is 2, should be 1.  Fix<y>? yes
Inode 46953 ref count is 5, should be 4.  Fix<y>? yes

Pass 5: Checking group summary information
Block bitmap differences:  -(208264--208266) -(210062--210068) -(211343--211491) -(213241--213250) -(213344--213393) -213397 -(213457--213463) -(213516--213521) -(213628--213655) -(213683--213688) -(213709--213728) -(215265--215300) -(215346--215365) -(221541--221551) -(221696--221704) -227517
Fix<y>? yes

Free blocks count wrong for group #6 (17247, counted=17611).
Fix<y>? yes

Free blocks count wrong (161691, counted=162055).
Fix<y>? yes

Inode bitmap differences:  +(47089--47090) +47093 +47095 +(47097--47099) +(47101--47104) -(47219--47220) -47222 -47224 -47228 -47231 -(47347--47348) -47350 -47352 -47356 -47359 -(47457--47488) -47985 -47996 -(47999--48000) -48017 -(48027--48028) -(48030--48032) -48049 -(48059--48060) -(48062--48064) -48081 -(48091--48092) -(48094--48096)
Fix<y>? yes

Free inodes count wrong for group #6 (7608, counted=7624).
Fix<y>? yes

Free inodes count wrong (61919, counted=61935).
Fix<y>? yes


embeddedrootwrite: ***** FILE SYSTEM WAS MODIFIED *****

embeddedrootwrite: ********** WARNING: Filesystem still has errors **********

embeddedrootwrite: 657/62592 files (24.4% non-contiguous), 87882/249937 blocks

Embedded-PC-failsafe:~# 
Embedded-PC-failsafe:~# e2fsck /dev/sda3
e2fsck 1.41.3 (12-Oct-2008)
embeddedrootwrite contains a file system with errors, check forced.
Pass 1: Checking inodes, blocks, and sizes
Pass 2: Checking directory structure
Directory entry for '.' in ... (46948) is big.
Split<y>? yes

Missing '..' in directory inode 46948.
Fix<y>? yes

Setting filetype for entry '..' in ... (46948) to 2.
Pass 3: Checking directory connectivity
'..' in /etc/network/run (46948) is <The NULL inode> (0), should be /etc/network (46953).
Fix<y>? yes

Pass 4: Checking reference counts
Inode 2 ref count is 12, should be 13.  Fix<y>? yes

Pass 5: Checking group summary information

embeddedrootwrite: ***** FILE SYSTEM WAS MODIFIED *****
embeddedrootwrite: 657/62592 files (24.4% non-contiguous), 87882/249937 blocks
Embedded-PC-failsafe:~# 
Embedded-PC-failsafe:~# e2fsck /dev/sda3
e2fsck 1.41.3 (12-Oct-2008)
embeddedrootwrite: clean, 657/62592 files, 87882/249937 blocks
filesystems hardware ssd ext3 share | improve this question edited Dec 5 '12 at 18:40 ewwhite 173k 75 364 712 asked Dec 4 '12 at 1:13 Jeremy Friesner Jeremy Friesner 611 1 8 25 add a comment | 2 Answers 2 active oldest votes 10 You're both wrong (maybe?)... ext3 is coping the best it can with having its underlying storage removed so abruptly.

Your SSD probably has some type of onboard cache. You don't mention the make/model of SSD in use, but this sounds like a consumer-level SSD versus an enterprise or industrial-grade model .

Either way, the cache is used to help coalesce writes and prolong the life of the drive. If there are writes in-transit, the sudden loss of power is definitely the source of your corruption. True enterprise and industrial SSD's have supercapacitors that maintain power long enough to move data from cache to nonvolatile storage, much in the same way battery-backed and flash-backed RAID controller caches work .

If your drive doesn't have a supercap, the in-flight transactions are being lost, hence the filesystem corruption. ext3 is probably being told that everything is on stable storage, but that's just a function of the cache. share | improve this answer edited Apr 13 '17 at 12:14 Community ♦ 1 answered Dec 4 '12 at 1:24 ewwhite ewwhite 173k 75 364 712

add a comment | 2 You are right and your coworker is wrong. Barring something going wrong the journal makes sure you never have inconsistent fs metadata. You might check with hdparm to see if the drive's write cache is enabled. If it is, and you have not enabled IO barriers ( off by default on ext3, on by default in ext4 ), then that would be the cause of the problem.

The barriers are needed to force the drive write cache to flush at the correct time to maintain consistency, but some drives are badly behaved and either report that their write cache is disabled when it is not, or silently ignore the flush commands. This prevents the journal from doing its job. share | improve this answer answered Dec 5 '12 at 19:09 psusi psusi 2,617 11 9

[Jan 29, 2019] xfs corrupted after power failure

Highly recommended!
Oct 15, 2013 | www.linuxquestions.org

katmai90210

hi guys,

i have a problem. yesterday there was a power outage at one of my datacenters, where i have a relatively large fileserver. 2 arrays, 1 x 14 tb and 1 x 18 tb both in raid6, with a 3ware card.

after the outage, the server came back online, the xfs partitions were mounted, and everything looked okay. i could access the data and everything seemed just fine.

today i woke up to lots of i/o errors, and when i rebooted the server, the partitions would not mount:

Oct 14 04:09:17 kp4 kernel:
Oct 14 04:09:17 kp4 kernel: XFS internal error XFS_WANT_CORRUPTED_RETURN a<ffffffff80056933>] pdflush+0x0/0x1fb
Oct 14 04:09:17 kp4 kernel: [<ffffffff80056a84>] pdflush+0x151/0x1fb
Oct 14 04:09:17 kp4 kernel: [<ffffffff800cd931>] wb_kupdate+0x0/0x16a
Oct 14 04:09:17 kp4 kernel: [<ffffffff80032c2b>] kthread+0xfe/0x132
Oct 14 04:09:17 kp4 kernel: [<ffffffff8005dfc1>] child_rip+0xa/0x11
Oct 14 04:09:17 kp4 kernel: [<ffffffff800a3ab7>] keventd_create_kthread+0x0/0xc4
Oct 14 04:09:17 kp4 kernel: [<ffffffff80032b2d>] kthread+0x0/0x132
Oct 14 04:09:17 kp4 kernel: [<ffffffff8005dfb7>] child_rip+0x0/0x11
Oct 14 04:09:17 kp4 kernel:
Oct 14 04:09:17 kp4 kernel: XFS internal error XFS_WANT_CORRUPTED_RETURN at line 279 of file fs/xfs/xfs_alloc.c. Caller 0xffffffff88342331
Oct 14 04:09:17 kp4 kernel:

got a bunch of these in dmesg.

The array is fine:

[root@kp4 ~]# tw_cli
//kp4> focus c6
s
//kp4/c6> how

Unit UnitType Status %RCmpl %V/I/M Stripe Size(GB) Cache AVrfy
------------------------------------------------------------------------------
u0 RAID-6 OK - - 256K 13969.8 RiW ON
u1 RAID-6 OK - - 256K 16763.7 RiW ON

VPort Status Unit Size Type Phy Encl-Slot Model
------------------------------------------------------------------------------
p0 OK u1 2.73 TB SATA 0 - Hitachi HDS723030AL
p1 OK u1 2.73 TB SATA 1 - Hitachi HDS723030AL
p2 OK u1 2.73 TB SATA 2 - Hitachi HDS723030AL
p3 OK u1 2.73 TB SATA 3 - Hitachi HDS723030AL
p4 OK u1 2.73 TB SATA 4 - Hitachi HDS723030AL
p5 OK u1 2.73 TB SATA 5 - Hitachi HDS723030AL
p6 OK u1 2.73 TB SATA 6 - Hitachi HDS723030AL
p7 OK u1 2.73 TB SATA 7 - Hitachi HDS723030AL
p8 OK u0 2.73 TB SATA 8 - Hitachi HDS723030AL
p9 OK u0 2.73 TB SATA 9 - Hitachi HDS723030AL
p10 OK u0 2.73 TB SATA 10 - Hitachi HDS723030AL
p11 OK u0 2.73 TB SATA 11 - Hitachi HDS723030AL
p12 OK u0 2.73 TB SATA 12 - Hitachi HDS723030AL
p13 OK u0 2.73 TB SATA 13 - Hitachi HDS723030AL
p14 OK u0 2.73 TB SATA 14 - Hitachi HDS723030AL

Name OnlineState BBUReady Status Volt Temp Hours LastCapTest
---------------------------------------------------------------------------
bbu On Yes OK OK OK 0 xx-xxx-xxxx

i googled for solutions and i think i jumped the horse by doing

xfs_repair -L /dev/sdc

it would not clean it with xfs_repair /dev/sdc, and everybody pretty much says the same thing.

this is what i was getting when trying to mount the array.

Filesystem Corruption of in-memory data detected. Shutting down filesystem xfs_check

Did i jump the gun by using the -L switch :/ ?

jefro

Here is the RH data on that.

https://docs.fedoraproject.org/en-US...xfsrepair.html

[Jan 29, 2019] an HVAC tech that confused the BLACK button that got pushed to exit the room with the RED button clearly marked EMERGENCY POWER OFF.

Jan 29, 2019 | thwack.solarwinds.com

George Sutherland Jul 8, 2015 9:58 AM ( in response to RandyBrown ) had similar thing happen with an HVAC tech that confused the BLACK button that got pushed to exit the room with the RED button clearly marked EMERGENCY POWER OFF. Clear plastic cover installed with in 24 hours.... after 3 hours of recovery!

PS... He told his boss that he did not do it.... the camera that focused on the door told a much different story. He was persona non grata at our site after that.

[Jan 29, 2019] HVAC units greatly help to increase reliability

Jan 29, 2019 | thwack.solarwinds.com

sleeper_777 Jul 15, 2015 1:07 PM

Worked at a bank. 6" raised floor. Liebert cooling units on floor with all network equipment. Two units developed a water drain issue over a weekend.

About an hour into Monday morning, devices, servers, routers, in a domino effect starting shorting out and shutting down or blowing up, literally.

Opened the floor tiles to find three inches of water.

We did not have water alarms on the floor at the time.

Shortly after the incident, we did.

But the mistake was very costly and multiple 24 hour shifts of IT people made it a week of pure h3ll.

[Jan 29, 2019] In a former life, I had every server crash over the weekend when the facilities group took down the climate control and HVAC systems without warning

Jan 29, 2019 | thwack.solarwinds.com

[Jan 29, 2019] [SOLVED] Unable to mount root file system after a power failure

Jan 29, 2019 | www.linuxquestions.org
07-01-2012, 12:56 PM # 1
damateem LQ Newbie
Registered: Dec 2010 Posts: 8
Rep: Reputation: 0
Unable to mount root file system after a power failure

[ Log in to get rid of this advertisement] We had a storm yesterday and the power dropped out, causing my Ubuntu server to shut off. Now, when booting, I get

[ 0.564310] Kernel panic - not syncing: VFS: Unable to mount root fs on unkown-block(0,0)

It looks like a file system corruption, but I'm having a hard time fixing the problem. I'm using Rescue Remix 12-04 to boot from USB and get access to the system.

Using

sudo fdisk -l

Shows the hard drive as

/dev/sda1: Linux
/dev/sda2: Extended
/dev/sda5: Linux LVM

Using

sudo lvdisplay

Shows LV Names as

/dev/server1/root
/dev/server1/swap_1

Using

sudo blkid

Shows types as

/dev/sda1: ext2
/dev/sda5: LVM2_member
/dev/mapper/server1-root: ext4
/dev/mapper/server1-swap_1: swap

I can mount sda1 and server1/root and all the files appear normal, although I'm not really sure what issues I should be looking for. On sda1, I see a grub folder and several other files. On root, I see the file system as it was before I started having trouble.

I've ran the following fsck commands and none of them report any errors

sudo fsck -f /dev/sda1
sudo fsck -f /dev/server1/root
sudo fsck.ext2 -f /dev/sda1
sudo fsck.ext4 -f /dev/server1/root

and I still get the same error when the system boots.

I've hit a brick wall.

What should I try next?

What can I look at to give me a better understanding of what the problem is?

Thanks,
David

damateem
View Public Profile
View LQ Blog
View Review Entries
View HCL Entries
Find More Posts by damateem
Old 07-02-2012, 05:58 AM # 2
syg00 LQ Veteran
Registered: Aug 2003 Location: Australia Distribution: Lots ... Posts: 17,415
Rep: Reputation: 2720 Reputation: 2720 Reputation: 2720 Reputation: 2720 Reputation: 2720 Reputation: 2720 Reputation: 2720 Reputation: 2720 Reputation: 2720 Reputation: 2720 Reputation: 2720
Might depend a bit on what messages we aren't seeing.

Normally I'd reckon that means that either the filesystem or disk controller support isn't available. But with something like Ubuntu you'd expect that to all be in place from the initrd. And that is on the /boot partition, and shouldn't be subject to update activity in a normal environment. Unless maybe you're real unlucky and an update was in flight.

Can you chroot into the server (disk) install and run from there successfully ?.

syg00
View Public Profile
View LQ Blog
View Review Entries
View HCL Entries
Find More Posts by syg00
Old 07-02-2012, 06:08 PM # 3
damateem LQ Newbie
Registered: Dec 2010 Posts: 8
Original Poster
Rep: Reputation: 0
I had a very hard time getting the Grub menu to appear. There must be a very small window for detecting the shift key. Holding it down through the boot didn't work. Repeatedly hitting it at about twice per second didn't work. Increasing the rate to about 4 hits per second got me into it.

Once there, I was able to select an older kernel (2.6.32-39-server). The non-booting kernel was 2.6.32-40-server. 39 booted without any problems.

When I initially setup this system, I couldn't send email from it. It wasn't important to me at the time, so I planned to come back and fix it later. Last week (before the power drop), email suddenly started working on its own. I was surprised because I haven't specifically performed any updates. However, I seem to remember setting up automatic updates, so perhaps an auto update was done that introduced a problem, but it wasn't seen until the reboot that was forced by the power outage.

Next, I'm going to try updating to the latest kernel and see if it has the same problem.

Thanks,
David

damateem
View Public Profile
View LQ Blog
View Review Entries
View HCL Entries
Find More Posts by damateem
Old 07-02-2012, 06:24 PM # 4
frieza Senior Member Contributing Member
Registered: Feb 2002 Location: harvard, il Distribution: Ubuntu 11.4,DD-WRT micro plus ssh,lfs-6.6,Fedora 15,Fedora 16 Posts: 3,233
Rep: Reputation: 405 Reputation: 405 Reputation: 405 Reputation: 405 Reputation: 405
imho auto updates are dangerous, if you want my opinion, make sure auto updates are off, and only have the system tell you there are updates, that way you can chose not to install them during a power failure

as for a possible future solution for what you went through, unlike other keys, the shift key being held doesn't register as a stuck key to the best of my knowledge, so you can hold the shift key to get into grub, after that, edit the recovery line (the e key) to say at the end, init=/bin/bash then boot the system using the keys specified on the bottom of the screen, then once booted to a prompt, you would run
Code:

fsck -f {root partition}
(in this state, the root partition should be either not mounted or mounted read-only, so you can safely run an fsck on the drive)

note the -f seems to be an undocumented flag that does a more thorough scan than merely a standard run of fsck.

then reboot, and hopefully that fixes things

glad things seem to be working for the moment though.

frieza
View Public Profile
View LQ Blog
View Review Entries
View HCL Entries
Visit frieza's homepage!
Find More Posts by frieza
Old 07-02-2012, 06:32 PM # 5
suicidaleggroll LQ Guru Contributing Member
Registered: Nov 2010 Location: Colorado Distribution: OpenSUSE, CentOS Posts: 5,573
Rep: Reputation: 2132 Reputation: 2132 Reputation: 2132 Reputation: 2132 Reputation: 2132 Reputation: 2132 Reputation: 2132 Reputation: 2132 Reputation: 2132 Reputation: 2132 Reputation: 2132
Quote:
Originally Posted by damateem View Post However, I seem to remember setting up automatic updates, so perhaps an auto update was done that introduced a problem, but it wasn't seen until the reboot that was forced by the power outage.
I think this is very likely. Delayed reboots after performing an update can make tracking down errors impossibly difficult. I had a system a while back that wouldn't boot, turns out it was caused by an update I had done 6 MONTHS earlier, and the system had simply never been restarted afterward.
suicidaleggroll
View Public Profile
View LQ Blog
View Review Entries
View HCL Entries
Find More Posts by suicidaleggroll
Old 07-04-2012, 10:18 AM # 6
damateem LQ Newbie
Registered: Dec 2010 Posts: 8
Original Poster
Rep: Reputation: 0
I discovered the root cause of the problem. When I attempted the update, I found that the boot partition was full. So I suspect that caused issues for the auto update, but they went undetected until the reboot.

I next tried to purge old kernels using the instructions at

http://www.liberiangeek.net/2011/11/...neiric-ocelot/

but that failed because a previous install had not completed, but it couldn't complete because of the full partition. So had no choice but to manually rm the oldest kernel and it's associated files. With that done, the command

apt-get -f install

got far enough that I could then purge the unwanted kernels. Finally,

sudo apt-get update
sudo apt-get upgrade

brought everything up to date.

I will be deactivating the auto updates.

Thanks for all the help!

David

[Jan 28, 2019] Locking yourself out on network connect or use right way to use ifdown eth0; ifup eth0 sequence

Notable quotes:
"... Doing it on one line means it comes back up right after it goes down. Doing it on two lines means you lose connection before you can type the second line. I figured this out the hard way, and haven't made the same mistake a second time. ..."
Jan 28, 2019 | thwack.solarwinds.com

jemertz Mar 30, 2016 10:26 AM

When working in a remote lab, on a Linux server which you're connecting to through eth0:

use: ifdown eth0; ifup eth0

not:

ifdown eth0
ifup eth0

Doing it on one line means it comes back up right after it goes down. Doing it on two lines means you lose connection before you can type the second line. I figured this out the hard way, and haven't made the same mistake a second time.

[Jan 28, 2019] Testing backup system as the main source of power outatages

Highly recommended!
Jan 28, 2019 | thwack.solarwinds.com

gcp Jul 8, 2015 10:33 PM

Many years ago I worked at an IBM Mainframe site. To make systems more robust they installed a UPS system for the mainframe with battery bank and a honkin' great diesel generator in the yard.

During the commissioning of the system, they decided to test the UPS cutover one afternoon - everything goes *dark* in seconds. Frantic running around to get power back on and MF restarted and databases recovered (afternoon, remember? during the work day...). Oh! The UPS batteries were not charged! Oops.

Over the next few weeks, they did two more 'tests' during the working day, with everything going *dark* in seconds for various reasons. Oops.

Then they decided - perhaps we should test this outside of office hours. (YAY!)

Still took a few more efforts to get everything working - diesel generator wouldn't start automatically, fixed that and forgot to fill up the diesel tank so cutover was fine until the fuel ran out.

Many, many lessons learned from this episode.

[Jan 28, 2019] False alarm: bas small inmashine room due to electrical light not a server

Jan 28, 2019 | www.reddit.com

radiomix Jack of All Trades 5 points 6 points 7 points 3 years ago (2 children)

I was in my main network facility, for a municipal fiber optic ring. Outside were two technicians replacing our backup air conditioning unit. I walk inside after talking with the two technicians, turn on the lights and begin walking around just visually checking things around the room. All of a sudden I started smelling that dreaded electric hot/burning smell. In this place I have my core switch, primary router, a handful of servers, some customer equipment and a couple of racks for my service provider. I start running around the place like a mad man sniffing all the equipment. I even called in the AC technicians to help me sniff.

After 15 minutes we could not narrow down where it was coming from. Finally I noticed that one of the florescent lights had not come on. I grabbed a ladder and opened it up.

The ballast had burned out on the light and it just so happen to be the light right in front of the AC vent blowing the smell all over the room.

The last time I had smelled that smell in that room a major piece of equipment went belly up and there was nothing I could do about it.

benjunmun 2 points 3 points 4 points 3 years ago (0 children)
The exact same thing has happened to me. Nothing quite as terrifying as the sudden smell of ozone as you're surrounded by critical computers and electrical gear.

[Jan 28, 2019] Loss of power problems: Machines are running, but every switch in the cabinet is dead. Some servers are dead. Panic sets in.

Jan 28, 2019 | www.reddit.com

eraser_6776 VP IT/Sec (a damn suit) 9 points 10 points 11 points 3 years ago (1 child)

May 22, 2004. There was a rather massive storm here that spurred one of the [biggest Tornaodes recorded in Nebraska]( www.tornadochaser.net/hallam.html ) and I was a sysadmin for a small company. It was a Saturday, aka beer day, and as all hell was breaking loose my friends and roomates' pagers and phones were all going off. "Ha ha!" I said, looking at a silent cellphone "sucks to be you!"

Next morning around 10 my phone rings, and I groggily answer it because it's the owner of the company. "You'd better come in here, none of the computers will turn on" he says. Slight panic, but I hadn't received any emails. So it must have been breakers, and I can get that fixed. No problem.

I get into the office and something strikes me. That eery sound of silence. Not a single machine is on.. why not? Still shaking off too much beer from the night before, I go into the server room and find out why I didn't get paged. Machines are running, but every switch in the cabinet is dead. Some servers are dead. Panic sets in.

I start walking around the office trying to turn on machines and.. dead. All of them. Every last desktop won't power on. That's when panic REALLY set in.

In the aftermath I found out two things - one, when the building was built, it was built with a steel roof and steel trusses. Two, when my predecessor had the network cabling wired he hired an idiot who didn't know fire code and ran the network cabling, conveniently, along the trusses into the ceiling. Thus, when lightning hit the building it had a perfect ground path to every workstation in the company. Some servers that weren't in the primary cabinet had been wired to a wall jack (which, in turn, went up into the ceiling then back down into the cabinet because you know, wire management!). Thankfully they were all "legacy" servers.

The only thing that saved the main servers was that Cisco 2924 XL-EN's are some badass mofo's that would die before they let that voltage pass through to the servers in the cabinet. At least that's what I told myself.

All in all, it ended up being one of the longest work weeks ever as I first had to source a bunch of switches, fast to get things like mail and the core network back up. Next up was feeding my buddies a bunch of beer and pizza after we raided every box store in town for spools of Cat 5 and threw wire along the floor.

Finally I found out that CDW can and would get you a whole lot of desktops delivered to your door with your software pre-installed in less than 24 hours if you have an open checkbook. Thanks to a great insurance policy, we did. Shipping and "handling" for those were more than the cost of the machines (again, this was back in 2004 and they were business desktops so you can imagine).

Still, for weeks after I had non-stop user complaints that generally involved "..I think this is related to the lightning ". I drank a lot that summer.

[Jan 28, 2019] That's how I learned to always check with somebody else before rebooting a production server, no matter how minor it may seem

Jan 28, 2019 | www.reddit.com

VexingRaven 1 point 2 points 3 points 3 years ago (1 child)

Not really a horror story but definitely one of my first "Oh shit" moments. I was the FNG helpdesk/sysadmin at a company of 150 people. I start getting calls that something (I think it was Outlook) wasn't working in Citrix, apparently something broken on one of the Citrix servers. I'm 100% positive it will be fixed with a reboot (I've seen this before on individual PCs), so I diligently start working to get people off that Citrix server (one of three) so I can reboot it.

I get it cleared out, hit Reboot... And almost immediately get a call from the call center manager saying every single person just got kicked off Citrix. Oh shit. But there was nobody on that server! Apparently that server also housed the Secure Gateway server which my senior hadn't bothered to tell me or simply didn't know (Set up by a consulting firm). Whoops. Thankfully the servers were pretty fast and people's sessions reconnected a few minutes later, no harm no foul. And on the plus side, it did indeed fix the problem.

And that's how I learned to always check with somebody else before rebooting a production server, no matter how minor it may seem.

[Jan 14, 2019] Safe rm stops you accidentally wiping the system! @ New Zealand Linux

Jan 14, 2019 | www.nzlinux.com
  1. Francois Marier October 21, 2009 at 10:34 am

    Another related tool, to prevent accidental reboots of servers this time, is molly-guard:

    http://packages.debian.org/sid/molly-guard

    It asks you to type the hostname of the machine you want to reboot as an extra confirmation step.

[Oct 05, 2018] Due to a configuration change I wasn't privy to, the software I was responsible for rebooted all the 911 operators servers at once

Oct 05, 2018 | www.reddit.com

ardwin 5 years ago (9 children)

Due to a configuration change I wasn't privy to, the software I was responsible for rebooted all the 911 operators servers at once.
cobra10101010 5 years ago (1 child)
Oh God..that is scary in true sense..hope everything was okay
ardwin 5 years ago (0 children)
I quickly learned that the 911 operators, are trained to do their jobs without any kind of computer support. It made me feel better.
reebzor 5 years ago (1 child)
I did this too!

edit: except I was the one that deployed the software that rebooted the machines

vocatus 5 years ago (0 children)
Hey, maybe you should go apologize to ardwin. I bet he was pissed.

[Jun 16, 2010] Prevent Accidental Shutdown/Reboot in Ubuntu

Jan 15, 2010 | Linux Today

blackhole

Re: Solution looking for a problem
> How exactly does someone "accidentally" issue a shutdown or reboot command? ... Failing that highly likely scenario, this is someone shopping around a solution for a problem that doesn't really exist. Give me a break.

I haven't checked out the actual package in question, but based on the fact that (according to the posted output) it notes the connection is via SSH and asks for a hostname, I would say the author of the article did not articulate well what the purpose of the package is. The purpose appears to be to avoid shutting down the *wrong* computer when connecting remotely.

I've never had that problem, but more than once I've shut down the local computer when I intended to shut down a remote computer. I think after the second time (after I stopped swearing!) I created aliases for halt and reboot that first query with a message like: Really halt {hostname} [yn]?

Marco

Re: Solution looking for a problem
Re: How exactly does someone "accidentally" issue a shutdown or reboot command?

I've done it while I was distracted, open a shell to one box, then open one to another box. Go to lunch. Forget which shell you are using and send the wrong command to the wrong machine. Nobody is perfect.

[Jun 14, 2010] IT Resource Center forums - greatest blunders

Michael Steele

    When I was first starting out I worked for a Telecom as an 'Application Administrator' and I sat in a small room with a half a dozen other admins and together we took calls from users as their calls escalated up from tier I support. We were tier II in a three tier organization.

    A month earlier someone from tier I confused a production server with a test server and rebooted it in the middle of the day. These servers were remotely connected over a large distance so it can be confusing. Care is needed before rebooting.

    The tier I culprit took a great deal of abuse for this mistake and soon became a victim of several jokes. An outage had been caused in a high availability environment which meant management, interviews, reports; It went on and on and was pretty brutal.

    And I was just as brutal as anyone.

    Their entire organization soon became victimize by everyone from our organization. The abuse traveled right up the management tree and all participated.

    It was hilarious, for us.

    Until I did the same thing a month later.

    There is nothing more humbling then 2000 people all knowing who you are for the wrong reason and I have never longed for anonymity more.

    Now I alway do a 'uname' or 'hostname' before a reboot, even when I'm right in front of it.

[Jun 9, 2010] horror - University of Cambridge Computing Service - Unix Support

(3) At the same institution, we were running a system software that had a serious bug where if anyone had logged out ungracefully, the system wouldn't let any more users onto the system and users who were logged on couldn't execute any new commands. (The newest release of the software later on did fix this bug.) I had to reboot the machine to restore the system to a sane state. I did a wall <<EOF We need to shutdown blah blah... EOF and then shutdown. Well, I should've waited since at the precise moment, one of our users was doing a once-a-year massive conversion of our financial data (talk about bad luck). I had shutdown in the middle of a very long disk write and thus, data was lost. We did recover that data and life went on.

Moral: make damn sure that *no one* is doing anything on your system before you reboot, even if other users are vociferously clamoring for you to reboot.

My 10 UNIX Command Line Mistakes by Vivek Gite

with 90 comments

Anyone who has never made a mistake has never tried anything new. -- Albert Einstein.

Here are a few mistakes that I made while working at UNIX prompt. Some mistakes caused me a good amount of downtime. Most of these mistakes are from my early days as a UNIX admin.

... ... ...

Rebooted Solaris Box

On Linux killall command kill processes by name (killall httpd). On Solaris it kill all active processes. As root I killed all process, this was our main Oracle db box:
killall process-name

Selected Comments

UnixEagle

Rebooted the wrong box

While adding alias to main network interface I ended up changing the main IP address, the system froze right away and I had to call for a reboot

Instead of appending text to Apache config file, I overwritten it's contents

Firewall lockdown while changing the ssh port

Wrongfully run a script contained recursive chmod and chown as root on / caused me a downtime of about 12 hours and a complete re-install

Some mistakes are really silly, and when they happen, you don't believe your self you did that, but every mistake, regardless of it's silliness, should be a learned lesson.

If you did a trivial mistake, you should not just overlook it, you have to think of the reasons that made you did it, like: you didn't have much sleep or your mind was confused about personal life or …..etc.

I like Einstein's quote, you really have to do mistakes to learn.

[May 21, 2009] Accidental shutdown

Re: Accidental shutdown

by Todd A. Jacobson 2009-05-21T20:46:53+00:00.

On Thu, May 21, 2009 at 12:31:47AM +0100, Bhasker C V wrote:

> I can rename and shell wrap the binaries poweroff/shutdown/reboot but
> that would not be a clean method and I am sure there should be much
> better way than that.

Nope. You could disable the reboot command in your sudoers file, but that isn't going to prevent you from rebooting the wrong machine if you really make an effort.

You might also consider editing sudoers to change the sudo password prompt to include the hostname of the box you're on, so that you're less likely to issue commands to the wrong box if you're paying attention.

However, the real problem here is that you're assuming Linux should protect you from yourself. It won't; part of being a power user is not running privileged commands without exercising due care. With power comes responsibility!

As has been said before: "*nix is user friendly. It's just picky about who its friends are!"

--
"Oh, look: rocks!"
-- Doctor Who, "Destiny of the Daleks"

by Scott Giffordon 2009-05-21T20:51:53+00:00.

Bhasker C V writes:

> Is there a method to prevent accidental powerdown of a linux box ? or atleast alert ?

If you get in the habit of running "shutdown -r +1" instead of "reboot", it will warn users for 1 minute before shutting down the server. That should give you enough time to run "shutdown -c" to cancel the shutdown if you realize it's on the wrong machine.

Hope this helps,
-----Scott.

Recommended Links

Google matched content

Softpanorama Recommended

Top articles

[Jan 29, 2019] xfs corrupted after power failure Published on Oct 15, 2013 | www.linuxquestions.org

[Jan 28, 2019] Testing backup system as the main source of power outatages Published on Jan 28, 2019 | thwack.solarwinds.com

Sites



Etc

Society

Groupthink : Two Party System as Polyarchy : Corruption of Regulators : Bureaucracies : Understanding Micromanagers and Control Freaks : Toxic Managers :   Harvard Mafia : Diplomatic Communication : Surviving a Bad Performance Review : Insufficient Retirement Funds as Immanent Problem of Neoliberal Regime : PseudoScience : Who Rules America : Neoliberalism  : The Iron Law of Oligarchy : Libertarian Philosophy

Quotes

War and Peace : Skeptical Finance : John Kenneth Galbraith :Talleyrand : Oscar Wilde : Otto Von Bismarck : Keynes : George Carlin : Skeptics : Propaganda  : SE quotes : Language Design and Programming Quotes : Random IT-related quotesSomerset Maugham : Marcus Aurelius : Kurt Vonnegut : Eric Hoffer : Winston Churchill : Napoleon Bonaparte : Ambrose BierceBernard Shaw : Mark Twain Quotes

Bulletin:

Vol 25, No.12 (December, 2013) Rational Fools vs. Efficient Crooks The efficient markets hypothesis : Political Skeptic Bulletin, 2013 : Unemployment Bulletin, 2010 :  Vol 23, No.10 (October, 2011) An observation about corporate security departments : Slightly Skeptical Euromaydan Chronicles, June 2014 : Greenspan legacy bulletin, 2008 : Vol 25, No.10 (October, 2013) Cryptolocker Trojan (Win32/Crilock.A) : Vol 25, No.08 (August, 2013) Cloud providers as intelligence collection hubs : Financial Humor Bulletin, 2010 : Inequality Bulletin, 2009 : Financial Humor Bulletin, 2008 : Copyleft Problems Bulletin, 2004 : Financial Humor Bulletin, 2011 : Energy Bulletin, 2010 : Malware Protection Bulletin, 2010 : Vol 26, No.1 (January, 2013) Object-Oriented Cult : Political Skeptic Bulletin, 2011 : Vol 23, No.11 (November, 2011) Softpanorama classification of sysadmin horror stories : Vol 25, No.05 (May, 2013) Corporate bullshit as a communication method  : Vol 25, No.06 (June, 2013) A Note on the Relationship of Brooks Law and Conway Law

History:

Fifty glorious years (1950-2000): the triumph of the US computer engineering : Donald Knuth : TAoCP and its Influence of Computer Science : Richard Stallman : Linus Torvalds  : Larry Wall  : John K. Ousterhout : CTSS : Multix OS Unix History : Unix shell history : VI editor : History of pipes concept : Solaris : MS DOSProgramming Languages History : PL/1 : Simula 67 : C : History of GCC developmentScripting Languages : Perl history   : OS History : Mail : DNS : SSH : CPU Instruction Sets : SPARC systems 1987-2006 : Norton Commander : Norton Utilities : Norton Ghost : Frontpage history : Malware Defense History : GNU Screen : OSS early history

Classic books:

The Peter Principle : Parkinson Law : 1984 : The Mythical Man-MonthHow to Solve It by George Polya : The Art of Computer Programming : The Elements of Programming Style : The Unix Hater’s Handbook : The Jargon file : The True Believer : Programming Pearls : The Good Soldier Svejk : The Power Elite

Most popular humor pages:

Manifest of the Softpanorama IT Slacker Society : Ten Commandments of the IT Slackers Society : Computer Humor Collection : BSD Logo Story : The Cuckoo's Egg : IT Slang : C++ Humor : ARE YOU A BBS ADDICT? : The Perl Purity Test : Object oriented programmers of all nations : Financial Humor : Financial Humor Bulletin, 2008 : Financial Humor Bulletin, 2010 : The Most Comprehensive Collection of Editor-related Humor : Programming Language Humor : Goldman Sachs related humor : Greenspan humor : C Humor : Scripting Humor : Real Programmers Humor : Web Humor : GPL-related Humor : OFM Humor : Politically Incorrect Humor : IDS Humor : "Linux Sucks" Humor : Russian Musical Humor : Best Russian Programmer Humor : Microsoft plans to buy Catholic Church : Richard Stallman Related Humor : Admin Humor : Perl-related Humor : Linus Torvalds Related humor : PseudoScience Related Humor : Networking Humor : Shell Humor : Financial Humor Bulletin, 2011 : Financial Humor Bulletin, 2012 : Financial Humor Bulletin, 2013 : Java Humor : Software Engineering Humor : Sun Solaris Related Humor : Education Humor : IBM Humor : Assembler-related Humor : VIM Humor : Computer Viruses Humor : Bright tomorrow is rescheduled to a day after tomorrow : Classic Computer Humor

The Last but not Least Technology is dominated by two types of people: those who understand what they do not manage and those who manage what they do not understand ~Archibald Putt. Ph.D


Copyright © 1996-2021 by Softpanorama Society. www.softpanorama.org was initially created as a service to the (now defunct) UN Sustainable Development Networking Programme (SDNP) without any remuneration. This document is an industrial compilation designed and created exclusively for educational use and is distributed under the Softpanorama Content License. Original materials copyright belong to respective owners. Quotes are made for educational purposes only in compliance with the fair use doctrine.

FAIR USE NOTICE This site contains copyrighted material the use of which has not always been specifically authorized by the copyright owner. We are making such material available to advance understanding of computer science, IT technology, economic, scientific, and social issues. We believe this constitutes a 'fair use' of any such copyrighted material as provided by section 107 of the US Copyright Law according to which such material can be distributed without profit exclusively for research and educational purposes.

This is a Spartan WHYFF (We Help You For Free) site written by people for whom English is not a native language. Grammar and spelling errors should be expected. The site contain some broken links as it develops like a living tree...

You can use PayPal to to buy a cup of coffee for authors of this site

Disclaimer:

The statements, views and opinions presented on this web page are those of the author (or referenced source) and are not endorsed by, nor do they necessarily reflect, the opinions of the Softpanorama society. We do not warrant the correctness of the information provided or its fitness for any purpose. The site uses AdSense so you need to be aware of Google privacy policy. You you do not want to be tracked by Google please disable Javascript for this site. This site is perfectly usable without Javascript.

Last modified: February 04, 2019