Troubleshooting Remote Autonomous Servers

News

Softpanorama University Solaris System Administration Automation Resources

Recommended Links

iLO 3 -- HP engineering fiasco

Dell DRAC

ALOM Setup
fsck Root filesystem is mounted read only on boot Troubleshooting Errors in /etc/fstab     Working with serial console

Typical problems with IT infrastructure

Heterogeneous Unix server farms Sysadmin Horror Stories Murphy's Laws Humor Etc

Introduction

Imagine that you're in charge of several dozen of Linux servers, mix of SLES and Red Hat. Servers are located in several remote datacenters with no IT staff. The only connection you have is via ILO/DRAC remote control. 

The most challenging part starts when one of the servers at remote location stops responding.  We are assuming that network connection to the site is in working condition and the servers has functional remote administration unit (which, unfortunately, is not always the case, see iLO 3 -- HP engineering fiasco ).

In this environment instead of supplementary role they play in local datacenter they suddenly became not just important, its a critical life saver skill. And the art of using them to a large degree determines your skills as a Unix sysadmin in critical situation when the server does not have functional networking and you can't connect to it using SSH or X11. That includes usage of virtual media, as there is nobody at the remote some to put a CD/DVD in the slot.  DRAC 7 vFlash functionality in those circumstances make a real difference so personally I prefer Dell servers and blades to HP servers or blades. See Dell DRAC

Theoretically ILO and DRAC should as reliable as AKM47, but unfortunately both HP and DELL  overloaded them with features at the expense of reliability.

There are several distinct areas that are unique to autonomous server environment:

 

Toolset

As reliability and knowledge of tools is the most important feature in this environment, simple tools are preferable to complex. You just need to spend time learning them really well. For example, in depth knowledge of SSH became really critical. Here is an interesting snippet from an Internet discussion on serverfault.com remote access - How do you remotely administer your Linux boxes - Server Fault

My toolset for these operations is painfully sparse (SSH into the box, edit files in VIM, WGET remote files that I need), and I suspect there is a much better way to do it. I'm curious to hear what other people in my position are doing.

Sparse? What on earth do you mean? Excuse me for ranting, but dismissing ssh, vim and wget as painful is almost insulting. From your question I deduce you are mainly a programmer for your daytime job, so I kinda understand the question. But honestly, I would not hire a Linux admin who is not comfortable with any of the three tools you mentioned.

Are you using some form of Windowing system and remote-desktop equivalent to access the box, or is it all command line? Managing remote Windows boxes is trivial, since you can simply remote desktop in and transfer files over the network. Is there an equivalent to this in the Linux world?

For administrator tasks I never, ever use an X environment. You do not need one, it'll only take up system resources and, for the most of the time, they're a hindrance instead of a help. Most GUI configuration tools (well, practically all, really) only offer a subset of the configuration option you can set in a configuration file with vim.

Managing a Linux box is no less trivial than managing a Windows box. It just takes some time to gain a decent skill set.

And a network file transfer equivalent? Plenty. scp, sftp, ftp, nfs, cifs / smb (Windows file sharing protocols), and then some.

Are you doing your config file changes/script tweaks directly on the machine? Or do you have something set up on your local box to edit these files remotely? Or are you simply editing them remotely then transferring them at each save?

Depends on what I am doing. Most of the things I do directly in the config files on the machine (for development and testing boxes) and then I push the file into a configuration channel on our Satellite server, after which I deploy the file to all servers directly (for production boxes). Really, vim is a treasure. That is, when you find out how to use it properly.

How are you moving files back and forth between the server and your local environment? FTP? Some sort of Mapped Drive via VPN?

scp all the way and maybe some sftp, and I suggest you do too. Never, ever use FTP to move sensitive files (e.g. config files) over a public network. I do not use a mapped network because again, all I need is on the server. If you mean c files and not configuration files here, I usually use something like svn or git and then push my changes to the box.

I'd really need to get some best practices in place for administering these boxes. Any suggestions to remove some of the pain would be most welcome!

You are already using them: ssh, scp, wget and vim. Those are not pain. There might be some teething pains, while you figure out how powerful they are. But, to bring the Windows analogy back, I feel seriously hampered when I have to use a Windows box. For you it's the other way around. It's just what you are used to. So, give it some time and it'll come to you.

You mentioned already ssh, vim and wget which is essential and perfect. Some additional tools that can make life easier:

1. GNU Screen

"GNU Screen is a free terminal multiplexer that allows a user to access multiple separate terminal sessions inside a single terminal window or remote terminal session. It is useful for dealing with multiple programs from the command line, and for separating programs from the shell that started the program." (From the GNU Screen page on Wikipedia)

A main advantage is that you can have one or several virtual terminals that are in the exactly same state as you left them when you come back (i.e. relogin via ssh). This is also good when your connection is broken for some reason.

Screen works indepently from the software you use to connect to the box (it lives on the server), so it combines well with putty or most other terminal software.

This article shows some nice things you can do with it: http://www.pastacode.de/extending-gnu-screen-adding-a-taskbar/en/

2. Midnight Commander

A console based graphical-like browsing tool for viewing and manipulation of files and directories. Can also do secure remote transfers.

3. rsync

For fast, secure and reliable file transfer and synchronization between different locations

4. VCS

Use of a distributed version control system like bazaar, mercurial or git to update code. Github or Bitbucket offer commercial code hosting, but it is not necessary, you can also use it efficiently on your own machines.

Joseph Kern: can you elaborate how you exactly use git for remote config organization?

5. Terminal Clients

On Unix-like-systems they are already on board, on Windows you can use Putty, Tera Term, Mind Term or Pandora. Or make a Cygwin installation and ssh from the Cygwin terminal windows to the remote boxes (which has more advantages but this is a question of what you prefer).

6. Tunneling and Port Forwarding

It can be helpful to forward certain ports securely to your local machine. For example you could forward the MySQL port TCP 3306 or postures TCP 5432 and install some database administration tool locally.

You can build tunnels from Windows machines with putty (or command line based with it's little brother plink), with Cygwin and Mindterm also can do port forwarding. If you are locally on a Unix-like machine you can use ssh odr plink to create such tunneling.

To create some more stable and permanent tunneling for various ports I recommend OpenVPN. The "pre-shared-key" tunneling method from point to point is not so hard to install.

7. Have a local Unix-like system

When your local machine is a Mac you have this already, you can open a local shell. When your workstation is windows-based it could be helpful to create a local Unix-like server, which is in the same local network. This can be a different machine in a different room connected to the same router or switch. Or if you want only one machine, you can install the free VMware server and make a virtual machine, preferably the same operating system as your remote machine. Install a samba server on it and you can "net use" the samba shares from your desktop.

If you an ssh server on the local server and open port 22 on your router for it you can ssh into your local system when you are outside.

You can build tunnels to remote machines or transfer and synchronize files and whole file trees with rsync. You can use it for testing, for VSC, for local development, as a local webserver, for training purposes.

You can pull backups from remote machines. You can create local cron jobs that do the backups automatically (e.g. databases you want to save locally regularly)

Server does not boot  problems

This is the most difficult to troubleshoot problem with remote autonomous servers. And you are lucky if DRAC/ILO works because Murphy's Laws apply. 

The most common reason of the server that is previously operational to boot to safe mode is change of /etc/fstab that contains an error and was not checked after after it was made. In this case the server boot to safe mode with root filesystem mounted read-only. See

Other possible reasons include

As with any complex error it is prudent to have a baseline of the server at some local storage pools.  As access is via DRAC or ILO only it is also prudent to log your actions via script.  In you do some destructive commands like rm or chmod or find with exec option test such a command on local server dfirst. Experiment on remote server can make bad problem even worse.

See Sysadmin Horror Stories

Usage of DRAC in command line console mode

Even when you can't connect to the server it is safer to work with command like console via Teraterm then with gui console. And both DRAC and ILO allow such a mode. In this case you can cut and paste commands from your records or from the Web minimizing possible errors. You can also log everything that you do via Teraterm log facility or some similar method. Don't be blindly changed to GUI console.

On Dell default setting of both server BIOS and DRAC allow usage of serial console via ssh.

Here are several links

I have a dell R710 with the idrac express with versions 1.2 firmware. I can login to the idrac via the web interface. I configured the idrac express according to:

http://support.dell.com/support/edocs/software/smdrac3/idrac/idrac12mono/en/ug/pdf/ug.pdf

Page 86 "Configuring iDRAC6 for Viewing Serial Output Remotely Over SSH/Telnet"

I can ssh in to the idrac, I then execute the command according to the manual:

/admin1-> console com2

I get the message:

Connected to Serial Device 2. To end type: ^\

At this point my login is hung, no output and executing the "^\"  does nothing, I have to kill my ssh session.

This is a Red Hat 5.4 Linux machine. I do not care at this point about redirecting the console before boot up so I did not implement Page 90 "Configuring Linux for Serial Console Redirection During Boot"

Cannot find any explanation of the BIOS "redirection after boot enable/disable" setting mentioned on Page 86. I tried both enable and disable, no joy. Can not find good doc on overview and settings on console redirection for the various drac versions, that would be nice.

Any suggestions? Thanks.

===

After further investigation:

I did get console output if I rebooted the system, I could access the bios setup etc.

When the linux kernel is loaded you must perform the steps "Configuring Linux for Serial Console Redirection During Boot". I believe the "redirection after boot enable/disable" setting would be used to disable the use of the tty port so an external device could be connect to it. Would be nice to get a detailed explanation of all these settings. 

Usage of baseliners

Modern Linux is way too complex OS to understand for a person with a single head ;-). So you need some helpers. One of the most useful is a baseliner.  They can exist on three different levels

Keeping history

One problem with autonomous server is that they are generally very reliable and can work with minimum administrator support. That's actually the main attraction of the technology.  It not uncommon to have one problem in a couple of years or even less frequently.  That means that from one "serious" encounter with the server to another you forget everything. Essentially at the moment you face problem this server is new to you and you need to relearn everything. 

So you need to create and meticulously keep artificial memory in a form on wiki or just regular Website with pages for each server that contains history.  This is an extremely important activity the value of which can't be overestimated.

Some fragments of history can be extracted from corporate helpdesk system, bash (or other shell) history for root. But usually helpdesk system is an object of passionate hate in large corporation as this part of infrastructure is so mismanaged by IT brass that is is useless for troubleshooting.

Some part of history is preservers in baselines, see above

Usage of local personnel for troubleshooting

Remote personal is often available but has a skill set of a typical PC user. So the art of using them for troubleshooting is a separate chapter of autonomous datacenter sysadmin skills. There are several recommendations here:

  1. Do not trust information the user provides. Try to verify it

  2. If you instruct user to do something by very specific. For example dot say: please put "yes", specify also the case in which the user should put it. Otherwise you can have unpleasant surprises.

  3. Pictures from camera are important troubleshooting tool

  4. Using remote cam is very helpful in situations when you need to see the lights on the servers or smoke. Using remote microphone allow to hear the sounds.

You can also buy a remote camera that works over IP network and put ask to put it in a specific position.


Top updates

Bulletin Latest Past week Past month
Google Search


NEWS CONTENTS

Old News ;-)

Distributed UNIX Systems Administration

Topic: System and Network Administration Author: John R. Wetsch

Author Bio:

John R. Wetsch granted a Ph.D. for his development of the SAmatrix. He is an accomplished UNIX system administrator and C/C++ programmer, who has worked with other administrators in the field validating his practical methods.

Key Benefits:

Are you in charge of a UNIX system distributed across multiple machines and locations? Are you sinking in a morass of system chores and failures? Distributed UNIX System Administration will give you both the solid, practical background you need for solving distributed system administration problems – and SAmatrix, an automated tool for managing administration tasks.

Distributed UNIX System Administration identifies and addresses the key components of a distributed UNIX system, and then organizes these components into 5 distinct modules: software, hardware, network, security and operations. The author also gives you a powerful tool to automate these components – SAmatrix, a C++ program for managing open systems.

SAmatrix brings order, reliability, and exception reporting to your distributed system, thus reducing downtime and increasing efficiency. It allows you to assess your system administration and management practices methodically – and then determine if your system is at risk with regard to reliability and integrity.

John Wetsch, the author, developed Samatrix after careful study of the dynamics of distributed systems. His organizational thesis is presented in these 5 modules:

With this newly gained understanding of the distributed system model, you will be ready to employ the SAmatrix in the regular analysis, appraisal and management of the network.

Module 1: Software Administration

Chapter 1 The basics

Chapter 2 File Systems

Chapter 3 Applications Maintenance

Module 2: Hardware Administration

Chapter 4 Device Drivers & Peripherals

Chapter 5 Hardware Maintenance

Module 3: Network Administration

Chapter 6 Understanding Networks

Chaptmr 7 The Internet

Module 4: Security Administration

Chapter 8 UNIX Security

Chapter 9 Layered UNIX Security

Module 5: Operations Administration

Chapter 10: Managing the UNIX Environment

Chapter 11: Implementing a Systems Administration Matrix

Chapter 12: Evaluating a Systems Administration Matrix

Published: July 1998
Price: US $49.95, trade paper with disk
Category: System Administration
R&D Books
336 pp, 7 x 9
Product code: rd2486
ISBN: 0-87930-540-1

Network Shell Delivers Concurrent Remote Administration

03/01/00, 3:00 a.m. ET

Network Shell 3.0, a remote systems administration toolset from Shpink Software, offers concurrent management of multiple Unix and Windows 95/98/NT machines from a single Unix or Windows NT administration station.

Network Shell provides a Shell and Perl environment that enables administrators to perform secure, automated, and/or interactive system administration without establishing a remote shell connection to each individual host.

The software helpsmanage complex Internet environments-via content replication and remote statistics gathering-by issuing commands to multiple machines simultaneously. An optional Windows NT client permits administrators to manage remote Windows and Unix systems.

Network Shell costs $199 per machine for a single Unix/NT server license.

Shpink Software, 3612 Santiago St., Ste. 100, San Mateo, CA 94403, (650) 525-1537 or (888) 492-6867, www.shpink.com .

Problem: You need to install applications, update files, and run commands and scripts to multiple Unix and Windows hosts across the Internet. Old solution: Telnet to or physically visit one machine at a time. New solution: Use Shpink Software's Network Shell. You just type a command here, reference a Perl script and ... yippee! Job done.

The Network Shell (NSH) is similar to a virtual private network, but tailored to the needs of a system administrator-or any reseller that needs to control multiple machines remotely. By installing the NSH daemon on remote Unix hosts, the service on Windows NT and the application on Windows 9x, we created end points connected to our Unix administration machine, which also ran the NSH daemon. Then, with tight authentication, we set up encrypted, secure connections to multiple machines.

Resellers, listen up! Because a single telnet session allows only one connection to another machine at a time, NSH is a major time-saver. You can install applications, perform backups, monitor disk usage, and run scripts and commands on all of your customers' computers in one quick and easy process. Granted, you do have to put in the legwork to install NSH on every computer, but everything has a price.

If you're running any variant of Unix, NSH has got you covered. Solaris, SCO, AIX, IRIX, FreeBSD and HP-UX all are supported, but you'll be out in left field if you run a Linux distribution other than Red Hat. However, Shpink did say it would support any Unix operating system if there was a demand.

We were extremely skeptical about security-by using this product over the Internet, you allow complete access to a machine. But NSH put us at ease. First, NSH must be running on both end points. What's more, the NSH installer is interactive, so it would be extremely difficult for a cracker to place NSH on a machine without a user knowing it. Second, two files-the exports and users files-control access to the host by providing read/write access to a user, a host, a user ID or any combination of those. The access files also are located locally, so permissions aren't transferred across the Internet-and even if they were, the traffic could be encrypted using DES or Blowfish.

We performed numerous tests. Our favorite was rebooting NT-the power surged through our fingers. We also shut down Linux; synchronized file directories using Shpink's own dsync command; installed software; and ran various remote administration commands. And then we found a little gem.

Running df–k (another Shpink command that lists drive statistics) on a remote Windows 98 host, we discovered we had access to any network resource that the machine had access to-in our case, NetWare resources. We didn't even have to log in; the drives appeared at our fingertips, with no need to mount or map any drives or partitions.

NSH is an excellent tool for providing service and support remotely to multiple customers. In addition, it's highly customizable with an application programming interface that allows a C programmer to write applications for distribution over NSH.


Recommended Links

Top Visited

Bulletin Latest Past week Past month
Google Search



Comparison of open-source configuration management software - Wikipedia, the free encyclopedia

Introduction to Puppet Streamlined System Configuration Linux.com

Puppet (software) - Wikipedia, the free encyclopedia

Puppet is used by the Wikimedia Foundation,[5] ARIN, Reddit,[6] Dell, Rackspace, Zynga, Twitter, the New York Stock Exchange, PayPal, Disney, Citrix Systems, Oracle, Yandex, the University of North Texas, the Los Alamos National Laboratory, Stanford University, Lexmark and Google, among others.[7]

Autonomic computing - Wikipedia, the free encyclopedia

IBM Systems Journal - Vol. 42, No. 1, 2003 - Autonomic Computing

...



Etc

FAIR USE NOTICE This site contains copyrighted material the use of which has not always been specifically authorized by the copyright owner. We are making such material available in our efforts to advance understanding of environmental, political, human rights, economic, democracy, scientific, and social justice issues, etc. We believe this constitutes a 'fair use' of any such copyrighted material as provided for in section 107 of the US Copyright Law. In accordance with Title 17 U.S.C. Section 107, the material on this site is distributed without profit exclusivly for research and educational purposes.   If you wish to use copyrighted material from this site for purposes of your own that go beyond 'fair use', you must obtain permission from the copyright owner. 

ABUSE: IPs or network segments from which we detect a stream of probes might be blocked for no less then 90 days. Multiple types of probes increase this period.  

Society

Groupthink : Two Party System as Polyarchy : Corruption of Regulators : Bureaucracies : Understanding Micromanagers and Control Freaks : Toxic Managers :   Harvard Mafia : Diplomatic Communication : Surviving a Bad Performance Review : Insufficient Retirement Funds as Immanent Problem of Neoliberal Regime : PseudoScience : Who Rules America : Neoliberalism  : The Iron Law of Oligarchy : Libertarian Philosophy

Quotes

War and Peace : Skeptical Finance : John Kenneth Galbraith :Talleyrand : Oscar Wilde : Otto Von Bismarck : Keynes : George Carlin : Skeptics : Propaganda  : SE quotes : Language Design and Programming Quotes : Random IT-related quotesSomerset Maugham : Marcus Aurelius : Kurt Vonnegut : Eric Hoffer : Winston Churchill : Napoleon Bonaparte : Ambrose BierceBernard Shaw : Mark Twain Quotes

Bulletin:

Vol 25, No.12 (December, 2013) Rational Fools vs. Efficient Crooks The efficient markets hypothesis : Political Skeptic Bulletin, 2013 : Unemployment Bulletin, 2010 :  Vol 23, No.10 (October, 2011) An observation about corporate security departments : Slightly Skeptical Euromaydan Chronicles, June 2014 : Greenspan legacy bulletin, 2008 : Vol 25, No.10 (October, 2013) Cryptolocker Trojan (Win32/Crilock.A) : Vol 25, No.08 (August, 2013) Cloud providers as intelligence collection hubs : Financial Humor Bulletin, 2010 : Inequality Bulletin, 2009 : Financial Humor Bulletin, 2008 : Copyleft Problems Bulletin, 2004 : Financial Humor Bulletin, 2011 : Energy Bulletin, 2010 : Malware Protection Bulletin, 2010 : Vol 26, No.1 (January, 2013) Object-Oriented Cult : Political Skeptic Bulletin, 2011 : Vol 23, No.11 (November, 2011) Softpanorama classification of sysadmin horror stories : Vol 25, No.05 (May, 2013) Corporate bullshit as a communication method  : Vol 25, No.06 (June, 2013) A Note on the Relationship of Brooks Law and Conway Law

History:

Fifty glorious years (1950-2000): the triumph of the US computer engineering : Donald Knuth : TAoCP and its Influence of Computer Science : Richard Stallman : Linus Torvalds  : Larry Wall  : John K. Ousterhout : CTSS : Multix OS Unix History : Unix shell history : VI editor : History of pipes concept : Solaris : MS DOSProgramming Languages History : PL/1 : Simula 67 : C : History of GCC developmentScripting Languages : Perl history   : OS History : Mail : DNS : SSH : CPU Instruction Sets : SPARC systems 1987-2006 : Norton Commander : Norton Utilities : Norton Ghost : Frontpage history : Malware Defense History : GNU Screen : OSS early history

Classic books:

The Peter Principle : Parkinson Law : 1984 : The Mythical Man-MonthHow to Solve It by George Polya : The Art of Computer Programming : The Elements of Programming Style : The Unix Hater’s Handbook : The Jargon file : The True Believer : Programming Pearls : The Good Soldier Svejk : The Power Elite

Most popular humor pages:

Manifest of the Softpanorama IT Slacker Society : Ten Commandments of the IT Slackers Society : Computer Humor Collection : BSD Logo Story : The Cuckoo's Egg : IT Slang : C++ Humor : ARE YOU A BBS ADDICT? : The Perl Purity Test : Object oriented programmers of all nations : Financial Humor : Financial Humor Bulletin, 2008 : Financial Humor Bulletin, 2010 : The Most Comprehensive Collection of Editor-related Humor : Programming Language Humor : Goldman Sachs related humor : Greenspan humor : C Humor : Scripting Humor : Real Programmers Humor : Web Humor : GPL-related Humor : OFM Humor : Politically Incorrect Humor : IDS Humor : "Linux Sucks" Humor : Russian Musical Humor : Best Russian Programmer Humor : Microsoft plans to buy Catholic Church : Richard Stallman Related Humor : Admin Humor : Perl-related Humor : Linus Torvalds Related humor : PseudoScience Related Humor : Networking Humor : Shell Humor : Financial Humor Bulletin, 2011 : Financial Humor Bulletin, 2012 : Financial Humor Bulletin, 2013 : Java Humor : Software Engineering Humor : Sun Solaris Related Humor : Education Humor : IBM Humor : Assembler-related Humor : VIM Humor : Computer Viruses Humor : Bright tomorrow is rescheduled to a day after tomorrow : Classic Computer Humor

The Last but not Least


Copyright © 1996-2016 by Dr. Nikolai Bezroukov. www.softpanorama.org was created as a service to the UN Sustainable Development Networking Programme (SDNP) in the author free time. This document is an industrial compilation designed and created exclusively for educational use and is distributed under the Softpanorama Content License.

The site uses AdSense so you need to be aware of Google privacy policy. You you do not want to be tracked by Google please disable Javascript for this site. This site is perfectly usable without Javascript.

Original materials copyright belong to respective owners. Quotes are made for educational purposes only in compliance with the fair use doctrine.

FAIR USE NOTICE This site contains copyrighted material the use of which has not always been specifically authorized by the copyright owner. We are making such material available to advance understanding of computer science, IT technology, economic, scientific, and social issues. We believe this constitutes a 'fair use' of any such copyrighted material as provided by section 107 of the US Copyright Law according to which such material can be distributed without profit exclusively for research and educational purposes.

This is a Spartan WHYFF (We Help You For Free) site written by people for whom English is not a native language. Grammar and spelling errors should be expected. The site contain some broken links as it develops like a living tree...

You can use PayPal to make a contribution, supporting development of this site and speed up access. In case softpanorama.org is down you can use the at softpanorama.info

Disclaimer:

The statements, views and opinions presented on this web page are those of the author (or referenced source) and are not endorsed by, nor do they necessarily reflect, the opinions of the author present and former employers, SDNP or any other organization the author may be associated with. We do not warrant the correctness of the information provided or its fitness for any purpose.

Last modified: October 03, 2017