Softpanorama

Home Switchboard Unix Administration Red Hat TCP/IP Networks Neoliberalism Toxic Managers
May the source be with you, but remember the KISS principle ;-)
Skepticism and critical thinking is not panacea, but can help to understand the world better

The tar pit of Red Hat overcomplexity

RHEL 6 and RHEL 7 differences are no smaller then between SUSE and RHEL which essentially doubles workload of sysadmins as the need to administer "extra" flavor of Linux/Unix leads to mental overflow and loss of productivity. That's why most experienced sysadmins resent RHEL 7. The joke is that "The D in Systemd stands for 'Dammmmit!' "

Version 1.62, Mar 17, 2019

News Administration\ Recommended Links Recommended Books Lecture notes for RHSCA certification for RHEL 7 Notes on RHCSA Certification for RHEL 7 Differences between RHEL 8 and RHEL 7 Registering a server using Red Hat Subscription Manager (RHSM)
Systemd Modifying ISO image to include kickstart file Log rotation in RHEL/Centos/Oracle linux Differences between RHEL 6 and RHEL 7 Root filesystem is mounted read only on boot RHEL vs. Oracle Linux pricing RHCSA: Using redirection and pipes RHCSA: Text files processing
Installation Kickstart Installation of Red Hat from a USB drive RHEL networking RHEL 6 NTP configuration RHEL handling of DST change LVM Xinetd
Installation and configuration of KVM in RHEL7 Recovery of LVM partitions Biosdevname and renaming of Ethernet interfaces RHEL vs Oracle Linux pricing Changing runlevel when booting with grub in RHEL6 Disabling useless daemons in RHEL/Centos/Oracle 6 servers Infiniband Installing Mellanox Infiniband Driver on RHEL 6.5
RPM YUM Anaconda VNC-based Installation in Anaconda Cron Wheel Group PAM SELinux
RHEL subscription management RHEL6 Runlevels Disabling useless daemons in RHEL 6 servers Disabling useless daemons in RHEL 5 linux servers Disabling RHEL 6 Network Manager RHEL6 registration on proxy protected network Disk Management IP address change
Redundant daemons in RHEL Disabling the avahi daemon Security rsyslog SSH NFS Samba NTP
 serial console Screen vim Log rotation rsync Sendmail VNC/VINO Midnight Commander
Red Hat Startup Scripts bash Fedora Oracle Linux Administration Notes RHEL 7 lands with systemd RHEL5 registration on proxy protected network Differences between RHEL 6 and RHEL 7 RHEL7 documentation
Red Hat Certification Program Red Hat history Saferm -- wrapper for rm command Red Hat vs. Solaris

Sysadmin Horror Stories

Tips Humor  
 
Without that discipline, too often, software teams get lost in what are known in the field as "boil-the-ocean" projects -- vast schemes to improve everything at once. That can be inspiring, but in the end we might prefer that they hunker down and make incremental improvements to rescue us from bugs and viruses and make our computers easier to use.

Idealistic software developers love to dream about world-changing innovations; meanwhile, we wait and wait for all the potholes to be fixed.

Frederick Brooks, Jr,
The Mythical Man-Month, 1975

Acknowledgment. I would like to thank sysadmins who have sent me the feedback for the version 1.0 of this article. It definitely helped to improve it.  Mistakes and errors in the current version are my own.


Abstract

The key problem with RHEL7 is its adoption of systemd as well as multiple and significant changes in standard utilities and daemons. Troubleshooting skills needed are different from RHEL6.  It also "obsoletes "pre-RHEL 7" books including many (or even most) good books from O'Reilly and other publishers.

The distance between RHEL6 and  RHEL7 is approximately the same as the distance between RHEL6 and Suse 12 so that we can view RHEL7 as a new flavor of Linux introduced into an enterprise environment.  Which increases the load of a sysadmin who previously supported RHEL5,6 and Suse Enterprise 12 30% or more. The amount of headache and hours you need to spend on the job increased, but salaries not so much.  The staff of IT departments continues to shrink, despite the growth of the overcomplexity.

Passing RHCSA exam is now also more difficult than for RHEL 6 and requires at least a year of hands-on experience along with self-study even for those who were certified in RHEL6.  "Red Hat engineer" certification became useless as complexity is such that no single person can master the whole system. Due to growing overcomplexity tech support for RHEL7 mutated into Red Hat Knowledgebase indexing service. Most problems now require extensive research to solve, hours of reading articles in Red Hat Knowledgebase, documentation, posts in Stack Overflow and similar sites.  In this sense "Red Hat engineer" certification probably should be renamed to "Red Hat detective", or something like that. 

This behavior of Red Hat, which amounts to the intentional destruction of the existing server ecosystem by misguided "desktop Linux" enthusiasts within Red Hat developer community, and by Red Hat brass greed, requires some adaptation.  Licensing structure of large enterprises regarding Red Hat should probably be changed: tech support should probably be diversified using your hardware vendors such as Dell and HP.  If you do not use Oracle Linux it might make sense to add it to the mix, especially on the low end (self-support) and for servers running databases, where you can get higher quality support from Oracle, then from Red Hat.  As support became increasingly self-support anyway, it might make sense to switch to wider use CentOS and Academic Linux on the low end, non-critical servers too, which allows buying Premium licenses for the critical servers for the same, or even less amount of money than the uniform licensing structure.  Four socket servers now should be avoided due to higher licensing costs (and with the ability to get 32 cores on a two socket server they are not  critical, anyway). Docker should be more widely used instead of licensing virtual instances from Red Hat (the limitation on virtual instances is a typical IBM-style way of extracting money from the customers).

Introduction

Imagine a language in which both grammar and vocabulary is changing each decade. Add to this that the syntax is complex, the vocabulary is huge and each verb has a couple of dozens of suffixes that often change its meaning, sometimes drastically.   This is the situation of "complexity trap" that we have in enterprise Linux. In this sense troubles with RHEL7 are just a tip of the iceberg.  You can learn some subset when you closely work with the particular subsystem  (package installation, networking, nfsd, httpd, sshd, Puppet/Ansible, Nagios, and so on and so forth) only to forget vital details after a couple quarters, or a year.  I have a feeling that RHEL is spinning out of control at an increasingly rapid pace. Many older sysadmins now have the sense that they no longer can understand the OS they need to manage and were taken hostages by desktop enthusiasts faction within Red Hat. They now need to work in a kind of hostile environment. Many complain that they feel overwhelmed with this avalanche of changes, and can't  troubleshoot problems that they used to able to troubleshoot before. In RHEL7 it is easier for an otherwise competent sysadmin to miss a step, or forget something important in the stress and pressure of the moment even for tasks that he knows perfectly well. That means that SNAFU become more common.  Overcomplexity of RHEL7, of course, increases the importance of checklists and personal knowledgebase (most often  set of files, or a simple website, or a wiki).

But creation of those requires time that is difficult to find and thus is often skipped: it is not uncommon to spent more time on creating documentation to the problem, than on its resolution. So "document everything" mantra is definitely increasing the overload.

The situation reached the stage of the inability to remember the set of information necessary for productive work. Now even "the basic set" is way too large for mere mortals, especially for sysadmins, who need to maintain one additional flavor of Linux (say, Suse Enterprise).  Any uncommon task became a research: you need to consult man pages, as well as available on the topic Web pages such as documents at Red Hat portal, Stack overflow discussion, and other relevant to the problem in hand sites. Often you can't perform again the task  that you performed a quarter or two ago without consulting your notes and documentation.

Even worse is the resulting  from overcomplexity "primitivization" of your style of work.  Sometimes you discover an interesting and/or more productive way to perform an important task. If you do not perform this task frequently and do not write this up, it will soon be displaced in your memory with other things: you will completely forget about it and degrade to the "basic" scheme of doing things.  This is very typically for any environment that is excessively complex. That's probably why so few enterprise sysadmin those days have their personal .profile and .bashrc files and often simply use defaults. Similarly the way Ansible/Puppet/Chef are used is often a sad joke -- they are used on so primitive level that execution of scripts via NFS would do the same thing with less fuss. The same is true for Nagios which in many cases is "primitivised" to a glorified ping. 

RHEL6 was complex enough to cause problems due to overcomplexity.  But RHEL7 increased the level of complexity to such a level that it become painful to work with, and especially to troubleshoot complex problems. That's probably why the quality of Red Hat support deteriorated so much (it essentially became referencing service to Red Hat advisories)  -- they are overwhelmed, and no longer can concentrate of a single ticket in this river of tickets related to RHEL7 that they receive daily.  In many cases it is clear from their answers that they did not even tried to understand the problem you face and just searched the database for keywords.

While adoption of  hugely complex, less secure (with a new security exploit approximately once a quarter),  and unreliable systemd instead of init probably played a major  role in this situation (and if you patch you server periodically you will notice that the lion share of updates includes updates to systemd), other changes were also significant enough to classify RHEL 7 as a new flavor of Linux, distinct from RHEL4-RHEL6 family: along with systemd, multiple new utilities and daemons were introduced as well; also many methods of going common tasks changed, often dramatically. Especially in the area of troubleshooting.

RHEL7 should be viewed as a new flavor of Linux, not as a version upgrade. As such productivity of sysadmins juggling multiple flavors of Linux drops. The value of many (or most) high quality books about Linux was destroyed by this distribution, and there is no joy to see this vandalism

It might well be the resulting complexity crosses some "red line" (as in "The straw that broke the camel's back ")   and instantly started to be visible and annoying to vastly more people then before.  In other words, with the addition of systemd quantity turned into quality  Look, for example, at the recovery of forgotten sysadmin password in RHEL7 vs. RHEL6.  This made this version of RHEL dramatically less attractive choice, as learning  curve is steep and troubleshooting skills needed are different from RHEL6.  The value of many high quality books about Linux was destroyed by this distribution, and there is no joy to see this vandalism.  And you feel somewhat like a prisoner of somebody else decisions:  no matter whether you approve those changes or not, the choice is not yours: after 2020 most enterprise customers need to switch, as RHEL6 will be retired...  You have essentially no control.

RHEL7 might well create the situation in which overcomplexity started to be visible and annoying to vastly more sysadmins then before.
 (as in "The straw that broke the camel's back ").   In other words, with the addition of systemd quantity turned into quality

Still it will be an exaggeration to say that all sysadmin (or RHEL 7 users in general) resent systemd. Many don't really care. Not much changed for  "click-click-click" sysadmins who limit  themselves to GUI tools, and other categories of users who use the RHEL7 "as is", such Linux laptop users. But even for laptop users (who mostly use Ubuntu, not RHEL),  there are doubts that they are  benefitting from faster boot ( systemd is at least 10 times larger then classic init and SSD drives accelerated loading of daemons dramatically, making delayed approach adopted by systemd obsolete).  It is also easier for entry level sysadmins who are coming to the enterprises right now and for whom RHEL7 is their first system. Most use Ubuntu at home and at university and they do not know or use any system without systemd. Also, they use much smaller subset of services and utilities, and generally are not involved with problem connected with more complex problems connected with hanged boots, botched networking, and crashed daemons.  But enterprise sysadmins who often are simultaneously "level 3 support" specialists were hit hard. Now they understand internals much less.

In a way, introduction of systemd signify "Microsoftization" of Red Hat.  This daemon replaces init in a way I find problematic: the key idea is to replace a language in which init scripts were written (which provides programmability and has its own flaws, which still were fixable) with the fixed set of all singing, all dancing parameters in  so called "unit files", removing programmability.

The key idea of systemd is to replace a language in which init scripts were written (which provides programmability and has its own flaws, which still were fixable) with the fixed set of all singing, all dancing parameters grouped in so called "unit files" and the interpret of this "ad hoc" new non-procedural language written in C. This is a flawed approach from the architectural standpoint, and no amount of systemd fixes can change that. 

Systemd is essentially an implementation of the interpreter of implicit non-procedural language defined by those parameters by a person, who never have written an interpreter from a programming language in his life, and never studied writing compilers and interpreters as a discipline.  Of course, that shows. 

It supports over a hundred various parameters which serve as keywords of this ad-hoc language, which we can call "Poettering init language." You can verify the  number  of introduced keywords yourself, using a pipe like the following one:

cat /lib/systemd/system/[a-z]* | egrep -v "^#|^ |^\[" | cut -d '=' -f 1 | sort | uniq -c | sort -rn

Yes, this is "yet another language" (as if sysadmins did not have enough of them already), and a badly designed one judging from to the total number  of keywords/parameters introduced. The most popular are the following 35 (in reverse frequency of occurrence order):

    234 Description
    215 Documentation
    167 After
    145 ExecStart
    125 Type
    124 DefaultDependencies
    105 Before
     78 Conflicts
     68 WantedBy
     56 RemainAfterExit
     55 ConditionPathExists
     47 Requires
     37 Wants
     28 KillMode
     28 Environment
     22 ConditionKernelCommandLine
     21 StandardOutput
     21 AllowIsolate
     19 EnvironmentFile
     19 BusName
     18 ConditionDirectoryNotEmpty
     17 TimeoutSec
     17 StandardInput
     17 Restart
     14 CapabilityBoundingSet
     13 WatchdogSec
     13 StandardError
     13 RefuseManualStart
     13 PrivateTmp
     13 ExecStop
     12 ProtectSystem
     12 ProtectHome
     12 ConditionPathIsReadWrite
     12 Alias
     10 RestartSec

Another warning sign about systemd is paying outsize attention for the subsystem that loads/unloads the initial set of daemons and then manages the working set of daemon, replacing init scripts and runlevels with a different and dramatically more complex alternative.  Essentially they replaced reasonably simple and understandable subsystem with some flaws, by a complex and opaque subsystem with non-procedural language defined by the multitude of parameters. While the logic of application of those parameters is hidden within systemd code.  Which due to its size has a lot more flaws. As well as side effects because proliferation of those parameters and sub parameters is never ending process: the answer to any new problem discovered in the systemd is the creation of additional parameters or, in best case, modification of existing.  Which looks to me like a self-defeating, never ending spiral of adding complexity to this subsystem, which requires incorporation into systemd many things that simply do not belong to the init. Thus making it "all signing all dancing" super daemon.  In other  words, systemd expansion might be a perverted attempt to solve problems resulting from the fundamental flaws of the chosen approach.

This fascinating story of personal and corporate ambition gone awry still wait for its researcher. When we are thinking  about some big disasters like, for example, Titanic, it is not about the outcome -- yes, the ship sank, lives were lost -- or even the obvious cause (yes, the iceberg), but to learn more about WHY. Why the warnings were ignored ? Why such a dangerous course was chosen? One additional and perhaps most important for our times question is: Why the lies that were told were believed?  

Currently customers on the ruse are assured that difficulties with systemd (and RHEL7 in general) are just temporary, until the software could be improved and stabilized. While some progress was made, that day might never come due to the architectural flaws of such  an approach and the resulting increase in complexity as well as the loss of flexibility, as programmability is now is more limited.  If you run the command find /lib/systemd/system | wc -l on a "pristine" RHEL system (just after the installation) you get something like 327 unit files. Does this number raise some questions about the level of complexity of systemd?  Yes it does.  It's more then three hundred files and with such a number it is reasonable to assume that some of them might have some hidden problem in generation of the correct init file logic on the fly.

You can think of systemd as the "universal" init script that on the fly is customized by supplied unit file parameters. Previously part of this functionality was implemented as a PHP-style pseudo-language within the initial comment block of the regular bash script. While the implementation was very weak (it was never written as a specialized interpreter with formal language definition, diagnostics and such), this approach was not bad at all in comparison with the extreme, "everything is a parameter" approach taken by systems, which eliminated bash from init file space.   And it might make to return to such a "mixed" approach in the future on a new level, as in a way systemd provides the foundation for such a language. Parameters used to deal with dependencies and such can be generated and converted into something less complex and more manageable.

One simple and educational experiment that shows brittleness of system approach how is to  replace on a freshly installed RHEL7 VM /etc/passwd /etc/shadow and /etc/group with files from RHEL 6 and see what happens during the reboot (BTW such error can happen in any organization with novice sysadmins, who are overworked and want to cut corners during the  installation of the new system, so such behaviour is not only of theoretical interest)

Now about the complexity: the growth of codebase (in lines of codes) is probably more then ten times. I read that the systemd has around 100K lines of C code (or 63K if you exclude journald, udevd, etc). In comparison sysvinit has about 7K lines of C code.  The total number of lines in all systemd-related subsystems is huge and by  some estimates is  close to a million (Phoronix Forums).  It is well known that the number of bugs grows with the total lines of codes. At a certain level of complexity "quantity turns in quality": the number of bugs became infinite in a sense that the system can't be fully debugged due to intellectual limitations of authors and maintainers in understanding the codebase as well as gradual deterioration of the conceptual integrity with time.

At a certain level of complexity "quantity turns in quality": the number of bugs became infinite in a sense that the system can't be fully debugged due to intellectual limitations of authors and maintainers in understanding the codebase as well as gradual deterioration of the conceptual integrity with time.

It is also easier to implant backdoors in complex code, especially in privileged complex code.  In this sense a larger init system means a larger attack surface, which on the current level of Linux complexity is already substantial with the never ending security patches for major subsystems (look at security patches in CentOS Pulse Newsletter, January 2019 #1901). as well as the parallel stream of Zero Day exploits for each major version of Linux on which such powerful organization as NSA is working day and night, as it is a now a part of their toolbox.  As the same systemd code is shared by all four major Linux distributions (RHEL, SUSE, Debian, and Ubuntu), systemd  represents a lucrative target for Zero Day exploits.

As boot time does not matter for servers (which often are rebooted just a couple of times a year) systemd raised complexity level and made RHEL7 drastically different from RHEL6,  while providing nothing constructive in return. The distance between RHEL6 and  RHEL7 is approximately the same as distance between RHEL6 and Suse, so we can speak of RHEL7 as a new flavor of Linux and about introduction of a new flavor of Linux into enterprise environment. Which, as any new flavor of Linux,  raises the cost of system administration (probably around 20-30%, if the particular enterprise is using mainly RHEL6 with  some SLES instances)

Systemd and other changes made RHEL7 as different from RHEL6 as Suse distribution is from Red Hat. Which, as any new flavor of Linux,  raises the cost of system administration (probably around 20-30%, if the particular enterprise is using RHEL6 and Suse12 right now)

Hopefully before 2020 we do not need to upgrade to RHEL7, and have time to explore this new monster. But eventually, when support of RHEL 6 ends, all RHEL enterprise users either need to switch to RHEL7 or to alternative distribution such as Oracle Linux or SUSE Enterprise.  Theoretically the option to abandon RHEL for other distribution that does not use systemd also exists, but corporate inertia is way to high and such a  move is too risky to materialize. Some narrow segments, such as research probably can use Debian without systemd, but in general that answer is no. Because this is a Catch 22 -- adopting distribution without systemd raised the level complexity (via increased diversification of Linux flavors) to the same levels as adopting RHEL 7 with system.

Still, at least theoretically, this is actually an opportunity for Oracle to bring Solaris solution of this problem to Linux, but I doubt that they want to spend money on it. They probably will continue their successful attempts to clone Red Hat (under the name of Oracle Linux) improving kernel and some other subsystems to work better with Oracle database. They might provide some extension of the useful life of RHEL6 though, as they did with RHEL 5.

Taking into account dominant market share of RHEL (which became Microsoft of Linux) finding an alternative ("no-systemd) distribution is a difficult proposition. But Red Hat acquisition by IBM might change that. Neither HP, nor Oracle have any warm feelings toward IBM and preexisting Red Hat neutrality to major hardware and enterprise software vendors now disappeared, making IBM de-facto owner of enterprise Linux. As IBM enterprise software directly competes with enterprise software from HP and Oracle, that fact gives IBM a substantial advantage.

IBM acquisition of Red Hat made IBM the de-facto owner of Enterprise Linux

One emerging alternative to CentOS in research is Devuan (a clone of Debian created specifically to avoid using systemd). If this distribution survives till 2020, it can be used by customers who do not depend much on commercial software (for example, for genomic computational clusters). But the problem is that right now this is not an enterprise distribution per se, as there no commercial support for it. I hope that they will create some startup to provide it.  Still it is pretty reliable and can compete with CentOS.  Latest versions of Debian also can be installed without systemd and Gnome, but systemd remains the default option.

But most probably you will be forced to convert in the same way as you was forced to convert from Windows 7 to Windows 10 on enterprise desktop.  So much about open source as something providing a choice. This is a choice similar to famous quote from Henry Ford days: "A customer can have a car painted any color he wants as long as it’s black".  With the current complexity of Linux the key question is "Open for whom?" For hackers? Availability of the source code does not affect much  most enterprise customers as the main mode of installation of OS and packages is via binary, precompiled RPMs, not the source complied on the fly like in Gentoo or FreeBSD/OpenBSD.  I do not see RHEL7 in this respect radically different from, say, Solaris 11.4, which is not supplied in source form for most customers. BTW while more expensive (and with the uncertain future) is also more secure (due to RBAC and security via obscurity ;-)  and has a better polished light weight virtual machines (aka zones, including branded zones ), DTrace,  as well as a better filesystem (ZFS), although XFS is not bad iether, although it is unable fully utilize advantages of SSD disks.

Actually Lennart Poettering is an interesting Trojan horse within the Linux developers community.  This is an example how one talented, motivated and productive Apple (or Windows) desktop biased C programmer can cripple a large open source project facing no any organized resistance. Of course, that means that his goals align with the goals of Red Hat management, which is to control the Linux  environment in a way  similar to Microsoft -- via complexity (Microsoft can be called the king of software complexity) providing for lame sysadmins GUI based tools for "click, click, click" style administration. Also standardization on GUI-based tools for the administration provides more flexibility to change internals without annoying users. The tools inner working of which they do not understand and are not interested in understanding. He single-handedly created a movement, which I would call "Occupy Linux" ;-). See Systemd Humor.

And respite wide resentment, I did not see the slogan "Boycott RHEL 7" too often and none on major IT websites jointed the "resistance".  The key for stowing systemd down the developers throats  was its close integration with Gnome (both Suse and Debian/Ubuntu adopted systemd). It is evident, that the absence of "architectural council" in projects like Linux is a serious weakness. It also suggests that developers from companies representing major Linux distribution became uncontrolled elite of Linux world, a kind of "open source nomenklatura", if you wish.  In other words we see the  "iron law of oligarchy" in action here. See also Systemd invasion into Linux distributions

Five flavors of Red Hat

 Like ice-cream Red Hat exist in multiple flavors:

  1. RHEL -- leading commercial distribution of Linux. The lowest cost is HPC node license  which is $100 per year (needs  a headnode license; provides only patches -- self-support -- with restricted number of packages in repositories). Patches only regular Enterprise version (self-support license) is  $350 per year. Standard license which provides patches and "mostly" web-based support is $800 per server per year (two sockets max; 4 sockets are extra).  Premium license that provides phone support for severity 1 and 2 (24 x 7)  is more expensive (~ $1300 per year).  See for example CDW prices
    • Three year licenses are also available and are slightly cheaper. From HP and Dell you can by five years licenses as well. There is also HPC license in which you license the headnode with regular, or premium license and then each computational node is $100 each (with severely limited set of packages, and no GUI).  I think Oracle Linux self-support license is a better deal then RHEL HPC license (see below) 
    • Red Hat also went "IBM way" and now is charging for 4 socket systems more then for two socket systems.  Which now with 16 core CPU available (which means 32 core in a two socket server) are needed only for special, mostly computational and database  applications.  Few applications scale well above 32 cores as memory management became more convoluted with each additional core (although some categories of user do not understand that and thus belong to the bizarre pseudo-scientific  cult -- with the main article of faith "the more cores, the better" ;-) 
  2. Oracle  Linux.  For all practical purposes this is a distribution identical to RHEL, but slightly cheaper. Self-support for enterprise  version is almost three times cheaper ( ~$120 per year).  It can be used either with Red Hat stock kernel or with (free) Oracle supplied kernel, which supposedly is less buggy and better optimized for running Oracle database and similar application (MySQL is one).  The Oracle move to milk weaknesses of GPL was pretty unique and it proved to be shrewd and highly successful.
  3. CentOS -- A community supported distribution that now is sponsored by Red Hat. Both distribution itself and patches are free and are of reasonable, usable quality. Although the general quality is less then Oracle Linux. Suffers from limited resources including access to servers to download patches (kind of "tragedy of commons" type of problems with any free distributions -- in a way CentOS is a victim of its own success).  Releases of a clone of the most recent Red Hat version is typically delayed by several months, which is actually a good thing, except for beta-addicts.  Here is how they define it in wiki:

    CentOS is an Enterprise Linux distribution based on the freely available sources from Red Hat Enterprise Linux. Each CentOS version is supported for 7 years (by means of security updates). A new CentOS version is released every 2 years and each CentOS version is periodically updated (roughly every 6 months) to support newer hardware. This results in a secure, low-maintenance, reliable, predictable and reproducible Linux environment.

  4. Scientific Linux is RHEL based distribution produced by Fermi National Accelerator Laboratory and the European Organization for Nuclear Research (CERN). The web site is Scientific Linux. Like CentOS it  aims to be "as close to the commercial enterprise distribution as we can get it."  It is a very attractive option for computational nodes of HPC clusters. It also appeals to individual application programmers in such areas as genomic and molecular modeling as it provides a slightly better environment for software development then stock Red Hat.
  5. Fedora -- community distribution, which generally can be considered as a beta version of RHEL. Attractive mostly for beta addicts. Red Hat Enterprise Linux 6 was forked from Fedora 12 and contains back ported features from Fedora 13 and 14.  It was the first distribution to adopt systemd: versions of Fedora, starting with Fedora 15 use systemd  daemon by Lennart Poettering (who is a well-known desktop Linux enthusiast and Red Hat developer (mainly in C) and who produced, saying politely, "mixed feelings" in server sysadmin community; along with systemd he also authored two other desktop oriented daemons -- pulseaudio and avahi, which also found their way into the server space)

We can also mention SUSE Enterprise, which historically was derived from RHEL and still is by and large compatible with it. While using RPM packages it uses a different package manager. It also allow to use a very elegant AppArmor kernel module for enhanced security (RHEL does not allow to use this module).

Mutation of Red Hat techsupport onto Knowledgebase's indexing service:  further deterioration of the quality of tech support in comparison with RHEL6 caused by overcomplexity

The quality of Red Hat tech support progressively deteriorated from RHEL 4 to RHEL 6. So further deterioration is nothing new. It is just a new stage of the same process caused by growing overcomplexity of OS and correspondingly difficulties in providing support to the billion of features present in it.

But while this is nothing new, the drop in case of RHEL 7 was pretty noticeable and pretty painful. Despite several level of support included in licenses (with premium supposedly to be higher level) technical support for really complex cases for RHEL7 is uniformly weak. It degenerated into "looking in database" type of service: an indexing  service for Red Hat vast Knowledgebase.

While Knowledgebase is a useful tool and along with Stack overflow often provides good tips were to look, it fails short of classic notion of tech support. Only in rare cases the article recommended exactly describes your problem. Usually this is just a starting point of your own research, no  more no less.  In some cases references to articles  provided are simply misleading and it is completely clear that engineers on the other end have no time or desire seriously word on your  tickers and help you.  

I would also dispute the notion that the title of Red Hat engineer makes sense for RHEL 7.  This distribution is so complex that is it clearly above the ability to comprehend it for even talanted and dedicated professionals.  Each area such as networking, process management, installation of packages and patching, security, etc is incredibly complex.   Certification exam barely scratched the surface of this complexity. 

Of course, much depends on individual and probably it does serve a useful role providing some assurance that the particular person can learn further, but in no way passing it means that the person has the ability to troubleshoot complex problem connected with RHEL7 (the usual meaning of the  term engineer).  This is just an illusion.  He might be  able to cover "in depth" one or two area, for example troblshooing of netwroking and package installation. or troubleshooting of web servers or troubleshooting Docker. In all other areas he will be much less advanced tepoblshooter, almost a novice. And in no way the person can possesses "universal" troubleshooting  skill for Red Hat. just network stack is so complex that it now requires not a vendor certification (like Cisco in the past),  but PHD in computer science in this area from the top university. And even that might not be enough, especially in environment  where proxy is present and some complex staff like bonded interfaces are used. Just making Docker to pull images from repositories, if your network uses Web proxy is a non-trivial exercise ;-).

If you have a complex problem, you are usually stuck, although premium service provide you an opportunity to talk with a live person, which might help.  for appliances and hardware you now need to resort to helpdesk of OEM (which means getting extended 5 years warraty from OEM is now a must, not an option). 

In a way, unless you buy premium license,  the only way to use RHEL7 is "as is".  While possible for most production servers, this is not a very attractive proposition for a research environment.

Of course, this deterioration is only particularly connected with RHEL7 idiosyncrasies. Linux generally became a more complex OS. But the trend from RHEL4 to RHEL7 is a clear trend to a Byzantine OS that nobody actually knows well (Even a number of utilities now is such that nobody knows probably more then 30% or 50% of them; just yum and installation of packages in general represent such a complex maze, that you might learn it all your professional career). Other parts in IT infrastructure are also growing in complexity and additional security measures throw an additional monkey wrench 

The level of complexity definitely reached human limits and I observed several times that even if you learn some utility during particular case of troubleshooting you will soon forget it if the problem doers not reoccur, say with a year. Look like acquired knowledge is displaced by new and wiped out die to limited capacity of human brain to hold information. In this sense the title "Red Hat Engineer" became a sad joke.   The title probably should be  called "Red Hat detective", as many problems are really mysteries which deserve new Agatha Christy to tell the world.

With RHEL 7 the "Red Hat Certified Engineer" certification became a sad joke as any such "engineer" ability to troubleshoot complex cases deteriorated significantly.  This is clearly visible when you are working with certified Red Hat engineers from Dell or HP, when using their Professional Services

The premium support level RHEL7 is very expensive for small and medium firms. But now there no free lunch and if you are using commercial version of RHEL you simply need to pay this annual maintenance for some critical servers. Just as insurance. But, truth be told, for many servers you might well be OK with just patches (and/or buy higher level. premium support license only for one server out of bunch of similar servers). Standard support RHEL7 is only marginally better that self-support using Google and access to Red Hat knowledgebase.

Generally the less static your datacenter is,  and the more unique type of servers you use, the more premium support licenses you need. But you rarely need more then one for each distinct type of the servers (aka "server group"). Moreover for certain type of problems, for example driver related problems,  now you need such level of support not from Red Hat, but from you hardware vendor as they provide better quality for problems related to hardware, as they know  it much better. 

For database servers getting a license from Oracle makes sense too as Oracle engineers are clearly superior in this area.  So diversification of licensing and minimization and  strategic placement of number  of the most expensive premium licenses now makes perfect sense and provides you is an opportunity to save some money.   This approach is used for a long time for computational clusters where typically only one node (the headnode) gets the premium support license, and all other nodes get the cheapest license possible or even running Academic Linux or CentOS. Now  it is time to generalize this approach to other part of the enterprise datacenter.   

Even if you learned something important today you will soon forget if you do not use it as there way too may utilities, application, configuration files, etc. You name it.  Keeping  your own knowledgebase  as a set of html files or private wiki is now the necessity.  Lab journal type of logs are now not enough

Another factor that makes this "selective licensing" necessary is that prices increased from RHEL5 to RHEL7 (especially is you use virtual guests a lot; see discussion at RHEL 6 how much for your package (The Register). Of course,  Xen is preferable for creation of virtual machines, so usage of "native"  RHEL VM is just an issue due to inertia. Also now instead of full virtualization in many cases can use light weight virtualization (or special packaging,  if you wish) via Docker (as long as major version of the kernel needed is the same). In any case this Red Hat attempt to milk the cow by charging for each virtual guest is so IBM-like that it can generate nothing but resentment, and the resulting from it desire to screw Red Hat in return.  The feeling  for a long time known for IBM enterprise software customers ;-).

First I was disappointed with the quality of RHEL7 support and even complained, but with time I start to understand that they have no choice but screw customers -- the product is way over their heads and the number of tickets to be resolved is too high resulting in overload. The fact that the next day other person can work in your ticker adds insult to the injury. That's why for a typical ticket their help now is limited to pointing you to relevant (or semi-relevant ;-) articles in the knowledgebase. Premium support still is above average, and they can switch  you to a live engineer on a different continent in critical cases in later hours, so if server downtime is important this is a kind of (weak) insurance. In any case, my impression is that Red Hat support is probably overwhelmed by problems with RHEL7 and even for subsystems fully developed by Red Hat such as subscription manager and yum. Unless you are lucky and accidentally get really knowledgeable guy, who is willing to help. Attempt to upgrade the ticket to higher level of severity sometimes help, but usually does not. 

The other problem (that I already mentioned) is that ticket are now assigned to a new person each day, so if the ticket in  not resolved by the first person, you get in treadmill with each new person starting from scratch.  That's why now it is important to submit with each ticket a good write-up and SOS file from the very beginning, as this slightly increase you chances that the first support person will give you a valid advice, not just as semi-useless reference to the article in the knowledgebase. Otherwise their replies often demonstrates complete lack of understanding of what problem you are experiencing.

It the ticket is important for you, in no way you can just submit a description of the problem now. You always need to submit SOS tarball, and, preferably, some result of your initial research.  Do not expect they will be looked closely. But just the process of organizing all the information you know for submission greatly help to improve you understanding of the problem, and sometimes led to resolution of ticket by yourself.   As RHEL engineers get used to work with plain vanilla RHEL installed you generally need to point what deviations your install has (proxy is one example --- it should be mentioned and configuration specifically listed; complex  cards like Intel four port 10Gbit cards, or Mellanox Ethernet cards with Infiniband layer from HP need to be mentioned too)

Human expert competence vs excessive automation

Linux is a system complexity of which is far  beyond the ability of mere mortals.  Even seasoned sysadmins with, say, 20 years under the best know of tiny portion of the system -- the portion which their previous experience allows them to study and learn. And in this sense introduction of more and more complex subsystems into Linux is a crime.  It simply makes already bad situation worse.

The paradox of complex system with more "intelligence" that now tend to replace simple systems that existed before is that in "normal circumstances" they accommodate incompetence. As Guardian noted they are made easy to operate by automatically correcting mistakes.

Earl Wiener, a cult figure in aviation safety, coined what is known as Wiener’s Laws of aviation and human error. One of them was: “Digital devices tune out small errors while creating opportunities for large errors.” We might rephrase it as:

Automation will routinely tidy up ordinary messes, but occasionally create an extraordinary mess.” It is an insight that applies far beyond aviation.

The behaviour of Anaconda installer in RHEL7 is a good example of this trend. Its behaviour is infuriating for an expert (constant "hand holding"), but is a life saver for a novice and most decisions are taken automatically (it even prevents novices from accidentally wiping out existing linux partitions)  and while the system is most probably suboptimal both in disk structure (I think it is wrong to put root partition under LVM: that complicates troubleshooting and increases the possibility of data loss) and in selection of packages (that can be corrected), but it probably will boot after the installation.

Because of this, a unqualified  operator can function for a long time before his lack of skill becomes apparent – his incompetence is a hidden and if everything is OK such situation can persist almost indefinitely.  The problems starts as soon as situation goes out of control, for example server crashes or for some reason became unbootable.  And God forbid if such system contains  valuable data.  There is some subtle analogy in what Red Hat did with RHEL7 and what Boeing did with 737 MAX. Both tried to hide the flaw in architecture and overcomplexity via introduction of additional of software to make it usable for unqualified operator.

Even if a sysadmin was an expert, automatic systems erode their skills by removing the need for practice.  It also diminish the level of understanding by introducing an additional intermediary in situations where previously sysadmin operated "directly".  This is the case with systemd -- it does introduce a powerful intermediary for many operations that previously were manual and make booting process much less transparent and more difficult to tprblshoot if something goes wrong. In such instances systemd becomes an enemy of sysadmin instead of a helper.   This is a similar situation as with HP RAID controllers which when they lose configuration happily wipe out your disk array creating a new just to help you.

But such systems tend to fail either in unusual situations or in ways that produce unusual situations, requiring a particularly skilful response. A more capable and reliable automatic system is, the more chances that the situation in case of disaster will be worse.

When something goes wrong in such situations, it is hard to deal with a situation that is very likely to be bewildering.

As aside note there is a way to construct interface in a way that it can even train system in command line. 

Regression to the previous system image as a troubleshooting method

Overcomplexity of RHEL7 also means also that you need to spend real money on configuration management (or hire a couple of dedicated guys to create your own system) as comparison of "states" or the  system and regression to the previous state often are now the only viable method of resolving, or at least mitigating the problem you face. 

In view of those complications, the ability to return to the "previous system image" now is a must and software such as ReaR (Relax-and-Recover ) needs to be known  very well by each RHEL7 sysadmin.   Capacity of low profile USB 3.0 drives (FIT drives) now reached 256GB (and Samsung SSD is also small enough that it can hang from the server USB port without much troubles which increase this local space to 1TB), and they are now capable to store at least a one system image locally on the  server (if you exclude large data directories), providing  bare metal recovery to "known good" state of the system. 

The option of booting to the last good image is what you might desperately need when you can't resolve the problem, which now happen more and more often as in many cases troubleshooting leads nowhere. For those who are averse to USB "solution" as "unclean", can , of course, stored images on NFS.

In any case this is the capability that you will definitely need in case of some "induced" problem such as botched upgrades.  Dell servers have flash cards in the enterprise version of DRAC, which can be used as perfect local repository of configuration information (via git or any other suitable method), which will survive the crash of the server.  And configuration information stored in /etc,  /boot, /root and few other places is the most important information for the recovery of the system.  What is important is that this repositories be updated continuously with each change of the state the system.

Using GIT or Subversion (or any other version control system that you are comfortable with) also mow makes more sense  GIT is not well suited for  sysadmin changes management control as by  default it does not preserve attributes and ACLs, but some packages like etckeeper add this capability. Of course, etckeeper is far from being perfect but  at least it can serve as a starting point). That capitalizes on the fact that after the installation OS usually works ;-)  It just deteriorates with  time. So we are starting to observe  the phenomenon who is well known to Windows administrators: self-disintegration of OS with time ;-) The problems typically comes later with additional packages, libraries, etc, which often are not critical for system functionality. "Keep it simple stupid" is a good approach  here.  Although for servers that are in research this mantra is impossible to implement.

Due to overcomplexity, you should be extremely frugal with packages that you keep on the server. For example, for many servers you do not need X11 to be installed (and Gnome is a horrible package, if you ask me. LXDE (which is default on Knoppix) is smaller and is adequate for most sysadmins (even Cockpit, which has  even smaller footprint  might be adequate). That cuts a lot of packages, cuts the size of /etc directory and a lot of complexity. Avoiding a complex package in favor of simpler and leaner alternative is almost always worthwhile approach.

RHEL 7 fiasco: adding desktop features in a server-oriented distribution

RHEL 7 looks like strange mix of desktop functionality artificially added to distribution which is oriented strictly on the server space (since the word "Enterprise" in the name.) This increase in complexity does not provide any noticeable return on the substantial investment in learning of yet another flavor of Linux.   Which is absolutely necessary as in no way RHEL6 and RHEL7 can be viewed as a single flavor of Linux.  The difference is similar to the difference between RHEL6 and SLES 10. 

Emergence of Devian was a strong  kick in the chin of Red Hat honchos, and they doubled their efforts in pushing systemd and resolving existing problem with some positive results. Improvement of systemd during the life of RHEL 7 are undeniable and its quality is higher in recent RHEL 7 releases, starting with RHEL7.5.  But architectural problems can't be solved by doubling  the efforts to debug existing problems. They remain, and people who can compare systemd with Solaris 10 solution instantly can see all the defects of the chosen architecture.

Similarly Red Hat honchos either out of envy to Apple (and/or Microsoft) success in desktop space, or for some other reason ( such as the ability to dictate the composition of enterprise Linux) make this server oriented Linux distribution a hostage of desktop Linux enthusiasts whims, and by doing so broke way too many things. And systemd is just one, although the most egregious example of this activity (see, for example, the discussion at Systemd invasion into Linux Server space.)

You can browse horror stories about systemd using any search engine you like (they are very educational). Here is random find from Down The Tech Rabbit Hole – systemd, BSD, Dockers (May 31, 2018): 

...The comment just below that is telling… (Why I like the BSD forums – so many clueful in so compact a space…)

Jan 31, 2018
#22
CoreOS _heavily_ relies on and makes use of systemd
and provides no secure multi-tenancy as it only leverages cgroups and namespaces and lots of wallpaper and duct tape (called e.g. docker or LXC) over all the air gaps in-between…

Most importantly, systemd can’t be even remotely considered production ready (just 3 random examples that popped up first and/or came to mind…), and secondly, cgroups and namespaces (combined with all the docker/LXC duct tape) might be a convenient toolset for development and offer some great features for this use case, but all 3 were never meant to provide secure isolation for containers; so they shouldn’t be used in production if you want/need secure isolation and multi-tenancy (which IMHO you should always want in a production environment).

SmartOS uses zones, respectively LX-zones, for deployment of docker containers. So each container actually has his own full network stack and is safely contained within a zone. This provides essentially the same level of isolation as running a separate KVM VM for each container (which seems to be the default solution in the linux/docker world today) – but zones run on bare-metal without all the VM and additional kernel/OS/filesystem overhead the fully-fledged KVM VM drags along.[…]

The three links having tree horror stories of systemD induced crap. Makes me feel all warm and fuzzy that I rejected it on sight as the wrong idea. Confirmation, what a joy.

First link:

10 June 2016
Postmortem of yesterday’s downtime

Yesterday we had a bad outage. From 22:25 to 22:58 most of our servers were down and serving 503 errors. As is common with these scenarios the cause was cascading failures which we go into detail below.

Every day we serve millions of API requests, and thousands of businesses depend on us – we deeply regret downtime of any nature, but it’s also an opportunity for us to learn and make sure it doesn’t happen in the future. Below is yesterday’s postmortem. We’ve taken several steps to remove single point of failures and ensure this kind of scenario never repeats again.

Timeline

While investigating high CPU usage on a number of our CoreOS machines we found that systemd-journald was consuming a lot of CPU.

Research led us to https://github.com/coreos/bugs/issues/1162 which included a suggested fix. The fix was tested and we confirmed that systemd-journald CPU usage had dropped significantly. The fix was then tested on two other machines a few minutes apart, also successfully lowering CPU use, with no signs of service interruption.

Satisfied that the fix was safe it was then rolled out to all of our machines sequentially. At this point there was a flood of pages as most of our infrastructure began to fail. Restarting systemd-journald had caused docker to restart on each machine, killing all running containers. As the fix was run on all of our machines in quick succession all of our fleet units went down at roughly the same time, including some that we rely on as part of our service discovery architecture. Several other compounding issues meant that our architecture was unable to heal itself. Once key pieces of our infrastructure were brought back up manually the services were able to recover...

I would just to add that any sysadmin who is working behind the proxy in enterprise environment would be happy to smash a cake or two into systemd developers face.  Here is an explanation of what in involved in adding proxy to Docker configuration  ( Control Docker with systemd Docker Documentation ):

HTTP/HTTPS proxy

The Docker daemon uses the HTTP_PROXY, HTTPS_PROXY, and NO_PROXY environmental variables in its start-up environment to configure HTTP or HTTPS proxy behavior. You cannot configure these environment variables using the daemon.json file.

This example overrides the default docker.service file. If you are behind an HTTP or HTTPS proxy server, for example in corporate settings, you need to add this configuration in the Docker systemd service file.

  1. Create a systemd drop-in directory for the docker service:
    $ sudo mkdir -p /etc/systemd/system/docker.service.d
  2. Create a file called /etc/systemd/system/docker.service.d/http-proxy.conf that adds the HTTP_PROXY environment variable:
    [Service]
    Environment="HTTP_PROXY=http://proxy.example.com:80/"

    Or, if you are behind an HTTPS proxy server, create a file called /etc/systemd/system/docker.service.d/https-proxy.conf that adds the HTTPS_PROXY environment variable:

    [Service]
    Environment="HTTPS_PROXY=https://proxy.example.com:443/"
  3. If you have internal Docker registries that you need to contact without proxying you can specify them via the NO_PROXY environment variable:
    [Service]    
    Environment="HTTP_PROXY=http://proxy.example.com:80/" "NO_PROXY=localhost,127.0.0.1,docker-registry.somecorporation.com"

    Or, if you are behind an HTTPS proxy server:

    [Service]    
    Environment="HTTPS_PROXY=https://proxy.example.com:443/" "NO_PROXY=localhost,127.0.0.1,docker-registry.somecorporation.com"
  4. Flush changes:
    $ sudo systemctl daemon-reload
  5. Restart Docker:
    $ sudo systemctl restart docker
  6. Verify that the configuration has been loaded:
    $ systemctl show --property=Environment docker
    Environment=HTTP_PROXY=http://proxy.example.com:80/

    Or, if you are behind an HTTPS proxy server:

    $ systemctl show --property=Environment docker
    Environment=HTTPS_PROXY=https://proxy.example.com:443/

In other words you now has deal with various idiosyncrasies that did not existed with the regular startup scripts. As everything in systemd is "yet another  setting" it adds opaque and difficult to understand layer of indirection. With problems amplified by poorly written documentation. The fact that you no longer can edit shell scripts to do such a simple change as adding a couple of environment variables to set a proxy (some  applications need a different proxy that "normal applications", for example a proxy without authentication)  is somewhat anxiety-producing factor.

Revamping Anaconda in "consumer friendly" fashion in RHEL 7 was another "enhancement", which can well be called a blunder.  But that can be compensated by usage of kickstart files.  There is also an option  to modify anaconda itself, but that's requires a lot of work. See Red Hat Enterprise Linux 7 Anaconda Customization Guide. Road to hell is paved with good intention and in their excessive zeal to protect users from errors in deleting existing partitions (BTW RHEL stopped displaying labels for partitions for some time, and now displayed UUIDs only) and make anaconda more "fool proof" they made the life professional sysadmins really miserable. Recently I tried to install RHEL 7.5 on a server with existing partition (which required unique install and Kickstart was not too suitable for this) and discovered that in RHEL 7 installer GUI there is not easy way to delete existing partitions. I was forced to exist installation boot from RHEL6 ISO delete partition and then continue (at this point I realized that choosing RHEL7 for the particular server was a mistake  and corrected it ;-). Later I have found the following post from 2014 that describes the same situation that I encountered in 2019 very well: 

Can not install RHEL 7 on disk with existing partitions - Red Hat Customer Portal

 May 12 2014  | access.redhat.com
Can not install RHEL 7 on disk with existing partitions. Menus for old partitions are grayed out. Can not select mount points for existing partitions. Can not delete them and create new ones. Menus say I can but installer says there is a partition error. It shows me a complicated menu that warns be what is about to happen and asks me to accept the changes. It does not seem to accept "accept" as an answer. Booting to the recovery mode on the USB drive and manually deleting all partitions is not a practical solution.

You have to assume that someone who is doing manual partitioning knows what he is doing. Provide a simple tool during install so that the drive can be fully partitioned and formatted. Do not try so hard to make tools that protect use from our own mistakes. In the case of the partition tool, it is too complicated for a common user to use so don't assume that the person using it is an idiot.

The common user does not know hardware and doesn't want to know it. He expects security questions during the install and everything else should be defaults. A competent operator will backup anything before doing a new OS install.  He needs the details and doesn't need an installer that tells him he can't do what he has done for 20 years.

RHEL7 is doing a lot of other unnecessary for server space things. Most of those changes also increase complexity by hiding "basic" things with the additional layers of "indirection".  For example, try to remove Network Manager in RHEL 7.  Now explain to me why we need it for a server room with servers that use "forever" attached by 10 Mbit/sec cables and which does not run any virtual machines or Docker.

RHEL 7 denies the usefulness of previous knowledge which makes overcomplexity more painful

But what is really bad is that RHEL7 in general and systemd in particular denies usefulness of previous knowledge of Red Hat and destroys the value of several dozens of good or even excellent books related to system administration and published in 1998-2015.  That's a big and unfortunate loss, as many of authors of excellent Linux books will never update them for RHEL7.   In a way, deliberate desire to break with the past demonstrated by systemd might well be called vandalism.

And  Unix is almost 50 Years OS (1973-2019). Linux was created around 1991 and in 2019 will be 28 years old. In other words this is a mature OS with more then two decades of history under the belt. Any substantial changes for OS of this age are not only costly, they are highly disruptive.

systemd denies usefulness of previous knowledge of Red Hat and destroys the value of books related to system administration and published in 1998-2015  

This utter disrespect to people who spend years learning Red Hat and writing about it increased my desire to switch to CentOS or Oracle Linux. I do not want to pay Red Hat money any longer and Red hat support is now outsourced and deteriorated to the level when it is almost completely useless "database query with human face", unless you buy premium support. And even in this case much depends on to what analyst your ticket is assigned.

Moreover the architectural quality of RHEL 7 is low. Services mixed into the distribution are produced by different authors and as such have no standards on configuration including such important for the enterprise thing as the enable the WEB proxy for those services that need to deal with Internet (yum is one example). At the same time the complexity of this distribution is considerably higher then RHEL6 with almost zero return on investment.  Troubleshooting it different and much more difficult.  Seasoned sysadmin switching to RHEL 7 feel like  a novice on the skating ring for several months. And as for troubleshooting complex problem much longer then that. 

It is clear that RHEL 7 main problem is not systemd by itself but the fact that the resulting distribution suffers from overcomplexity. It is way too complex to administer and requires regular human having to remember way so many things. Things that they can never fit into one head. This means this is by definition a brittle system, the elephant that nobody understands  completely.  Much like Microsoft Windows. And I am talking not only  about systemd, which was introduced in this version. Other subsystems suffer from overcomplexity as well. Some of them came directly from RHEL, some not (Biosdevname, which is a perversion, came from Dell).

Troubleshooting became not only more difficult.  In complex cases you also are provided with less information (especially cases where the system does not boot).   Partially this is due to generally complexity of modern Linux environment (library hell, introduction of redundant utilities, etc), partially due to increase complexity of hardware,. More one of the key reasons is the lack of architectural vision on the part of Red Hat.

Add to this multiple examples of sloppy programming (dealing with proxies is still not unified and individual packages have their own settings which can conflict with the environment  variable settings) and old warts (RPM format now is old and partially outlived its usefulness, creating rather complex issues with patching and installing software, issues that take a lot of sysadmin time to resolve) and you get the picture. 

A classic example of Red Hat inaptitude is how they handle proxy settings. Even for software fully controlled by Red Hat  such as yum and subscription manager they use proxy settings in each and every configuration file. Why not to put them  /etc/sysconfig/network and always, consistently read them from the environment variables first and this this file second?  Any well behaving application needs to read environment variables which should take precedence over settings in configuration files. They do not do it.  God knows why. 

They even manage to embed proxy settings from /etc/rhsm/rhsm.conf into yum file /etc/yum.repos.d/redhat.repo, so the proxy value is taken from this file. Not from  your /etc/yum.conf settings, as you would expect.  Moreover this is done without any elementary checks for consistency: if you make a pretty innocent  mistake and specify proxy setting in /etc/rhsm/rhsm.conf as

proxy http://yourproxy.yourdomain.com

The Red Hat registration manager will accept this and will work fine. But for yum to work properly /etc/rhsm/rhsm.conf proxy specification requires just DNS name without prefix http:// or https://  -- prefix https will be added blindly (and that's wrong) in redhat.repo   without checking if you specified http:// (or https://) prefix or not. This SNAFU will lead to generation in  redhat.repo  the proxy statement of the form https://http://yourproxy.yourdomain.com

At this point you are up for a nasty surprise -- yum will not work with any Redhat repository. And there is no any meaningful diagnostic messages. Looks like RHEL managers are iether engaged in binge drinking, or watch too much porn on the job ;-). 

Yum which started as a very helpful utility also gradually deteriorated. It turned into a complex monster which requires quite a bit of study and introduces by itself  a set of very complex bugs, some of which are almost features.

SELinux was never a transparent security subsystems and has a lot of quirks of its own. And its key idea is far from being elegant like were the key ideas of AppArmor ( per application assigning umask for key directories and config files)  which actually disappeared from the Linux security landscape. Many sysadmins simply disable SElinux leaving only firewall to protect the server. Some application require disabling SELinux for proper functioning, and this is specified in the documentation.

The deterioration of architectural vision within Red Hat as a company is clearly visible in the terrible (simply terrible, without any exaggeration) quality of the customer portal, which is probably the worst I ever encountered. Sometimes I just put tickets to understand how to perform particular operation on the portal.  Old or Classic as they call it RHEL customer portal actually was OK, and even has had some useful features. Then for some reason they tried to introduced something new and completely messes the thing.  As of August  2017  quality somewhat improves, but still leaves much to be desired. Sometimes I wonder why I am still using the distribution, if the company which produced it (and charges substantial money for it ) is so tremendously architecturally inapt, that it is unable to create a usable customer portal. And in view of existence of Oracle Linux I do not really know the answer to this question.  Although Oracle has its own set of problems, thousands of EPEL packages, signed and built by Oracle, have been added to Oracle Linux yum server.

Note about moving directories /bin, /lib, /lib64, and /sbin to /usr

In RHEL 7 Red Hat decided to move four system directories ( /bin, /lib /lib64 and /sbin ) previously located at /  into  /usr.  They were replaced with symlinks to preserve compatibility.

[0]d620@ROOT:/ # ll
total 316K
dr-xr-xr-x.   5 root root 4.0K Jan 17 17:46 boot/
drwxr-xr-x.  18 root root 3.2K Jan 17 17:46 dev/
drwxr-xr-x.  80 root root 8.0K Jan 17 17:46 etc/
drwxr-xr-x.  12 root root  165 Dec  7 02:12 home/
drwxr-xr-x.   2 root root    6 Apr 11  2018 media/
drwxr-xr-x.   2 root root    6 Apr 11  2018 mnt/
drwxr-xr-x.   2 root root    6 Apr 11  2018 opt/
dr-xr-xr-x. 125 root root    0 Jan 17 12:46 proc/
dr-xr-x---.   7 root root 4.0K Jan 10 20:55 root/
drwxr-xr-x.  28 root root  860 Jan 17 17:46 run/
drwxr-xr-x.   2 root root    6 Apr 11  2018 srv/
dr-xr-xr-x.  13 root root    0 Jan 17 17:46 sys/
drwxrwxrwt.  12 root root 4.0K Jan 17 17:47 tmp/
drwxr-xr-x.  13 root root  155 Nov  2 18:22 usr/
drwxr-xr-x.  20 root root  278 Dec 13 13:45 var/
lrwxrwxrwx.   1 root root    7 Nov  2 18:22 bin -> usr/bin/
lrwxrwxrwx.   1 root root    7 Nov  2 18:22 lib -> usr/lib/
lrwxrwxrwx.   1 root root    9 Nov  2 18:22 lib64 -> usr/lib64/
-rw-r--r--.   1 root root 292K Jan  9 12:05 .readahead
lrwxrwxrwx.   1 root root    8 Nov  2 18:22 sbin -> usr/sbin/

You would think that RHEL RPMs now use new locations. Wrong. You can do a simple experiment of RHEL7 or CentOs7 based VM -- deleted those symlinks using the command

rm -f /bin /lib /lib64 /sbin 

and see what will happen ;-). Actually recovery from this situation might represents a good interview question for candidates pretending on senior sysadmin positions.

Note of never ending stream of security patches

First and foremost not too much zeal
( Talleyrand advice to young diplomats)

In security patches Red Hat found its Eldorado. I would say that they greatly helped Red Hat  to became a billion dollars' in revenue company, despite obvious weakness of GPL for the creating a stable revenue stream. Constantly shifting sands created by over-emphasizing security patches compensate weakness of GPL to the extent  that profitability can approach the level of a typical commercial software company.

And Red Hat produces them at the rate of around a dozen a week or so ;-). As keeping up with patches is an easy metric to implement inthe Dilbertized neoliberal enterprise Sometimes even it becomes the key metric by which sysadmins is judged along with another completely fake metric -- average time to resolve tickets of a given category. So clueless corporate brass often play the role of Trojan Horse for Red Hat by enforcing too strict (and mostly useless) patching policies (which is such cases are often not adhered in practice, creating a gap between documentation and reality typical for any corporate bureaucracy). If this part of IT is outsourced,  everything became the contract negotiation, including this one and the situation with the patching became even more complicated with additional levels of indirection and bureaucratic avoidance of responsibility. 

Often patching is elevated to such high priority that it consumes lion share of sysadmin efforts (and is loved  by sysadmins who are not  good for anything else, as they can report their accomplishments each month ;-)  why other important issue  are lingering if obscurity and disrepair.

While not patching servers at all is questionable policy (which bounds on recklessness),  too frequent patching of servers is another extreme, and extremes meet. As Talleyrand advised to young diplomats "First and foremost not too much zeal". this is fully applicable to the patching of RHEL servers.

Security of linux server (taking into account that RHEL is Microsoft Windows of Linux  world and it the most attacked flavor of linux) is mostly architectural issue. Especially network architecture issue.

For example, using address translation (private subnet witin the enterprise) is now a must. Prosy for some often  abused protocols is desirable.

As attach can happen not only via main interface, but also via management interface and you do not control or under remote management firmware, providing  a different, tightly controlled subnet for DRAC assessable only for selected IPs belongs to sysadmin team improves security far above what patching of DRAC firmware can accomplish (see Linux Servers Most Affected by IPMI Enabled JungleSec Ransomware by Christine Hall ).

Similarly implementing jumpbox for accessing servers from DHCP segment (corporate desktops) via ssh also improves ssh security without any patches (and in the past Openssh was a source of very nasty exploits, as is a very complex subsystem). There is no reason why "regular" desktops on DHCP segment should be able to initiate login directly (typically IP addressee for sysadmins are made "pseudo-static" by supplying DHCP server with MAC addresses of those desktops.) 

In any case, as soon as you see a email from the top IT honcho that congratulates subordinates for archiving 100% compliance with patching schedule in the enterprise that does not implements those measures, you know what kind of IT and enterprise is that.

I mentioned two no-nonsense measures (separate subnet for DRAC/ILO and jumpbox for ssh), but there are several other "common sense" measures which really effect security far more the any tight patching schedule. Among  them:

So if those measures are not implemented,  creating spreadsheets with the dates of installation of patches is just perversion of security, not a real security improvement measure. Yet another Dilbertalised enterprise activity. As well as providing as a measure that provides some justification for the existence of the corporate security department.  In reality the level of security by-and-large is defined by IT network architecture and no amount of patching will improve bad Enterprise IT architecture. 

I thing that this never ending stream of security patches serves dual purpose, with the secondary goal to entice customers to keep continues Red Hat subscription for the life of the server.  For example for research servers, I am not so sure that updating to the next minor release (say from CentOS 7.4 -- to CentOS 7.5) provide less adequate level of security.  And such schedule creates much less fuss, enabling people to concentrate on more important problems. 

And if the server is internet facing then well thought out firewall policy and SElinux enforcing policy block most "remote exploits" even without any patching. Usually only HTTP port is opened (ssh port in most cases should be served via VPN, or internal DMZ subnet, if  the server is on premises ) and that limits exposure patching or no patching, although of course does not eliminate  this. Please understands that the mere existence of NSA, CIA with their own team of hackers (and their foreign counterparts) guarantees at any point of time availability of zero-day exploits for "lucrative" targets on the level of the network stack, or lower. Against these zero-day threats, reactive security is helpless (and patching is a reactive security measure.)  In this sense,  scanning of servers and running set of predefined exploits (mainly PHP oriented) by script kiddies is just a noise and does not qualify as an attack, no matter how dishonest security companies try to hype this threat.  The real threat is quite different.

If you need higher security you probably need to switch to the distribution that supports AppArmor kernel module (Suse Enterprise in one). Or if you have competent sysadmins with certification in Solaris to  Solaris (on Intel, or even Spark) which not only provides better overall security, but also implement "security via obscurity" defense which alone stops 99.95 of Linux exploit dead of tracks (and outside of encryption algorithms there is nothing shameful in using security via obscurity; and even for encryption algorithms it can serve a useful role too ;-). Solaris has RBAC (role based access control) implemented which allows to compartmentalize many roles (including root). Sudo is just a crude attempt to "bolt on" RBAC  on Linux and Free BSD. Solaris also has a special hardened version called Trusted Solaris.

Too frequent patching of the server also sometimes introduces some subtle changes that break some applications, which are not that easy to detect and debug.

All-in-all "too much  zeal" in patching the servers and keep up with the never ending stream of RHEL patches often represents a voodoo dance around the fire. Such thing happens often when that ability to understand a very complex phenomenon is beyond normal human comprehension. In this case easy "magic" solution, a palliative is adopted.

Also as I noted above, in this  area the real situation on corporate servers most often does not corresponds to spreadsheets and presentations to corporate management.

RHEL licensing

We have two distinct problem with RHEL licensing:

Licensing model: four different types of RHEL licenses

RHEL is struggling to fence off "copycats" by complicating access to the source of patches, but the problem is that its licensing model in  pretty much Byzantium.  It is based of a half-dozen of different types of subscriptions. The most useful are pretty expensive.  That dictates the necessity of diversification within this particular vendor and combining/complementing it with licenses from other vendors. With RHEL 7 is does not makes sense any longer to license all you server from Red Hat.

As for Byzantium structure of Red Hat licensing I resent paying Red Hat for our 4 socket servers to the extent that I stop using this type of servers and completely switched to two socket servers. Which with Intel CPUs rising core count was a easy escape from RHEL restrictions and greed.  Currently Red Hat probably has most complex, the most Byzantine system of subscriptions comparable with IBM (which is probably the leader in "licensing obscurantism" and the ability to force customers to overpay for its software ;-).

And there are at least four different RHEL licenses for real (hardware-based) servers ( https://access.redhat.com/support/offerings/production/sla  )

  1. Self-support. If you have many identical or almost identical servers or virtual machines it does not make sense to buy standard or premium licenses for all. It can be bought for one server and all other can be used with self-support licenses, which provides access to patches). They sometimes are used for a group for servers, for example a park of webservers with only one server getting standard of premium licensing. 
    • Self support provides the most restrictive set of repositories. For missing you can use Epel repository, or download packages from CentOs repositories. Including CentOs repositories in yum configuration breaks RHEL.
    • NOTE: In most cases it makes sense to buy Oracle Linux self-support license instead
  2. Standard (actually means web only, although formally you can try chat and phone during business hours (if you manage to get to a support specialist). So this is mostly web-based support.  This level of subscription provides better set of repositories, but Red Hat is playing games in this area and some of them now require additional payment.
  3. Premium (Web and phone with phone 24 x7 for severity 1 and 2 problems). Here you really can get specialist on phone if you have a critical problems. Even in after hours.
  4. HPC computational nodes. Those are the cheapest but they have limited number of packages in the distribution and repositories. In this sense using Oracle Linux self-support licenses is a better deal for computational nodes then this type of RHEL licenses; sometime CentOS can be used too which eliminates the licensing problem completely, but adds other problems (access to repositories sometimes is a problem as too many people are using  too few mirrors). In any case, I have positive experience with using CentOS for computational clusters that run bioinformatics software.  The headnode can be licensed from Red Hat or Oracle.  Recently I have found that Dell provides pretty good, better the Red Hat, support for headnode type problems too.  The same is probably true for the HP Red Hat support.
    • NOTE: In most cases it makes sense to buy Oracle Linux self-support license instead
  5. No-Cost RHEL Developer Subscription is available from March 2016  -- I do not know much about this license.

There is also several restricted support contracts:

RHEL licensing scheme is based on so called "entitlements" which oversimplifying is one license for a 2 socket server. In the past they were "mergeable" so if your 4 socket license expired and you have two spare two socket licenses Red Hat was happy to accommodate your needs. Now they are not and use IBM approach to this issue.  And that's a problem :-).

Now you need the right mix if you have different classes of servers. All is fine until you use mixture of licenses (for example some cluster licenses, some patch only ( aka self-support), some premium license -- 4 types of licenses altogether). In this case to understand where is the particular license landed in the past was almost impossible to predict. But if you use uniform server part this scheme worked reasonably well (actually better then the current model.) Red Hat fixed the problem of unpredictability where the particular license goes (now you have a full control) but created a new problem -- if you license expired it's now your problem -- with subscription manager it nor replaced by the license from "free license pool". 

This path works well but to cut the costs and you now need to buy five year license with the server to capitalize the cost of the Red Hat license. But five year term means that you lack the ability to switch to a different falvour of Linux. Most often this is not a problem, but still.

This also a good solution for computational cluster licenses -- Dell and HP can install basic cluster software on the enclosure for minimal fee. They try to force upon an additional software which you might not want or need (Bright Cluster Manager in case of Dell), but  that's a solvable issue (you just do not extend support contract after the evaluation period ends).  

And believe me this HPC architecture is very useful outside computational tasks so HPC license for node can be used outside the  areas, where computational cluster are used to cut the licensing costs.  It is actually an interesting paradigm of managing heterogeneous datacenter. The only problem is that you need to learn to use it :-). For example SGE is a well engineered scheduler (originally from Sun, but later open sourced) that with some minimal additional software can be used as enterprise scheduler.  While this is a free software it beats many commercial offerings and while it lacks calendar scheduling, any calendar scheduler can be used with it to compensate for this (even cron -- in this each cron task becomes SGE submit script).  Another advantage is that it is no longer updated, as often updates of commercial software are often just directed on milking customers and add nothing or even subtract from the value of software.  Other cluster schedulers can be used instead of SGE as well.

Using HPC-like config with the "headnode" and "computational nodes" is option to lower the fees if you use multiple similar servers which do not need any fancy software (for example, a  blade enclosure with 16 blades used as HTTP server farm, or Oracle DB farm). It is relatively easy to organize a particular set of servers as a cluster with SGE (or other similar scheduler) installed on the head node and the common NFS (or GPFS) filesystem exported to all nodes.  Such a common filesystem is ideal for complex maintenance tasks and it alone serves as a poor man software configuration system. Just add a few simple scripts and parallel execution software and in many cases you this is all you need for configuration management ;-).  It is amazing that many aspects/advantages of this functionality is not that easy to replicate with Ansible, Puppet, or Chef. 

BTW now Hadoop is a fashionable thing (while being just a simple case of distributed search) and you can always claim that this is a Hadoop type of service, which justifies calling such  an architecture a cluster.  In this case you can easily pay premium license for the headnode and one nodes, but all other computation nodes are $100 a year each or so. Although you can get the same self-support license from Oracle for almost the price ($119 per year the last time I checked) without any  Red Hat restrictions, so from other point of view, why bother licensing self-support servers from Red Hat ?

As you can mix and match licenses it is recommended to buy Oracle  self-support licenses instead of RHEL self-support licenses (Oracle license costs $119 per year the last time I checked). It provides a better set of repositories and does not entail any Red Hat restrictions, so from other point of view, why bother licensing self-support servers from Red Hat ?

The ides of Guinea Pig node

Rising costs also created strong preference creating server group so that only one  server in the server group has "expensive license and used as guinea pig for problems, which all other enjoy self-support licenses. That allow you getting full support for complex problem by replicating them of Guinea Pig node. And outside financial industry companies now  typically are tight on money for IT. 

It is natural that at the certain, critical level of complexity, qualified support disappears. Read RHEL gurus are unique people with tremendous talent and they are exceedingly rare, dying our breed.   Now you what you get is low level support which mostly consist of pointing you to a relevant (or often to an irrelevant) to your case article in the Red Hat knowledgebase. With patience you can upgrade your ticket and get to proper specialist, but the amount of effort might nor justify that -- for the amount of time and money you might be better off using a good external consultants. For example, from universities, which have high quality people in need of additional money. 

In most cases the level of support still can be viewed as acceptable only if you have Premium subscription. But at least for me this creates resentment; that why now I an trying to install CentOS or Scientific Linux instead if they work OK for a particular application.  You can buy Premium license only for one node out of several similar saving on many and using this node of Guinea Pig for the problems.

New licensing scheme, while improving  the situation in many areas, has a couple of serious drawbacks , such as dealing with 4 socket servers and expiration of licenses problems. We will talk about 4 socket servers later (and who now need them if Intel packs 16 or more cores into one  CPU ;-). But too tight control of licenses by licensing manager that Red Hat probably enjoys, is a hassle for me: if the license expired now this is "your problem", as there is no automatic renewal from the pool of available licenses (which was one of the few positive thing about now  discontinued RHN).

Instead you need to write scripts and deploy them on all nodes to be able to use patch command on all nodes simultaneously via some parallel execution tool (which is OK way to deploy security patches after testing). 

And please note that typically large corporation overpay Red Hat for licensing, as they have bad or non-operational licensing controls and prefer to err on higher number of licenses then they need. This adds insult to injury -- why on the earth we can't use available free licenses to automatically replace them when the license expires but equivalent pool of licenses is available and not used ? 

Diversification of RHEL licensing and tech support providers

Of course, diversifying licensing from Red Hat now must be the option that  should be given a serious consideration. One important fact is that college graduated now comes to the enterprise with the knowledge of Ubuntu. And due to that tend to deploy application using Docker, as they are more comfortable, more knowledgably in Ubuntu then in CentOS or Red Hat enterprise. This is a factor that should be considered.

But there is an obvious Catch 22 with switching to another distribution: adding Debian/Devian to the mix also is not free -- it increases the cost of administration by around  the same threshold as the switch to RHEL7:  introducing to the mix  "yet another Linux distribution" usually have approximately the same cost estimated around 20-30% of loss of productivity of sysadmins. As such "diversification" of flavors of Linux in enterprise environment generally should be avoided at all costs. 

So while paying for continuing  systemd development to Red Hat is not the best  strategy, switching to alternative distribution which is not RHEL derivative and typically uses a different package manager entrain substantial costs and substantial uncertainty. Many enterprise applications simply do not support flavors other then Red Hat and they require licensed server for the installation.  So the pressure to conform with the whims of Red Hat brass is high and most people are not ready to drop Red Hat just due to the problems with systemd. So mitigation of the damage caused by systemd strategies are probably the most valuable avenue of actions.  One such  strategy is diversification of RHEL licensing and providers of tech support, while not abandoning  RHEL compatible flavors.

This diversification strategy first of all should include larger use of CentOS as well as Oracle linux as more cost effective alternatives.  This allow to purchase more costly "premium" license for server that really matter. Taking into account  the increase complexity buying such a license this is just a sound  insurance against increased uncertainty.  You do need at least one such license for a each given class of servers at the  datacenter. As well as cluster headnode, if you have computational clusters in your datacenter.

So a logical step would be switching  to "license groups"  in which only one server is licensed with expensive premium, subscription and all other are used with minimal self-support license. This plan can be executed with Oracle even better then with Red Hat as it has lower price for self-support subscription. 

The second issue is connected with using Red Hat as support provider. Here we also have alternatives as both Dell and HOP provides "in house" support of RHEL for their hardware. While less known and popular Suse provided RHEL support too.

It is important to understand that due to excessive complexity of RHEL7, and the flow of tickets related to systemd, Red Hat tech support mostly degenerated to the level of  "pointing to relevant  Knowledgebase article."  Sometime the article is relevant and helps to solve the problem, but often it is just a "go to hell" type of response, an imitation of support, if you wish.  In the past (in time of RHEL 4) the quality of support was much better and can even discuss your problem with the support engineer.  Now it is unclear what we are paying for.  My experience suggests that most complex problem typically are in one way or another are connected with the hardware interaction with OS. If this observation is true it might  be better to use alternative providers, which in many cases provide higher quality tech support as they are more specialized.

So it you have substantial money allocated to support (and here  I mean 100 or more systems to support) you probably should be thinking about third party that suit your needs too. There are two viable options here:

Note about licensing management system

There are two licensing system used by Red Hat

  1. Classic(RHN) -- old system that was phased out in mid 2017 (now has only historical interest)
  2. "New" (RHSM) -- a new, better, system used predominantly on RHEL 6 and 7. Obligatory from August 2017.

RHSM is complex and requires study.  Many hours of sysadmin time are wasted on mastering its complexities, while in reality this is an overhead that allows Red Hat to charge money for the product. So the fact that they are NOT supporting it well tells us a lot about the level of deterioration of the company.   So those Red Hat honchos with high salaries essentially create a new job in enterprise environment -- a license administrator. Congratulations !

All-in-all Red Hat successful created almost un-penetrable mess of obsolete and semi obsolete notes, poorly written and incomplete documentation, dismal diagnostic and poor troubleshooting tools. And the level of frustration sometimes reaches such a level that people just abandon RHEL. I did for several non-critical system. If CentOS or Academic Linux works there is no reason to suffer from Red Hat licensing issues. Also that makes Oracle, surprisingly, more attractive option too :-). Oracle Linux is also cheaper. But usually you are bound by corporate policy here.

"New" subscription system (RHSM) is slightly better then RHN for large organizations, but it created new probalme, for example a problem with 4 socket servers which now are treated as a distinct entity from two socket servers. In old RHN the set of your "entitlements" was treated uniformly as licensing tokens and can cover various  number of sockets (the default is 2). For 4 socket server it will just take two 2-socket licenses. This was pretty logical (albeit expensive) solution. This is not the case with RHNSM. They want you to buy specific license for 4 socket server and generally those are tiled to the upper levels on RHEL licensing (no self-support for 4 socket servers). In RHN, at least, licenses were eventually converted into some kind of uniform licensing tokens that are assigned to unlicensed systems more or less automatically (for example if you have 4 socket system then two tokens were consumed). With RHNSM this is not true, which creating for large enterprises a set of complex problems. In general licensing by physical socket (or worse by number of cores -- an old IBM trick) is a dirty trick step of Red Hat which point its direction in the future.

RHSM allows to assign specific license to specific box and list the current status of licensing.  But like RHN it requires to use proxy setting in configuration file, it does not take them from the environment. If the company has several proxies and you have mismatch you can be royally screwed. In general you need to check consistency of your environment  settings with conf files settings.  The level of understanding of proxies environment by RHEL tech support is basic of worse, so they are  using the database of articles instead of actually troubleshooting based on sosreport data. Moreover each day there might a new person working on your ticket, so there no continuity.

RHEL System Registration Guide (https://access.redhat.com/articles/737393) is weak and does not cover more complex cases and typical mishaps.

"New" subscription system (RHSM) is slightly better then old RHN in a sense that it creates for you a pool of licensing and gives you the ability to assign more expensive licensees to the most valuable servers.  It allows to assign specific license to specific box and to list the current status of licensing. 

Troubleshooting

Learn More

The RHEL System Registration Guide (https://access.redhat.com/articles/737393) which outlines major options available for registering a system (and carefully avoids mentioning bugs and pitfalls, which are many).  For some reason migration from RHN to RHNSM usually worked well

Also might be useful (to the extent any Red Hat purely written documentation is useful) is the document: How to register and subscribe a system to the Red Hat Customer Portal using Red Hat Subscription-Manager (https://access.redhat.com/solutions/253273). At least it tires to  answers to some most basic questions. 

Other problems with RHEL 7

Mitigating damage done by systemd

Some  tips:

Sometime you just need to be inventive and add additional startup script to mitigate the damage. Here is one realistic example (systemd sucks):

How to umount NFS before killing any processes. or How to save your state before umounting NFS. or The blind and angry leading the blind and angry down a thorny path full of goblins. April 29, 2017

A narrative, because the reference-style documentation sucks.

So, rudderless Debian installed yet another god-forsaken solipsist piece of over-reaching GNOME-tainted garbage on your system: systemd. And you've got some process like openvpn or a userspace fs daemon or so on that you have been explicitly managing for years. But on shutdown or reboot, you need to run something to clean up before it dies, like umount. If it waits until too late in the shutdown process, your umounts will hang.

This is a very very very common variety of problem. To throw salt in your wounds, systemd is needlessly opaque even about the things that it will share with you.

"This will be easy."

Here's the rough framework for how to make a service unit that runs a script before shutdown. I made a file /etc/systemd/system/greg.service (you might want to avoid naming it something meaningful because there is probably already an opaque and dysfunctional service with the same name already, and that will obfuscate everything):

[Unit]
Description=umount nfs to save the world
After=networking.service

[Service]
ExecStart=/bin/true
ExecStop=/root/bin/umountnfs
TimeoutSec=10
Type=oneshot
RemainAfterExit=yes

The man pages systemd.unit(5) and systemd.service(5) are handy references for this file format. Roughly, After= indicates which service this one is nested inside -- units can be nested, and this one starts after networking.service and therefore stops before it. The ExecStart is executed when it starts, and because of RemainAfterExit=yes it will be considered active even after /bin/true completes. ExecStop is executed when it ends, and because of Type=oneshot, networking.service cannot be terminated until ExecStop has finished (which must happen within TimeoutSec=10 seconds or the ExecStop is killed).

If networking.service actually provides your network facility, congratulations, all you need to do is systemctl start greg.service, and you're done! But you wouldn't be reading this if that were the case. You've decided already that you just need to find the right thing to put in that After= line to make your ExecStop actually get run before your manually-started service is killed. Well, let's take a trip down that rabbit hole.

The most basic status information comes from just running systemctl without arguments (equivalent to list-units). It gives you a useful triple of information for each service:

greg.service                loaded active exited

loaded means it is supposed to be running. active means that, according to systemd's criteria, it is currently running and its ExecStop needs to be executed some time in the future. exited means the ExecStart has already finished.

People will tell you to put LogLevel=debug in /etc/systemd/system.conf. That will give you a few more clues. There are two important steps about unit shutdown that you can see (maybe in syslog or maybe in journalctl):

systemd[4798]: greg.service: Executing: /root/bin/umountnfs
systemd[1]: rsyslog.service: Changed running -> stop-sigterm

That is, it tells you about the ExecStart and ExecStop rules running. And it tells you about the unit going into a mode where it starts killing off the cgroup (I think cgroup used to be called process group). But it doesn't tell you what processes are actually killed, and here's the important part: systemd is solipsist. Systemd believes that when it closes its eyes, the whole universe blinks out of existence.

Once systemd has determined that a process is orphaned -- not associated with any active unit -- it just kills it outright. This is why, if you start a service that forks into the background, you must use Type=forking, because otherwise systemd will consider any forked children of your ExecStart command to be orphans when the top-level ExecStart exits.

So, very early in shutdown, it transitions a ton of processes into the orphaned category and kills them without explanation. And it is nigh unto impossible to tell how a given process becomes orphaned. Is it because a unit associated with the top level process (like getty) transitioned to stop-sigterm, and then after getty died, all of its children became orphans? If that were the case, it seems like you could simply add to your After rule.

After=networking.service getty.target

For example, my openvpn process was started from /etc/rc.local, so systemd considers it part of the unit rc-local.service (defined in /lib/systemd/system/rc-local.service). So After=rc-local.service saves the day!

Not so fast! The openvpn process is started from /etc/rc.local on bootup, but on resume from sleep it winds up being executed from /etc/acpi/actions/lm_lid.sh. And if it failed for some reason, then I start it again manually under su.

So the inclination is to just make a longer After= line:

After=networking.service getty.target acpid.service

Maybe getty@.service? Maybe systemd-user-sessions.service? How about adding all the items from After= to Requires= too? Sadly, no. It seems that anyone who goes down this road meets with failure. But I did find something which might help you if you really want to:

systemctl status 1234

That will tell you what unit systemd thinks that pid 1234 belongs to. For example, an openvpn started under su winds up owned by /run/systemd/transient/session-c1.scope. Does that mean if I put After=session-c1.scope, I would win? I have no idea, but I have even less faith. systemd is meddlesome garbage, and this is not the correct way to pay fealty to it.

I'd love to know what you can put in After= to actually run before vast and random chunks of userland get killed, but I am a mere mortal and systemd has closed its eyes to my existence. I have forsaken that road.

I give up, I will let systemd manage the service, but I'll do it my way!

What you really want is to put your process in an explicit cgroup, and then you can control it easily enough. And luckily that is not inordinately difficult, though systemd still has surprises up its sleeve for you.

So this is what I wound up with, in /etc/systemd/system/greg.service:

[Unit]
Description=openvpn and nfs mounts
After=networking.service

[Service]
ExecStart=/root/bin/openvpn_start
ExecStop=/root/bin/umountnfs
TimeoutSec=10
Type=forking

Here's roughly the narrative of how all that plays out:

So, this EXIT_STATUS hack... If I had made the NFS its own service, it might be strictly nested within the openvpn service, but that isn't actually what I desire -- I want the NFS mounts to stick around until we are shutting down, on the assumption that at all other times, we are on the verge of openvpn restoring the connection. So I use the EXIT_STATUS to determine if umountnfs is being called because of shutdown or just because openvpn died (anyways, the umount won't succeed if openvpn is already dead!). You might want to add an export > /tmp/foo to see what environment variables are defined.

And there is a huge caveat here: if something else in the shutdown process interferes with the network, such as a call to ifdown, then we will need to be After= that as well. And, worse, the documentation doesn't say (and user reports vary wildly) whether it will wait until your ExecStop completes before starting the dependent ExecStop. My experiments suggest Type=oneshot will cause that sort of delay...not so sure about Type=forking.

Fine, sure, whatever. Let's sing Kumbaya with systemd.

I have the idea that Wants= vs. Requires= will let us use two services and do it almost how a real systemd fan would do it. So here's my files:

Then I replace the killall -9 openvpn with systemctl stop greg-openvpn.service, and I replace systemctl start greg.service with systemctl start greg-nfs.service, and that's it.

The Requires=networking.service enforces the strict nesting rule. If you run systemctl stop networking.service, for example, it will stop greg-openvpn.service first.

On the other hand, Wants=greg-openvpn.service is not as strict. On systemctl start greg-nfs.service, it launches greg-openvpn.service, even if greg-nfs.service is already active. But if greg-openvpn.service stops or dies or fails, greg-nfs.service is unaffected, which is exactly what we want. The icing on the cake is that if greg-nfs.service is going down anyways, and greg-openvpn.service is running, then it won't stop greg-openvpn.service (or networking.service) until after /root/bin/umountnfs is done.

Exactly the behavior I wanted. Exactly the behavior I've had for 14 years with a couple readable shell scripts. Great, now I've learned another fly-by-night proprietary system.

GNOME, you're as bad as MacOS X. No, really. In February of 2006 I went through almost identical trouble learning Apple's configd and Kicker for almost exactly the same purpose, and never used that knowledge again -- Kicker had already been officially deprecated before I even learned how to use it. People who will fix what isn't broken never stop.

As an aside - Allan Nathanson at Apple was a way friendlier guy to talk to than Lennart Poettering is. Of course, that's easy for Allan -- he isn't universally reviled.

A side story

If you've had systemd foisted on you, odds are you've got Adwaita theme too.

rm -rf /usr/share/icons/Adwaita/cursors/

You're welcome. Especially if you were using one of the X servers where animated cursors are a DoS. People who will fix what isn't broken never stop.

[update August 10, 2017]

I found out the reason my laptop double-unsuspends and other crazy behavior is systemd. I found out systemd has hacks that enable a service to call into it through dbus and tell it not to be stupid, but those hacks have to be done as a service! You can't just run dbus on the commandline, or edit a config file. So in a fit of pique I ran the directions for uninstalling systemd.

It worked marvelously and everything bad fixed itself immediately. The coolest part is restoring my hack to run openvpn without systemd didn't take any effort or thought, even though I had not bothered to preserve the original shell script. Unix provides some really powerful, simple, and *general* paradigms for process management. You really do already know it. It really is easy to use.

I've been using sysvinit on my laptop for several weeks now. Come on in, the water's warm!

So this is still a valuable tutorial for using systemd, but the steps have been reduced to one: DON'T.

[update September 27, 2017]

systemd reinvents the system log as a "journal", which is a binary format log that is hard to read with standard command-line tools. This was irritating to me from the start because systemd components are staggeringly verbose, and all that shit gets sent to the console when the services start/stop in the wrong order such that the journal daemon isn't available. (side note, despite the intense verbosity, it is impossible to learn anything useful about why systemd is doing what it is doing)

What could possibly motivate such a fundamental redesign? I can think of two things off the top of my head: The need to handle such tremendous verbosity efficiently, and the need to support laptops. The first need is obviously bullshit, right -- a mistake in search of a problem. But laptops do present a logging challenge. Most laptops sleep during the night and thus never run nightly maintenance (which is configured to run at 6am on my laptop). So nothing ever rotates the logs and they just keep getting bigger and bigger and bigger.

But still, that doesn't call for a ground-up redesign, an unreadable binary format, and certainly not deeper integration. There are so many regular userland hacks that would resolve such a requirement. But nevermind, because.

I went space-hunting on my laptop yesterday and found an 800MB journal. Since I've removed systemd, I couldn't read it to see how much time it had covered, but let me just say, they didn't solve the problem. It was neither an efficient representation where the verbosity cost is ameliorated, nor a laptop-aware logging system.

When people are serious about re-inventing core Unix utilities, like ChromeOS or Android, they solve the log-rotation-on-laptops problem.

Pretty convoluted RPM packaging system which creates problems

The idea of RPM was to simplify installation of complex packages. But they created of a set of problem of their own. Especially connected with libraries (which not exactly Red Hat problem, it is Linux problem called "libraries hell"). One example is so called multilib problem that is detected by YUM:

--> Finished Dependency Resolution

Error:  Multilib version problems found. This often means that the root
       cause is something else and multilib version checking is just
       pointing out that there is a problem. Eg.:

         1. You have an upgrade for libicu which is missing some
            dependency that another package requires. Yum is trying to
            solve this by installing an older version of libicu of the
            different architecture. If you exclude the bad architecture
            yum will tell you what the root cause is (which package
            requires what). You can try redoing the upgrade with
            --exclude libicu.otherarch ... this should give you an error
            message showing the root cause of the problem.

         2. You have multiple architectures of libicu installed, but
            yum can only see an upgrade for one of those arcitectures.
            If you don't want/need both architectures anymore then you
            can remove the one with the missing update and everything
            will work.

         3. You have duplicate versions of libicu installed already.
            You can use "yum check" to get yum show these errors.

       ...you can also use --setopt=protected_multilib=false to remove
       this checking, however this is almost never the correct thing to
       do as something else is very likely to go wrong (often causing
       much more problems).

       Protected multilib versions: libicu-4.2.1-14.el6.x86_64 != libicu-4.2.1-11.el6.i686

The idea of precompiled package is great until it is not. And that's what we have now. Important package such as R language or Infiniband drivers from Mellanox routinely prevent the ability to patch systems in RHEL 6.

The total number of packages installed is just way too high with many overlapping packages. Typically it is over one thousand, unless you use base system, or HPC computational node distribution.  In the latter case it is still over six hundred.  So in no way you can understand the packages structure in your system without special tools. It became a complex and not very transparent layer of indirection between you and installed binaries and packages that  you have on the system.  And the level on  which you know this subsystem is now important indication of the qualification of sysadmin. Along with networking and LAMP stack, or its enterprise variations.  

Number of daemons running in the default RHEL installation is also  very high and few sysadmins understand what all those daemons are doing and why they are running after the startup. In other word we already entered Microsoft world.  In other words RHEL is Microsoft Windows of the Linux word.  And with systemd pushed through the throat of enterprise customers, you will understand even less.

And while Red Hat support is expensive, the help in such  cases form Red Hat support is marginal. Looks like they afraid not only of customers, but of their own packages too. All those guys do and to look into database to see if a similar problem is described. That works some some problems, but for more difficult one it usually does not. Using free version of Linux such as CentOS is an escape, but with commercial applications you are facing troubles: they can easily blame Os for the the problem you are having and them you are holding the bag.

No efforts are make to consolidate those hundreds of overlapping packages (with some barely supported or unsupported).  This "library hell" is a distinct feature of modern enterprise linux distribution.

 When /etc/resolv.conf is no longer a valid DNS configuration file

In RHEL7 Network Manager is the default configuration tool for the entire networking stack including DNS resolution. One interesting problem with Network Manager is that in default installation  it happily overwrites /etc/resolv.conf putting the end to the Unix era during which you can assume that if you write something into config file it will be intact and any script that generate such a file should be able  to detect that it was changed manually and re-parse changes or produce a warning. 

In RHEL6 most sysadmins  simply deinstall Network Manager on servers, and thus did not face its idiosyncrasies. BTW Network Manager is a part of Gnome project, which is pretty symbolic  taking into account that the word "Gnome" recently became a kind of curse among Linux users ( GNOME 3.x has alienated both users and developers ;-).

In RHEL 6 and  before certain installation options excluded Network Manager from installation by default (Minimal server in RHEL6 is one).  For all other you can deinstall it after the installation. This is no longer recommended in RHEL7, but still you can disable it. See  How to disable NetworkManager on CentOS - RHEL 7 for details.  Still as the Network Manager is the preferred by Red Hat intermediary for connecting to the network for RHEL7 (and now is it is present even in minimal installation) many sysadmin prefer to keep it on RHEL7. People who tried run into various types of problems if they have a setup that was more or less non-trivial. See for example How to uninstall NetworkManager in RHEL7 Packages like Docker expect it to be present as well.

 Non-documented by Red Hat solution exists even with Network Manager running (CentOS 7 NetworkManager Keeps Overwriting -etc-resolv.conf )

To prevent Network Manager to overwrite your resolv.conf changes, remove the DNS1, DNS2, ... lines from /etc/sysconfig/network-scripts/ifcfg-*.

and

...tell NetworkManager to not modify the DNS settings:

 /etc/NetworkManager/NetworkManager.conf
 [main]
 dns=none

After that you need to restart Network manager service. Red Hat does not addresses this problem properly even in training courses:

Alex Wednesday, November 16, 2016 at 00:38 - Reply ↓

Ironically, Redhat’s own training manual does not address this problem properly.

I was taking a RHEL 7 Sysadmin course when I ran into this bug. I used nmcli thinking it would save me time in creating a static connection. Well, the connection was able to ping IPs immediately, but was not able to resolve any host addresses. I noticed that /etc/resolv.conf was being overwritten and cleared of it’s settings.

No matter what we tried, there was nothing the instructor and I could do to fix the issue. We finally used the “dns=none” solution posted here to fix the problem.

Actually the behaviour is more complex and tricky then I described and hostnamectl produce the same effect of overwriting the file:

Brian Wednesday, March 7, 2018 at 01:08 -

Thank you Ken – yes – that fix finally worked for me on RHEL 7!

Set “dns=none” in the [man] section of /etc/NetworkManager/NetworkManager.conf
Tested: with “$ cat /etc/resolv.conf” both before and after “# service network restart” and got the same output!

Otherwise I could not find out how to reliably set the “search” domains list, as I did not see an option in the /etc/sysconfig/network-scripts/ifcfg-INT files.

Brian Wednesday, March 7, 2018 at 01:41 -

Brian again here… note that I also had “DNS1, DNS2” removed from /etc/sysconfig/nework-scripts/ifcfg-INT.

CAUTION: the “hostnamectl”[1] command will also reset /etc/resolv.conf rather bluntly… replacing the default “search” domain and deleting any “nameserver” entries. The file will also include the “# Generated by NetworkManager” header comment.

[1] e.g. #hostnamectl set-hostname newhost.domain –static; hostnamectl status
Then notice how that will overwrite /etc/resolv.conf as well

But this is troubling sign: now in addition to which configuration file you need to edit, you need to know which files you can edit and which you can't (or how you can disable the default behaviour).  Which is a completely different environment which is  just one step away from Microsoft Registry. 

Moreover, even minimal installation of RHEL7 has over hundred of *.conf files.

Problems with architectural vision of Red Hat brass

Both architectural level of thinking of Red Hat brass (with daemons like avahi, systemd, network manager installed by default for all types of servers ) and clear attempts along the lines "Not invented here" in virtualization creates concerns. It is clear that Red Hat by itself can't become a major virtualization player like VMware. It just does not have enough money for development and marketing.  That's why they now try to became a major player in "private cloud" space with  Docker.

You would think that the safest bet is to reuse the leader among open source offerings, which is currently Xen. But Red Hat brass thinks differently and wants to play more dangerous poker game: it started promoting KVM, making it obligatory part of RHCSA exam. Actually, Red Hat has released Enterprise Linux 5 with integrated Xen and then changed their mind after RHEL 5.5 or so. In RHEL 6 Xen no longer present even as an option. It was replaced by KVM.

What is good that after ten years they eventually manage to some extent  to re-implement Solaris 10 zones (without RBAC). In RHEL 7 they are more or less usable.

Security overkill with SELinux

RHEL contain security layer called SELinux, but in most cases of corporate deployment it is either disabled, or operates in permissive mode.  The reason is that is notoriously difficult to configure correctly and in many cases the game does not worth the candles.  This is a classic problem that overcomplexity created: functionality to me OS more secure is here, but  it not used/configured properly, because very few sysadmin can operate on the required level of complexity.

Firewall which is a much simpler concept proved to be tremendously more usable in  corporate deployments, especially in cases when you have obnoxious or incompetent security department (a pretty typical situation for a large corporation ;-) as it prevents a lot of stupid questions from utterly incompetent "security gurus" about opened ports and can stop dead scanning attempts of tools that test for known vulnerabilities and by using which security departments are trying to justify their miserable existence. Those tools sometimes crash production servers. 

Generally it is dangerous to allow exploits used in such tools which local script kiddies (aka "security team") recklessly launch against your production server (as if checking for a particular vulnerability using internal script is inferior solution).  Without understanding of their value (which is often zero) and possible consequences (which are sometimes non zero ;-)

Another interesting but seldom utilized option is to use AppArmor which is now integrated into Linux kernel.  AppArmor is a security module for Linux kernel and it is part of the mainstream kernel since 2.6.36 kernel. It is considered as an alternative to SELinux and is IMHO more elegant, more understandable solution to the same problem.  But you might need to switch to Suse in this case. Red Hat Enterprise Linux kernel doesn't have support for AppArmor security modules ( Does Red Hat Enterprise Linux support AppArmor )

To get the idea about the level of complexity of SE try to read the RHEL7 Deployment Guide. Full set of documentation is available from www.redhat.com/docs/manuals/enterprise/   So it is not accidental that in many cases SElinix is disabled in enterprise installations. Some commercial software packages explicitly recommend to disable it in their installation manuals.

Unfortunately  AppArmor which is/was used in SLES (only by knowledgeable sysadmins ;-), never got traction and never achieved real popularity iether (SLES now  has SELinux  as well, and  as such suffers from overcomplexity even more then RHEL ;-).  It is essentially the idea of introducing "per-application" unmask for all major directories, which can block dead attempts to write to them and/or read the information from certain sensitive files, even if the application has a vulnerability.

Escaping potential exploits by using Solaris on Intel

As I mentioned above patching became high priority activity in large Red Hat enterprise customers. Guidelines now are strict and usually specify monthly or in best case quality application of security patches.  This amount of efforts might be better applied elsewhere with much better return on investment. 

Solaris has an interesting security subsystem called RBAC, which  allow selectively grant part of root privileges and can be considered as a generalization of sudo. But Solaris is dying in enterprise environment as Oracle limits it to their own (rather expensive) hardware.  But you can use Solaris for X86 via VM (XEN).  IMHO if you need a really high level of security for a particular server, which does not have any fancy applications installed, this sometimes might be a preferable path to go despite being "security via obscurity" solution. There is no value of using Linux in highly security applications as Linux is the most hacked flavor of Unix. And this situation will not change in the foreseeable future.

This solution is especially attractive if you still have knowledgeable Solaris sysadmin from "the old guard" on the floor. Security via obscurity actually works pretty well in this case; add to this RBAC capabilities and you have a winner.  The question here is why take  additional risks with "zero day" Linux exploits,  if you can avoid them. See Potemkin Villages of Computer Security  for more detailed discussion.

I never understood IT managers who spend  additional money on enhancing linux  security. Especially via "security monitoring" solutions using third party providers, which  more often then not is a complete, but expensive fake.  Pioneered by ISS more then ten year ago.

Normal hardening scripts are OK, but to spend money of same fancy and expensive system to enhance Linux  security is a questionable option in my opinion, as Linux will always be the most attacked flavor of Unix with the greatest number of zero time exploits.  That is especially true for foreign companies operating in the USA. You can be sure that specialists in NSA are well ahead of any hackers in zero time exploits for Linux (and, if not Linux, then CISCO is good enough too ;-) 

So instead of  Sisyphus task of enhancing Linus security via keeping up with patching schedule (which is a typical large enterprise mantra, as this is the most obvious and understandable by corporate IT brass path) it makes sense to switch to a different OS for critical servers, especially Internet  facing servers or to use a security appliance to block most available paths to the particular server group (for example one typical blunder is allocating IP addresses for DRAC/ILO remote controls on the same segment  as the main server interface; those "internal appliances" rarely have up-to-date firmware, sometimes have default passwords, are in general are much more hackable and do need additional protection by a separate firewalls which limit  access to selected sysadmins desktops with static IP addresses).

My choice would be Oracle Solaris as this is a well architectured by Sun Microsystems OS with an excellent security record and additional, unique security mechanisms (up to Trusted Solaris level).  And a good thing is that Oracle (so far) did not spoil it with excessive updates, like Red Hat spoiled it distribution with RHEL 7.  Your mileage may vary.

Also important  in today's "semi-outsourced" IT environments is the level of competence and level of loyalty of the people who are responsible for selecting and implementing security solutions. For example low loyalty of contractor based technical personnel naturally leads to increased probability of security incidents and/or signing useless or harmful for the enterprise contracts: security represents the new Eldorado for snake oil sellers. A good recent example was (in the past) tremendous  market success of ISS intrusion detection appliances. Which were as close to snake oil as one can get.  May be  that's why they were bought  by IBM for completely obscene amount of money: list of fools has tremendous commercial value for such shrewd players as IBM.

In such  an environment  "security via obscurity" is probably the optimal path of increasing the security of both OS and typical applications.  Yes, Oracle is more expensive. But there is no free lunch. I am actually disgusted with the proliferation of security products for Linux ;-)

Road to hell in paved with good intentions: biosdevname package on Dell servers

Loss of architectural integrity of Unix is now very pronounced in RHEL. Both RHEL6 and  RHEL7 although 7 is definitely worse. And this is not only systemd fiasco. For example, recently I spend a day troubleshooting an interesting and unusual problem: one out of 16 identical (both hardware and software wise) blades in a small HPC cluster (and only one) failed to enable bonded interface on boot and this remained off line. Most sysadmins would think that something is wrong with hardware. For example, Ethernet card on the blade and/or switch port or even internal enclosure interconnect. I also initially thought his way. But this was not the case  ;-)  

Tech support from Dell was not able to locate any hardware problem although they did diligently upgraded CMC on enclosure and BIOS and firmware on the blade. BTW this blade has had similar problems in the past and Dell tech support once even replaced Ethernet card in it, thinking that it is culprit. Now I know that this was a completely wrong decision on their part, and waist of both time and money  :-), They come to this conclusion by swapping the blade to a different slot and seeing that the problem migrated into new slot. Bingo -- the card is the root cause.  The problem is that it was not.  What is especially funny is that replacing the card did solve the problem for a while. After reading the information provided below you will be as puzzled as me why it happened. 

To make long story short the card was changed but  after yet another power outage the problem  returned. This  time I started to suspect that the card has nothing to do with the problem. After more close examination I discovered that in its infinite wisdom in RHEL 6 Red Hat introduced a package called biosdevname. The package was developed by Dell (the fact which seriously undermined my trust in Dell hardware ;-). This package renames interfaces to a new set of names, supposedly consistent with their etching on the case of rack servers. It is useless (or more correctly harmful) for blades.  The package is primitive and does not understand that server is a rack server or blade.  Moreover by doing this supposedly useful remaining this package introduces in  70-persistent-net.rules file a stealth rule:

KERNEL=="eth*", ACTION=="add", PROGRAM="/sbin/biosdevname -i %k", NAME="%c"

I did not look at the code, but from the observed behaviour it looks that in some cases in RHEL 6 (and most probably in RHEL 7 too) the package adds a "stealth" rule to the END (not the beginning but the end !!!) of  /udev/rules.d/70-persistent-net.rules file, which means that if similar rule in 70-persistent-net.rules exits it is overwritten. Or something similar to this effect. 

If you look at Dell knowledge base there are dozens of advisories related to this package (search just for biosdevname). Which suggests that there some deeply wrong with its architecture.

What I observed  that on some (the key word is some, converting the situation in Alice in Wonderland environment) rules for interfaces listed in  70-persistent-net.rules file simply does not work, if this package is enabled. For example Dell Professional services in their infinite wisdom renamed interfaces  back to eth0-eth4 for Inter X710 4-port 10Gb Ethernet card that we have on some blades. On 15 out of 16 blades in the Dell enclosure this absolutely wrong solution works perfectly well. But on blades 16 sometimes it does not. As a result this blade does not boot after power outage of reboot. When this happens is unpredictable. Sometimes it  boots, but sometimes it does not.  And you can't understand what is happening, no matter how hard you try because of stealth nature of changes introduced by biosdevname package. 

Two interfaces on this blades (as you now suspect eth0 and eth1) were bonded on this blade. After around 6 hours of poking around the problem I discover that despite presence of rule for "eth0-eth4 in 70-persistent-net.rules file RHEL 6.7 still renames on boot all four interfaces to em0-em4 scheme and naturally bonding fails as eth0 and eth1 interfaces do not exit.

First I decided to deinstall biosdevname package and see what will happen. Did not work (see below why -- de-installation script in this RPM is incorrect and contains a bug -- it is not enough to remove files  you also need to run  the command update-initramfs -u (Hat tip to Oler). Searching for "Renaming em to eth" I found a post in which the author recommended disabling this "feature" via adding biosdevname=0 to kernel parameters /etc/grub.conf. That worked.  So two days my life were lost for finding a way to disable this completely unnecessary for blades RHEL "enhancement".

Here is some information about this package

biosdevname
Copyright (c) 2006, 2007 Dell, Inc.  
Licensed under the GNU General Public License, Version 2.

biosdevname in its simplest form takes a kernel device name as an argument, and returns the BIOS-given name it "should" be.  This is
necessary on systems where the BIOS name for a given device (e.g. the label on the chassis is "Gb1") doesn't map directly 
and obviously to the kernel name (e.g. eth0).
The distro-patches/sles10/ directory contains a patch needed to integrate biosdevname into the SLES10 udev ethernet 
naming rules.This also works as a straight udev rule.  On RHEL4, that looks like:
KERNEL=="eth*", ACTION=="add", PROGRAM="/sbin/biosdevname -i %k", NAME="%c"
This makes use of various BIOS-provided tables:
PCI Confuration Space
PCI IRQ Routing Table ($PIR)
PCMCIA Card Information Structure
SMBIOS 2.6 Type 9, Type 41, and HP OEM-specific types
therefore it's likely that this will only work well on architectures that provide such information in their BIOS.

To add insult to injury this behaviour was demonstrated on only one of 16 absolutely identically configured Dell M630 blades with identical hardware and absolutely identical (cloned) OS instances.  Which makes RHEL real "Alice in Wonderland" system.  This is just one example. I have more similar  stories to tell

I would like to repeat that while idea is not completely wrong and sometimes might even make sense,  the package itself is very primitive and the utility that is included in this package does not understand that the target for installation is a blade (NOTE to DELL: there is no etchings on blade network interfaces ;-)

If you look at this topic using your favorite search engine (which should be Google anymore ;-) you will find dozens of posts in which people try to resolve this problem with various levels of competency and success. Such a tremendous waist of time and efforts.  Among best that I have found were:

But this is not the only example of harmful packages installed. They also install audio packages on servers that have no audio card  ;-) 

Current versions and year of end of support

As of October 2018 supported versions of RHEL are 6.10 and 7.3-7.5. Usually a large enterprise uses a mixture of versions, so knowing  the end  of support for each is a common problem.  Compatibility within a single version of RHEL is usually very good (I would say on par with Solaris) and the risk on upgrading from, say, 6.5 to 6.10 is minimal.  There were some broken minor upgrades like RHEL 6.6, but this is a rare even, and this particular on  is in the past.

Problems arise with major version upgrade. Usually total reinstallation is the best bet.  Formally you can upgrade from RHEL 6.10 to RHEL 7.5 but only for a very narrow class of servers and not much installed (or much de-installed ;-) .  You need to remove Gnome via yum groupremove gnome-desktop, if it works (how to remove gnome-desktop using yum ) and several other packages.

Against,  here your mileage may vary and reinstallation might be a better option (RedHat 6 to RedHat 7 upgrade):

1. Prerequisites

1. Make sure that you are on running on the latest minor version (i.e rhel 6.8).

2. The upgrade process can handle only the following package groups and packages: Minimal (@minimal),
Base (@base), Web Server (@web-server), DHCP Server, File Server (@nfs-server), and Print Server
(@print-server).

3. Upgrading GNOME and KDE is not supported, So please uninstall the GUI desktop before the upgrade and
install it after the upgrade.

4. Backup the entire system to avoid potential data loss.

5. If in case you system is registered with RHN classic , you must first unregister from RHN classic and
register with subscription-manager.

6. Make sure /usr is not on separate partition.

2. Assessment

Before upgrading we need to assess the machine first and check if it is eligible for the upgrade and this can be
done by an utility called "Preupgrade Assistant".

Preupgrade Assistant does the following:

2.1. Installing dependencies of Preupgrade Assistant:

Preupgrade Assistant needs some dependencies packages (openscap , openscap-engine-sce, openscap-
utils , pykickstart, mod_wsgi) all these packages can be installed from the installation media but "openscap-
engine-sce" package needs to be downloaded from the redhat portal.

See Red Hat Enterprise Linux - Wikipedia.  and DistroWatch.com Red Hat Enterprise Linux.

Tip:

In linux there is no convention for determination which flavor of linux you are running. For Red Hat in order to  determine which version is installed on the server you can use command

cat /etc/redhat-release

Oracle linux adds its own file preserving RHEL file, so a more appropriate  command would be

cat /etc/*release

End of support issues

See Red Hat Enterprise Linux Life Cycle - Red Hat Customer Portal:

Avoiding useless daemons during installation

While new Anaconda sucks, you still can improve the typical for RHEL situation with a lot of useless daemons installed by carefully selecting packages and then reusing generated kickstart file. That can be done via advanced menu for one box and then using this kickstart file for all other boxes with minor modifications.

Kickstart still works, despite trend toward overcomplexity in other parts of distribution. They did not manage to screw it up yet  ;-)

What's new in RHEL7

RHEL 7 was released in June 2014. With the release of RHEL 7 we see hard push to systemd exclusivity.  Runlevels are gone. The release of RHEL 7 with systemd as the only option for system and process management has reignited the old debate weather Red Hat is trying to establish Microsoft-style monopoly over enterprise Linux and move Linux closer to Windows: closed but user-friendly system. 

Scripting languages version in RHEL7

RHEL7 uses Bash 4.2, Perl 5.16.3, PHP 5.4.16, Python 2.7.5. 

XFS filesystem

The high-capacity, 64-bit XFS file system, which was available in RHEL6, now the default file system. It originated in the Silicon Graphics Irix operating system. It can scale up to 500 TB of addressable memory. In comparison, previous file systems, such EXT 4, were limited to 16 TBs.

For more packages version information see DistroWatch.com Red Hat Enterprise Linux

Systemd

for server sysadmins systemd is a massive, fundamental change to core Linux administration for no perceivable gain. So while there is a high level of support of systemd from Linux users who run Linux on their laptops and maybe as home server, there is a strong backlash against systemd from Linux system administrators who are responsible for significant number of Linux servers in enterprise environment.

After all runlevels were used in production environment, if only to run system with or without X11.  Please read  an interesting essay on systemd (ProSystemdAntiSystemd).

Often initiated by opponents, they will lament on the horrors of PulseAudio and point out their scorn for Lennart Poettering. This later became a common canard for proponents to dismiss criticism as Lennart-bashing. Futile to even discuss, but it’s a staple.

Lennart’s character is actually, at times, relevant.. Trying to have a large discussion about systemd without ever invoking him is like discussing glibc in detail without ever mentioning Ulrich Drepper. Most people take it overboard, however.

A lot of systemd opponents will express their opinions regarding a supposed takeover of the Linux ecosystem by systemd, as its auxiliaries (all requiring governance by the systemd init) expose APIs, which are then used by various software in the desktop stack, creating dependency chains between it and systemd that the opponents deemed unwarranted. They will also point out the udev debacle and occasionally quote Lennart. Opponents see this as anti-competitive behavior and liken it to “embrace, extend, extinguish”. They often exaggerate and go all out with their vitriol though, as they start to contemplate shadowy backroom conspiracies at Red Hat (admittedly it is pretty fun to pretend that anyone defending a given piece of software is actually a shill who secretly works for it, but I digress), leaving many of their concerns to be ignored and deem ridiculous altogether.

... ... ...

In addition, the Linux community is known for reinventing the square wheel over and over again. Chaos is both Linux’s greatest strength and its greatest weakness. Remember HAL? Distro adoption is not an indicator of something being good, so much as something having sufficient mindshare.

... ... ...

The observation that sysinit is dumb and heavily flawed with its clunky inittab and runlevel abstractions, is absolutely nothing new. Richard Gooch wrote a paper back in 2002 entitled “Linux Boot Scripts”, which criticized both the SysV and BSD approaches, based on his earlier work on simpleinit(8). That said, his solution is still firmly rooted in the SysV and BSD philosophies, but he makes it more elegant by supplying primitives for modularity and expressing dependencies.

Even before that, DJB wrote the famous daemontools suite which has had many successors influenced by its approach, including s6, perp, runit and daemontools-encore. The former two are completely independent implementations, but based on similar principles, though with significant improvements. An article dated to 2007 entitled “Init Scripts Considered Harmful” encourages this approach and criticizes initscripts.

Around 2002, Richard Lightman wrote depinit(8), which introduced parallel service start, a dependency system, named service groups rather than runlevels (similar to systemd targets), its own unmount logic on shutdown, arbitrary pipelines between daemons for logging purposes, and more. It failed to gain traction and is now a historical relic.

Other systems like initng and eINIT came afterward, which were based on highly modular plugin-based architectures, implementing large parts of their logic as plugins, for a wide variety of actions that software like systemd implements as an inseparable part of its core. Initmacs, anyone?

Even Fefe, anti-bloat activist extraordinaire, wrote his own system called minit early on, which could handle dependencies and autorestart. As is typical of Fefe’s software, it is painful to read and makes you want to contemplate seppuku with a pizza cutter.

And that’s just Linux. Partial list, obviously.

At the end of the day, all comparing to sysvinit does is show that you’ve been living under a rock for years. What’s more, it is no secret to a lot of people that the way distros have been writing initscripts has been totally anathema to basic software development practices, like modularizing and reusing common functions, for years. Among other concerns such as inadequate use of already leaky abstractions like start-stop-daemon(8). Though sysvinit does encourage poor work like this to an extent, it’s distro maintainers who do share a deal of the blame for the mess. See the BSDs for a sane example of writing initscripts. OpenRC was directly inspired by the BSDs’ example. Hint: it’s in the name - “RC”.

The rather huge scope and opinionated nature of systemd leads to people yearning for the days of sysvinit. A lot of this is ignorance about good design principles, but a good part may also be motivated from an inability to properly convey desires of simple and transparent systems. In this way, proponents and opponents get caught in feedback loops of incessantly going nowhere with flame wars over one initd implementation (that happened to be dominant), completely ignoring all the previous research on improving init, as it all gets left to bite the dust. Even further, most people fail to differentiate init from rc scripts, and sort of hold sysvinit to be equivalent to the shoddy initscripts that distros have written, and all the hacks they bolted on top like LSB headers and startpar(2). This is a huge misunderstanding that leads to a lot of wasted energy.

Don’t talk about sysvinit. Talk about systemd on its own merits and the advantages or disadvantages of how it solves problems, potentially contrasting them to other init systems. But don’t immediately go “SysV initscripts were way better and more configurable, I don’t see what systemd helps solve beyond faster boot times.”, or from the other side “systemd is way better than sysvinit, look at how clean unit files are compared to this horribly written initscript I cherrypicked! Why wouldn’t you switch?”

... ... ...

Now that we pointed out how most systemd debates play out in practice and why it’s usually a colossal waste of time to partake in them, let’s do a crude overview of the personalities that make this clusterfuck possible.

The technically competent sides tend to largely fall in these two broad categories:

a) Proponents are usually part of the modern Desktop Linux bandwagon. They run contemporary mainstream distributions with the latest software, use and contribute to large desktop environment initiatives and related standards like the *kits. They’re not necessarily purely focused on the Linux desktop. They’ll often work on features ostensibly meant for enterprise server management, cloud computing, embedded systems and other needs, but the rhetoric of needing a better desktop and following the example set by Windows and OS X is largely pervasive amongst their ranks. They will decry what they perceive as “integration failures”, “fragmentation” and are generally hostile towards research projects and anything they see as “toy projects”. They are hackers, but their mindset is largely geared towards reducing interface complexity, instead of implementation complexity, and will frequently argue against the alleged pitfalls of too much configurability, while seeing computers as appliances instead of tools.

b) Opponents are a bit more varied in their backgrounds, but they typically hail from more niche distributions like Slackware, Gentoo, CRUX and others. They are largely uninterested in many of the Desktop Linux “advancements”, value configuration, minimalism and care about malleability more than user friendliness. They’re often familiar with many other Unix-like environments besides Linux, though they retain a fondness for the latter. They have their own pet projects and are likely to use, contribute to or at least follow a lot of small projects in the low-level system plumbing area. They can likely name at least a dozen alternatives to the GNU coreutils (I can name about 7, I think), generally favor traditional Unix principles and see computers as tools. These are the people more likely to be sympathetic to things like the suckless philosophy.

It should really come as no surprise that the former group dominates. They’re the ones that largely shape the end user experience. The latter are pretty apathetic or even critical of it, in contrast. Additionally, the former group simply has far more manpower in the right places. Red Hat’s employees alone dominate much of the Linux kernel, the GNU base system, GNOME, NetworkManager, many projects affiliated with Freedesktop.org standards (including Polkit) and more. There’s no way to compete with a vast group of paid individuals like those.

Conclusion

The “Year of the Linux Desktop” has become a meme at this point, one that is used most often sarcastically. Yet there are still a lot of people who deeply hold onto it and think that if only Linux had a good abstraction engine for package manager backends, those Windows users will be running Fedora in no time.

What we’re seeing is undoubtedly a cultural clash by two polar opposites that coexist in the Linux community. We can see it in action through the vitriol against Red Hat developers, and conversely the derision against Gentoo users on part of Lennart Poettering, Greg K-H and others. Though it appears in this case “Gentoo user” is meant as a metonym for Linux users whose needs fall outside the mainstream application set. Theo de Raadt infamously quipped that Linux is “for people who hate Microsoft”, but that quote is starting to appear outdated.

Many of the more technically competent people with views critical of systemd have been rather quiet in public, for some reason. Likely it’s a realization that the Linux desktop’s direction is inevitable, and thus trying to criticize it is a futile endeavor. There are people who still think GNOME abandoning Sawfish was a mistake, so yes.

The non-desktop people still have their own turf, but they feel threatened by systemd to one degree or another. Still, I personally do not see them dwindling down. What I believe will happen is that they will become even more segregated than they already are from mainstream Linux and that using their software will feel more otherworldly as time goes on.

There are many who are predicting a huge renaissance for BSD in the aftermath of systemd, but I’m skeptical of this. No doubt there will be increased interest, but as a whole it seems most of the anti-systemd crowd is still deeply invested in sticking to Linux.

Ultimately, the cruel irony is that in systemd’s attempt to supposedly unify the distributions, it has created a huge rift unlike any other and is exacerbating the long-present hostilities between desktop Linux and minimalist Linux sides at rates that are absolutely atypical. What will end up of systemd remains unknown. Given Linux’s tendency for chaos, it might end up the new HAL, though with a significantly more painful aftermath, or it might continue on its merry way and become a Linux standard set in stone, in which case the Linux community will see a sharp ideological divide. Or perhaps it won’t. Perhaps things will go on as usual, on an endless spiral of reinvention without climax. Perhaps we will be doomed to flame on systemd for all eternity. Perhaps we’ll eventually get sick of it and just part our own ways into different corners.

Either way, I’ve become less and less fond of politics for uselessd and see systemd debates as being metaphorically like car crashes. I likely won’t help but chime in at times, though I intend uselessd to branch off into its own direction with time.

SYSTEMD

A very controversial subsystem, systemd is implemented. systemd is a suite of system management daemons, libraries, and utilities designed for Linux and programmed exclusively for the Linux API. There is no more runlevels. For servers systemd makes little sense. Sysadmins now need to learn new systemd commands  for starting and stopping various services. There is still ‘service’ command included for backwards compatibility, but it may go away in future releases. See CentOS 7 - RHEL 7 systemd commands Linux BrigadeCentOS 7 - RHEL 7 systemd commands

From Wikipedia (systemd)

In a 2012 interview, Slackware's founder Patrick Volkerding  expressed the following reservations about the systemd architecture which are fully applicable to the server environment

Concerning systemd, I do like the idea of a faster boot time (obviously), but I also like controlling the startup of the system with shell scripts that are readable, and I'm guessing that's what most Slackware users prefer too. I don't spend all day rebooting my machine, and having looked at systemd config files it seems to me a very foreign way of controlling a system to me, and attempting to control services, sockets, devices, mounts, etc., all within one daemon flies in the face of the UNIX concept of doing one thing and doing it well.

In an August 2014 article published in InfoWorld, Paul Venezia wrote about the systemd controversy, and attributed the controversy to violation of the Unix philosophy, and to "enormous egos who firmly believe they can do no wrong."[42] The article also characterizes the architecture of systemd as more similar to that of Microsoft Windows software:[42]

While systemd has succeeded in its original goals, it's not stopping there. systemd is becoming the Svchost of Linux – which I don't think most Linux folks want. You see, systemd is growing, like wildfire, well outside the bounds of enhancing the Linux boot experience. systemd wants to control most, if not all, of the fundamental functional aspects of a Linux system – from authentication to mounting shares to network configuration to syslog to cron.
 

LINUX CONTAINERS

After 10 years or so after Solaris 10 Linux at last got them.

Linux containers have emerged as a key open source application packaging and delivery technology, combining lightweight application isolation with the flexibility of image-based deployment methods. Developers have rapidly embraced Linux containers because they simplify and accelerate application deployment, and many Platform-as-a-Service (PaaS) platforms are built around Linux container technology, including OpenShift by Red Hat.

Red Hat Enterprise Linux 7 implements Linux containers using core technologies such as control groups (cGroups) for resource management, namespaces for process isolation, and SELinux for security, enabling secure multi-tenancy and reducing the potential for security exploits. The Red Hat container certification ensures that application containers built using Red Hat Enterprise Linux will operate seamlessly across certified container hosts.

NUMA AFFINITY

With more and more systems, even at the low end, presenting non-uniform memory access (NUMA) topologies, Red Hat Enterprise Linux 7 addresses the performance irregularities that such systems present. A new, kernel-based NUMA affinity mechanism automates memory and scheduler optimization. It attempts to match processes that consume significant resources with available memory and CPU resources in order to reduce cross-node traffic. The resulting improved NUMA resource alignment improves performance for applications and virtual machines, especially when running memory-intensive workloads.

HARDWARE EVENT REPORTING MECHANISM

Red Hat Enterprise Linux 7 unifies hardware event reporting into a single reporting mechanism. Instead of various tools collecting errors from different sources with different timestamps, a new hardware event reporting mechanism (HERM) will make it easier to correlate events and get an accurate picture of system behavior. HERM reports events in a single location and in a sequential timeline. HERM uses a new userspace daemon, rasdaemon, to catch and log all RAS events coming from the kernel tracing infrastructure.

VIRTUALIZATION GUEST INTEGRATION WITH VMWARE

Red Hat Enterprise Linux 7 advances the level of integration and usability between the Red Hat Enterprise Linux guest and VMware vSphere. Integration now includes: • Open VM Tools — bundled open source virtualization utilities. • 3D graphics drivers for hardware-accelerated OpenGL and X11 rendering. • Fast communication mechanisms between VMware ESX and the virtual machine.

PARTITIONING DEFAULTS FOR ROLLBACK

The ability to revert to a known, good system configuration is crucial in a production environment. Using LVM snapshots with ext4 and XFS (or the integrated snapshotting feature in Btrfs described in the “Snapper” section) an administrator can capture the state of a system and preserve it for future use. An example use case would involve an in-place upgrade that does not present a desired outcome and an administrator who wants to restore the original configuration.

CREATING INSTALLATION MEDIA

Red Hat Enterprise Linux 7 introduces Live Media Creator for creating customized installation media from a kickstart file for a range of deployment use cases. Media can then be used to deploy standardized images whether on standardized corporate desktops, standardized servers, virtual machines, or hyperscale deployments. Live Media Creator, especially when used with templates, provides a way to control and manage configurations across the enterprise.

SERVER PROFILE TEMPLATES

Red Hat Enterprise Linux 7 features the ability to use installation templates to create servers for common workloads. These templates can simplify and speed creating and deploying Red Hat Enterprise Linux servers, even for those with little or no experience with Linux.

What's new in RHEL8

 Python 3.6 is now the default version of Python (see below).  Wayland as their default display server for Gnome instead of the X.org server, which was used with the previous major version of RHEL.  If you upgrade to RHEL 8 from a RHEL 7 system where you used the X.org GNOME session, your system continues to use X.org.

In Red Hat Enterprise Linux 8, the GCC toolchain is based on the GCC 8.2 

nftables replaces iptables as the default network packet filtering framework. firewalld uses nftables by default

Red Hat Enterprise Linux 7 supported two implementations of the NTP protocol: ntp and chrony. In Red Hat Enterpise Linux 8, only chrony is available.

In ssh the Digital Signature Algorithm (DSA) is considered deprecated in Red Hat Enterprise Linux 8. Authentication mechanisms that depend on DSA keys do not work in the default configuration.

Network scripts are deprecated in Red Hat Enterprise Linux 8 and they are no longer provided by default. The basic installation provides a new version of the ifup and ifdown scripts which call the NetworkManager service through the nmcli tool. In Red Hat Enterprise Linux 8, to run the ifup and the ifdown scripts, NetworkManager must be running.

If any of these scripts are required, the installation of the deprecated network scripts in the system is still possible with the following command:

~]# yum install network-scripts

With Red Hat Enterprise Linux 8, all packages related to KDE Plasma Workspaces (KDE) have been removed, and it is no longer possible to use KDE as an alternative to the default GNOME desktop environment.

The Btrfs file system has been removed in Red Hat Enterprise Linux 8.

virt-manager has been deprecated

  • The support for rootless containers is available as a technology preview in RHEL 8 Beta.

    Rootless containers are containers that are created and managed by regular system users without administrative permissions.

Chapter 4. New features - Red Hat Customer Portal

YUM performance improvement and support for modular content

On Red Hat Enterprise Linux 8, installing software is ensured by the new version of the YUM tool, which is based on the DNF technology.

YUM based on DNF has the following advantages over the previous YUM v3 used on RHEL 7:

  • Increased performance
  • Support for modular content
  • Well-designed stable API for integration with tooling

For detailed information about differences between the new YUM tool and the previous version YUM v3 from RHEL 7, see http://dnf.readthedocs.io/en/latest/cli_vs_yum.html.

YUM based on DNF is compatible with YUM v3 when using from the command line, editing or creating configuration files.

For installing software, you can use the yum command and its particular options in the same way as on RHEL 7. Packages can be installed under the previous names using Provides. Packages also provide compatibility symlinks, so the binaries, configuration files and directories can be found in usual locations.

4.6. Shells and command-line tools

The nobody user replaces nfsnobody

In Red Hat Enterprise Linux 7, there was:

  • the nobody user and group pair with the ID of 99, and
  • the nfsnobody user and group pair with the ID of 65534, which is the default kernel overflow ID, too.

Both of these have been merged into the nobody user and group pair, which uses the 65534 ID in Red Hat Enterprise Linux 8. New installations no longer create the nfsnobody pair.

This change reduces the confusion about files that are owned by nobody but have nothing to do with NFS.

(BZ#1591969)

4.7. Web servers, databases, dynamic languages

Database servers in RHEL 8

RHEL 8 provides the following database servers:

  • MySQL 8.0, a multi-user, multi-threaded SQL database server. It consists of the MySQL server daemon, mysqld, and many client programs.
  • MariaDB 10.3, a multi-user, multi-threaded SQL database server. For all practical purposes, MariaDB is binary-compatible with MySQL.
  • PostgreSQL 10 and PostgreSQL 9.6, an advanced object-relational database management system (DBMS).
  • Redis 4.0, an advanced key-value store. It is often referred to as a data structure server because keys can contain strings, hashes, lists, sets, and sorted sets. Redis is provided for the first time in RHEL.

Note that the NoSQL MongoDB database server is not included in RHEL 8.0 Beta because it uses the Server Side Public License (SSPL).

(BZ#1647908)

Notable changes in MySQL 8.0

RHEL 8 is distributed with MySQL 8.0, which provides, for example, the following enhancements:

  • MySQL now incorporates a transactional data dictionary, which stores information about database objects.
  • MySQL now supports roles, which are collections of privileges.
  • The default character set has been changed from latin1 to utf8mb4.
  • Support for common table expressions, both nonrecursive and recursive, has been added.
  • MySQL now supports window functions, which perform a calculation for each row from a query, using related rows.
  • InnoDB now supports the NOWAIT and SKIP LOCKED options with locking read statements.
  • GIS-related functions have been improved.
  • JSON functionality has been enhanced.
  • The new mariadb-connector-c packages provide a common client library for MySQL and MariaDB. This library is usable with any version of the MySQL and MariaDB database servers. As a result, the user is able to connect one build of an application to any of the MySQL and MariaDB servers distributed with RHEL 8.

In addition, the MySQL 8.0 server distributed with RHEL 8 is configured to use mysql_native_password as the default authentication plug-in because client tools and libraries in RHEL 8 are incompatible with the caching_sha2_password method, which is used by default in the upstream MySQL 8.0 version.

To change the default authentication plug-in to caching_sha2_password, edit the /etc/my.cnf.d/mysql-default-authentication-plugin.cnf file as follows:

[mysqld]
default_authentication_plugin=caching_sha2_password

(BZ#1649891, BZ#1519450, BZ#1631400)

Notable changes in MariaDB 10.3

MariaDB 10.3 provides numerous new features over the version 5.5 distributed in RHEL 7. Some of the most notable changes are:

  • MariaDB Galera Cluster, a synchronous multi-master cluster, is now a standard part of MariaDB.
  • InnoDB is used as the default storage engine instead of XtraDB.
  • Common table expressions
  • System-versioned tables
  • FOR loops
  • Invisible columns
  • Sequences
  • Instant ADD COLUMN for InnoDB
  • Storage-engine independent column compression
  • Parallel replication
  • Multi-source replication

In addition, the new mariadb-connector-c packages provide a common client library for MySQL and MariaDB. This library is usable with any version of the MySQL and MariaDB database servers. As a result, the user is able to connect one build of an application to any of the MySQL and MariaDB servers distributed with RHEL 8.

See also Using MariaDB on Red Hat Enterprise Linux 8.

(BZ#1637034, BZ#1519450)

Python scripts must specify major version in hashbangs at RPM build time

In RHEL 8, executable Python scripts are expected to use hashbangs (shebangs) specifying explicitly at least the major Python version.

The /usr/lib/rpm/redhat/brp-mangle-shebangs buildroot policy (BRP) script is run automatically when building any RPM package, and attempts to correct hashbangs in all executable files. The BRP script will generate errors when encountering a Python script with an ambiguous hashbang, such as:

  • #! /usr/bin/python
  • #! /usr/bin/env python

To modify hashbangs in the Python scripts causing these build errors at RPM build time, use the pathfix.py script from the platform-python-devel package:

pathfix.py -pn -i %{__python3} PATH ...

Multiple PATHs can be specified. If a PATH is a directory, pathfix.py recursively scans for any Python scripts matching the pattern ^[a-zA-Z0-9_]+\.py$, not only those with an ambiguous hashbang. Add this command to the %prep section or at the end of the %install section.

For more information, see Handling hashbangs in Python scripts.

(BZ#1583620)

Python 3 is the default Python implementation in RHEL 8

Red Hat Enterprise Linux 8 is distributed with Python 3.6. The package is not installed by default. To install Python 3.6, use the yum install python3 command.

Python 2.7 is available in the python2 package. However, Python 2 will have a shorter life cycle and its aim is to facilitate smoother transition to Python 3 for customers.

Neither the default python package nor the unversioned /usr/bin/python executable is distributed with RHEL 8. Customers are advised to use python3 or python2 directly. Alternatively, administrators can configure the unversioned python command using the alternatives command.

For details, see Using Python in Red Hat Enterprise Linux 8.

(BZ#1580387)

Notable changes in Ruby

RHEL 8 provides Ruby 2.5, which introduces numerous new features and enhancements over Ruby 2.0.0 available in RHEL 7. Notable changes include:

  • Incremental garbage collector has been added.
  • The Refinements syntax has been added.
  • Symbols are now garbage collected.
  • The $SAFE=2 and $SAFE=3 safe levels are now obsolete.
  • The Fixnum and Bignum classes have been unified into the Integer class.
  • Performance has been improved by optimizing the Hash class, improved access to instance variables, and the Mutex class being smaller and faster.
  • Certain old APIs have been deprecated.
  • Bundled libraries, such as RubyGems, Rake, RDoc, Psych, Minitest, and test-unit, have been updated.
  • Other libraries, such as mathn, DL, ext/tk, and XMLRPC, which were previously distributed with Ruby, are deprecated or no longer included.
  • The SemVer versioning scheme is now used for Ruby versioning.

(BZ#1648843)

Notable changes in PHP

Red Hat Enterprise Linux 8 is distributed with PHP 7.2. This version introduces the following major changes over PHP 5.4, which is available in RHEL 7:

  • PHP uses FastCGI Process Manager (FPM) by default (safe for use with a threaded httpd).
  • The php_value and php-flag variables should no longer be used in the httpd configuration files; they should be set in pool configuration instead: /etc/php-fpm.d/*.conf
  • PHP script errors and warning are logged to the /var/log/php-fpm/www-error.log file instead of /var/log/httpd/error.log
  • When changing the PHP max_execution_time configuration variable, the httpd ProxyTimeout setting should be increased to match
  • The user running PHP scripts is now configured in the FPM pool configuration (the /etc/php-fpm/d/www.conf file; the apache user is the default)
  • The php-fpm service needs to be restarted after a configuration change or after a new extension is installed

The following extensions have been removed:

  • aspell
  • mysql (note that the mysqli and pdo_mysql extensions are still available, provided by php-mysqlnd package)
  • zip
  • memcache

(BZ#1580430)

Notable changes in Perl

Perl 5.26, distributed with RHEL 8, introduces the following changes over the version available in RHEL 7:

  • Unicode 9.0 is now supported.
  • New op-entry, loading-file, and loaded-file SystemTap probes are provided.
  • Copy-on-write mechanism is used when assigning scalars for improved performance.
  • The IO::Socket::IP module for handling IPv4 and IPv6 sockets transparently has been added.
  • The Config::Perl::V module to access perl -V data in a structured way has been added.
  • A new perl-App-cpanminus package has been added, which contains the cpanm utility for getting, extracting, building, and installing modules from the Comprehensive Perl Archive Network (CPAN) repository.
  • The current directory . has been removed from the @INC module search path for security reasons.
  • The do statement now returns a deprecation warning when it fails to load a file because of the behavioral change described above.
  • The do subroutine(LIST) call is no longer supported and results in a syntax error.
  • Hashes are randomized by default now. The order in which keys and values are returned from a hash changes on each perl run. To disable the randomization, set the PERL_PERTURB_KEYS environment variable to 0.
  • Unescaped literal { characters in regular expression patterns are no longer permissible.
  • Lexical scope support for the $_ variable has been removed.
  • Using the defined operator on an array or a hash results in a fatal error.
  • Importing functions from the UNIVERSAL module results in a fatal error.
  • The find2perl, s2p, a2p, c2ph, and pstruct tools have been removed.
  • The ${^ENCODING} facility has been removed. The encoding pragma's default mode is no longer supported. To write source code in other encoding than UTF-8, use the encoding's Filter option.
  • The perl packaging is now aligned with upstream. The perl package installs also core modules, while the /usr/bin/perl interpreter is provided by the perl-interpreter package. In previous releases, the perl package included just a minimal interpreter, whereas the perl-core package included both the interpreter and the core modules.

(BZ#1511131)

Notable changes in Apache httpd

RHEL 8 is distributed with the Apache HTTP Server 2.4.35. This version introduces the following changes over httpd available in RHEL 7:

  • HTTP/2 support is now provided by the mod_http2 package, which is a part of the httpd module.
  • Automated TLS certificate provisioning and renewal using the Automatic Certificate Management Environment (ACME) protocol is now supported with the mod_md package (for use with certificate providers such as Let's Encrypt)
  • The Apache HTTP Server now supports loading TLS certificates and private keys from hardware security tokens directly from PKCS#11 modules. As a result, a mod_ssl configuration can now use PKCS#11 URLs to identify the TLS private key, and, optionally, the TLS certificate in the SSLCertificateKeyFile and SSLCertificateFile directives.
  • The multi-processing module (MPM) configured by default with the Apache HTTP Server has changed from a multi-process, forked model (known as prefork) to a high-performance multi-threaded model, event. Any third-party modules that are not thread-safe need to be replaced or removed. To change the configured MPM, edit the /etc/httpd/conf.modules.d/00-mpm.conf file. See the httpd.service(8) man page for more information.

For more information about httpd, see Setting up the Apache HTTP web server.

(BZ#1632754, BZ#1527084, BZ#1581178)

The nginx web server new in RHEL 8

RHEL 8 introduces nginx 1.14, a web and proxy server supporting HTTP and other protocols, with a focus on high concurrency, performance, and low memory usage. nginx was previously available only as a Software Collection.

The nginx web server now supports loading TLS certificates and private keys from hardware security tokens directly from PKCS#11 modules. As a result, an nginx configuration can use PKCS#11 URLs to identify the TLS private key in the ssl_certificate_key directive.

OpenSSH rebased to version 7.8p1

The openssh packages have been upgraded to upstream version 7.8p1. Notable changes include:

  • Removed support for the SSH version 1 protocol.
  • Removed support for the hmac-ripemd160 message authentication code.
  • Removed support for RC4 (arcfour) ciphers.
  • Removed support for Blowfish ciphers.
  • Removed support for CAST ciphers.
  • Changed the default value of the UseDNS option to no.
  • Disabled DSA public key algorithms by default.
  • Changed the minimal modulus size for Diffie-Hellman parameters to 2048 bits.
  • Changed semantics of the ExposeAuthInfo configuration option.
  • The UsePrivilegeSeparation=sandbox option is now mandatory and cannot be disabled.
  • Set the minimal accepted RSA key size to 1024 bits.

(BZ#1622511)

RSA-PSS is now supported in OpenSC

This update adds support for the RSA-PSS cryptographic signature scheme to the OpenSC smart card driver. The new scheme enables a secure cryptographic algorithm required for the TLS 1.3 support in the client software.

(BZ#1595626)

rsyslog rebased to version 8.37.0

The rsyslog packages have been upgraded to upstream version 8.37.0, which provides many bug fixes and enhancements over the previous versions. Most notable changes include:

  • Enhanced processing of rsyslog internal messages; possibility of rate-limiting them; fixed possible deadlock.
  • Enhanced rate-limiting in general; the actual spam source is now logged.
  • Improved handling of oversized messages - the user can now set how to treat them both in the core and in certain modules with separate actions.
  • mmnormalize rule bases can now be embedded in the config file instead of creating separate files for them.
  • The user can now set the GnuTLS priority string for imtcp that allows fine-grained control over encryption.
  • All config variables, including variables in JSON, are now case-insensitive.
  • Various improvements of PostgreSQL output.
  • Added a possibility to use shell variables to control config processing, such as conditional loading of additional configuration files, executing statements, or including a text in config. Note that an excessive use of this feature can make it very hard to debug problems with rsyslog.
  • 4-digit file creation modes can be now specified in config.
  • Reliable Event Logging Protocol (RELP) input can now bind also only on a specified address.
  • The default value of the enable.body option of mail output is now aligned to documentation
  • The user can now specify insertion error codes that should be ignored in MongoDB output.
  • Parallel TCP (pTCP) input has now the configurable backlog for better load-balancing.

Packages in base version

Appendix B. Packages in BaseOS - Red Hat Customer Portal


Top Visited
Switchboard
Latest
Past week
Past month


NEWS CONTENTS

Old News ;-)

[Dec 05, 2019] Life as a Linux system administrator Enable Sysadmin

System administration isn't easy nor is it for the thin-skinned.
Notable quotes:
"... System administration covers just about every aspect of hardware and software management for both physical and virtual systems. ..."
"... An SA's day is very full. In fact, you never really finish, but you have to pick a point in time to abandon your activities. Being an SA is a 24x7x365 job, which does take its toll on you physically and mentally. You'll hear a lot about burnout in this field. We, at Enable Sysadmin, have written several articles on the topic. ..."
"... You are the person who gets blamed when things go wrong and when things go right, it's "just part of your job." It's a tough place to be. ..."
"... Dealing with people is hard. Learn to breathe, smile, and comply if you want to survive and maintain your sanity. ..."
Dec 05, 2019 | www.redhat.com

... ... ...

What a Linux System Administrator does

A Linux system administrator wears many hats and the smaller your environment, the more hats you will wear. Linux administration covers backups, file restores, disaster recovery, new system builds, hardware maintenance, automation, user maintenance, filesystem housekeeping, application installation and configuration, system security management, and storage management. System administration covers just about every aspect of hardware and software management for both physical and virtual systems.

Oddly enough, you also need a broad knowledge base of network configuration, virtualization, interoperability, and yes, even Windows operating systems. A Linux system administrator needs to have some technical knowledge of network security, firewalls, databases, and all aspects of a working network. The reason is that, while you're primarily a Linux SA, you're also part of a larger support team that often must work together to solve complex problems.

Security, in some form or another, is often at the root of issues confronting a support team. A user might not have proper access or too much access. A daemon might not have the correct permissions to write to a log directory. A firewall exception hasn't been saved into the running configuration of a network appliance. There are hundreds of fail points in a network and your job is to help locate and resolve failures.

Linux system administration also requires that you stay on top of best practices, learn new software, maintain patches, read and comply with security notifications, and apply hardware updates. An SA's day is very full. In fact, you never really finish, but you have to pick a point in time to abandon your activities. Being an SA is a 24x7x365 job, which does take its toll on you physically and mentally. You'll hear a lot about burnout in this field. We, at Enable Sysadmin, have written several articles on the topic.

The hardest part of the job

Doing the technical stuff is relatively easy. It's dealing with people that makes the job really hard. That sounds terrible but it's true. On one side, you deal with your management, which is not always easy. You are the person who gets blamed when things go wrong and when things go right, it's "just part of your job." It's a tough place to be.

Coworkers don't seem to make life better for the SA. They should, but they often don't. You'll deal with lazy, unmotivated coworkers so often that you'll feel that you're carrying all the weight of the job yourself. Not all coworkers are bad. Some are helpful, diligent, proactive types and I've never had the pleasure of working with too many of them. It's hard to do your work and then take on the dubious responsibility of making sure everyone else does theirs as well.

And then there are users. Oh the bane of every SA's life, the end user. An SA friend of mine once said, "You know, this would be a great job if I just didn't have to interface with users." Agreed. But then again, with no users, there's probably also not a job. Dealing with computers is easy. Dealing with people is hard. Learn to breathe, smile, and comply if you want to survive and maintain your sanity.

... ... ...

[Nov 09, 2019] Mirroring a running system into a ramdisk Oracle Linux Blog

Nov 09, 2019 | blogs.oracle.com

javascript:void(0)

  • October 29, 2019
Mirroring a running system into a ramdisk Greg Marsden

In this blog post, Oracle Linux kernel developer William Roche presents a method to mirror a running system into a ramdisk.

A RAM mirrored System ?

There are cases where a system can boot correctly but after some time, can lose its system disk access - for example an iSCSI system disk configuration that has network issues, or any other disk driver problem. Once the system disk is no longer accessible, we rapidly face a hang situation followed by I/O failures, without the possibility of local investigation on this machine. I/O errors can be reported on the console:

 XFS (dm-0): Log I/O Error Detected....

Or losing access to basic commands like:

# ls
-bash: /bin/ls: Input/output error

The approach presented here allows a small system disk space to be mirrored in memory to avoid the above I/O failures situation, which provides the ability to investigate the reasons for the disk loss. The system disk loss will be noticed as an I/O hang, at which point there will be a transition to use only the ram-disk.

To enable this, the Oracle Linux developer Philip "Bryce" Copeland created the following method (more details will follow):

  • Create a "small enough" system disk image using LVM (a minimized Oracle Linux installation does that)
  • After the system is started, create a ramdisk and use it as a mirror for the system volume
  • when/if the (primary) system disk access is lost, the ramdisk continues to provide all necessary system functions.
Disk and memory sizes:

As we are going to mirror the entire system installation to the memory, this system installation image has to fit in a fraction of the memory - giving enough memory room to hold the mirror image and necessary running space.

Of course this is a trade-off between the memory available to the server and the minimal disk size needed to run the system. For example a 12GB disk space can be used for a minimal system installation on a 16GB memory machine.

A standard Oracle Linux installation uses XFS as root fs, which (currently) can't be shrunk. In order to generate a usable "small enough" system, it is recommended to proceed to the OS installation on a correctly sized disk space. Of course, a correctly sized installation location can be created using partitions of large physical disk. Then, the needed application filesystems can be mounted from their current installation disk(s). Some system adjustments may also be required (services added, configuration changes, etc...).

This configuration phase should not be underestimated as it can be difficult to separate the system from the needed applications, and keeping both on the same space could be too large for a RAM disk mirroring.

The idea is not to keep an entire system load active when losing disks access, but to be able to have enough system to avoid system commands access failure and analyze the situation.

We are also going to avoid the use of swap. When the system disk access is lost, we don't want to require it for swap data. Also, we don't want to use more memory space to hold a swap space mirror. The memory is better used directly by the system itself.

The system installation can have a swap space (for example a 1.2GB space on our 12GB disk example) but we are neither going to mirror it nor use it.

Our 12GB disk example could be used with: 1GB /boot space, 11GB LVM Space (1.2GB swap volume, 9.8 GB root volume).

Ramdisk memory footprint:

The ramdisk size has to be a little larger (8M) than the root volume size that we are going to mirror, making room for metadata. But we can deal with 2 types of ramdisk:

  • A classical Block Ram Disk (brd) device
  • A memory compressed Ram Block Device (zram)

We can expect roughly 30% to 50% memory space gain from zram compared to brd, but zram must use 4k I/O blocks only. This means that the filesystem used for root has to only deal with a multiple of 4k I/Os.

Basic commands:

Here is a simple list of commands to manually create and use a ramdisk and mirror the root filesystem space. We create a temporary configuration that needs to be undone or the subsequent reboot will not work. But we also provide below a way of automating at startup and shutdown.

Note the root volume size (considered to be ol/root in this example):

?
1 2 3 # lvs --units k -o lv_size ol/root LSize 10268672.00k

Create a ramdisk a little larger than that (at least 8M larger):

?
1 # modprobe brd rd_nr=1 rd_size=$((10268672 + 8*1024))

Verify the created disk:

?
1 2 3 # lsblk /dev/ram0 NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT ram0 1:0 0 9.8G 0 disk

Put the disk under lvm control

?
1 2 3 4 5 6 7 8 9 # pvcreate /dev/ram0 Physical volume "/dev/ram0" successfully created. # vgextend ol /dev/ram0 Volume group "ol" successfully extended # vgscan --cache Reading volume groups from cache. Found volume group "ol" using metadata type lvm2 # lvconvert -y -m 1 ol/root /dev/ram0 Logical volume ol/root successfully converted.

We now have ol/root mirror to our /dev/ram0 disk.

?
1 2 3 4 5 6 7 8 # lvs -a -o +devices LV VG Attr LSize Pool Origin Data% Meta% Move Log Cpy%Sync Convert Devices root ol rwi-aor--- 9.79g 40.70 root_rimage_0(0),root_rimage_1(0) [root_rimage_0] ol iwi-aor--- 9.79g /dev/sda2(307) [root_rimage_1] ol Iwi-aor--- 9.79g /dev/ram0(1) [root_rmeta_0] ol ewi-aor--- 4.00m /dev/sda2(2814) [root_rmeta_1] ol ewi-aor--- 4.00m /dev/ram0(0) swap ol -wi-ao---- <1.20g /dev/sda2(0)

A few minutes (or seconds) later, the synchronization is completed:

?
1 2 3 4 5 6 7 8 # lvs -a -o +devices LV VG Attr LSize Pool Origin Data% Meta% Move Log Cpy%Sync Convert Devices root ol rwi-aor--- 9.79g 100.00 root_rimage_0(0),root_rimage_1(0) [root_rimage_0] ol iwi-aor--- 9.79g /dev/sda2(307) [root_rimage_1] ol iwi-aor--- 9.79g /dev/ram0(1) [root_rmeta_0] ol ewi-aor--- 4.00m /dev/sda2(2814) [root_rmeta_1] ol ewi-aor--- 4.00m /dev/ram0(0) swap ol -wi-ao---- <1.20g /dev/sda2(0)

We have our mirrored configuration running !

For security, we can also remove the swap and /boot, /boot/efi(if it exists) mount points:

?
1 2 3 # swapoff -a # umount /boot/efi # umount /boot

Stopping the system also requires some actions as you need to cleanup the configuration so that it will not be looking for a gone ramdisk on reboot.

?
1 2 3 4 5 6 7 # lvconvert -y -m 0 ol/root /dev/ram0 Logical volume ol/root successfully converted. # vgreduce ol /dev/ram0 Removed "/dev/ram0" from volume group "ol" # mount /boot # mount /boot/efi # swapon -a
What about in-memory compression ?

As indicated above, zRAM devices can compress data in-memory, but 2 main problems need to be fixed:

  • LVM does take into account zRAM devices by default
  • zRAM only works with 4K I/Os
Make lvm work with zram:

The lvm configuration file has to be changed to take into account the "zram" type of devices. Including the following "types" entry to the /etc/lvm/lvm.conf file in its "devices" section:

?
1 2 3 devices { types = [ "zram" , 16 ] }
Root file system I/Os:

A standard Oracle Linux installation uses XFS, and we can check the sector size used (depending on the disk type used) with

?
1 2 3 4 5 6 7 8 9 10 # xfs_info / meta-data=/dev/mapper/ol-root isize=256 agcount=4, agsize=641792 blks = sectsz=512 attr=2, projid32bit=1 = crc=0 finobt=0 spinodes=0 data = bsize=4096 blocks=2567168, imaxpct=25 = sunit=0 swidth=0 blks naming =version 2 bsize=4096 ascii-ci=0 ftype=1 log =internal bsize=4096 blocks=2560, version=2 = sectsz=512 sunit=0 blks, lazy-count=1 realtime =none extsz=4096 blocks=0, rtextents=0

We can notice here that the sector size (sectsz) used on this root fs is a standard 512 bytes. This fs type cannot be mirrored with a zRAM device, and needs to be recreated with 4k sector sizes.

Transforming the root file system to 4k sector size:

This is simply a backup (to a zram disk) and restore procedure after recreating the root FS. To do so, the system has to be booted from another system image. Booting from an installation DVD image can be a good possibility.

  • Boot from an OL installation DVD [Choose "Troubleshooting", "Rescue a Oracle Linux system", "3) Skip to shell"]
  • Activate and mount the root volume
?
1 2 3 sh-4.2 # vgchange -a y ol 2 logical volume(s) in volume group "ol" now active sh-4.2 # mount /dev/mapper/ol-root /mnt
  • create a zram to store our disk backup
?
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 sh-4.2 # modprobe zram sh-4.2 # echo 10G > /sys/block/zram0/disksize sh-4.2 # mkfs.xfs /dev/zram0 meta-data=/dev/zram0 isize=256 agcount=4, agsize=655360 blks = sectsz=4096 attr=2, projid32bit=1 = crc=0 finobt=0, sparse=0 data = bsize=4096 blocks=2621440, imaxpct=25 = sunit=0 swidth=0 blks naming =version 2 bsize=4096 ascii-ci=0 ftype=1 log =internal log bsize=4096 blocks=2560, version=2 = sectsz=4096 sunit=1 blks, lazy-count=1 realtime =none extsz=4096 blocks=0, rtextents=0 sh-4.2 # mkdir /mnt2 sh-4.2 # mount /dev/zram0 /mnt2 sh-4.2 # xfsdump -L BckUp -M dump -f /mnt2/ROOT /mnt xfsdump: using file dump (drive_simple) strategy xfsdump: version 3.1.7 (dump format 3.0) - type ^C for status and control xfsdump: level 0 dump of localhost:/mnt ... xfsdump: dump complete: 130 seconds elapsed xfsdump: Dump Summary: xfsdump: stream 0 /mnt2/ROOT OK (success) xfsdump: Dump Status: SUCCESS sh-4.2 # umount /mnt
  • recreate the xfs on the disk with a 4k sector size
?
1 2 3 4 5 6 7 8 9 10 11 12 sh-4.2 # mkfs.xfs -f -s size=4096 /dev/mapper/ol-root meta-data=/dev/mapper/ol-root isize=256 agcount=4, agsize=641792 blks = sectsz=4096 attr=2, projid32bit=1 = crc=0 finobt=0, sparse=0 data = bsize=4096 blocks=2567168, imaxpct=25 = sunit=0 swidth=0 blks naming =version 2 bsize=4096 ascii-ci=0 ftype=1 log =internal log bsize=4096 blocks=2560, version=2 = sectsz=4096 sunit=1 blks, lazy-count=1 realtime =none extsz=4096 blocks=0, rtextents=0 sh-4.2 # mount /dev/mapper/ol-root /mnt
  • restore the backup
?
1 2 3 4 5 6 7 8 9 10 11 sh-4.2 # xfsrestore -f /mnt2/ROOT /mnt xfsrestore: using file dump (drive_simple) strategy xfsrestore: version 3.1.7 (dump format 3.0) - type ^C for status and control xfsrestore: searching media for dump ... xfsrestore: restore complete: 337 seconds elapsed xfsrestore: Restore Summary: xfsrestore: stream 0 /mnt2/ROOT OK (success) xfsrestore: Restore Status: SUCCESS sh-4.2 # umount /mnt sh-4.2 # umount /mnt2
  • reboot the machine on its disk (may need to remove the DVD)
?
1 sh-4.2 # reboot
  • login and verify the root filesystem
?
1 2 3 4 5 6 7 8 9 10 $ xfs_info / meta-data=/dev/mapper/ol-root isize=256 agcount=4, agsize=641792 blks = sectsz=4096 attr=2, projid32bit=1 = crc=0 finobt=0 spinodes=0 data = bsize=4096 blocks=2567168, imaxpct=25 = sunit=0 swidth=0 blks naming =version 2 bsize=4096 ascii-ci=0 ftype=1 log =internal bsize=4096 blocks=2560, version=2 = sectsz=4096 sunit=1 blks, lazy-count=1 realtime =none extsz=4096 blocks=0, rtextents=0

With sectsz=4096, our system is now ready for zRAM mirroring.

Basic commands with a zRAM device: ?
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 # modprobe zram # zramctl --find --size 10G /dev/zram0 # pvcreate /dev/zram0 Physical volume "/dev/zram0" successfully created. # vgextend ol /dev/zram0 Volume group "ol" successfully extended # vgscan --cache Reading volume groups from cache. Found volume group "ol" using metadata type lvm2 # lvconvert -y -m 1 ol/root /dev/zram0 Logical volume ol/root successfully converted. # lvs -a -o +devices LV VG Attr LSize Pool Origin Data% Meta% Move Log Cpy%Sync Convert Devices root ol rwi-aor--- 9.79g 12.38 root_rimage_0(0),root_rimage_1(0) [root_rimage_0] ol iwi-aor--- 9.79g /dev/sda2(307) [root_rimage_1] ol Iwi-aor--- 9.79g /dev/zram0(1) [root_rmeta_0] ol ewi-aor--- 4.00m /dev/sda2(2814) [root_rmeta_1] ol ewi-aor--- 4.00m /dev/zram0(0) swap ol -wi-ao---- <1.20g /dev/sda2(0) # lvs -a -o +devices LV VG Attr LSize Pool Origin Data% Meta% Move Log Cpy%Sync Convert Devices root ol rwi-aor--- 9.79g 100.00 root_rimage_0(0),root_rimage_1(0) [root_rimage_0] ol iwi-aor--- 9.79g /dev/sda2(307) [root_rimage_1] ol iwi-aor--- 9.79g /dev/zram0(1) [root_rmeta_0] ol ewi-aor--- 4.00m /dev/sda2(2814) [root_rmeta_1] ol ewi-aor--- 4.00m /dev/zram0(0) swap ol -wi-ao---- <1.20g /dev/sda2(0) # zramctl NAME ALGORITHM DISKSIZE DATA COMPR TOTAL STREAMS MOUNTPOINT /dev/zram0 lzo 10G 9.8G 5.3G 5.5G 1

The compressed disk uses a total of 5.5GB of memory to mirror a 9.8G volume size (using in this case 8.5G).

Removal is performed the same way as brd, except that the device is /dev/zram0 instead of /dev/ram0.

Automating the process:

Fortunately, the procedure can be automated on system boot and shutdown with the following scripts (given as examples).

The start method: /usr/sbin/start-raid1-ramdisk: [ https://github.com/oracle/linux-blog-sample-code/blob/ramdisk-system-image/start-raid1-ramdisk ]

After a chmod 555 /usr/sbin/start-raid1-ramdisk, running this script on a 4k xfs root file system should show something like:

?
1 2 3 4 5 6 7 8 9 10 11 # /usr/sbin/start-raid1-ramdisk Volume group "ol" is already consistent. RAID1 ramdisk: intending to use 10276864 K of memory for facilitation of [ / ] Physical volume "/dev/zram0" successfully created. Volume group "ol" successfully extended Logical volume ol/root successfully converted. Waiting for mirror to synchronize... LVM RAID1 sync of [ / ] took 00:01:53 sec Logical volume ol/root changed. NAME ALGORITHM DISKSIZE DATA COMPR TOTAL STREAMS MOUNTPOINT /dev/zram0 lz4 9.8G 9.8G 5.5G 5.8G 1

The stop method: /usr/sbin/stop-raid1-ramdisk: [ https://github.com/oracle/linux-blog-sample-code/blob/ramdisk-system-image/stop-raid1-ramdisk ]

After a chmod 555 /usr/sbin/stop-raid1-ramdisk, running this script should show something like:

?
1 2 3 4 5 6 # /usr/sbin/stop-raid1-ramdisk Volume group "ol" is already consistent. Logical volume ol/root changed. Logical volume ol/root successfully converted. Removed "/dev/zram0" from volume group "ol" Labels on physical volume "/dev/zram0" successfully wiped.

A service Unit file can also be created: /etc/systemd/system/raid1-ramdisk.service [https://github.com/oracle/linux-blog-sample-code/blob/ramdisk-system-image/raid1-ramdisk.service]

?
1 2 3 4 5 6 7 8 9 10 11 12 13 14 [Unit] Description=Enable RAMdisk RAID 1 on LVM After= local -fs.target Before= shutdown .target reboot.target halt.target [Service] ExecStart=/usr/sbin/start-raid1-ramdisk ExecStop=/usr/sbin/stop-raid1-ramdisk Type=oneshot RemainAfterExit= yes TimeoutSec=0 [Install] WantedBy=multi-user.target
Conclusion:

When the system disk access problem manifests itself, the ramdisk mirror branch will provide the possibility to investigate the situation. This procedure goal is not to keep the system running on this memory mirror configuration, but help investigate a bad situation.

When the problem is identified and fixed, I really recommend to come back to a standard configuration -- enjoying the entire memory of the system, a standard system disk, a possible swap space etc.

Hoping the method described here can help. I also want to thank for their reviews Philip "Bryce" Copeland who also created the first prototype of the above scripts, and Mark Kanda who also helped testing many aspects of this work.

[Nov 09, 2019] chkservice Is A systemd Unit Manager With A Terminal User Interface

The site is https://github.com/linuxenko/chkservice The tool is written in C++
Looks like in version 0.3 the author increased the complexity by adding features which probably are not needed at all
Nov 07, 2019 | www.linuxuprising.com

chkservice systemd manager
chkservice, a terminal user interface (TUI) for managing systemd units, has been updated recently with window resize and search support.

chkservice is a simplistic systemd unit manager that uses ncurses for its terminal interface. Using it you can enable or disable, and start or stop a systemd unit. It also shows the units status (enabled, disabled, static or masked).

You can navigate the chkservice user interface using keyboard shortcuts:

  • Up or l to move cursor up
  • Down or j to move cursor down
  • PgUp or b to move page up
  • PgDown or f to move page down
To enable or disable a unit press Space , and to start or stop a unity press s . You can access the help screen which shows all available keys by pressing ? .

The command line tool had its first release in August 2017, with no new releases until a few days ago when version 0.2 was released, quickly followed by 0.3.

With the latest 0.3 release, chkservice adds a search feature that allows easily searching through all systemd units.

To search, type / followed by your search query, and press Enter . To search for the next item matching your search query you'll have to type / again, followed by Enter or Ctrl + m (without entering any search text).

Another addition to the latest chkservice is window resize support. In the 0.1 version, the tool would close when the user tried to resize the terminal window. That's no longer the case now, chkservice allowing the resize of the terminal window it runs in.

And finally, the last addition to the latest chkservice 0.3 is G-g navigation support . Press G ( Shift + g ) to navigate to the bottom, and g to navigate to the top.

Download and install chkservice

The initial (0.1) chkservice version can be found in the official repositories of a few Linux distributions, including Debian and Ubuntu (and Debian or Ubuntu based Linux distribution -- e.g. Linux Mint, Pop!_OS, Elementary OS and so on).

There are some third-party repositories available as well, including a Fedora Copr, Ubuntu / Linux Mint PPA, and Arch Linux AUR, but at the time I'm writing this, only the AUR package was updated to the latest chkservice version 0.3.

You may also install chkservice from source. Use the instructions provided in the tool's readme to either create a DEB package or install it directly.

[Nov 08, 2019] How to use cron in Linux by David Both

Nov 06, 2017 | opensource.com
No time for commands? Scheduling tasks with cron means programs can run but you don't have to stay up late. 9 comments Image by : Internet Archive Book Images. Modified by Opensource.com. CC BY-SA 4.0 x Subscribe now

Get the highlights in your inbox every week.

https://opensource.com/eloqua-embedded-email-capture-block.html?offer_id=70160000000QzXNAA0

Instead, I use two service utilities that allow me to run commands, programs, and tasks at predetermined times. The cron and at services enable sysadmins to schedule tasks to run at a specific time in the future. The at service specifies a one-time task that runs at a certain time. The cron service can schedule tasks on a repetitive basis, such as daily, weekly, or monthly.

In this article, I'll introduce the cron service and how to use it.

Common (and uncommon) cron uses

I use the cron service to schedule obvious things, such as regular backups that occur daily at 2 a.m. I also use it for less obvious things.

  • The system times (i.e., the operating system time) on my many computers are set using the Network Time Protocol (NTP). While NTP sets the system time, it does not set the hardware time, which can drift. I use cron to set the hardware time based on the system time.
  • I also have a Bash program I run early every morning that creates a new "message of the day" (MOTD) on each computer. It contains information, such as disk usage, that should be current in order to be useful.
  • Many system processes and services, like Logwatch , logrotate , and Rootkit Hunter , use the cron service to schedule tasks and run programs every day.

The crond daemon is the background service that enables cron functionality.

The cron service checks for files in the /var/spool/cron and /etc/cron.d directories and the /etc/anacrontab file. The contents of these files define cron jobs that are to be run at various intervals. The individual user cron files are located in /var/spool/cron , and system services and applications generally add cron job files in the /etc/cron.d directory. The /etc/anacrontab is a special case that will be covered later in this article.

Using crontab

The cron utility runs based on commands specified in a cron table ( crontab ). Each user, including root, can have a cron file. These files don't exist by default, but can be created in the /var/spool/cron directory using the crontab -e command that's also used to edit a cron file (see the script below). I strongly recommend that you not use a standard editor (such as Vi, Vim, Emacs, Nano, or any of the many other editors that are available). Using the crontab command not only allows you to edit the command, it also restarts the crond daemon when you save and exit the editor. The crontab command uses Vi as its underlying editor, because Vi is always present (on even the most basic of installations).

New cron files are empty, so commands must be added from scratch. I added the job definition example below to my own cron files, just as a quick reference, so I know what the various parts of a command mean. Feel free to copy it for your own use.

# crontab -e
SHELL = / bin / bash
MAILTO =root @ example.com
PATH = / bin: / sbin: / usr / bin: / usr / sbin: / usr / local / bin: / usr / local / sbin

# For details see man 4 crontabs

# Example of job definition:
# .---------------- minute (0 - 59)
# | .------------- hour (0 - 23)
# | | .---------- day of month (1 - 31)
# | | | .------- month (1 - 12) OR jan,feb,mar,apr ...
# | | | | .---- day of week (0 - 6) (Sunday=0 or 7) OR sun,mon,tue,wed,thu,fri,sat
# | | | | |
# * * * * * user-name command to be executed

# backup using the rsbu program to the internal 4TB HDD and then 4TB external
01 01 * * * / usr / local / bin / rsbu -vbd1 ; / usr / local / bin / rsbu -vbd2

# Set the hardware clock to keep it in sync with the more accurate system clock
03 05 * * * / sbin / hwclock --systohc

# Perform monthly updates on the first of the month
# 25 04 1 * * /usr/bin/dnf -y update

The crontab command is used to view or edit the cron files.

The first three lines in the code above set up a default environment. The environment must be set to whatever is necessary for a given user because cron does not provide an environment of any kind. The SHELL variable specifies the shell to use when commands are executed. This example specifies the Bash shell. The MAILTO variable sets the email address where cron job results will be sent. These emails can provide the status of the cron job (backups, updates, etc.) and consist of the output you would see if you ran the program manually from the command line. The third line sets up the PATH for the environment. Even though the path is set here, I always prepend the fully qualified path to each executable.

There are several comment lines in the example above that detail the syntax required to define a cron job. I'll break those commands down, then add a few more to show you some more advanced capabilities of crontab files.

01 01 * * * /usr/local/bin/rsbu -vbd1 ; /usr/local/bin/rsbu -vbd2

This line in my /etc/crontab runs a script that performs backups for my systems.

This line runs my self-written Bash shell script, rsbu , that backs up all my systems. This job kicks off at 1:01 a.m. (01 01) every day. The asterisks (*) in positions three, four, and five of the time specification are like file globs, or wildcards, for other time divisions; they specify "every day of the month," "every month," and "every day of the week." This line runs my backups twice; one backs up to an internal dedicated backup hard drive, and the other backs up to an external USB drive that I can take to the safe deposit box.

The following line sets the hardware clock on the computer using the system clock as the source of an accurate time. This line is set to run at 5:03 a.m. (03 05) every day.

03 05 * * * /sbin/hwclock --systohc

This line sets the hardware clock using the system time as the source.

I was using the third and final cron job (commented out) to perform a dnf or yum update at 04:25 a.m. on the first day of each month, but I commented it out so it no longer runs.

# 25 04 1 * * /usr/bin/dnf -y update

This line used to perform a monthly update, but I've commented it out.

Other scheduling tricks

Now let's do some things that are a little more interesting than these basics. Suppose you want to run a particular job every Thursday at 3 p.m.:

00 15 * * Thu /usr/local/bin/mycronjob.sh

This line runs mycronjob.sh every Thursday at 3 p.m.

Or, maybe you need to run quarterly reports after the end of each quarter. The cron service has no option for "The last day of the month," so instead you can use the first day of the following month, as shown below. (This assumes that the data needed for the reports will be ready when the job is set to run.)

02 03 1 1,4,7,10 * /usr/local/bin/reports.sh

This cron job runs quarterly reports on the first day of the month after a quarter ends.

The following shows a job that runs one minute past every hour between 9:01 a.m. and 5:01 p.m.

01 09-17 * * * /usr/local/bin/hourlyreminder.sh

Sometimes you want to run jobs at regular times during normal business hours.

I have encountered situations where I need to run a job every two, three, or four hours. That can be accomplished by dividing the hours by the desired interval, such as */3 for every three hours, or 6-18/3 to run every three hours between 6 a.m. and 6 p.m. Other intervals can be divided similarly; for example, the expression */15 in the minutes position means "run the job every 15 minutes."

*/5 08-18/2 * * * /usr/local/bin/mycronjob.sh

This cron job runs every five minutes during every hour between 8 a.m. and 5:58 p.m.

One thing to note: The division expressions must result in a remainder of zero for the job to run. That's why, in this example, the job is set to run every five minutes (08:05, 08:10, 08:15, etc.) during even-numbered hours from 8 a.m. to 6 p.m., but not during any odd-numbered hours. For example, the job will not run at all from 9 p.m. to 9:59 a.m.

I am sure you can come up with many other possibilities based on these examples.

Limiting cron access

More Linux resources

Regular users with cron access could make mistakes that, for example, might cause system resources (such as memory and CPU time) to be swamped. To prevent possible misuse, the sysadmin can limit user access by creating a /etc/cron.allow file that contains a list of all users with permission to create cron jobs. The root user cannot be prevented from using cron.

By preventing non-root users from creating their own cron jobs, it may be necessary for root to add their cron jobs to the root crontab. "But wait!" you say. "Doesn't that run those jobs as root?" Not necessarily. In the first example in this article, the username field shown in the comments can be used to specify the user ID a job is to have when it runs. This prevents the specified non-root user's jobs from running as root. The following example shows a job definition that runs a job as the user "student":

04 07 * * * student /usr/local/bin/mycronjob.sh

If no user is specified, the job is run as the user that owns the crontab file, root in this case.

cron.d

The directory /etc/cron.d is where some applications, such as SpamAssassin and sysstat , install cron files. Because there is no spamassassin or sysstat user, these programs need a place to locate cron files, so they are placed in /etc/cron.d .

The /etc/cron.d/sysstat file below contains cron jobs that relate to system activity reporting (SAR). These cron files have the same format as a user cron file.

# Run system activity accounting tool every 10 minutes
*/ 10 * * * * root / usr / lib64 / sa / sa1 1 1
# Generate a daily summary of process accounting at 23:53
53 23 * * * root / usr / lib64 / sa / sa2 -A

The sysstat package installs the /etc/cron.d/sysstat cron file to run programs for SAR.

The sysstat cron file has two lines that perform tasks. The first line runs the sa1 program every 10 minutes to collect data stored in special binary files in the /var/log/sa directory. Then, every night at 23:53, the sa2 program runs to create a daily summary.

Scheduling tips

Some of the times I set in the crontab files seem rather random -- and to some extent they are. Trying to schedule cron jobs can be challenging, especially as the number of jobs increases. I usually have only a few tasks to schedule on each of my computers, which is simpler than in some of the production and lab environments where I have worked.

One system I administered had around a dozen cron jobs that ran every night and an additional three or four that ran on weekends or the first of the month. That was a challenge, because if too many jobs ran at the same time -- especially the backups and compiles -- the system would run out of RAM and nearly fill the swap file, which resulted in system thrashing while performance tanked, so nothing got done. We added more memory and improved how we scheduled tasks. We also removed a task that was very poorly written and used large amounts of memory.

The crond service assumes that the host computer runs all the time. That means that if the computer is turned off during a period when cron jobs were scheduled to run, they will not run until the next time they are scheduled. This might cause problems if they are critical cron jobs. Fortunately, there is another option for running jobs at regular intervals: anacron .

anacron

The anacron program performs the same function as crond, but it adds the ability to run jobs that were skipped, such as if the computer was off or otherwise unable to run the job for one or more cycles. This is very useful for laptops and other computers that are turned off or put into sleep mode.

As soon as the computer is turned on and booted, anacron checks to see whether configured jobs missed their last scheduled run. If they have, those jobs run immediately, but only once (no matter how many cycles have been missed). For example, if a weekly job was not run for three weeks because the system was shut down while you were on vacation, it would be run soon after you turn the computer on, but only once, not three times.

The anacron program provides some easy options for running regularly scheduled tasks. Just install your scripts in the /etc/cron.[hourly|daily|weekly|monthly] directories, depending how frequently they need to be run.

How does this work? The sequence is simpler than it first appears.

  1. The crond service runs the cron job specified in /etc/cron.d/0hourly .
# Run the hourly jobs
SHELL = / bin / bash
PATH = / sbin: / bin: / usr / sbin: / usr / bin
MAILTO =root
01 * * * * root run-parts / etc / cron.hourly

The contents of /etc/cron.d/0hourly cause the shell scripts located in /etc/cron.hourly to run.

  1. The cron job specified in /etc/cron.d/0hourly runs the run-parts program once per hour.
  2. The run-parts program runs all the scripts located in the /etc/cron.hourly directory.
  3. The /etc/cron.hourly directory contains the 0anacron script, which runs the anacron program using the /etdc/anacrontab configuration file shown here.
# /etc/anacrontab: configuration file for anacron

# See anacron(8) and anacrontab(5) for details.

SHELL = / bin / sh
PATH = / sbin: / bin: / usr / sbin: / usr / bin
MAILTO =root
# the maximal random delay added to the base delay of the jobs
RANDOM_DELAY = 45
# the jobs will be started during the following hours only
START_HOURS_RANGE = 3 - 22

#period in days delay in minutes job-identifier command
1 5 cron.daily nice run-parts / etc / cron.daily
7 25 cron.weekly nice run-parts / etc / cron.weekly
@ monthly 45 cron.monthly nice run-parts / etc / cron.monthly

The contents of /etc/anacrontab file runs the executable files in the cron.[daily|weekly|monthly] directories at the appropriate times.

  1. The anacron program runs the programs located in /etc/cron.daily once per day; it runs the jobs located in /etc/cron.weekly once per week, and the jobs in cron.monthly once per month. Note the specified delay times in each line that help prevent these jobs from overlapping themselves and other cron jobs.

Instead of placing complete Bash programs in the cron.X directories, I install them in the /usr/local/bin directory, which allows me to run them easily from the command line. Then I add a symlink in the appropriate cron directory, such as /etc/cron.daily .

The anacron program is not designed to run programs at specific times. Rather, it is intended to run programs at intervals that begin at the specified times, such as 3 a.m. (see the START_HOURS_RANGE line in the script just above) of each day, on Sunday (to begin the week), and on the first day of the month. If any one or more cycles are missed, anacron will run the missed jobs once, as soon as possible.

More on setting limits

I use most of these methods for scheduling tasks to run on my computers. All those tasks are ones that need to run with root privileges. It's rare in my experience that regular users really need a cron job. One case was a developer user who needed a cron job to kick off a daily compile in a development lab.

It is important to restrict access to cron functions by non-root users. However, there are circumstances when a user needs to set a task to run at pre-specified times, and cron can allow them to do that. Many users do not understand how to properly configure these tasks using cron and they make mistakes. Those mistakes may be harmless, but, more often than not, they can cause problems. By setting functional policies that cause users to interact with the sysadmin, individual cron jobs are much less likely to interfere with other users and other system functions.

It is possible to set limits on the total resources that can be allocated to individual users or groups, but that is an article for another time.

For more information, the man pages for cron , crontab , anacron , anacrontab , and run-parts all have excellent information and descriptions of how the cron system works.


Ben Cotton on 06 Nov 2017 Permalink

One problem I used to have in an old job was cron jobs that would hang for some reason. This old sysadvent post had some good suggestions for how to deal with that: http://sysadvent.blogspot.com/2009/12/cron-practices.html

Jesper Larsen on 06 Nov 2017 Permalink

Cron is definitely a good tool. But if you need to do more advanced scheduling then Apache Airflow is great for this.

Airflow has a number of advantages over Cron. The most important are: Dependencies (let tasks run after other tasks), nice web based overview, automatic failure recovery and a centralized scheduler. The disadvantages are that you will need to setup the scheduler and some other centralized components on one server and a worker on each machine you want to run stuff on.

You definitely want to use Cron for some stuff. But if you find that Cron is too limited for your use case I would recommend looking into Airflow.

Leslle Satenstein on 13 Nov 2017 Permalink

Hi David,
you have a well done article. Much appreciated. I make use of the @reboot crontab entry. With crontab and root. I run the following.

@reboot /bin/dofstrim.sh

I wanted to run fstrim for my SSD drive once and only once per week.
dofstrim.sh is a script that runs the "fstrim" program once per week, irrespective of the number of times the system is rebooted. I happen to have several Linux systems sharing one computer, and each system has a root crontab with that entry. Since I may hop from Linux to Linux in the day or several times per week, my dofstrim.sh only runs fstrim once per week, irrespective which Linux system I boot. I make use of a common partition to all Linux systems, a partition mounted as "/scratch" and the wonderful Linux command line "date" program.

The dofstrim.sh listing follows below.

#!/bin/bash
# run fstrim either once/week or once/day not once for every reboot
#
# Use the date function to extract today's day number or week number
# the day number range is 1..366, weekno is 1 to 53
#WEEKLY=0 #once per day
WEEKLY=1 #once per week
lockdir='/scratch/lock/'

if [[ WEEKLY -eq 1 ]]; then
dayno="$lockdir/dofstrim.weekno"
today=$(date +%V)
else
dayno=$lockdir/dofstrim.dayno
today=$(date +%j)
fi

prevval="000"

if [ -f "$dayno" ]
then
prevval=$(cat ${dayno} )
if [ x$prevval = x ];then
prevval="000"
fi
else
mkdir -p $lockdir
fi

if [ ${prevval} -ne ${today} ]
then
/sbin/fstrim -a
echo $today > $dayno
fi

I had thought to use anacron, but then fstrim would be run frequently as each linux's anacron would have a similar entry.
The "date" program produces a day number or a week number, depending upon the +%V or +%j

Leslle Satenstein on 13 Nov 2017 Permalink

Running a report on the last day of the month is easy if you use the date program. Use the date function from Linux as shown

*/9 15 28-31 * * [ `date -d +'1 day' +\%d` -eq 1 ] && echo "Tomorrow is the first of month Today(now) is `date`" >> /root/message

Once per day from the 28th to the 31st, the date function is executed.
If the result of date +1day is the first of the month, today must be the last day of the month.

sgtrock on 14 Nov 2017 Permalink

Why not use crontab to launch something like Ansible playbooks instead of simple bash scripts? A lot easier to troubleshoot and manage these days. :-)

[Nov 08, 2019] Bash aliases you can't live without by Seth Kenlon

Jul 31, 2019 | opensource.com

Tired of typing the same long commands over and over? Do you feel inefficient working on the command line? Bash aliases can make a world of difference. 28 comments

A Bash alias is a method of supplementing or overriding Bash commands with new ones. Bash aliases make it easy for users to customize their experience in a POSIX terminal. They are often defined in $HOME/.bashrc or $HOME/bash_aliases (which must be loaded by $HOME/.bashrc ).

Most distributions add at least some popular aliases in the default .bashrc file of any new user account. These are simple ones to demonstrate the syntax of a Bash alias:

alias ls = 'ls -F'
alias ll = 'ls -lh'

Not all distributions ship with pre-populated aliases, though. If you add aliases manually, then you must load them into your current Bash session:

$ source ~/.bashrc

Otherwise, you can close your terminal and re-open it so that it reloads its configuration file.

With those aliases defined in your Bash initialization script, you can then type ll and get the results of ls -l , and when you type ls you get, instead of the output of plain old ls .

Those aliases are great to have, but they just scratch the surface of what's possible. Here are the top 10 Bash aliases that, once you try them, you won't be able to live without.

Set up first

Before beginning, create a file called ~/.bash_aliases :

$ touch ~/.bash_aliases

Then, make sure that this code appears in your ~/.bashrc file:

if [ -e $HOME / .bash_aliases ] ; then
source $HOME / .bash_aliases
fi

If you want to try any of the aliases in this article for yourself, enter them into your .bash_aliases file, and then load them into your Bash session with the source ~/.bashrc command.

Sort by file size

If you started your computing life with GUI file managers like Nautilus in GNOME, the Finder in MacOS, or Explorer in Windows, then you're probably used to sorting a list of files by their size. You can do that in a terminal as well, but it's not exactly succinct.

Add this alias to your configuration on a GNU system:

alias lt = 'ls --human-readable --size -1 -S --classify'

This alias replaces lt with an ls command that displays the size of each item, and then sorts it by size, in a single column, with a notation to indicate the kind of file. Load your new alias, and then try it out:

$ source ~ / .bashrc
$ lt
total 344K
140K configure *
44K aclocal.m4
36K LICENSE
32K config.status *
24K Makefile
24K Makefile.in
12K config.log
8.0K README.md
4.0K info.slackermedia.Git-portal.json
4.0K git-portal.spec
4.0K flatpak.path.patch
4.0K Makefile.am *
4.0K dot-gitlab.ci.yml
4.0K configure.ac *
0 autom4te.cache /
0 share /
0 bin /
0 install-sh @
0 compile @
0 missing @
0 COPYING @

On MacOS or BSD, the ls command doesn't have the same options, so this alias works instead:

alias lt = 'du -sh * | sort -h'

The results of this version are a little different:

$ du -sh * | sort -h
0 compile
0 COPYING
0 install-sh
0 missing
4.0K configure.ac
4.0K dot-gitlab.ci.yml
4.0K flatpak.path.patch
4.0K git-portal.spec
4.0K info.slackermedia.Git-portal.json
4.0K Makefile.am
8.0K README.md
12K config.log
16K bin
24K Makefile
24K Makefile.in
32K config.status
36K LICENSE
44K aclocal.m4
60K share
140K configure
476K autom4te.cache

In fact, even on Linux, that command is useful, because using ls lists directories and symlinks as being 0 in size, which may not be the information you actually want. It's your choice.

Thanks to Brad Alexander for this alias idea.

View only mounted drives

The mount command used to be so simple. With just one command, you could get a list of all the mounted filesystems on your computer, and it was frequently used for an overview of what drives were attached to a workstation. It used to be impressive to see more than three or four entries because most computers don't have many more USB ports than that, so the results were manageable.

Computers are a little more complicated now, and between LVM, physical drives, network storage, and virtual filesystems, the results of mount can be difficult to parse:

sysfs on /sys type sysfs (rw,nosuid,nodev,noexec,relatime,seclabel)
proc on /proc type proc (rw,nosuid,nodev,noexec,relatime)
devtmpfs on /dev type devtmpfs (rw,nosuid,seclabel,size=8131024k,nr_inodes=2032756,mode=755)
securityfs on /sys/kernel/security type securityfs (rw,nosuid,nodev,noexec,relatime)
[...]
/dev/nvme0n1p2 on /boot type ext4 (rw,relatime,seclabel)
/dev/nvme0n1p1 on /boot/efi type vfat (rw,relatime,fmask=0077,dmask=0077,codepage=437,iocharset=ascii,shortname=winnt,errors=remount-ro)
[...]
gvfsd-fuse on /run/user/100977/gvfs type fuse.gvfsd-fuse (rw,nosuid,nodev,relatime,user_id=100977,group_id=100977)
/dev/sda1 on /run/media/seth/pocket type ext4 (rw,nosuid,nodev,relatime,seclabel,uhelper=udisks2)
/dev/sdc1 on /run/media/seth/trip type ext4 (rw,nosuid,nodev,relatime,seclabel,uhelper=udisks2)
binfmt_misc on /proc/sys/fs/binfmt_misc type binfmt_misc (rw,relatime)

To solve that problem, try an alias like this:

alias mnt = "mount | awk -F' ' '{ printf \" %s \t %s \n\" , \$ 1, \$ 3; }' | column -t | egrep ^/dev/ | sort"

This alias uses awk to parse the output of mount by column, reducing the output to what you probably looking for (what hard drives, and not file systems, are mounted):

$ mnt
/dev/mapper/fedora-root /
/dev/nvme0n1p1 /boot/efi
/dev/nvme0n1p2 /boot
/dev/sda1 /run/media/seth/pocket
/dev/sdc1 /run/media/seth/trip

On MacOS, the mount command doesn't provide terribly verbose output, so an alias may be overkill. However, if you prefer a succinct report, try this:

alias mnt = 'mount | grep -E ^/dev | column -t'

The results:

$ mnt
/dev/disk1s1 on / (apfs, local, journaled)
/dev/disk1s4 on /private/var/vm (apfs, local, noexec, journaled, noatime, nobrowse) Find a command in your grep history

Sometimes you figure out how to do something in the terminal, and promise yourself that you'll never forget what you've just learned. Then an hour goes by, and you've completely forgotten what you did.

Searching through your Bash history is something everyone has to do from time to time. If you know exactly what you're searching for, you can use Ctrl+R to do a reverse search through your history, but sometimes you can't remember the exact command you want to find.

Here's an alias to make that task a little easier:

alias gh = 'history|grep'

Here's an example of how to use it:

$ gh bash
482 cat ~/.bashrc | grep _alias
498 emacs ~/.bashrc
530 emacs ~/.bash_aliases
531 source ~/.bashrc Sort by modification time

It happens every Monday: You get to work, you sit down at your computer, you open a terminal, and you find you've forgotten what you were doing last Friday. What you need is an alias to list the most recently modified files.

You can use the ls command to create an alias to help you find where you left off:

alias left = 'ls -t -1'

The output is simple, although you can extend it with the -- long option if you prefer. The alias, as listed, displays this:

$ left
demo.jpeg
demo.xcf
design-proposal.md
rejects.txt
brainstorm.txt
query-letter.xml Count files

If you need to know how many files you have in a directory, the solution is one of the most classic examples of UNIX command construction: You list files with the ls command, control its output to be only one column with the -1 option, and then pipe that output to the wc (word count) command to count how many lines of single files there are.

It's a brilliant demonstration of how the UNIX philosophy allows users to build their own solutions using small system components. This command combination is also a lot to type if you happen to do it several times a day, and it doesn't exactly work for a directory of directories without using the -R option, which introduces new lines to the output and renders the exercise useless.

Instead, this alias makes the process easy:

alias count = 'find . -type f | wc -l'

This one counts files, ignoring directories, but not the contents of directories. If you have a project folder containing two directories, each of which contains two files, the alias returns four, because there are four files in the entire project.

$ ls
foo bar
$ count
4 Create a Python virtual environment

Do you code in Python?

Do you code in Python a lot?

If you do, then you know that creating a Python virtual environment requires, at the very least, 53 keystrokes.
That's 49 too many, but that's easily circumvented with two new aliases called ve and va :

alias ve = 'python3 -m venv ./venv'
alias va = 'source ./venv/bin/activate'

Running ve creates a new directory, called venv , containing the usual virtual environment filesystem for Python3. The va alias activates the environment in your current shell:

$ cd my-project
$ ve
$ va
(venv) $ Add a copy progress bar

Everybody pokes fun at progress bars because they're infamously inaccurate. And yet, deep down, we all seem to want them. The UNIX cp command has no progress bar, but it does have a -v option for verbosity, meaning that it echoes the name of each file being copied to your terminal. That's a pretty good hack, but it doesn't work so well when you're copying one big file and want some indication of how much of the file has yet to be transferred.

The pv command provides a progress bar during copy, but it's not common as a default application. On the other hand, the rsync command is included in the default installation of nearly every POSIX system available, and it's widely recognized as one of the smartest ways to copy files both remotely and locally.

Better yet, it has a built-in progress bar.

alias cpv = 'rsync -ah --info=progress2'

Using this alias is the same as using the cp command:

$ cpv bigfile.flac /run/media/seth/audio/
3.83M 6% 213.15MB/s 0:00:00 (xfr#4, to-chk=0/4)

An interesting side effect of using this command is that rsync copies both files and directories without the -r flag that cp would otherwise require.

Protect yourself from file removal accidents

You shouldn't use the rm command. The rm manual even says so:

Warning : If you use 'rm' to remove a file, it is usually possible to recover the contents of that file. If you want more assurance that the contents are truly unrecoverable, consider using 'shred'.

If you want to remove a file, you should move the file to your Trash, just as you do when using a desktop.

POSIX makes this easy, because the Trash is an accessible, actual location in your filesystem. That location may change, depending on your platform: On a FreeDesktop , the Trash is located at ~/.local/share/Trash , while on MacOS it's ~/.Trash , but either way, it's just a directory into which you place files that you want out of sight until you're ready to erase them forever.

This simple alias provides a way to toss files into the Trash bin from your terminal:

alias tcn = 'mv --force -t ~/.local/share/Trash '

This alias uses a little-known mv flag that enables you to provide the file you want to move as the final argument, ignoring the usual requirement for that file to be listed first. Now you can use your new command to move files and folders to your system Trash:

$ ls
foo bar
$ tcn foo
$ ls
bar

Now the file is "gone," but only until you realize in a cold sweat that you still need it. At that point, you can rescue the file from your system Trash; be sure to tip the Bash and mv developers on the way out.

Note: If you need a more robust Trash command with better FreeDesktop compliance, see Trashy .

Simplify your Git workflow

Everyone has a unique workflow, but there are usually repetitive tasks no matter what. If you work with Git on a regular basis, then there's probably some sequence you find yourself repeating pretty frequently. Maybe you find yourself going back to the master branch and pulling the latest changes over and over again during the day, or maybe you find yourself creating tags and then pushing them to the remote, or maybe it's something else entirely.

No matter what Git incantation you've grown tired of typing, you may be able to alleviate some pain with a Bash alias. Largely thanks to its ability to pass arguments to hooks, Git has a rich set of introspective commands that save you from having to perform uncanny feats in Bash.

For instance, while you might struggle to locate, in Bash, a project's top-level directory (which, as far as Bash is concerned, is an entirely arbitrary designation, since the absolute top level to a computer is the root directory), Git knows its top level with a simple query. If you study up on Git hooks, you'll find yourself able to find out all kinds of information that Bash knows nothing about, but you can leverage that information with a Bash alias.

Here's an alias to find the top level of a Git project, no matter where in that project you are currently working, and then to change directory to it, change to the master branch, and perform a Git pull:

alias startgit = 'cd `git rev-parse --show-toplevel` && git checkout master && git pull'

This kind of alias is by no means a universally useful alias, but it demonstrates how a relatively simple alias can eliminate a lot of laborious navigation, commands, and waiting for prompts.

A simpler, and probably more universal, alias returns you to the Git project's top level. This alias is useful because when you're working on a project, that project more or less becomes your "temporary home" directory. It should be as simple to go "home" as it is to go to your actual home, and here's an alias to do it:

alias cg = 'cd `git rev-parse --show-toplevel`'

Now the command cg takes you to the top of your Git project, no matter how deep into its directory structure you have descended.

Change directories and view the contents at the same time

It was once (allegedly) proposed by a leading scientist that we could solve many of the planet's energy problems by harnessing the energy expended by geeks typing cd followed by ls .
It's a common pattern, because generally when you change directories, you have the impulse or the need to see what's around.

But "walking" your computer's directory tree doesn't have to be a start-and-stop process.

This one's cheating, because it's not an alias at all, but it's a great excuse to explore Bash functions. While aliases are great for quick substitutions, Bash allows you to add local functions in your .bashrc file (or a separate functions file that you load into .bashrc , just as you do your aliases file).

To keep things modular, create a new file called ~/.bash_functions and then have your .bashrc load it:

if [ -e $HOME / .bash_functions ] ; then
source $HOME / .bash_functions
fi

In the functions file, add this code:

function cl () {
DIR = "$*" ;
# if no DIR given, go home
if [ $# -lt 1 ] ; then
DIR = $HOME ;
fi ;
builtin cd " ${DIR} " && \
# use your preferred ls command
ls -F --color =auto
}

Load the function into your Bash session and then try it out:

$ source ~ / .bash_functions
$ cl Documents
foo bar baz
$ pwd
/ home / seth / Documents
$ cl ..
Desktop Documents Downloads
[ ... ]
$ pwd
/ home / seth

Functions are much more flexible than aliases, but with that flexibility comes the responsibility for you to ensure that your code makes sense and does what you expect. Aliases are meant to be simple, so keep them easy, but useful. For serious modifications to how Bash behaves, use functions or custom shell scripts saved to a location in your PATH .

For the record, there are some clever hacks to implement the cd and ls sequence as an alias, so if you're patient enough, then the sky is the limit even using humble aliases.

Start aliasing and functioning

Customizing your environment is what makes Linux fun, and increasing your efficiency is what makes Linux life-changing. Get started with simple aliases, graduate to functions, and post your must-have aliases in the comments!


ACG on 31 Jul 2019 Permalink

One function I like a lot is a function that diffs a file and its backup.
It goes something like

#!/usr/bin/env bash
file="${1:?File not given}"

if [[ ! -e "$file" || ! -e "$file"~ ]]; then
echo "File doesn't exist or has no backup" 1>&2
exit 1
fi

diff --color=always "$file"{~,} | less -r

I may have gotten the if wrong, but you get the idea. I'm typing this on my phone, away from home.
Cheers

Seth Kenlon on 31 Jul 2019 Permalink

That's pretty slick! I like it.

My backup tool of choice (rdiff-backup) handles these sorts of comparisons pretty well, so I tend to be confident in my backup files. That said, there's always the edge case, and this kind of function is a great solution for those. Thanks!

Kevin Cole on 13 Aug 2019 Permalink

A few of my "cannot-live-withouts" are regex based:

Decomment removes full-line comments and blank lines. For example, when looking at a "stock" /etc/httpd/whatever.conf file that has a gazillion lines in it,

alias decomment='egrep -v "^[[:space:]]*((#|;|//).*)?$" '

will show you that only four lines in the file actually DO anything, and the gazillion minus four are comments. I use this ALL the time with config files, Python (and other languages) code, and god knows where else.

Then there's unprintables and expletives which are both very similar:

alias unprintable='grep --color="auto" -P -n "[\x00-\x1E]"'
alias expletives='grep --color="auto" -P -n "[^\x00-\x7E]" '

The first shows which lines (with line numbers) in a file contain control characters, and the second shows which lines in a file contain anything "above" a RUBOUT, er, excuse me, I mean above ASCII 127. (I feel old.) ;-) Handy when, for example, someone gives you a program that they edited or created with LibreOffice, and oops... half of the quoted strings have "real" curly opening and closing quote marks instead of ASCII 0x22 "straight" quote mark delimiters... But there's actually a few curlies you want to keep, so a "nuke 'em all in one swell foop" approach won't work.

Seth Kenlon on 14 Aug 2019 Permalink

These are great!

Dan Jones on 13 Aug 2019 Permalink

Your `cl` function could be simplified, since `cd` without arguments already goes to home.

```
function cl() {
cd "$@" && \
ls -F --color=auto
}
```

Seth Kenlon on 14 Aug 2019 Permalink

Nice!

jkeener on 20 Aug 2019 Permalink

The first alias in my .bash_aliases file is always:

alias realias='vim ~/.bash_aliases; source ~/.bash_aliases'

replace vim with your favorite editor or $VISUAL

bhuvana on 04 Oct 2019 Permalink

Thanks for this post! I have created a Github repo- https://github.com/bhuvana-guna/awesome-bash-shortcuts
with a motive to create an extended list of aliases/functions for various programs. As I am a newbie to terminal and linux, please do contribute to it with these and other super awesome utilities and help others easily access them.

[Nov 08, 2019] A guide to logging in Python by Mario Corchero

Nov 08, 2019 | opensource.com

13 Sep 2017

Logging is a critical way to understand what's going on inside an application. 10 comments Image credits : Pixabay. CC0 x Subscribe now

Get the highlights in your inbox every week.

There is little worse as a developer than trying to figure out why an application is not working if you don't know what is going on inside it. Sometimes you can't even tell whether the system is working as designed at all.

When applications are running in production, they become black boxes that need to be traced and monitored. One of the simplest, yet most important ways to do so is by logging. Logging allows us -- at the time we develop our software -- to instruct the program to emit information while the system is running that will be useful for us and our sysadmins.

More Python Resources The same way we document code for future developers, we should direct new software to generate adequate logs for developers and sysadmins. Logs are a critical part of the system documentation about an application's runtime status. When instrumenting your software with logs, think of it like writing documentation for developers and sysadmins who will maintain the system in the future.

Some purists argue that a disciplined developer who uses logging and testing should hardly need an interactive debugger. If we cannot reason about our application during development with verbose logging, it will be even harder to do it when our code is running in production.

This article looks at Python's logging module, its design, and ways to adapt it for more complex use cases. This is not intended as documentation for developers, rather as a guide to show how the Python logging module is built and to encourage the curious to delve deeper.

Why use the logging module?

A developer might argue, why aren't simple print statements sufficient? The logging module offers multiple benefits, including:

  • Multi-threading support
  • Categorization via different levels of logging
  • Flexibility and configurability
  • Separation of the how from the what

This last point, the actual separation of the what we log from the how we log enables collaboration between different parts of the software. As an example, it allows the developer of a framework or library to add logs and let the sysadmin or person in charge of the runtime configuration decide what should be logged at a later point.

What's in the logging module

The logging module beautifully separates the responsibility of each of its parts (following the Apache Log4j API's approach). Let's look at how a log line travels around the module's code and explore its different parts.

Logger

Loggers are the objects a developer usually interacts with. They are the main APIs that indicate what we want to log.

Given an instance of a logger , we can categorize and ask for messages to be emitted without worrying about how or where they will be emitted.

For example, when we write logger.info("Stock was sold at %s", price) we have the following model in mind:

1-ikkitaj.jpg

We request a line and we assume some code is executed in the logger that makes that line appear in the console/file. But what actually is happening inside?

Log records

Log records are packages that the logging module uses to pass all required information around. They contain information about the function where the log was requested, the string that was passed, arguments, call stack information, etc.

These are the objects that are being logged. Every time we invoke our loggers, we are creating instances of these objects. But how do objects like these get serialized into a stream? Via handlers!

Handlers

Handlers emit the log records into any output. They take log records and handle them in the function of what they were built for.

As an example, a FileHandler will take a log record and append it to a file.

The standard logging module already comes with multiple built-in handlers like:

  • Multiple file handlers ( TimeRotated , SizeRotated , Watched ) that can write to files
  • StreamHandler can target a stream like stdout or stderr
  • SMTPHandler sends log records via email
  • SocketHandler sends LogRecords to a streaming socket
  • SyslogHandler , NTEventHandler , HTTPHandler , MemoryHandler , and others

We have now a model that's closer to reality:

2-crmvrmf.jpg

But most handlers work with simple strings (SMTPHandler, FileHandler, etc.), so you may be wondering how those structured LogRecords are transformed into easy-to-serialize bytes...

Formatters

Let me present the Formatters. Formatters are in charge of serializing the metadata-rich LogRecord into a string. There is a default formatter if none is provided.

The generic formatter class provided by the logging library takes a template and style as input. Then placeholders can be declared for all the attributes in a LogRecord object.

As an example: '%(asctime)s %(levelname)s %(name)s: %(message)s' will generate logs like 2017-07-19 15:31:13,942 INFO parent.child: Hello EuroPython .

Note that the attribute message is the result of interpolating the log's original template with the arguments provided. (e.g., for logger.info("Hello %s", "Laszlo") , the message will be "Hello Laszlo").

All default attributes can be found in the logging documentation .

OK, now that we know about formatters, our model has changed again:

3-olzyr7p.jpg Filters

The last objects in our logging toolkit are filters.

Filters allow for finer-grained control of which logs should be emitted. Multiple filters can be attached to both loggers and handlers. For a log to be emitted, all filters should allow the record to pass.

Users can declare their own filters as objects using a filter method that takes a record as input and returns True / False as output.

With this in mind, here is the current logging workflow:

4-nfjk7uc.jpg The logger hierarchy

At this point, you might be impressed by the amount of complexity and configuration that the module is hiding so nicely for you, but there is even more to consider: the logger hierarchy.

We can create a logger via logging.getLogger(<logger_name>) . The string passed as an argument to getLogger can define a hierarchy by separating the elements using dots.

As an example, logging.getLogger("parent.child") will create a logger "child" with a parent logger named "parent." Loggers are global objects managed by the logging module, so they can be retrieved conveniently anywhere during our project.

Logger instances are also known as channels. The hierarchy allows the developer to define the channels and their hierarchy.

After the log record has been passed to all the handlers within the logger, the parents' handlers will be called recursively until we reach the top logger (defined as an empty string) or a logger has configured propagate = False . We can see it in the updated diagram:

5-sjkrpt3.jpg

Note that the parent logger is not called, only its handlers. This means that filters and other code in the logger class won't be executed on the parents. This is a common pitfall when adding filters to loggers.

Recapping the workflow

We've examined the split in responsibility and how we can fine tune log filtering. But there are two other attributes we haven't mentioned yet:

  1. Loggers can be disabled, thereby not allowing any record to be emitted from them.
  2. An effective level can be configured in both loggers and handlers.

As an example, when a logger has configured a level of INFO , only INFO levels and above will be passed. The same rule applies to handlers.

With all this in mind, the final flow diagram in the logging documentation looks like this:

6-kd6eiyr.jpg How to use logging

Now that we've looked at the logging module's parts and design, it's time to examine how a developer interacts with it. Here is a code example:

import logging

def sample_function ( secret_parameter ) :
logger = logging . getLogger ( __name__ ) # __name__=projectA.moduleB
logger. debug ( "Going to perform magic with '%s'" , secret_parameter )
...
try :
result = do_magic ( secret_parameter )
except IndexError :
logger. exception ( "OMG it happened again, someone please tell Laszlo" )
except :
logger. info ( "Unexpected exception" , exc_info = True )
raise
else :
logger. info ( "Magic with '%s' resulted in '%s'" , secret_parameter , result , stack_info = True )

This creates a logger using the module __name__ . It will create channels and hierarchies based on the project structure, as Python modules are concatenated with dots.

The logger variable references the logger "module," having "projectA" as a parent, which has "root" as its parent.

On line 5, we see how to perform calls to emit logs. We can use one of the methods debug , info , error , or critical to log using the appropriate level.

When logging a message, in addition to the template arguments, we can pass keyword arguments with specific meaning. The most interesting are exc_info and stack_info . These will add information about the current exception and the stack frame, respectively. For convenience, a method exception is available in the logger objects, which is the same as calling error with exc_info=True .

These are the basics of how to use the logger module. ʘ‿ʘ. But it is also worth mentioning some uses that are usually considered bad practices.

Greedy string formatting

Using loggger.info("string template {}".format(argument)) should be avoided whenever possible in favor of logger.info("string template %s", argument) . This is a better practice, as the actual string interpolation will be used only if the log will be emitted. Not doing so can lead to wasted cycles when we are logging on a level over INFO , as the interpolation will still occur.

Capturing and formatting exceptions

Quite often, we want to log information about the exception in a catch block, and it might feel intuitive to use:

try :
...
except Exception as error:
logger. info ( "Something bad happened: %s" , error )

But that code can give us log lines like Something bad happened: "secret_key." This is not that useful. If we use exc_info as illustrated previously, it will produce the following:

try :
...
except Exception :
logger. info ( "Something bad happened" , exc_info = True ) Something bad happened
Traceback ( most recent call last ) :
File "sample_project.py" , line 10 , in code
inner_code ()
File "sample_project.py" , line 6 , in inner_code
x = data [ "secret_key" ]
KeyError : 'secret_key'

This not only contains the exact source of the exception, but also the type.

Configuring our loggers

It's easy to instrument our software, and we need to configure the logging stack and specify how those records will be emitted.

There are multiple ways to configure the logging stack.

BasicConfig

This is by far the simplest way to configure logging. Just doing logging.basicConfig(level="INFO") sets up a basic StreamHandler that will log everything on the INFO and above levels to the console. There are arguments to customize this basic configuration. Some of them are:

Format Description Example
filename Specifies that a FileHandler should be created, using the specified filename, rather than a StreamHandler /var/logs/logs.txt
format Use the specified format string for the handler "'%(asctime)s %(message)s'"
datefmt Use the specified date/time format "%H:%M:%S"
level Set the root logger level to the specified level "INFO"

This is a simple and practical way to configure small scripts.

Note, basicConfig only works the first time it is called in a runtime. If you have already configured your root logger, calling basicConfig will have no effect.

DictConfig

The configuration for all elements and how to connect them can be specified as a dictionary. This dictionary should have different sections for loggers, handlers, formatters, and some basic global parameters.

Here's an example:

config = {
'disable_existing_loggers' : False ,
'version' : 1 ,
'formatters' : {
'short' : {
'format' : '%(asctime)s %(levelname)s %(name)s: %(message)s'
} ,
} ,
'handlers' : {
'console' : {
'level' : 'INFO' ,
'formatter' : 'short' ,
'class' : 'logging.StreamHandler' ,
} ,
} ,
'loggers' : {
'' : {
'handlers' : [ 'console' ] ,
'level' : 'ERROR' ,
} ,
'plugins' : {
'handlers' : [ 'console' ] ,
'level' : 'INFO' ,
'propagate' : False
}
} ,
}
import logging . config
logging . config . dictConfig ( config )

When invoked, dictConfig will disable all existing loggers, unless disable_existing_loggers is set to false . This is usually desired, as many modules declare a global logger that will be instantiated at import time, before dictConfig is called.

You can see the schema that can be used for the dictConfig method . Often, this configuration is stored in a YAML file and configured from there. Many developers often prefer this over using fileConfig , as it offers better support for customization.

Extending logging

Thanks to the way it is designed, it is easy to extend the logging module. Let's see some examples:

Logging JSON

If we want, we can log JSON by creating a custom formatter that transforms the log records into a JSON-encoded string:

import logging
import logging . config
import json
ATTR_TO_JSON = [ 'created' , 'filename' , 'funcName' , 'levelname' , 'lineno' , 'module' , 'msecs' , 'msg' , 'name' , 'pathname' , 'process' , 'processName' , 'relativeCreated' , 'thread' , 'threadName' ]
class JsonFormatter:
def format ( self , record ) :
obj = { attr: getattr ( record , attr )
for attr in ATTR_TO_JSON }
return json. dumps ( obj , indent = 4 )

handler = logging . StreamHandler ()
handler. formatter = JsonFormatter ()
logger = logging . getLogger ( __name__ )
logger. addHandler ( handler )
logger. error ( "Hello" )

Adding further context

On the formatters, we can specify any log record attribute.

We can inject attributes in multiple ways. In this example, we abuse filters to enrich the records.

import logging
import logging . config
GLOBAL_STUFF = 1

class ContextFilter ( logging . Filter ) :
def filter ( self , record ) :
global GLOBAL_STUFF
GLOBAL_STUFF + = 1
record. global_data = GLOBAL_STUFF
return True

handler = logging . StreamHandler ()
handler. formatter = logging . Formatter ( "%(global_data)s %(message)s" )
handler. addFilter ( ContextFilter ())
logger = logging . getLogger ( __name__ )
logger. addHandler ( handler )

logger. error ( "Hi1" )
logger. error ( "Hi2" )

This effectively adds an attribute to all the records that go through that logger. The formatter will then include it in the log line.

Note that this impacts all log records in your application, including libraries or other frameworks that you might be using and for which you are emitting logs. It can be used to log things like a unique request ID on all log lines to track requests or to add extra contextual information.

Starting with Python 3.2, you can use setLogRecordFactory to capture all log record creation and inject extra information. The extra attribute and the LoggerAdapter class may also be of the interest.

Buffering logs

Sometimes we would like to have access to debug logs when an error happens. This is feasible by creating a buffered handler that will log the last debug messages after an error occurs. See the following code as a non-curated example:

import logging
import logging . handlers

class SmartBufferHandler ( logging . handlers . MemoryHandler ) :
def __init__ ( self , num_buffered , *args , **kwargs ) :
kwargs [ "capacity" ] = num_buffered + 2 # +2 one for current, one for prepop
super () . __init__ ( *args , **kwargs )

def emit ( self , record ) :
if len ( self . buffer ) == self . capacity - 1 :
self . buffer . pop ( 0 )
super () . emit ( record )

handler = SmartBufferHandler ( num_buffered = 2 , target = logging . StreamHandler () , flushLevel = logging . ERROR )
logger = logging . getLogger ( __name__ )
logger. setLevel ( "DEBUG" )
logger. addHandler ( handler )

logger. error ( "Hello1" )
logger. debug ( "Hello2" ) # This line won't be logged
logger. debug ( "Hello3" )
logger. debug ( "Hello4" )
logger. error ( "Hello5" ) # As error will flush the buffered logs, the two last debugs will be logged

For more information

This introduction to the logging library's flexibility and configurability aims to demonstrate the beauty of how its design splits concerns. It also offers a solid foundation for anyone interested in a deeper dive into the logging documentation and the how-to guide . Although this article isn't a comprehensive guide to Python logging, here are answers to a few frequently asked questions.

My library emits a "no logger configured" warning

Check how to configure logging in a library from "The Hitchhiker's Guide to Python."

What happens if a logger has no level configured?

The effective level of the logger will then be defined recursively by its parents.

All my logs are in local time. How do I log in UTC?

Formatters are the answer! You need to set the converter attribute of your formatter to generate UTC times. Use converter = time.gmtime .

[Nov 08, 2019] The Python community is special

Oct 29, 2019 | opensource.com

Moshe Zadka (Community Moderator) Feed 26 up 2 comments The Python community is amazing. It was one of the first to adopt a code of conduct, first for the Python Software Foundation and then for PyCon . There is a real commitment to diversity and inclusion: blog posts and conference talks on this theme are frequent, thoughtful, and well-read by Python community members.

While the community is global, there is a lot of great activity in the local community as well. Local Python meet-ups are a great place to meet wonderful people who are smart, experienced, and eager to help. A lot of meet-ups will explicitly have time set aside for experienced people to help newcomers who want to learn a new concept or to get past an issue with their code. My local community took the time to support me as I began my Python journey, and I am privileged to continue to give back to new developers.

Whether you can attend a local community meet-up or you spend time with the online Python community across IRC, Slack, and Twitter, I am sure you will meet lovely people who want to help you succeed as a developer.

[Nov 08, 2019] Selected Python-related articles from opensource.org

There are lots of ways to save time as a sysadmin. Here's one way to build a web app using open source tools to automate a chunk of your daily routine away
Nov 08, 2019 | opensource.com
  1. A dozen ways to learn Python These resources will get you started and well on your way to proficiency with Python. Don Watkins (Community Moderator) 27 Aug 2019 51
  2. Building a non-breaking breakpoint for Python debugging Have you ever wondered how to speed up a debugger? Here are some lessons learned while building one for Python. Liran Haimovitch 13 Aug 2019 39
  3. Sending custom emails with Python Customize your group emails with Mailmerge, a command-line program that can handle simple and complex emails. Brian "bex" Exelbierd (Red Hat) 08 Aug 2019 42
  4. Save and load Python data with JSON The JSON format saves you from creating your own data formats, and is particularly easy to learn if you already know Python. Here's how to use it with Python. Seth Kenlon (Red Hat) 16 Jul 2019 36
  5. Parse arguments with Python Parse Python like a pro with the argparse module. Seth Kenlon (Red Hat) 03 Jul 2019 59
  6. Learn object-oriented programming with Python Make your code more modular with Python classes. Seth Kenlon (Red Hat) 05 Jul 2019 62
  7. Get modular with Python functions Minimize your coding workload by using Python functions for repeating tasks. Seth Kenlon (Red Hat) 01 Jul 2019 62
  8. Learn Python with these awesome resources Expand your Python knowledge by adding these resources to your personal learning network. Don Watkins (Community Moderator) 31 May 2019 71
  9. Run your blog on GitHub Pages with Python Create a blog with Pelican, a Python-based blogging platform that works well with GitHub. Erik O'Shaughnessy 23 May 2019 76
  10. 100 ways to learn Python and R for data science Want to learn data science? Find recommended courses in the Data Science Repo, a community-sourced directory of Python and R learning resources. Chris Engelhardt Dorris Scott Annu Singh 15 May 2019 59
  11. How to analyze log data with Python and Apache Spark Case study with NASA logs to show how Spark can be leveraged for analyzing data at scale. Dipanjan (DJ) Sarkar (Red Hat) 14 May 2019 76
  12. Learn Python by teaching in your community A free and fun way to learn by teaching others. Don Watkins (Community Moderator) 13 May 2019 47
  13. Format Python however you like with Black Learn more about solving common Python problems in our series covering seven PyPI libraries. Moshe Zadka (Community Moderator) 02 May 2019 80
  14. Write faster C extensions for Python with Cython Learn more about solving common Python problems in our series covering seven PyPI libraries. Moshe Zadka (Community Moderator) 01 May 2019 64
  15. Managing Python packages the right way Don't fall victim to the perils of Python package management. László Kiss Kollár 04 Apr 2019 75
  16. 7 steps for hunting down Python code bugs Learn some tricks to minimize the time you spend tracking down the reasons your code fails. Maria Mckinley 08 Feb 2019 108
  17. Getting started with Pelican: A Python-based static site generator Pelican is a great choice for Python users who want to self-host a simple website or blog. Craig Sebenik 07 Jan 2019 119
  18. Python 3.7 beginner's cheat sheet Get acquainted with Python's built-in pieces. Nicholas Hunt-Walker 20 Sep 2018 177
  19. 18 Python programming books for beginners and veterans Get started with this popular language or buff up on your coding skills with this curated book list. Jen Wike Huger (Red Hat) 10 Sep 2018 157
  20. Why I love Xonsh Ever wondered if Python can be your shell? Moshe Zadka (Community Moderator) 04 Sep 2018 123

[Nov 08, 2019] Pylint: Making your Python code consistent

Oct 21, 2019 | opensource.com

Pylint is your friend when you want to avoid arguing about code complexity.

Moshe Zadka (Community Moderator) Feed 30 up 1 comment

Get the highlights in your inbox every week.

flake8 and black will take care of "local" style: where the newlines occur, how comments are formatted, or find issues like commented out code or bad practices in log formatting.

Pylint is extremely aggressive by default. It will offer strong opinions on everything from checking if declared interfaces are actually implemented to opportunities to refactor duplicate code, which can be a lot to a new user. One way of introducing it gently to a project, or a team, is to start by turning all checkers off, and then enabling checkers one by one. This is especially useful if you already use flake8, black, and mypy : Pylint has quite a few checkers that overlap in functionality.

More Python Resources

However, one of the things unique to Pylint is the ability to enforce higher-level issues: for example, number of lines in a function, or number of methods in a class.

These numbers might be different from project to project and can depend on the development team's preferences. However, once the team comes to an agreement about the parameters, it is useful to enforce those parameters using an automated tool. This is where Pylint shines.

Configuring Pylint

In order to start with an empty configuration, start your .pylintrc with

[MESSAGES CONTROL]

disable=all

This disables all Pylint messages. Since many of them are redundant, this makes sense. In Pylint, a message is a specific kind of warning.

You can check that all messages have been turned off by running pylint :

$ pylint <my package>

In general, it is not a great idea to add parameters to the pylint command-line: the best place to configure your pylint is the .pylintrc . In order to have it do something useful, we need to enable some messages.

In order to enable messages, add to your .pylintrc , under the [MESSAGES CONTROL] .

enable = <message>,

...

For the "messages" (what Pylint calls different kinds of warnings) that look useful. Some of my favorites include too-many-lines , too-many-arguments , and too-many-branches . All of those limit complexity of modules or functions, and serve as an objective check, without a human nitpicker needed, for code complexity measurement.

A checker is a source of messages : every message belongs to exactly one checker. Many of the most useful messages are under the design checker . The default numbers are usually good, but tweaking the maximums is straightfoward: we can add a section called DESIGN in the .pylintrc .

[ DESIGN ]

max-args=7

max-locals=15

Another good source of useful messages is the refactoring checker. Some of my favorite messages to enable there are consider-using-dict-comprehension , stop-iteration-return (which looks for generators which use raise StopIteration when return is the correct way to stop the iteration). and chained-comparison , which will suggest using syntax like 1 <= x < 5 rather than the less obvious 1 <= x && x > 5

Finally, an expensive checker, in terms of performance, but highly useful, is similarities . It is designed to enforce "Don't Repeat Yourself" (the DRY principle) by explicitly looking for copy-paste between different parts of the code. It only has one message to enable: duplicate-code . The default "minimum similarity lines" is set to 4 . It is possible to set it to a different value using the .pylintrc .

[ SIMILARITIES ]

min-similarity-lines=3

Pylint makes code reviews easy

If you are sick of code reviews where you point out that a class is too complicated, or that two different functions are basically the same, add Pylint to your Continuous Integration configuration, and only have the arguments about complexity guidelines for your project once .

[Nov 08, 2019] Manage NTP with Chrony by David Both

Dec 03, 2018 | opensource.com

Chronyd is a better choice for most networks than ntpd for keeping computers synchronized with the Network Time Protocol.

"Does anybody really know what time it is? Does anybody really care?"
Chicago , 1969

Perhaps that rock group didn't care what time it was, but our computers do need to know the exact time. Timekeeping is very important to computer networks. In banking, stock markets, and other financial businesses, transactions must be maintained in the proper order, and exact time sequences are critical for that. For sysadmins and DevOps professionals, it's easier to follow the trail of email through a series of servers or to determine the exact sequence of events using log files on geographically dispersed hosts when exact times are kept on the computers in question.

I used to work at an organization that received over 20 million emails per day and had four servers just to accept and do a basic filter on the incoming flood of email. From there, emails were sent to one of four other servers to perform more complex anti-spam assessments, then they were delivered to one of several additional servers where the emails were placed in the correct inboxes. At each layer, the emails would be sent to one of the next-level servers, selected only by the randomness of round-robin DNS. Sometimes we had to trace a new message through the system until we could determine where it "got lost," according to the pointy-haired bosses. We had to do this with frightening regularity.

Most of that email turned out to be spam. Some people actually complained that their [joke, cat pic, recipe, inspirational saying, or other-strange-email]-of-the-day was missing and asked us to find it. We did reject those opportunities.

Our email and other transactional searches were aided by log entries with timestamps that -- today -- can resolve down to the nanosecond in even the slowest of modern Linux computers. In very high-volume transaction environments, even a few microseconds of difference in the system clocks can mean sorting thousands of transactions to find the correct one(s).

The NTP server hierarchy

Computers worldwide use the Network Time Protocol (NTP) to synchronize their times with internet standard reference clocks via a hierarchy of NTP servers. The primary servers are at stratum 1, and they are connected directly to various national time services at stratum 0 via satellite, radio, or even modems over phone lines. The time service at stratum 0 may be an atomic clock, a radio receiver tuned to the signals broadcast by an atomic clock, or a GPS receiver using the highly accurate clock signals broadcast by GPS satellites.

To prevent time requests from time servers lower in the hierarchy (i.e., with a higher stratum number) from overwhelming the primary reference servers, there are several thousand public NTP stratum 2 servers that are open and available for anyone to use. Many organizations with large numbers of hosts that need an NTP server will set up their own time servers so that only one local host accesses the stratum 2 time servers, then they configure the remaining network hosts to use the local time server which, in my case, is a stratum 3 server.

NTP choices

The original NTP daemon, ntpd , has been joined by a newer one, chronyd . Both keep the local host's time synchronized with the time server. Both services are available, and I have seen nothing to indicate that this will change anytime soon.

Chrony has features that make it the better choice for most environments for the following reasons:

  • Chrony can synchronize to the time server much faster than NTP. This is good for laptops or desktops that don't run constantly.
  • It can compensate for fluctuating clock frequencies, such as when a host hibernates or enters sleep mode, or when the clock speed varies due to frequency stepping that slows clock speeds when loads are low.
  • It handles intermittent network connections and bandwidth saturation.
  • It adjusts for network delays and latency.
  • After the initial time sync, Chrony never steps the clock. This ensures stable and consistent time intervals for system services and applications.
  • Chrony can work even without a network connection. In this case, the local host or server can be updated manually.

The NTP and Chrony RPM packages are available from standard Fedora repositories. You can install both and switch between them, but modern Fedora, CentOS, and RHEL releases have moved from NTP to Chrony as their default time-keeping implementation. I have found that Chrony works well, provides a better interface for the sysadmin, presents much more information, and increases control.

Just to make it clear, NTP is a protocol that is implemented with either NTP or Chrony. If you'd like to know more, read this comparison between NTP and Chrony as implementations of the NTP protocol.

This article explains how to configure Chrony clients and servers on a Fedora host, but the configuration for CentOS and RHEL current releases works the same.

Chrony structure

The Chrony daemon, chronyd , runs in the background and monitors the time and status of the time server specified in the chrony.conf file. If the local time needs to be adjusted, chronyd does it smoothly without the programmatic trauma that would occur if the clock were instantly reset to a new time.

Chrony's chronyc tool allows someone to monitor the current status of Chrony and make changes if necessary. The chronyc utility can be used as a command that accepts subcommands, or it can be used as an interactive text-mode program. This article will explain both uses.

Client configuration

The NTP client configuration is simple and requires little or no intervention. The NTP server can be defined during the Linux installation or provided by the DHCP server at boot time. The default /etc/chrony.conf file (shown below in its entirety) requires no intervention to work properly as a client. For Fedora, Chrony uses the Fedora NTP pool, and CentOS and RHEL have their own NTP server pools. Like many Red Hat-based distributions, the configuration file is well commented.

# Use public servers from the pool.ntp.org project.
# Please consider joining the pool (http://www.pool.ntp.org/join.html).
pool 2.fedora.pool.ntp.org iburst

# Record the rate at which the system clock gains/losses time.
driftfile /var/lib/chrony/drift

# Allow the system clock to be stepped in the first three updates
# if its offset is larger than 1 second.
makestep 1.0 3

# Enable kernel synchronization of the real-time clock (RTC).

# Enable hardware timestamping on all interfaces that support it.
#hwtimestamp *

# Increase the minimum number of selectable sources required to adjust
# the system clock.
#minsources 2

# Allow NTP client access from local network.
#allow 192.168.0.0/16

# Serve time even if not synchronized to a time source.
#local stratum 10

# Specify file containing keys for NTP authentication.
keyfile /etc/chrony.keys

# Get TAI-UTC offset and leap seconds from the system tz database.
leapsectz right/UTC

# Specify directory for log files.
logdir /var/log/chrony

# Select which information is logged.
#log measurements statistics tracking

Let's look at the current status of NTP on a virtual machine I use for testing. The chronyc command, when used with the tracking subcommand, provides statistics that report how far off the local system is from the reference server.

[root@studentvm1 ~]# chronyc tracking
Reference ID : 23ABED4D (ec2-35-171-237-77.compute-1.amazonaws.com)
Stratum : 3
Ref time (UTC) : Fri Nov 16 16:21:30 2018
System time : 0.000645622 seconds slow of NTP time
Last offset : -0.000308577 seconds
RMS offset : 0.000786140 seconds
Frequency : 0.147 ppm slow
Residual freq : -0.073 ppm
Skew : 0.062 ppm
Root delay : 0.041452706 seconds
Root dispersion : 0.022665167 seconds
Update interval : 1044.2 seconds
Leap status : Normal
[root@studentvm1 ~]#

The Reference ID in the first line of the result is the server the host is synchronized to -- in this case, a stratum 3 reference server that was last contacted by the host at 16:21:30 2018. The other lines are described in the chronyc(1) man page .

The sources subcommand is also useful because it provides information about the time source configured in chrony.conf .

[root@studentvm1 ~]# chronyc sources
210 Number of sources = 5
MS Name/IP address Stratum Poll Reach LastRx Last sample
===============================================================================
^+ 192.168.0.51 3 6 377 0 -2613us[-2613us] +/- 63ms
^+ dev.smatwebdesign.com 3 10 377 28m -2961us[-3534us] +/- 113ms
^+ propjet.latt.net 2 10 377 465 -1097us[-1085us] +/- 77ms
^* ec2-35-171-237-77.comput> 2 10 377 83 +2388us[+2395us] +/- 95ms
^+ PBX.cytranet.net 3 10 377 507 -1602us[-1589us] +/- 96ms
[root@studentvm1 ~]#

The first source in the list is the time server I set up for my personal network. The others were provided by the pool. Even though my NTP server doesn't appear in the Chrony configuration file above, my DHCP server provides its IP address for the NTP server. The "S" column -- Source State -- indicates with an asterisk ( * ) the server our host is synced to. This is consistent with the data from the tracking subcommand.

The -v option provides a nice description of the fields in this output.

[root@studentvm1 ~]# chronyc sources -v
210 Number of sources = 5

.-- Source mode '^' = server, '=' = peer, '#' = local clock.
/ .- Source state '*' = current synced, '+' = combined , '-' = not combined,
| / '?' = unreachable, 'x' = time may be in error, '~' = time too variable.
|| .- xxxx [ yyyy ] +/- zzzz
|| Reachability register (octal) -. | xxxx = adjusted offset,
|| Log2(Polling interval) --. | | yyyy = measured offset,
|| \ | | zzzz = estimated error.
|| | | \
MS Name/IP address Stratum Poll Reach LastRx Last sample
===============================================================================
^+ 192.168.0.51 3 7 377 28 -2156us[-2156us] +/- 63ms
^+ triton.ellipse.net 2 10 377 24 +5716us[+5716us] +/- 62ms
^+ lithium.constant.com 2 10 377 351 -820us[ -820us] +/- 64ms
^* t2.time.bf1.yahoo.com 2 10 377 453 -992us[ -965us] +/- 46ms
^- ntp.idealab.com 2 10 377 799 +3653us[+3674us] +/- 87ms
[root@studentvm1 ~]#

If I wanted my server to be the preferred reference time source for this host, I would add the line below to the /etc/chrony.conf file.

server 192.168.0.51 iburst prefer

I usually place this line just above the first pool server statement near the top of the file. There is no special reason for this, except I like to keep the server statements together. It would work just as well at the bottom of the file, and I have done that on several hosts. This configuration file is not sequence-sensitive.

The prefer option marks this as the preferred reference source. As such, this host will always be synchronized with this reference source (as long as it is available). We can also use the fully qualified hostname for a remote reference server or the hostname only (without the domain name) for a local reference time source as long as the search statement is set in the /etc/resolv.conf file. I prefer the IP address to ensure that the time source is accessible even if DNS is not working. In most environments, the server name is probably the better option, because NTP will continue to work even if the server's IP address changes.

If you don't have a specific reference source you want to synchronize to, it is fine to use the defaults.

Configuring an NTP server with Chrony

The nice thing about the Chrony configuration file is that this single file configures the host as both a client and a server. To add a server function to our host -- it will always be a client, obtaining its time from a reference server -- we just need to make a couple of changes to the Chrony configuration, then configure the host's firewall to accept NTP requests.

Open the /etc/chrony.conf file in your favorite text editor and uncomment the local stratum 10 line. This enables the Chrony NTP server to continue to act as if it were connected to a remote reference server if the internet connection fails; this enables the host to continue to be an NTP server to other hosts on the local network.

Let's restart chronyd and track how the service is working for a few minutes. Before we enable our host as an NTP server, we want to test a bit.

[root@studentvm1 ~]# systemctl restart chronyd ; watch chronyc tracking

The results should look like this. The watch command runs the chronyc tracking command every two seconds so we can watch changes occur over time.

Every 2.0s: chronyc tracking studentvm1: Fri Nov 16 20:59:31 2018

Reference ID : C0A80033 (192.168.0.51)
Stratum : 4
Ref time (UTC) : Sat Nov 17 01:58:51 2018
System time : 0.001598277 seconds fast of NTP time
Last offset : +0.001791533 seconds
RMS offset : 0.001791533 seconds
Frequency : 0.546 ppm slow
Residual freq : -0.175 ppm
Skew : 0.168 ppm
Root delay : 0.094823152 seconds
Root dispersion : 0.021242738 seconds
Update interval : 65.0 seconds
Leap status : Normal

Notice that my NTP server, the studentvm1 host, synchronizes to the host at 192.168.0.51, which is my internal network NTP server, at stratum 4. Synchronizing directly to the Fedora pool machines would result in synchronization at stratum 3. Notice also that the amount of error decreases over time. Eventually, it should stabilize with a tiny variation around a fairly small range of error. The size of the error depends upon the stratum and other network factors. After a few minutes, use Ctrl+C to break out of the watch loop.

To turn our host into an NTP server, we need to allow it to listen on the local network. Uncomment the following line to allow hosts on the local network to access our NTP server.

# Allow NTP client access from local network.
allow 192.168.0.0/16

Note that the server can listen for requests on any local network it's attached to. The IP address in the "allow" line is just intended for illustrative purposes. Be sure to change the IP network and subnet mask in that line to match your local network's.

Restart chronyd .

[root@studentvm1 ~]# systemctl restart chronyd

To allow other hosts on your network to access this server, configure the firewall to allow inbound UDP packets on port 123. Check your firewall's documentation to find out how to do that.

Testing

Your host is now an NTP server. You can test it with another host or a VM that has access to the network on which the NTP server is listening. Configure the client to use the new NTP server as the preferred server in the /etc/chrony.conf file, then monitor that client using the chronyc tools we used above.

Chronyc as an interactive tool

As I mentioned earlier, chronyc can be used as an interactive command tool. Simply run the command without a subcommand and you get a chronyc command prompt.

[root@studentvm1 ~]# chronyc
chrony version 3.4
Copyright (C) 1997-2003, 2007, 2009-2018 Richard P. Curnow and others
chrony comes with ABSOLUTELY NO WARRANTY. This is free software, and
you are welcome to redistribute it under certain conditions. See the
GNU General Public License version 2 for details.

chronyc>

You can enter just the subcommands at this prompt. Try using the tracking , ntpdata , and sources commands. The chronyc command line allows command recall and editing for chronyc subcommands. You can use the help subcommand to get a list of possible commands and their syntax.

Conclusion

Chrony is a powerful tool for synchronizing the times of client hosts, whether they are all on the local network or scattered around the globe. It's easy to configure because, despite the large number of options available, only a few configurations are required for most circumstances.

After my client computers have synchronized with the NTP server, I like to set the system hardware clock from the system (OS) time by using the following command:

/sbin/hwclock --systohc

This command can be added as a cron job or a script in cron.daily to keep the hardware clock synced with the system time.

Chrony and NTP (the service) both use the same configuration, and the files' contents are interchangeable. The man pages for chronyd , chronyc , and chrony.conf contain an amazing amount of information that can help you get started or learn about esoteric configuration options.

Do you run your own NTP server? Let us know in the comments and be sure to tell us which implementation you are using, NTP or Chrony.

[Nov 08, 2019] Quiet log noise with Python and machine learning by Tristan de Cacqueray

Sep 28, 2018 | opensource.com

Logreduce machine learning model is trained using previous successful job runs to extract anomalies from failed runs' logs.

This principle can also be applied to other use cases, for example, extracting anomalies from Journald or other systemwide regular log files.

Using machine learning to reduce noise

A typical log file contains many nominal events ("baselines") along with a few exceptions that are relevant to the developer. Baselines may contain random elements such as timestamps or unique identifiers that are difficult to detect and remove. To remove the baseline events, we can use a k -nearest neighbors pattern recognition algorithm ( k -NN).

ml-generic-workflow.png

Log events must be converted to numeric values for k -NN regression. Using the generic feature extraction tool HashingVectorizer enables the process to be applied to any type of log. It hashes each word and encodes each event in a sparse matrix. To further reduce the search space, tokenization removes known random words, such as dates or IP addresses.

hashing-vectorizer.png

Once the model is trained, the k -NN search tells us the distance of each new event from the baseline.

kneighbors.png

This Jupyter notebook demonstrates the process and graphs the sparse matrix vectors.

anomaly-detection-with-scikit-learn.png

Introducing Logreduce

The Logreduce Python software transparently implements this process. Logreduce's initial goal was to assist with Zuul CI job failure analyses using the build database, and it is now integrated into the Software Factory development forge's job logs process.

At its simplest, Logreduce compares files or directories and removes lines that are similar. Logreduce builds a model for each source file and outputs any of the target's lines whose distances are above a defined threshold by using the following syntax: distance | filename:line-number: line-content .

$ logreduce diff / var / log / audit / audit.log.1 / var / log / audit / audit.log
INFO logreduce.Classifier - Training took 21.982s at 0.364MB / s ( 1.314kl / s ) ( 8.000 MB - 28.884 kilo-lines )
0.244 | audit.log: 19963 : type =USER_AUTH acct = "root" exe = "/usr/bin/su" hostname =managesf.sftests.com
INFO logreduce.Classifier - Testing took 18.297s at 0.306MB / s ( 1.094kl / s ) ( 5.607 MB - 20.015 kilo-lines )
99.99 % reduction ( from 20015 lines to 1

A more advanced Logreduce use can train a model offline to be reused. Many variants of the baselines can be used to fit the k -NN search tree.

$ logreduce dir-train audit.clf / var / log / audit / audit.log. *
INFO logreduce.Classifier - Training took 80.883s at 0.396MB / s ( 1.397kl / s ) ( 32.001 MB - 112.977 kilo-lines )
DEBUG logreduce.Classifier - audit.clf: written
$ logreduce dir-run audit.clf / var / log / audit / audit.log

Logreduce also implements interfaces to discover baselines for Journald time ranges (days/weeks/months) and Zuul CI job build histories. It can also generate HTML reports that group anomalies found in multiple files in a simple interface.

html-report.png Managing baselines

The key to using k -NN regression for anomaly detection is to have a database of known good baselines, which the model uses to detect lines that deviate too far. This method relies on the baselines containing all nominal events, as anything that isn't found in the baseline will be reported as anomalous.

CI jobs are great targets for k -NN regression because the job outputs are often deterministic and previous runs can be automatically used as baselines. Logreduce features Zuul job roles that can be used as part of a failed job post task in order to issue a concise report (instead of the full job's logs). This principle can be applied to other cases, as long as baselines can be constructed in advance. For example, a nominal system's SoS report can be used to find issues in a defective deployment.

baselines.png Anomaly classification service

The next version of Logreduce introduces a server mode to offload log processing to an external service where reports can be further analyzed. It also supports importing existing reports and requests to analyze a Zuul build. The services run analyses asynchronously and feature a web interface to adjust scores and remove false positives.

classification-interface.png

Reviewed reports can be archived as a standalone dataset with the target log files and the scores for anomalous lines recorded in a flat JSON file.

Project roadmap

Logreduce is already being used effectively, but there are many opportunities for improving the tool. Plans for the future include:

  • Curating many annotated anomalies found in log files and producing a public domain dataset to enable further research. Anomaly detection in log files is a challenging topic, and having a common dataset to test new models would help identify new solutions.
  • Reusing the annotated anomalies with the model to refine the distances reported. For example, when users mark lines as false positives by setting their distance to zero, the model could reduce the score of those lines in future reports.
  • Fingerprinting archived anomalies to detect when a new report contains an already known anomaly. Thus, instead of reporting the anomaly's content, the service could notify the user that the job hit a known issue. When the issue is fixed, the service could automatically restart the job.
  • Supporting more baseline discovery interfaces for targets such as SOS reports, Jenkins builds, Travis CI, and more.

If you are interested in getting involved in this project, please contact us on the #log-classify Freenode IRC channel. Feedback is always appreciated!


Tristan Cacqueray will present Reduce your log noise using machine learning at the OpenStack Summit , November 13-15 in Berlin.

[Nov 08, 2019] Getting started with Logstash by Jamie Riedesel

Nov 08, 2019 | opensource.com

No longer a simple log-processing pipeline, Logstash has evolved into a powerful and versatile data processing tool. Here are basics to get you started. 19 Oct 2017 Feed 298 up Image by : Opensource.com x Subscribe now

Get the highlights in your inbox every week.

https://opensource.com/eloqua-embedded-email-capture-block.html?offer_id=70160000000QzXNAA0 Logstash , an open source tool released by Elastic , is designed to ingest and transform data. It was originally built to be a log-processing pipeline to ingest logging data into ElasticSearch . Several versions later, it can do much more.

At its core, Logstash is a form of Extract-Transform-Load (ETL) pipeline. Unstructured log data is extracted , filters transform it, and the results are loaded into some form of data store.

Logstash can take a line of text like this syslog example:

Sep 11 14:13:38 vorthys sshd[16998]: Received disconnect from 192.0.2.11 port 53730:11: disconnected by user

and transform it into a much richer datastructure:

{
"timestamp" : "1505157218000" ,
"host" : "vorthys" ,
"program" : "sshd" ,
"pid" : "16998" ,
"message" : "Received disconnect from 192.0.2.11 port 53730:11: disconnected by user" ,
"sshd_action" : "disconnect" ,
"sshd_tuple" : "192.0.2.11:513730"
}

Depending on what you are using for your backing store, you can find events like this by using indexed fields rather than grepping terabytes of text. If you're generating tens to hundreds of gigabytes of logs a day, that matters.

Internal architecture

Logstash has a three-stage pipeline implemented in JRuby:

The input stage plugins extract data. This can be from logfiles, a TCP or UDP listener, one of several protocol-specific plugins such as syslog or IRC, or even queuing systems such as Redis, AQMP, or Kafka. This stage tags incoming events with metadata surrounding where the events came from.

The filter stage plugins transform and enrich the data. This is the stage that produces the sshd_action and sshd_tuple fields in the example above. This is where you'll find most of Logstash's value.

The output stage plugins load the processed events into something else, such as ElasticSearch or another document-database, or a queuing system such as Redis, AQMP, or Kafka. It can also be configured to communicate with an API. It is also possible to hook up something like PagerDuty to your Logstash outputs.

Have a cron job that checks if your backups completed successfully? It can issue an alarm in the logging stream. This is picked up by an input, and a filter config set up to catch those events marks it up, allowing a conditional output to know this event is for it. This is how you can add alarms to scripts that would otherwise need to create their own notification layers, or that operate on systems that aren't allowed to communicate with the outside world.

Threads

In general, each input runs in its own thread. The filter and output stages are more complicated. In Logstash 1.5 through 2.1, the filter stage had a configurable number of threads, with the output stage occupying a single thread. That changed in Logstash 2.2, when the filter-stage threads were built to handle the output stage. With one fewer internal queue to keep track of, throughput improved with Logstash 2.2.

If you're running an older version, it's worth upgrading to at least 2.2. When we moved from 1.5 to 2.2, we saw a 20-25% increase in overall throughput. Logstash also spent less time in wait states, so we used more of the CPU (47% vs 75%).

Configuring the pipeline

Logstash can take a single file or a directory for its configuration. If a directory is given, it reads the files in lexical order. This is important, as ordering is significant for filter plugins (we'll discuss that in more detail later).

Here is a bare Logstash config file:

input { }
filter { }
output { }

Each of these will contain zero or more plugin configurations, and there can be multiple blocks.

Input config

An input section can look like this:

input {
syslog {
port => 514
type => "syslog_server"
}
}

This tells Logstash to open the syslog { } plugin on port 514 and will set the document type for each event coming in through that plugin to be syslog_server . This plugin follows RFC 3164 only, not the newer RFC 5424.

Here is a slightly more complex input block:

# Pull in syslog data
input {
file {
path => [
"/var/log/syslog" ,
"/var/log/auth.log"
]
type => "syslog"
}
}

# Pull in application - log data. They emit data in JSON form.
input {
file {
path => [
"/var/log/app/worker_info.log" ,
"/var/log/app/broker_info.log" ,
"/var/log/app/supervisor.log"
]
exclude => "*.gz"
type => "applog"
codec => "json"
}
}

This one uses two different input { } blocks to call different invocations of the file { } plugin : One tracks system-level logs, the other tracks application-level logs. By using two different input { } blocks, a Java thread is spawned for each one. For a multi-core system, different cores keep track of the configured files; if one thread blocks, the other will continue to function.

Both of these file { } blocks could be put into the same input { } block; they would simply run in the same thread -- Logstash doesn't really care.

Filter config

The filter section is where you transform your data into something that's newer and easier to work with. Filters can get quite complex. Here are a few examples of filters that accomplish different goals:

filter {
if [ program ] == "metrics_fetcher" {
mutate {
add_tag => [ 'metrics' ]
}
}
}

In this example, if the program field, populated by the syslog plugin in the example input at the top, reads metrics_fetcher , then it tags the event metrics . This tag could be used in a later filter plugin to further enrich the data.

filter {
if "metrics" in [ tags ] {
kv {
source => "message"
target => "metrics"
}
}
}

This one runs only if metrics is in the list of tags. It then uses the kv { } plugin to populate a new set of fields based on the key=value pairs in the message field. These new keys are placed as sub-fields of the metrics field, allowing the text pages_per_second=42 faults=0 to become metrics.pages_per_second = 42 and metrics.faults = 0 on the event.

Why wouldn't you just put this in the same conditional that set the tag value? Because there are multiple ways an event could get the metrics tag -- this way, the kv filter will handle them all.

Because the filters are ordered, being sure that the filter plugin that defines the metrics tag is run before the conditional that checks for it is important. Here are guidelines to ensure your filter sections are optimally ordered:

  1. Your early filters should apply as much metadata as possible.
  2. Using the metadata, perform detailed parsing of events.
  3. In your late filters, regularize your data to reduce problems downstream.
    • Ensure field data types get cast to a unified value. priority could be boolean, integer, or string.
      • Some systems, including ElasticSearch, will quietly convert types for you. Sending strings into a boolean field won't give you the results you want.
      • Other systems will reject a value outright if it isn't in the right data type.
    • The mutate { } plugin is helpful here, as it has methods to coerce fields into specific data types.

Here are useful plugins to extract fields from long strings:

  • date : Many logging systems emit a timestamp. This plugin parses that timestamp and sets the timestamp of the event to be that embedded time. By default, the timestamp of the event is when it was ingested , which could be seconds, hours, or even days later.
  • kv : As previously demonstrated, it can turn strings like backup_state=failed progress=0.24 into fields you can perform operations on.
  • csv : When given a list of columns to expect, it can create fields on the event based on comma-separated values.
  • json : If a field is formatted in JSON, this will turn it into fields. Very powerful!
  • xml : Like the JSON plugin, this will turn a field containing XML data into new fields.
  • grok : This is your regex engine. If you need to translate strings like The accounting backup failed into something that will pass if [backup_status] == 'failed' , this will do it.
Output config

Elastic would like you to send it all into ElasticSearch, but anything that can accept a JSON document, or the datastructure it represents, can be an output. Keep in mind that events can be sent to multiple outputs. Consider this example of metrics:

output {
# Send to the local ElasticSearch port , and rotate the index daily.
elasticsearch {
hosts => [
"localhost" ,
"logelastic.prod.internal"
]
template_name => "logstash"
index => "logstash-{+YYYY.MM.dd}"
}

if "metrics" in [ tags ] {
influxdb {
host => "influx.prod.internal"
db => "logstash"
measurement => "appstats"
# This next bit only works because it is already a hash.
data_points => "%{metrics}"
send_as_tags => [ 'environment' , 'application' ]
}
}
}

Remember the metrics example above? This is how we can output it. The events tagged metrics will get sent to ElasticSearch in their full event form. In addition, the subfields under the metrics field on that event will be sent to influxdb , in the logstash database, under the appstats measurement. Along with the measurements, the values of the environment and application fields will be submitted as indexed tags.

There are a great many outputs. Here are some grouped by type:

  • API enpoints : Jira, PagerDuty, Rackspace, Redmine, Zabbix
  • Queues : Redis, Rabbit, Kafka, SQS
  • Messaging Platforms : IRC, XMPP, HipChat, email, IMAP
  • Document Databases : ElasticSearch, MongoDB, Solr
  • Metrics : OpenTSDB, InfluxDB, Nagios, Graphite, StatsD
  • Files and other static artifacts : File, CSV, S3

There are many more output plugins .

[Nov 08, 2019] 10 resources every sysadmin should know about Opensource.com

Nov 08, 2019 | opensource.com

Cheat

Having a hard time remembering a command? Normally you might resort to a man page, but some man pages have a hard time getting to the point. It's the reason Chris Allen Lane came up with the idea (and more importantly, the code) for a cheat command .

The cheat command displays cheatsheets for common tasks in your terminal. It's a man page without the preamble. It cuts to the chase and tells you exactly how to do whatever it is you're trying to do. And if it lacks a common example that you think ought to be included, you can submit an update.

$ cheat tar
# To extract an uncompressed archive:
tar -xvf '/path/to/foo.tar'

# To extract a .gz archive:
tar -xzvf '/path/to/foo.tgz'
[ ... ]

You can also treat cheat as a local cheatsheet system, which is great for all the in-house commands you and your team have invented over the years. You can easily add a local cheatsheet to your own home directory, and cheat will find and display it just as if it were a popular system command.

[Nov 08, 2019] 5 alerting and visualization tools for sysadmins

Nov 08, 2019 | opensource.com

Common types of alerts and visualizations Alerts

Let's first cover what alerts are not . Alerts should not be sent if the human responder can't do anything about the problem. This includes alerts that are sent to multiple individuals with only a few who can respond, or situations where every anomaly in the system triggers an alert. This leads to alert fatigue and receivers ignoring all alerts within a specific medium until the system escalates to a medium that isn't already saturated.

For example, if an operator receives hundreds of emails a day from the alerting system, that operator will soon ignore all emails from the alerting system. The operator will respond to a real incident only when he or she is experiencing the problem, emailed by a customer, or called by the boss. In this case, alerts have lost their meaning and usefulness.

Alerts are not a constant stream of information or a status update. They are meant to convey a problem from which the system can't automatically recover, and they are sent only to the individual most likely to be able to recover the system. Everything that falls outside this definition isn't an alert and will only damage your employees and company culture.

Everyone has a different set of alert types, so I won't discuss things like priority levels (P1-P5) or models that use words like "Informational," "Warning," and "Critical." Instead, I'll describe the generic categories emergent in complex systems' incident response.

You might have noticed I mentioned an "Informational" alert type right after I wrote that alerts shouldn't be informational. Well, not everyone agrees, but I don't consider something an alert if it isn't sent to anyone. It is a data point that many systems refer to as an alert. It represents some event that should be known but not responded to. It is generally part of the visualization system of the alerting tool and not an event that triggers actual notifications. Mike Julian covers this and other aspects of alerting in his book Practical Monitoring . It's a must read for work in this area.

Non-informational alerts consist of types that can be responded to or require action. I group these into two categories: internal outage and external outage. (Most companies have more than two levels for prioritizing their response efforts.) Degraded system performance is considered an outage in this model, as the impact to each user is usually unknown.

Internal outages are a lower priority than external outages, but they still need to be responded to quickly. They often include internal systems that company employees use or components of applications that are visible only to company employees.

External outages consist of any system outage that would immediately impact a customer. These don't include a system outage that prevents releasing updates to the system. They do include customer-facing application failures, database outages, and networking partitions that hurt availability or consistency if either can impact a user. They also include outages of tools that may not have a direct impact on users, as the application continues to run but this transparent dependency impacts performance. This is common when the system uses some external service or data source that isn't necessary for full functionality but may cause delays as the application performs retries or handles errors from this external dependency.

Visualizations

There are many visualization types, and I won't cover them all here. It's a fascinating area of research. On the data analytics side of my career, learning and applying that knowledge is a constant challenge. We need to provide simple representations of complex system outputs for the widest dissemination of information. Google Charts and Tableau have a wide selection of visualization types. We'll cover the most common visualizations and some innovative solutions for quickly understanding systems.

Line chart

The line chart is probably the most common visualization. It does a pretty good job of producing an understanding of a system over time. A line chart in a metrics system would have a line for each unique metric or some aggregation of metrics. This can get confusing when there are a lot of metrics in the same dashboard (as shown below), but most systems can select specific metrics to view rather than having all of them visible. Also, anomalous behavior is easy to spot if it's significant enough to escape the noise of normal operations. Below we can see purple, yellow, and light blue lines that might indicate anomalous behavior.

monitoring_guide_line_chart.png

Another feature of a line chart is that you can often stack them to show relationships. For example, you might want to look at requests on each server individually, but also in aggregate. This allows you to understand the overall system as well as each instance in the same graph.

monitoring_guide_line_chart_aggregate.png Heatmaps

Another common visualization is the heatmap. It is useful when looking at histograms. This type of visualization is similar to a bar chart but can show gradients within the bars representing the different percentiles of the overall metric. For example, suppose you're looking at request latencies and you want to quickly understand the overall trend as well as the distribution of all requests. A heatmap is great for this, and it can use color to disambiguate the quantity of each section with a quick glance.

The heatmap below shows the higher concentration around the centerline of the graph with an easy-to-understand visualization of the distribution vertically for each time bucket. We might want to review a couple of points in time where the distribution gets wide while the others are fairly tight like at 14:00. This distribution might be a negative performance indicator.

monitoring_guide_histogram.png Gauges

The last common visualization I'll cover here is the gauge, which helps users understand a single metric quickly. Gauges can represent a single metric, like your speedometer represents your driving speed or your gas gauge represents the amount of gas in your car. Similar to the gas gauge, most monitoring gauges clearly indicate what is good and what isn't. Often (as is shown below), good is represented by green, getting worse by orange, and "everything is breaking" by red. The middle row below shows traditional gauges.

monitoring_guide_gauges.png Image source: Grafana.org (© Grafana Labs)

This image shows more than just traditional gauges. The other gauges are single stat representations that are similar to the function of the classic gauge. They all use the same color scheme to quickly indicate system health with just a glance. Arguably, the bottom row is probably the best example of a gauge that allows you to glance at a dashboard and know that everything is healthy (or not). This type of visualization is usually what I put on a top-level dashboard. It offers a full, high-level understanding of system health in seconds.

Flame graphs

A less common visualization is the flame graph, introduced by Netflix's Brendan Gregg in 2011. It's not ideal for dashboarding or quickly observing high-level system concerns; it's normally seen when trying to understand a specific application problem. This visualization focuses on CPU and memory and the associated frames. The X-axis lists the frames alphabetically, and the Y-axis shows stack depth. Each rectangle is a stack frame and includes the function being called. The wider the rectangle, the more it appears in the stack. This method is invaluable when trying to diagnose system performance at the application level and I urge everyone to give it a try.

monitoring_guide_flame_graph.png Image source: Wikimedia.org ( Creative Commons BY SA 3.0 ) Tool options

There are several commercial options for alerting, but since this is Opensource.com, I'll cover only systems that are being used at scale by real companies that you can use at no cost. Hopefully, you'll be able to contribute new and innovative features to make these systems even better.

Alerting tools Bosun

If you've ever done anything with computers and gotten stuck, the help you received was probably thanks to a Stack Exchange system. Stack Exchange runs many different websites around a crowdsourced question-and-answer model. Stack Overflow is very popular with developers, and Super User is popular with operations. However, there are now hundreds of sites ranging from parenting to sci-fi and philosophy to bicycles.

Stack Exchange open-sourced its alert management system, Bosun , around the same time Prometheus and its AlertManager system were released. There were many similarities in the two systems, and that's a really good thing. Like Prometheus, Bosun is written in Golang. Bosun's scope is more extensive than Prometheus' as it can interact with systems beyond metrics aggregation. It can also ingest data from log and event aggregation systems. It supports Graphite, InfluxDB, OpenTSDB, and Elasticsearch.

Bosun's architecture consists of a single server binary, a backend like OpenTSDB, Redis, and scollector agents . The scollector agents automatically detect services on a host and report metrics for those processes and other system resources. This data is sent to a metrics backend. The Bosun server binary then queries the backends to determine if any alerts need to be fired. Bosun can also be used by tools like Grafana to query the underlying backends through one common interface. Redis is used to store state and metadata for Bosun.

A really neat feature of Bosun is that it lets you test your alerts against historical data. This was something I missed in Prometheus several years ago, when I had data for an issue I wanted alerts on but no easy way to test it. To make sure my alerts were working, I had to create and insert dummy data. This system alleviates that very time-consuming process.

Bosun also has the usual features like showing simple graphs and creating alerts. It has a powerful expression language for writing alerting rules. However, it only has email and HTTP notification configurations, which means connecting to Slack and other tools requires a bit more customization ( which its documentation covers ). Similar to Prometheus, Bosun can use templates for these notifications, which means they can look as awesome as you want them to. You can use all your HTML and CSS skills to create the baddest email alert anyone has ever seen.

Cabot

Cabot was created by a company called Arachnys . You may not know who Arachnys is or what it does, but you have probably felt its impact: It built the leading cloud-based solution for fighting financial crimes. That sounds pretty cool, right? At a previous company, I was involved in similar functions around "know your customer" laws. Most companies would consider it a very bad thing to be linked to a terrorist group, for example, funneling money through their systems. These solutions also help defend against less-atrocious offenders like fraudsters who could also pose a risk to the institution.

So why did Arachnys create Cabot? Well, it is kind of a Christmas present to everyone, as it was a Christmas project built because its developers couldn't wrap their heads around Nagios . And really, who can blame them? Cabot was written with Django and Bootstrap, so it should be easy for most to contribute to the project. (Another interesting factoid: The name comes from the creator's dog.)

The Cabot architecture is similar to Bosun in that it doesn't collect any data. Instead, it accesses data through the APIs of the tools it is alerting for. Therefore, Cabot uses a pull (rather than a push) model for alerting. It reaches out into each system's API and retrieves the information it needs to make a decision based on a specific check. Cabot stores the alerting data in a Postgres database and also has a cache using Redis.

Cabot natively supports Graphite , but it also supports Jenkins , which is rare in this area. Arachnys uses Jenkins like a centralized cron, but I like this idea of treating build failures like outages. Obviously, a build failure isn't as critical as a production outage, but it could still alert the team and escalate if the failure isn't resolved. Who actually checks Jenkins every time an email comes in about a build failure? Yeah, me too!

Another interesting feature is that Cabot can integrate with Google Calendar for on-call rotations. Cabot calls this feature Rota, which is a British term for a roster or rotation. This makes a lot of sense, and I wish other systems would take this idea further. Cabot doesn't support anything more complex than primary and backup personnel, but there is certainly room for additional features. The docs say if you want something more advanced, you should look at a commercial option.

StatsAgg

StatsAgg ? How did that make the list? Well, it's not every day you come across a publishing company that has created an alerting platform. I think that deserves recognition. Of course, Pearson isn't just a publishing company anymore; it has several web presences and a joint venture with O'Reilly Media . However, I still think of it as the company that published my schoolbooks and tests.

StatsAgg isn't just an alerting platform; it's also a metrics aggregation platform. And it's kind of like a proxy for other systems. It supports Graphite, StatsD, InfluxDB, and OpenTSDB as inputs, but it can also forward those metrics to their respective platforms. This is an interesting concept, but potentially risky as loads increase on a central service. However, if the StatsAgg infrastructure is robust enough, it can still produce alerts even when a backend storage platform has an outage.

StatsAgg is written in Java and consists only of the main server and UI, which keeps complexity to a minimum. It can send alerts based on regular expression matching and is focused on alerting by service rather than host or instance. Its goal is to fill a void in the open source observability stack, and I think it does that quite well.

Visualization tools Grafana

Almost everyone knows about Grafana , and many have used it. I have used it for years whenever I need a simple dashboard. The tool I used before was deprecated, and I was fairly distraught about that until Grafana made it okay. Grafana was gifted to us by Torkel Ödegaard. Like Cabot, Grafana was also created around Christmastime, and released in January 2014. It has come a long way in just a few years. It started life as a Kibana dashboarding system, and Torkel forked it into what became Grafana.

Grafana's sole focus is presenting monitoring data in a more usable and pleasing way. It can natively gather data from Graphite, Elasticsearch, OpenTSDB, Prometheus, and InfluxDB. There's an Enterprise version that uses plugins for more data sources, but there's no reason those other data source plugins couldn't be created as open source, as the Grafana plugin ecosystem already offers many other data sources.

What does Grafana do for me? It provides a central location for understanding my system. It is web-based, so anyone can access the information, although it can be restricted using different authentication methods. Grafana can provide knowledge at a glance using many different types of visualizations. However, it has started integrating alerting and other features that aren't traditionally combined with visualizations.

Now you can set alerts visually. That means you can look at a graph, maybe even one showing where an alert should have triggered due to some degradation of the system, click on the graph where you want the alert to trigger, and then tell Grafana where to send the alert. That's a pretty powerful addition that won't necessarily replace an alerting platform, but it can certainly help augment it by providing a different perspective on alerting criteria.

Grafana has also introduced more collaboration features. Users have been able to share dashboards for a long time, meaning you don't have to create your own dashboard for your Kubernetes cluster because there are several already available -- with some maintained by Kubernetes developers and others by Grafana developers.

The most significant addition around collaboration is annotations. Annotations allow a user to add context to part of a graph. Other users can then use this context to understand the system better. This is an invaluable tool when a team is in the middle of an incident and communication and common understanding are critical. Having all the information right where you're already looking makes it much more likely that knowledge will be shared across the team quickly. It's also a nice feature to use during blameless postmortems when the team is trying to understand how the failure occurred and learn more about their system.

Vizceral

Netflix created Vizceral to understand its traffic patterns better when performing a traffic failover. Unlike Grafana, which is a more general tool, Vizceral serves a very specific use case. Netflix no longer uses this tool internally and says it is no longer actively maintained, but it still updates the tool periodically. I highlight it here primarily to point out an interesting visualization mechanism and how it can help solve a problem. It's worth running it in a demo environment just to better grasp the concepts and witness what's possible with these systems.

[Nov 08, 2019] What breaks our systems A taxonomy of black swans by Laura Nolan Feed

Oct 25, 2018 | opensource.com
Find and fix outlier events that create issues before they trigger severe production problems. Black swans are a metaphor for outlier events that are severe in impact (like the 2008 financial crash). In production systems, these are the incidents that trigger problems that you didn't know you had, cause major visible impact, and can't be fixed quickly and easily by a rollback or some other standard response from your on-call playbook. They are the events you tell new engineers about years after the fact.

Black swans, by definition, can't be predicted, but sometimes there are patterns we can find and use to create defenses against categories of related problems.

For example, a large proportion of failures are a direct result of changes (code, environment, or configuration). Each bug triggered in this way is distinctive and unpredictable, but the common practice of canarying all changes is somewhat effective against this class of problems, and automated rollbacks have become a standard mitigation.

As our profession continues to mature, other kinds of problems are becoming well-understood classes of hazards with generalized prevention strategies.

Black swans observed in the wild

All technology organizations have production problems, but not all of them share their analyses. The organizations that publicly discuss incidents are doing us all a service. The following incidents describe one class of a problem and are by no means isolated instances. We all have black swans lurking in our systems; it's just some of us don't know it yet.

Hitting limits

Programming and development

Running headlong into any sort of limit can produce very severe incidents. A canonical example of this was Instapaper's outage in February 2017 . I challenge any engineer who has carried a pager to read the outage report without a chill running up their spine. Instapaper's production database was on a filesystem that, unknown to the team running the service, had a 2TB limit. With no warning, it stopped accepting writes. Full recovery took days and required migrating its database.

The organizations that publicly discuss incidents are doing us all a service. Limits can strike in various ways. Sentry hit limits on maximum transaction IDs in Postgres . Platform.sh hit size limits on a pipe buffer . SparkPost triggered AWS's DDoS protection . Foursquare hit a performance cliff when one of its datastores ran out of RAM .

One way to get advance knowledge of system limits is to test periodically. Good load testing (on a production replica) ought to involve write transactions and should involve growing each datastore beyond its current production size. It's easy to forget to test things that aren't your main datastores (such as Zookeeper). If you hit limits during testing, you have time to fix the problems. Given that resolution of limits-related issues can involve major changes (like splitting a datastore), time is invaluable.

When it comes to cloud services, if your service generates unusual loads or uses less widely used products or features (such as older or newer ones), you may be more at risk of hitting limits. It's worth load testing these, too. But warn your cloud provider first.

Finally, where limits are known, add monitoring (with associated documentation) so you will know when your systems are approaching those ceilings. Don't rely on people still being around to remember.

Spreading slowness
"The world is much more correlated than we give credit to. And so we see more of what Nassim Taleb calls 'black swan events' -- rare events happen more often than they should because the world is more correlated."
-- Richard Thaler

HostedGraphite's postmortem on how an AWS outage took down its load balancers (which are not hosted on AWS) is a good example of just how much correlation exists in distributed computing systems. In this case, the load-balancer connection pools were saturated by slow connections from customers that were hosted in AWS. The same kinds of saturation can happen with application threads, locks, and database connections -- any kind of resource monopolized by slow operations.

HostedGraphite's incident is an example of externally imposed slowness, but often slowness can result from saturation somewhere in your own system creating a cascade and causing other parts of your system to slow down. An incident at Spotify demonstrates such spread -- the streaming service's frontends became unhealthy due to saturation in a different microservice. Enforcing deadlines for all requests, as well as limiting the length of request queues, can prevent such spread. Your service will serve at least some traffic, and recovery will be easier because fewer parts of your system will be broken.

Retries should be limited with exponential backoff and some jitter. An outage at Square, in which its Redis datastore became overloaded due to a piece of code that retried failed transactions up to 500 times with no backoff, demonstrates the potential severity of excessive retries. The Circuit Breaker design pattern can be helpful here, too.

Dashboards should be designed to clearly show utilization, saturation, and errors for all resources so problems can be found quickly.

Thundering herds

Often, failure scenarios arise when a system is under unusually heavy load. This can arise organically from users, but often it arises from systems. A surge of cron jobs that starts at midnight is a venerable example. Mobile clients can also be a source of coordinated demand if they are programmed to fetch updates at the same time (of course, it is much better to jitter such requests).

Events occurring at pre-configured times aren't the only source of thundering herds. Slack experienced multiple outages over a short time due to large numbers of clients being disconnected and immediately reconnecting, causing large spikes of load. CircleCI saw a severe outage when a GitLab outage ended, leading to a surge of builds queued in its database, which became saturated and very slow.

Almost any service can be the target of a thundering herd. Planning for such eventualities -- and testing that your plan works as intended -- is therefore a must. Client backoff and load shedding are often core to such approaches.

If your systems must constantly ingest data that can't be dropped, it's key to have a scalable way to buffer this data in a queue for later processing.

Automation systems are complex systems
"Complex systems are intrinsically hazardous systems."
-- Richard Cook, MD

If your systems must constantly ingest data that can't be dropped, it's key to have a scalable way to buffer this data in a queue for later processing. The trend for the past several years has been strongly towards more automation of software operations. Automation of anything that can reduce your system's capacity (e.g., erasing disks, decommissioning devices, taking down serving jobs) needs to be done with care. Accidents (due to bugs or incorrect invocations) with this kind of automation can take down your system very efficiently, potentially in ways that are hard to recover from.

Christina Schulman and Etienne Perot of Google describe some examples in their talk Help Protect Your Data Centers with Safety Constraints . One incident sent Google's entire in-house content delivery network (CDN) to disk-erase.

Schulman and Perot suggest using a central service to manage constraints, which limits the pace at which destructive automation can operate, and being aware of system conditions (for example, avoiding destructive operations if the service has recently had an alert).

Automation systems can also cause havoc when they interact with operators (or with other automated systems). Reddit experienced a major outage when its automation restarted a system that operators had stopped for maintenance. Once you have multiple automation systems, their potential interactions become extremely complex and impossible to predict.

It will help to deal with the inevitable surprises if all this automation writes logs to an easily searchable, central place. Automation systems should always have a mechanism to allow them to be quickly turned off (fully or only for a subset of operations or targets).

Defense against the dark swans

These are not the only black swans that might be waiting to strike your systems. There are many other kinds of severe problem that can be avoided using techniques such as canarying, load testing, chaos engineering, disaster testing, and fuzz testing -- and of course designing for redundancy and resiliency. Even with all that, at some point your system will fail.

To ensure your organization can respond effectively, make sure your key technical staff and your leadership have a way to coordinate during an outage. For example, one unpleasant issue you might have to deal with is a complete outage of your network. It's important to have a fail-safe communications channel completely independent of your own infrastructure and its dependencies. For instance, if you run on AWS, using a service that also runs on AWS as your fail-safe communication method is not a good idea. A phone bridge or an IRC server that runs somewhere separate from your main systems is good. Make sure everyone knows what the communications platform is and practices using it.

Another principle is to ensure that your monitoring and your operational tools rely on your production systems as little as possible. Separate your control and your data planes so you can make changes even when systems are not healthy. Don't use a single message queue for both data processing and config changes or monitoring, for example -- use separate instances. In SparkPost: The Day the DNS Died , Jeremy Blosser presents an example where critical tools relied on the production DNS setup, which failed.

The psychology of battling the black swan

To ensure your organization can respond effectively, make sure your key technical staff and your leadership have a way to coordinate during an outage. Dealing with major incidents in production can be stressful. It really helps to have a structured incident-management process in place for these situations. Many technology organizations ( including Google ) successfully use a version of FEMA's Incident Command System. There should be a clear way for any on-call individual to call for assistance in the event of a major problem they can't resolve alone.

For long-running incidents, it's important to make sure people don't work for unreasonable lengths of time and get breaks to eat and sleep (uninterrupted by a pager). It's easy for exhausted engineers to make a mistake or overlook something that might resolve the incident faster.

Learn more

There are many other things that could be said about black (or formerly black) swans and strategies for dealing with them. If you'd like to learn more, I highly recommend these two books dealing with resilience and stability in production: Susan Fowler's Production-Ready Microservices and Michael T. Nygard's Release It! .


Laura Nolan will present What Breaks Our Systems: A Taxonomy of Black Swans at LISA18 , October 29-31 in Nashville, Tennessee, USA.

[Nov 08, 2019] My first sysadmin mistake by Jim Hall

Wiping out of /etc directory is one thing that sysadmin accidentally do. This is often happen if the other directory is name etc, for example /Backup/etc. In such cases you automatically put a slash in front of etc because it is ingrained in your mind. And you put the slash in front of etc subconsciously, not realizing what you are doing. And then faces consequences. If you do not use saferm, the result are pretty devastating. In most cases the sever does not die, but new logins are impossible. SSH session survives. That's why it is important to backup /etc/at the first login to the server. On modern severs it takes a couple of seconds.
If subdirectories are intact then you still can copy the content from another server. But content of sysconfig subdirectory in linux is unique to the server and you need a backup to restore it.
Notable quotes:
"... As root. I thought I was deleting some stale cache files for one of our programs. Instead, I wiped out all files in the /etc directory by mistake. Ouch. ..."
"... I put together a simple strategy: Don't reboot the server. Use an identical system as a template, and re-create the ..."
Nov 08, 2019 | opensource.com
rm command in the wrong directory. As root. I thought I was deleting some stale cache files for one of our programs. Instead, I wiped out all files in the /etc directory by mistake. Ouch.

My clue that I'd done something wrong was an error message that rm couldn't delete certain subdirectories. But the cache directory should contain only files! I immediately stopped the rm command and looked at what I'd done. And then I panicked. All at once, a million thoughts ran through my head. Did I just destroy an important server? What was going to happen to the system? Would I get fired?

Fortunately, I'd run rm * and not rm -rf * so I'd deleted only files. The subdirectories were still there. But that didn't make me feel any better.

Immediately, I went to my supervisor and told her what I'd done. She saw that I felt really dumb about my mistake, but I owned it. Despite the urgency, she took a few minutes to do some coaching with me. "You're not the first person to do this," she said. "What would someone else do in your situation?" That helped me calm down and focus. I started to think less about the stupid thing I had just done, and more about what I was going to do next.

I put together a simple strategy: Don't reboot the server. Use an identical system as a template, and re-create the /etc directory.

Once I had my plan of action, the rest was easy. It was just a matter of running the right commands to copy the /etc files from another server and edit the configuration so it matched the system. Thanks to my practice of documenting everything, I used my existing documentation to make any final adjustments. I avoided having to completely restore the server, which would have meant a huge disruption.

To be sure, I learned from that mistake. For the rest of my years as a systems administrator, I always confirmed what directory I was in before running any command.

I also learned the value of building a "mistake strategy." When things go wrong, it's natural to panic and think about all the bad things that might happen next. That's human nature. But creating a "mistake strategy" helps me stop worrying about what just went wrong and focus on making things better. I may still think about it, but knowing my next steps allows me to "get over it."

[Nov 08, 2019] 13 open source backup solutions by Don Watkins

This is mostly the list. You need to do your own research. Some improtant backup applications are not mentioned. It is unclear from it what are methods used in each, and why each of them is preferable to tar. The stress in the list is on portability (Linux plus Mc and windows, not just Linux)
Mar 07, 2019 | opensource.com

Recently, we published a poll that asked readers to vote on their favorite open source backup solution. We offered six solutions recommended by our moderator community -- Cronopete, Deja Dup, Rclone, Rdiff-backup, Restic, and Rsync -- and invited readers to share other options in the comments. And you came through, offering 13 other solutions (so far) that we either hadn't considered or hadn't even heard of.

By far the most popular suggestion was BorgBackup . It is a deduplicating backup solution that features compression and encryption. It is supported on Linux, MacOS, and BSD and has a BSD License.

Second was UrBackup , which does full and incremental image and file backups; you can save whole partitions or single directories. It has clients for Windows, Linux, and MacOS and has a GNU Affero Public License.

Third was LuckyBackup ; according to its website, "it is simple to use, fast (transfers over only changes made and not all data), safe (keeps your data safe by checking all declared directories before proceeding in any data manipulation), reliable, and fully customizable." It carries a GNU Public License.

Casync is content-addressable synchronization -- it's designed for backup and synchronizing and stores and retrieves multiple related versions of large file systems. It is licensed with the GNU Lesser Public License.

Syncthing synchronizes files between two computers. It is licensed with the Mozilla Public License and, according to its website, is secure and private. It works on MacOS, Windows, Linux, FreeBSD, Solaris, and OpenBSD.

Duplicati is a free backup solution that works on Windows, MacOS, and Linux and a variety of standard protocols, such as FTP, SSH, and WebDAV, and cloud services. It features strong encryption and is licensed with the GPL.

Dirvish is a disk-based virtual image backup system licensed under OSL-3.0. It also requires Rsync, Perl5, and SSH to be installed.

Bacula 's website says it "is a set of computer programs that permits the system administrator to manage backup, recovery, and verification of computer data across a network of computers of different kinds." It is supported on Linux, FreeBSD, Windows, MacOS, OpenBSD, and Solaris and the bulk of its source code is licensed under AGPLv3.

BackupPC "is a high-performance, enterprise-grade system for backing up Linux, Windows, and MacOS PCs and laptops to a server's disk," according to its website. It is licensed under the GPLv3.

Amanda is a backup system written in C and Perl that allows a system administrator to back up an entire network of client machines to a single server using tape, disk, or cloud-based systems. It was developed and copyrighted in 1991 at the University of Maryland and has a BSD-style license.

Back in Time is a simple backup utility designed for Linux. It provides a command line client and a GUI, both written in Python. To do a backup, just specify where to store snapshots, what folders to back up, and the frequency of the backups. BackInTime is licensed with GPLv2.

Timeshift is a backup utility for Linux that is similar to System Restore for Windows and Time Capsule for MacOS. According to its GitHub repository, "Timeshift protects your system by taking incremental snapshots of the file system at regular intervals. These snapshots can be restored at a later date to undo all changes to the system."

Kup is a backup solution that was created to help users back up their files to a USB drive, but it can also be used to perform network backups. According to its GitHub repository, "When you plug in your external hard drive, Kup will automatically start copying your latest changes."

[Nov 08, 2019] What you probably didn't know about sudo

Nov 08, 2019 | opensource.com

Enable features for a certain group of users

The sudo command comes with a huge set of defaults. Still, there are situations when you want to override some of these. This is when you use the Defaults statement in the configuration. Usually, these defaults are enforced on every user, but you can narrow the setting down to a subset of users based on host, username, and so on. Here is an example that my generation of sysadmins loves to hear about: insults. These are just some funny messages for when someone mistypes a password:

czanik @ linux-mewy:~ > sudo ls
[ sudo ] password for root:
Hold it up to the light --- not a brain in sight !
[ sudo ] password for root:
My pet ferret can type better than you !
[ sudo ] password for root:
sudo: 3 incorrect password attempts
czanik @ linux-mewy:~ >

Because not everyone is a fan of sysadmin humor, these insults are disabled by default. The following example shows how to enable this setting only for your seasoned sysadmins, who are members of the wheel group:

Defaults !insults
Defaults:%wheel insults

I do not have enough fingers to count how many people thanked me for bringing these messages back.

Digest verification

There are, of course, more serious features in sudo as well. One of them is digest verification. You can include the digest of applications in your configuration:

peter ALL = sha244:11925141bb22866afdf257ce7790bd6275feda80b3b241c108b79c88 /usr/bin/passwd

In this case, sudo checks and compares the digest of the application to the one stored in the configuration before running the application. If they do not match, sudo refuses to run the application. While it is difficult to maintain this information in your configuration -- there are no automated tools for this purpose -- these digests can provide you with an additional layer of protection.

Session recording

Session recording is also a lesser-known feature of sudo . After my demo, many people leave my talk with plans to implement it on their infrastructure. Why? Because with session recording, you see not just the command name, but also everything that happened in the terminal. You can see what your admins are doing even if they have shell access and logs only show that bash is started.

There is one limitation, currently. Records are stored locally, so with enough permissions, users can delete their traces. Stay tuned for upcoming features. New features

There is a new version of sudo right around the corner. Version 1.9 will include many interesting new features. Here are the most important planned features:

  • A recording service to collect session recordings centrally, which offers many advantages compared to local storage:
    • It is more convenient to search in one place.
    • Recordings are available even if the sender machine is down.
    • Recordings cannot be deleted by someone who wants to delete their tracks.
  • The audit plugin does not add new features to sudoers , but instead provides an API for plugins to easily access any kind of sudo logs. This plugin enables creating custom logs from sudo events using plugins.
  • The approval plugin enables session approvals without using third-party plugins.
  • And my personal favorite: Python support for plugins, which enables you to easily extend sudo using Python code instead of coding natively in C.
Conclusion

I hope this article proved to you that sudo is a lot more than just a simple prefix. There are tons of possibilities to fine-tune permissions on your system. You cannot just fine-tune permissions, but also improve security by checking digests. Session recordings enable you to check what is happening on your systems. You can also extend the functionality of sudo using plugins, either using something already available or writing your own. Finally, given the list of upcoming features you can see that even if sudo is decades old, it is a living project that is constantly evolving.

If you want to learn more about sudo , here are a few resources:

[Nov 08, 2019] Five tips for using revision control in operations Opensource.com

Nov 08, 2019 | opensource.com

Whether you're still using Subversion (SVN), or have moved to a distributed system like Git , revision control has found its place in modern operations infrastructures. If you listen to talks at conferences and see what new companies are doing, it can be easy to assume that everyone is now using revision control, and using it effectively. Unfortunately that's not the case. I routinely interact with organizations who either don't track changes in their infrastructure at all, or are not doing so in an effective manner.

If you're looking for a way to convince your boss to spend the time to set it up, or are simply looking for some tips to improve how use it, the following are five tips for using revision control in operations.

1. Use revision control

A long time ago in a galaxy far, far away, systems administrators would log into servers and make changes to configuration files and restart services manually. Over time we've worked to automate much of this, to the point that members of your operations team may not even have login access to your servers. Everything may be centrally managed through your configuration management system, or other tooling, to automatically manage services. Whatever you're using to handle your configurations and expectations in your infrastructure, you should have a history of changes made to it.

Having a history of changes allows you to:

  • Debug problems by matching up changes committed to the repository and deployed to outages, infrastructure behavior changes, and other problems. It's tempting for us to always blame developers for problems, but let's be honest, there are rare times when it is our fault.
  • Understand why changes were made. A good commit message, which you should insist upon, will not only explain what the change is doing, but why the change is being made. This will help your future colleagues, and your future self, understand why certain architecture changes were made. The decisions may have been sound ones at the time, and continue to make sense, or they were based on criteria that are no longer applicable to your organization. By tracking these reasons, you can use decisions from the past to make better decisions today.
  • Revert to a prior state. Whether a change to your infrastructure caused problems and you need to do a major roll-back, or a file was simply deleted before a backup was run, having production changes stored in revision control allows you to go back in time to a known state to recover.

This first move can often be the hardest thing for an organization to do. You're moving from static configurations, or configuration management files on a filesystem, into a revision control system which changes the process, and often the speed at which changes can be made. Your engineers need to know how to use revision control and get used to the idea that all changes they put into production will be tracked by other people in the organization.

2. Have a plan for what should be put in revision control

This has a few major components: make sure you have multiple repositories in your infrastructure that are targeted at specific services; don't put auto-generated content or binaries into revision control; and make sure you're working securely.

First, you typically want to split up different parts of your services into different repositories. This allows fine-tuned control of who has access to commit to specific service repositories. It also prevents a repository from getting too large, which can complicate the life of your systems adminstrators who are trying to copy it onto their systems.

You may not believe that a repository can get very big, since it's just text files, but you'll have a different perspective when you have been using a repository for five years, and every copy includes every change ever made. Let me show you the system-config repository for the OpenStack Infrastructure project. The first commit to this project was made in July 2011:

elizabeth@r2d2$:~$ time git clone https://git.openstack.org/openstack-infra/system-config
Cloning into 'system-config'...
remote: Counting objects: 79237, done.
remote: Compressing objects: 100% (37446/37446), done.
remote: Total 79237 (delta 50215), reused 64257 (delta 35955)
Receiving objects: 100% (79237/79237), 10.52 MiB | 2.78 MiB/s, done.
Resolving deltas: 100% (50215/50215), done.
Checking connectivity... done.

real    0m7.600s
user    0m3.344s
sys     0m0.680s

That's over 10M of data for a text-only repository over five years.

Again, yes, text-only. You typically want to avoid stuffing binaries into revision control. You often don't get diffs on these and they just bloat your repository. Find a better way to distribute your binaries. You also don't want your auto-generated files in revision control. Put the configuration file that creates those auto-generated files into revision control, and let your configuration management tooling do its work to auto-generate the files.

Finally, split out all secret data into separate repositories. Organizations can get a considerable amount of benefit from allowing all of their technical staff see their repositories, but you don't necessarily want to expose every private SSH or SSL key to everyone. You may also consider open sourcing some of your tooling some day, so making sure you have no private data in your repository from the beginning will prevent a lot of headaches later on.

3. Make it your canonical place for changes, and deploy from it

Your revision control system must be a central part of your infrastructure. You can't just update it when you remember to, or add it to a checklist of things you should be doing when you make a change in production. Any time someone makes a change, it must be tracked in revision control. If it's not, file a bug and make sure that configuration file is added to revision control in the future.

This also means you should deploy from the configuration tracked in your revision control system. No one should be able to log into a server and make changes without it being put through revision control, except in rare case of an actual emergency where you have a strict process for getting back on track as soon as the emergency has concluded.

4. Use pre-commit scripts

Humans are pretty forgetful, and we sysadmins are routinely distracted and work in a very interrupt-driven environments. Help yourself out and provide your systems administrators with some scripts to remind them what the change message should include.

These reminders may include:

  • Did you explain the reason for your change in the commit message?
  • Did you include a reference to the bug, ticket, issue, etc related to this change?
  • Did you update the documentation to reflect this change?
  • Did you write/update a test for this change? (see the bonus tip at the end of this article)

As a bonus, this also documents what reviewers of your change should look for.

Wait, reviewers? That's #5.

5: Hook it all into a code review system

Now that you have a healthy revision control system set up, you have an excellent platform for adding another great tool for systems administrators: code review. This should typically be a peer-review system where your fellow systems administrators of all levels can review changes, and your changes will meet certain agreed-upon criteria for merging. This allows a whole team to take responsibility for a change, and not just the person who came up with it. I like to joke that since changes on our team requires two people to approve, it becomes the fault of three people when something goes wrong, not just one!

Starting out, you don't need to do anything fancy, maybe just have team members submit a merge proposal or pull request and socially make sure someone other than the proposer is the one to merge it. Eventually you can look into more sophisticated code review systems. Most code review tools allow features like inline commenting and discussion-style commenting which provide a non-confrontational way to suggest changes to your colleagues. It's also great for remotely distributed teams who may not be talking face to face about changes, or even awake at the same time.

Bonus: Do tests on every commit!

You have everything in revision control, you have your fellow humans review it, why not add some robots? Start with simple things: Computers are very good at making sure files are alphabetized, humans are not. Computers are really good at making sure syntax is correct (the right number of spaces/tabs?); checking for these things is a waste of time for your brilliant systems engineers. Once you have simple tests in place, start adding unit tests, functional tests, and integration testing so you are confident that your changes won't break in production. This will also help you find where you haven't automated your production infrastructure and solve those problems in your development environment first. Topics Sysadmin About the author Elizabeth K. Joseph - After spending a decade doing Linux systems administration, today Elizabeth K. Joseph works as a developer advocate at IBM focused on IBM Z. As a systems administrator, she worked for a small services provider in Philadelphia before joining HPE where she worked for four years on the geographically distributed OpenStack Infrastructure team. This team runs the fully open source infrastructure for OpenStack development and lead to an interest in other open source projects that have opened up their... More about me

Recommended reading
What you probably didn't know about sudo

Build web apps to automate sysadmin tasks

A guide to human communication for sysadmins

Linux permissions 101

16 essentials for sysadmin superheroes

What does it mean to be a sysadmin hero?
4 Comments

Jenifer Soflous on 25 Jul 2016 Permalink

Great. This article is very useful for my research. I thank you so much for the Author.

Ben Cotton on 25 Jul 2016 Permalink

When I worked on a team of 10-ish, we had our SVN server email all commits to the team. This gave everyone a chance to see what changes were being made in real time, even if it was to a part of the system they didn't normally touch. I personally found it really useful in improving my own work. I'd love to see more articles expanding on each of your points here (hint hint!)

Elizabeth K. Joseph on 27 Jul 2016 Permalink

The email point is a good one! We have optional tuning of email via our code review system, and it's also allowed me to improve my own work and keep a better handle on what's going on with the team when I'm traveling (and thus not hovering over IRC).

Shawn H Corey on 25 Jul 2016 Permalink

For anyone using revision control, not just admins.

1. Add a critic to checkin. A critic not only checks syntax but for common bugs.

2. Add a tidy to checkin. It reformats the code into The Style. That way, developers don't have to waste time making conform to The Style.

[Nov 08, 2019] Winterize your Bash prompt in Linux

Nov 08, 2019 | opensource.com

Your Linux terminal probably supports Unicode, so why not take advantage of that and add a seasonal touch to your prompt? 11 Dec 2018 Jason Baker (Red Hat) Feed 84 up 3 comments Image credits : Jason Baker x Subscribe now

Get the highlights in your inbox every week.

https://opensource.com/eloqua-embedded-email-capture-block.html?offer_id=70160000000QzXNAA0

Hello once again for another installment of the Linux command-line toys advent calendar. If this is your first visit to the series, you might be asking yourself what a command-line toy even is? Really, we're keeping it pretty open-ended: It's anything that's a fun diversion at the terminal, and we're giving bonus points for anything holiday-themed.

Maybe you've seen some of these before, maybe you haven't. Either way, we hope you have fun.

Today's toy is super-simple: It's your Bash prompt. Your Bash prompt? Yep! We've got a few more weeks of the holiday season left to stare at it, and even more weeks of winter here in the northern hemisphere, so why not have some fun with it.

Your Bash prompt currently might be a simple dollar sign ( $ ), or more likely, it's something a little longer. If you're not sure what makes up your Bash prompt right now, you can find it in an environment variable called $PS1. To see it, type:

echo $PS1

For me, this returns:

[\u@\h \W]\$

The \u , \h , and \W are special characters for username, hostname, and working directory. There are others you can use as well; for help building out your Bash prompt, you can use EzPrompt , an online generator of PS1 configurations that includes lots of options including date and time, Git status, and more.

You may have other variables that make up your Bash prompt set as well; $PS2 for me contains the closing brace of my command prompt. See this article for more information.

To change your prompt, simply set the environment variable in your terminal like this:

$ PS1 = '\u is cold: '
jehb is cold:

To set it permanently, add the same code to your /etc/bashrc using your favorite text editor.

So what does this have to do with winterization? Well, chances are on a modern machine, your terminal support Unicode, so you're not limited to the standard ASCII character set. You can use any emoji that's a part of the Unicode specification, including a snowflake ❄, a snowman ☃, or a pair of skis 🎿. You've got plenty of wintery options to choose from.

🎄 Christmas Tree
🧥 Coat
🦌 Deer
🧤 Gloves
🤶 Mrs. Claus
🎅 Santa Claus
🧣 Scarf
🎿 Skis
🏂 Snowboarder
❄ Snowflake
☃ Snowman
⛄ Snowman Without Snow
🎁 Wrapped Gift

Pick your favorite, and enjoy some winter cheer. Fun fact: modern filesystems also support Unicode characters in their filenames, meaning you can technically name your next program "❄❄❄❄❄.py" . That said, please don't.

Do you have a favorite command-line toy that you think I ought to include? The calendar for this series is mostly filled out but I've got a few spots left. Let me know in the comments below, and I'll check it out. If there's space, I'll try to include it. If not, but I get some good submissions, I'll do a round-up of honorable mentions at the end.

[Nov 08, 2019] How to change the default shell prompt

Jun 29, 2014 | access.redhat.com
  • The shell prompt is controlled via the PS environment variables.
Raw
**PS1** - The value of this parameter is expanded and used as the primary prompt string. The default value is \u@\h \W\\$ .
**PS2** - The value of this parameter is expanded as with PS1 and used as the secondary prompt string. The default is ]
**PS3** - The value of this parameter is used as the prompt for the select command
**PS4** - The value of this parameter is expanded as with PS1 and the value is printed before each command bash displays during an execution trace. The first character of PS4 is replicated multiple times, as necessary, to indicate multiple levels of indirection. The default is +
  • PS1 is a primary prompt variable which holds \u@\h \W\\$ special bash characters. This is the default structure of the bash prompt and is displayed every time a user logs in using a terminal. These default values are set in the /etc/bashrc file.
  • The special characters in the default prompt are as follows:
Raw
\u = username
\h = hostname
\W = current working directory
  • This command will show the current value.
Raw
# echo $PS1
  • This can be modified by changing the PS1 variable:
Raw
# PS1='[[prod]\u@\h \W]\$'
  • The modified shell prompt will look like:
Raw
[[prod]root@hostname ~]#
  • In order to make these settings permanent, edit the /etc/bashrc file:

Find this line:

Raw
[ "$PS1" = "\\s-\\v\\\$ " ] && PS1="[\u@\h \W]\\$ "

And change it as needed:

Raw
[ "$PS1" = "\\s-\\v\\\$ " ] && PS1="[[prod]\u@\h \W]\\$ "

This solution is part of Red Hat's fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form. 2 Comments Log in to comment MW Community Member 48 points

6 October 2016 1:53 PM Mike Willis

This solution has simply "Red Hat Enterprise Linux" in the Environment section implying it applies to all versions of Red Hat Enterprise Linux.

Editing /etc/bashrc is against the advice of the comments in /etc/bashrc on Red Hat Enterprise Linux 7 which say

Raw
# It's NOT a good idea to change this file unless you know what you
# are doing. It's much better to create a custom.sh shell script in
# /etc/profile.d/ to make custom changes to your environment, as this
# will prevent the need for merging in future updates.

On RHEL 7 instead of the solution suggested above create a /etc/profile.d/custom.sh which contains

Raw
PS1="[[prod]\u@\h \W]\\$ "
27 March 2019 12:44 PM Mike Chanslor

Hello Red Hat community! I also found this useful: Raw

Special prompt variable characters:
 \d   The date, in "Weekday Month Date" format (e.g., "Tue May 26"). 

 \h   The hostname, up to the first . (e.g. deckard) 
 \H   The hostname. (e.g. deckard.SS64.com)

 \j   The number of jobs currently managed by the shell. 

 \l   The basename of the shell's terminal device name. 

 \s   The name of the shell, the basename of $0 (the portion following 
      the final slash). 

 \t   The time, in 24-hour HH:MM:SS format. 
 \T   The time, in 12-hour HH:MM:SS format. 
 \@   The time, in 12-hour am/pm format. 

 \u   The username of the current user. 

 \v   The version of Bash (e.g., 2.00) 

 \V   The release of Bash, version + patchlevel (e.g., 2.00.0) 

 \w   The current working directory. 
 \W   The basename of $PWD. 

 \!   The history number of this command. 
 \#   The command number of this command. 

 \$   If you are not root, inserts a "$"; if you are root, you get a "#"  (root uid = 0) 

 \nnn   The character whose ASCII code is the octal value nnn. 

 \n   A newline. 
 \r   A carriage return. 
 \e   An escape character (typically a color code). 
 \a   A bell character.
 \\   A backslash. 

 \[   Begin a sequence of non-printing characters. (like color escape sequences). This
      allows bash to calculate word wrapping correctly.

 \]   End a sequence of non-printing characters.
Using single quotes instead of double quotes when exporting your PS variables is recommended, it makes the prompt a tiny bit faster to evaluate plus you can then do an echo $PS1 to see the current prompt settings.

[Nov 08, 2019] 7 Bash history shortcuts you will actually use by Ian Miell Feed

Nov 08, 2019 | opensource.com

7 Bash history shortcuts you will actually use Save time on the command line with these essential Bash shortcuts. 02 Oct 2019 37 up 12 comments Image by : Opensource.com x Subscribe now

This article outlines the shortcuts I actually use every day. It is based on some of the contents of my book, Learn Bash the hard way ; (you can read a preview of it to learn more).

When people see me use these shortcuts, they often ask me, "What did you do there!?" There's minimal effort or intelligence required, but to really learn them, I recommend using one each day for a week, then moving to the next one. It's worth taking your time to get them under your fingers, as the time you save will be significant in the long run.

1. The "last argument" one: !$

If you only take one shortcut from this article, make it this one. It substitutes in the last argument of the last command into your line.

Consider this scenario:

$ mv / path / to / wrongfile / some / other / place
mv: cannot stat '/path/to/wrongfile' : No such file or directory

Ach, I put the wrongfile filename in my command. I should have put rightfile instead.

You might decide to retype the last command and replace wrongfile with rightfile completely. Instead, you can type:

$ mv / path / to / rightfile ! $
mv / path / to / rightfile / some / other / place

and the command will work.

There are other ways to achieve the same thing in Bash with shortcuts, but this trick of reusing the last argument of the last command is one I use the most.

2. The " n th argument" one: !:2

Ever done anything like this?

$ tar -cvf afolder afolder.tar
tar: failed to open

Like many others, I get the arguments to tar (and ln ) wrong more often than I would like to admit.

tar_2x.png

When you mix up arguments like that, you can run:

$ ! : 0 ! : 1 ! : 3 ! : 2
tar -cvf afolder.tar afolder

and your reputation will be saved.

The last command's items are zero-indexed and can be substituted in with the number after the !: .

Obviously, you can also use this to reuse specific arguments from the last command rather than all of them.

3. The "all the arguments" one: !:1-$

Imagine I run a command like:

$ grep '(ping|pong)' afile

The arguments are correct; however, I want to match ping or pong in a file, but I used grep rather than egrep .

I start typing egrep , but I don't want to retype the other arguments. So I can use the !:1$ shortcut to ask for all the arguments to the previous command from the second one (remember they're zero-indexed) to the last one (represented by the $ sign).

$ egrep ! : 1 -$
egrep '(ping|pong)' afile
ping

You don't need to pick 1-$ ; you can pick a subset like 1-2 or 3-9 (if you had that many arguments in the previous command).

4. The "last but n " one: !-2:$

The shortcuts above are great when I know immediately how to correct my last command, but often I run commands after the original one, which means that the last command is no longer the one I want to reference.

For example, using the mv example from before, if I follow up my mistake with an ls check of the folder's contents:

$ mv / path / to / wrongfile / some / other / place
mv: cannot stat '/path/to/wrongfile' : No such file or directory
$ ls / path / to /
rightfile

I can no longer use the !$ shortcut.

In these cases, I can insert a - n : (where n is the number of commands to go back in the history) after the ! to grab the last argument from an older command:

$ mv / path / to / rightfile ! - 2 :$
mv / path / to / rightfile / some / other / place

Again, once you learn it, you may be surprised at how often you need it.

5. The "get me the folder" one: !$:h

This one looks less promising on the face of it, but I use it dozens of times daily.

Imagine I run a command like this:

$ tar -cvf system.tar / etc / system
tar: / etc / system: Cannot stat: No such file or directory
tar: Error exit delayed from previous errors.

The first thing I might want to do is go to the /etc folder to see what's in there and work out what I've done wrong.

I can do this at a stroke with:

$ cd ! $:h
cd / etc

This one says: "Get the last argument to the last command ( /etc/system ) and take off its last filename component, leaving only the /etc ."

6. The "the current line" one: !#:1

For years, I occasionally wondered if I could reference an argument on the current line before finally looking it up and learning it. I wish I'd done so a long time ago. I most commonly use it to make backup files:

$ cp / path / to / some / file ! #:1.bak
cp / path / to / some / file / path / to / some / file.bak

but once under the fingers, it can be a very quick alternative to

7. The "search and replace" one: !!:gs

This one searches across the referenced command and replaces what's in the first two / characters with what's in the second two.

Say I want to tell the world that my s key does not work and outputs f instead:

$ echo my f key doef not work
my f key doef not work

Then I realize that I was just hitting the f key by accident. To replace all the f s with s es, I can type:

$ !! :gs / f / s /
echo my s key does not work
my s key does not work

It doesn't work only on single characters; I can replace words or sentences, too:

$ !! :gs / does / did /
echo my s key did not work
my s key did not work Test them out

Just to show you how these shortcuts can be combined, can you work out what these toenail clippings will output?

$ ping ! #:0:gs/i/o
$ vi / tmp /! : 0 .txt
$ ls ! $:h
$ cd ! - 2 :h
$ touch ! $! - 3 :$ !! ! $.txt
$ cat ! : 1 -$ Conclusion

Bash can be an elegant source of shortcuts for the day-to-day command-line user. While there are thousands of tips and tricks to learn, these are my favorites that I frequently put to use.

If you want to dive even deeper into all that Bash can teach you, pick up my book, Learn Bash the hard way or check out my online course, Master the Bash shell .


This article was originally posted on Ian's blog, Zwischenzugs.com , and is reused with permission.

[Nov 08, 2019] A guide to logging in Python Opensource.com

Nov 08, 2019 | opensource.com

A guide to logging in Python Logging is a critical way to understand what's going on inside an application. 13 Sep 2017 Mario Corchero Feed 437 up 10 comments Image credits : Pixabay. CC0 x Subscribe now

Get the highlights in your inbox every week.

https://opensource.com/eloqua-embedded-email-capture-block.html?offer_id=70160000000QzXNAA0

When applications are running in production, they become black boxes that need to be traced and monitored. One of the simplest, yet most important ways to do so is by logging. Logging allows us -- at the time we develop our software -- to instruct the program to emit information while the system is running that will be useful for us and our sysadmins.

More Python Resources

The same way we document code for future developers, we should direct new software to generate adequate logs for developers and sysadmins. Logs are a critical part of the system documentation about an application's runtime status. When instrumenting your software with logs, think of it like writing documentation for developers and sysadmins who will maintain the system in the future.

Some purists argue that a disciplined developer who uses logging and testing should hardly need an interactive debugger. If we cannot reason about our application during development with verbose logging, it will be even harder to do it when our code is running in production.

This article looks at Python's logging module, its design, and ways to adapt it for more complex use cases. This is not intended as documentation for developers, rather as a guide to show how the Python logging module is built and to encourage the curious to delve deeper.

Why use the logging module?

A developer might argue, why aren't simple print statements sufficient? The logging module offers multiple benefits, including:

  • Multi-threading support
  • Categorization via different levels of logging
  • Flexibility and configurability
  • Separation of the how from the what

This last point, the actual separation of the what we log from the how we log enables collaboration between different parts of the software. As an example, it allows the developer of a framework or library to add logs and let the sysadmin or person in charge of the runtime configuration decide what should be logged at a later point.

What's in the logging module

The logging module beautifully separates the responsibility of each of its parts (following the Apache Log4j API's approach). Let's look at how a log line travels around the module's code and explore its different parts.

Logger

Loggers are the objects a developer usually interacts with. They are the main APIs that indicate what we want to log.

Given an instance of a logger , we can categorize and ask for messages to be emitted without worrying about how or where they will be emitted.

For example, when we write logger.info("Stock was sold at %s", price) we have the following model in mind:

1-ikkitaj.jpg

We request a line and we assume some code is executed in the logger that makes that line appear in the console/file. But what actually is happening inside?

Log records

Log records are packages that the logging module uses to pass all required information around. They contain information about the function where the log was requested, the string that was passed, arguments, call stack information, etc.

These are the objects that are being logged. Every time we invoke our loggers, we are creating instances of these objects. But how do objects like these get serialized into a stream? Via handlers!

Handlers

Handlers emit the log records into any output. They take log records and handle them in the function of what they were built for.

As an example, a FileHandler will take a log record and append it to a file.

The standard logging module already comes with multiple built-in handlers like:

  • Multiple file handlers ( TimeRotated , SizeRotated , Watched ) that can write to files
  • StreamHandler can target a stream like stdout or stderr
  • SMTPHandler sends log records via email
  • SocketHandler sends LogRecords to a streaming socket
  • SyslogHandler , NTEventHandler , HTTPHandler , MemoryHandler , and others

We have now a model that's closer to reality:

2-crmvrmf.jpg

But most handlers work with simple strings (SMTPHandler, FileHandler, etc.), so you may be wondering how those structured LogRecords are transformed into easy-to-serialize bytes...

Formatters

Let me present the Formatters. Formatters are in charge of serializing the metadata-rich LogRecord into a string. There is a default formatter if none is provided.

The generic formatter class provided by the logging library takes a template and style as input. Then placeholders can be declared for all the attributes in a LogRecord object.

As an example: '%(asctime)s %(levelname)s %(name)s: %(message)s' will generate logs like 2017-07-19 15:31:13,942 INFO parent.child: Hello EuroPython .

Note that the attribute message is the result of interpolating the log's original template with the arguments provided. (e.g., for logger.info("Hello %s", "Laszlo") , the message will be "Hello Laszlo").

All default attributes can be found in the logging documentation .

OK, now that we know about formatters, our model has changed again:

3-olzyr7p.jpg Filters

The last objects in our logging toolkit are filters.

Filters allow for finer-grained control of which logs should be emitted. Multiple filters can be attached to both loggers and handlers. For a log to be emitted, all filters should allow the record to pass.

Users can declare their own filters as objects using a filter method that takes a record as input and returns True / False as output.

With this in mind, here is the current logging workflow:

4-nfjk7uc.jpg The logger hierarchy

At this point, you might be impressed by the amount of complexity and configuration that the module is hiding so nicely for you, but there is even more to consider: the logger hierarchy.

We can create a logger via logging.getLogger(<logger_name>) . The string passed as an argument to getLogger can define a hierarchy by separating the elements using dots.

As an example, logging.getLogger("parent.child") will create a logger "child" with a parent logger named "parent." Loggers are global objects managed by the logging module, so they can be retrieved conveniently anywhere during our project.

Logger instances are also known as channels. The hierarchy allows the developer to define the channels and their hierarchy.

After the log record has been passed to all the handlers within the logger, the parents' handlers will be called recursively until we reach the top logger (defined as an empty string) or a logger has configured propagate = False . We can see it in the updated diagram:

5-sjkrpt3.jpg

Note that the parent logger is not called, only its handlers. This means that filters and other code in the logger class won't be executed on the parents. This is a common pitfall when adding filters to loggers.

Recapping the workflow

We've examined the split in responsibility and how we can fine tune log filtering. But there are two other attributes we haven't mentioned yet:

  1. Loggers can be disabled, thereby not allowing any record to be emitted from them.
  2. An effective level can be configured in both loggers and handlers.

As an example, when a logger has configured a level of INFO , only INFO levels and above will be passed. The same rule applies to handlers.

With all this in mind, the final flow diagram in the logging documentation looks like this:

6-kd6eiyr.jpg How to use logging

Now that we've looked at the logging module's parts and design, it's time to examine how a developer interacts with it. Here is a code example:

import logging

def sample_function ( secret_parameter ) :
logger = logging . getLogger ( __name__ ) # __name__=projectA.moduleB
logger. debug ( "Going to perform magic with '%s'" , secret_parameter )
...
try :
result = do_magic ( secret_parameter )
except IndexError :
logger. exception ( "OMG it happened again, someone please tell Laszlo" )
except :
logger. info ( "Unexpected exception" , exc_info = True )
raise
else :
logger. info ( "Magic with '%s' resulted in '%s'" , secret_parameter , result , stack_info = True )

This creates a logger using the module __name__ . It will create channels and hierarchies based on the project structure, as Python modules are concatenated with dots.

The logger variable references the logger "module," having "projectA" as a parent, which has "root" as its parent.

On line 5, we see how to perform calls to emit logs. We can use one of the methods debug , info , error , or critical to log using the appropriate level.

When logging a message, in addition to the template arguments, we can pass keyword arguments with specific meaning. The most interesting are exc_info and stack_info . These will add information about the current exception and the stack frame, respectively. For convenience, a method exception is available in the logger objects, which is the same as calling error with exc_info=True .

These are the basics of how to use the logger module. ʘ‿ʘ. But it is also worth mentioning some uses that are usually considered bad practices.

Greedy string formatting

Using loggger.info("string template {}".format(argument)) should be avoided whenever possible in favor of logger.info("string template %s", argument) . This is a better practice, as the actual string interpolation will be used only if the log will be emitted. Not doing so can lead to wasted cycles when we are logging on a level over INFO , as the interpolation will still occur.

Capturing and formatting exceptions

Quite often, we want to log information about the exception in a catch block, and it might feel intuitive to use:

try :
...
except Exception as error:
logger. info ( "Something bad happened: %s" , error )

But that code can give us log lines like Something bad happened: "secret_key." This is not that useful. If we use exc_info as illustrated previously, it will produce the following:

try :
...
except Exception :
logger. info ( "Something bad happened" , exc_info = True ) Something bad happened
Traceback ( most recent call last ) :
File "sample_project.py" , line 10 , in code
inner_code ()
File "sample_project.py" , line 6 , in inner_code
x = data [ "secret_key" ]
KeyError : 'secret_key'

This not only contains the exact source of the exception, but also the type.

Configuring our loggers

It's easy to instrument our software, and we need to configure the logging stack and specify how those records will be emitted.

There are multiple ways to configure the logging stack.

BasicConfig

This is by far the simplest way to configure logging. Just doing logging.basicConfig(level="INFO") sets up a basic StreamHandler that will log everything on the INFO and above levels to the console. There are arguments to customize this basic configuration. Some of them are:

Format Description Example
filename Specifies that a FileHandler should be created, using the specified filename, rather than a StreamHandler /var/logs/logs.txt
format Use the specified format string for the handler "'%(asctime)s %(message)s'"
datefmt Use the specified date/time format "%H:%M:%S"
level Set the root logger level to the specified level "INFO"

This is a simple and practical way to configure small scripts.

Note, basicConfig only works the first time it is called in a runtime. If you have already configured your root logger, calling basicConfig will have no effect.

DictConfig

The configuration for all elements and how to connect them can be specified as a dictionary. This dictionary should have different sections for loggers, handlers, formatters, and some basic global parameters.

Here's an example:

config = {
'disable_existing_loggers' : False ,
'version' : 1 ,
'formatters' : {
'short' : {
'format' : '%(asctime)s %(levelname)s %(name)s: %(message)s'
} ,
} ,
'handlers' : {
'console' : {
'level' : 'INFO' ,
'formatter' : 'short' ,
'class' : 'logging.StreamHandler' ,
} ,
} ,
'loggers' : {
'' : {
'handlers' : [ 'console' ] ,
'level' : 'ERROR' ,
} ,
'plugins' : {
'handlers' : [ 'console' ] ,
'level' : 'INFO' ,
'propagate' : False
}
} ,
}
import logging . config
logging . config . dictConfig ( config )

When invoked, dictConfig will disable all existing loggers, unless disable_existing_loggers is set to false . This is usually desired, as many modules declare a global logger that will be instantiated at import time, before dictConfig is called.

You can see the schema that can be used for the dictConfig method . Often, this configuration is stored in a YAML file and configured from there. Many developers often prefer this over using fileConfig , as it offers better support for customization.

Extending logging

Thanks to the way it is designed, it is easy to extend the logging module. Let's see some examples:

Logging JSON

If we want, we can log JSON by creating a custom formatter that transforms the log records into a JSON-encoded string:

import logging
import logging . config
import json
ATTR_TO_JSON = [ 'created' , 'filename' , 'funcName' , 'levelname' , 'lineno' , 'module' , 'msecs' , 'msg' , 'name' , 'pathname' , 'process' , 'processName' , 'relativeCreated' , 'thread' , 'threadName' ]
class JsonFormatter:
def format ( self , record ) :
obj = { attr: getattr ( record , attr )
for attr in ATTR_TO_JSON }
return json. dumps ( obj , indent = 4 )

handler = logging . StreamHandler ()
handler. formatter = JsonFormatter ()
logger = logging . getLogger ( __name__ )
logger. addHandler ( handler )
logger. error ( "Hello" )

Adding further context

On the formatters, we can specify any log record attribute.

We can inject attributes in multiple ways. In this example, we abuse filters to enrich the records.

import logging
import logging . config
GLOBAL_STUFF = 1

class ContextFilter ( logging . Filter ) :
def filter ( self , record ) :
global GLOBAL_STUFF
GLOBAL_STUFF + = 1
record. global_data = GLOBAL_STUFF
return True

handler = logging . StreamHandler ()
handler. formatter = logging . Formatter ( "%(global_data)s %(message)s" )
handler. addFilter ( ContextFilter ())
logger = logging . getLogger ( __name__ )
logger. addHandler ( handler )

logger. error ( "Hi1" )
logger. error ( "Hi2" )

This effectively adds an attribute to all the records that go through that logger. The formatter will then include it in the log line.

Note that this impacts all log records in your application, including libraries or other frameworks that you might be using and for which you are emitting logs. It can be used to log things like a unique request ID on all log lines to track requests or to add extra contextual information.

Starting with Python 3.2, you can use setLogRecordFactory to capture all log record creation and inject extra information. The extra attribute and the LoggerAdapter class may also be of the interest.

Buffering logs

Sometimes we would like to have access to debug logs when an error happens. This is feasible by creating a buffered handler that will log the last debug messages after an error occurs. See the following code as a non-curated example:

import logging
import logging . handlers

class SmartBufferHandler ( logging . handlers . MemoryHandler ) :
def __init__ ( self , num_buffered , *args , **kwargs ) :
kwargs [ "capacity" ] = num_buffered + 2 # +2 one for current, one for prepop
super () . __init__ ( *args , **kwargs )

def emit ( self , record ) :
if len ( self . buffer ) == self . capacity - 1 :
self . buffer . pop ( 0 )
super () . emit ( record )

handler = SmartBufferHandler ( num_buffered = 2 , target = logging . StreamHandler () , flushLevel = logging . ERROR )
logger = logging . getLogger ( __name__ )
logger. setLevel ( "DEBUG" )
logger. addHandler ( handler )

logger. error ( "Hello1" )
logger. debug ( "Hello2" ) # This line won't be logged
logger. debug ( "Hello3" )
logger. debug ( "Hello4" )
logger. error ( "Hello5" ) # As error will flush the buffered logs, the two last debugs will be logged

For more information

This introduction to the logging library's flexibility and configurability aims to demonstrate the beauty of how its design splits concerns. It also offers a solid foundation for anyone interested in a deeper dive into the logging documentation and the how-to guide . Although this article isn't a comprehensive guide to Python logging, here are answers to a few frequently asked questions.

My library emits a "no logger configured" warning

Check how to configure logging in a library from "The Hitchhiker's Guide to Python."

What happens if a logger has no level configured?

The effective level of the logger will then be defined recursively by its parents.

All my logs are in local time. How do I log in UTC?

Formatters are the answer! You need to set the converter attribute of your formatter to generate UTC times. Use converter = time.gmtime .

[Nov 08, 2019] The monumental impact of C Opensource.com

Nov 08, 2019 | opensource.com

The monumental impact of C The season finale of Command Line Heroes offers a lesson in how a small community of open source enthusiasts can change the world. 01 Oct 2019 Matthew Broberg (Red Hat) Feed 29 up 3 comments x Subscribe now

Get the highlights in your inbox every week.

https://opensource.com/eloqua-embedded-email-capture-block.html?offer_id=70160000000QzXNAA0 Command Line Heroes podcast explores C's origin story in a way that showcases the longevity and power of its design. It's a perfect synthesis of all the languages discussed throughout the podcast's third season and this series of articles . original_c_programming_book.jpg

C is such a fundamental language that many of us forget how much it has changed. Technically a "high-level language," in the sense that it requires a compiler to be runnable, it's as close to assembly language as people like to get these days (outside of specialized, low-memory environments). It's also considered to be the language that made nearly all languages that came after it possible.

The path to C began with failure

While the myth persists that all great inventions come from highly competitive garage dwellers, C's story is more fit for the Renaissance period.

In the 1960s, Bell Labs in suburban New Jersey was one of the most innovative places of its time. Jon Gertner, author of The idea factory , describes the culture of the time marked by optimism and the excitement to solve tough problems. Instead of monetization pressures with tight timelines, Bell Labs offered seemingly endless funding for wild ideas. It had a research and development ethos that aligns well with today's open leadership principles . The results were significant and prove that brilliance can come without the promise of VC funding or an IPO.

Programming and development

The challenge back then was terminal sharing: finding a way for lots of people to access the (very limited number of) available computers. Before there was a scalable answer for that, and long before we had a shell like Bash , there was the Multics project. It was a hypothetical operating system where hundreds or even thousands of developers could share time on the same system. This was a dream of John McCarty, creator of Lisp and the term artificial intelligence (AI), as I recently explored .

Joy Lisi Ranken, author of A people's history of computing in the United States , describes what happened next. There was a lot of public interest in driving forward with Multics' vision of more universally available timesharing. Academics, scientists, educators, and some in the broader public were looking forward to this computer-powered future. Many advocated for computing as a public utility, akin to electricity, and the push toward timesharing was a global movement.

Up to that point, high-end mainframes topped out at 40-50 terminals per system. The change of scale was ambitious and eventually failed, as Warren Toomey writes in IEEE Spectrum :

"Over five years, AT&T invested millions in the Multics project, purchasing a GE-645 mainframe computer and dedicating to the effort many of the top researchers at the company's renowned Bell Telephone Laboratories -- including Thompson and Ritchie, Joseph F. Ossanna, Stuart Feldman, M. Douglas McIlroy, and the late Robert Morris. But the new system was too ambitious, and it fell troublingly behind schedule. In the end, AT&T's corporate leaders decided to pull the plug."

Bell Labs pulled out of the Multics program in 1969. Multics wasn't going to happen.

The fellowship of the C

Funding wrapped up, and the powerful GE645 mainframe was assigned to other tasks inside Bell Labs. But that didn't discourage everyone.

Among the last holdouts from the Multics project were four men who felt passionately tied to the project: Ken Thompson, Dennis Ritchie, Doug McIlroy, and J.F. Ossanna. These four diehards continued to muse and scribble ideas on paper. Thompson and Ritchie developed a game called Space Travel for the PDP-7 minicomputer. While they were working on that, Thompson started implementing all those crazy hand-written ideas about filesystems they'd developed among the wreckage of Multics.

pdp7-minicomputer-oslo-2005.jpeg A PDP-7 minicomputer was not top of line technology at the time, but the team implemented foundational technologies that change the future of programming languages and operating systems alike.

That's worth emphasizing: Some of the original filesystem specifications were written by hand and then programmed on what was effectively a toy compared to the systems they were using to build Multics. Wikipedia's Ken Thompson page dives deeper into what came next:

"While writing Multics, Thompson created the Bon programming language. He also created a video game called Space Travel . Later, Bell Labs withdrew from the MULTICS project. In order to go on playing the game, Thompson found an old PDP-7 machine and rewrote Space Travel on it. Eventually, the tools developed by Thompson became the Unix operating system : Working on a PDP-7, a team of Bell Labs researchers led by Thompson and Ritchie, and including Rudd Canaday, developed a hierarchical file system , the concepts of computer processes and device files , a command-line interpreter , pipes for easy inter-process communication, and some small utility programs. In 1970, Brian Kernighan suggested the name 'Unix,' in a pun on the name 'Multics.' After initial work on Unix, Thompson decided that Unix needed a system programming language and created B , a precursor to Ritchie's C ."

As Walter Toomey documented in the IEEE Spectrum article mentioned above, Unix showed promise in a way the Multics project never materialized. After winning over the team and doing a lot more programming, the pathway to Unix was paved.

Getting from B to C in Unix

Thompson quickly created a Unix language he called B. B inherited much from its predecessor BCPL, but it wasn't enough of a breakaway from older languages. B didn't know data types, for starters. It's considered a typeless language, which meant its "Hello World" program looked like this:

main( ) {
extrn a, b, c;
putchar(a); putchar(b); putchar(c); putchar('!*n');
}

a 'hell';
b 'o, w';
c 'orld';

Even if you're not a programmer, it's clear that carving up strings four characters at a time would be limiting. It's also worth noting that this text is considered the original "Hello World" from Brian Kernighan's 1972 book, A tutorial introduction to the language B (although that claim is not definitive).

640px-unix_history-simple.svg_.png

Typelessness aside, B's assembly-language counterparts were still yielding programs faster than was possible using the B compiler's threaded-code technique. So, from 1971 to 1973, Ritchie modified B. He added a "character type" and built a new compiler so that it didn't have to use threaded code anymore. After two years of work, B had become C.

The right abstraction at the right time

C's use of types and ease of compiling down to efficient assembly code made it the perfect language for the rise of minicomputers, which speak in bytecode. B was eventually overtaken by C. Once C became the language of Unix, it became the de facto standard across the budding computer industry. Unix was the sharing platform of the pre-internet days. The more people wrote C, the better it got, and the more it was adopted. It eventually became an open standard itself. According to the Brief history of C programming language :

"For many years, the de facto standard for C was the version supplied with the Unix operating system. In the summer of 1983 a committee was established to create an ANSI (American National Standards Institute) standard that would define the C language. The standardization process took six years (much longer than anyone reasonably expected)."

How influential is C today? A quick review reveals:

  • Parts of all major operating systems are written in C, including macOS, Windows, Linux, and Android.
  • The world's most prolific databases, including DB2, MySQL, MS SQL, and PostgreSQL, are written in C.
  • Many programming-language specifics begun in C, including Python, Go, Perl's core interpreter, and the R statistical language.

Decades after they started as scrappy outsiders, Thompson and Ritchie are praised as titans of the programming world. They shared 1983's Turing Award, and in 1998, received the National Medal of Science for their work on the C language and Unix.

c-programming-language-national-medal.jpeg

But Doug McIlroy and J.F. Ossanna deserve their share of praise, too. All four of them are true Command Line Heroes.

Wrapping up the season

Command Line Heroes has completed an entire season of insights into the programming languages that affect how we code today. It's been a joy to learn about these languages and share them with you. I hope you've enjoyed it as well!

[Nov 08, 2019] A Linux user's guide to Logical Volume Management Opensource.com

Nov 08, 2019 | opensource.com

In Figure 1, two complete physical hard drives and one partition from a third hard drive have been combined into a single volume group. Two logical volumes have been created from the space in the volume group, and a filesystem, such as an EXT3 or EXT4 filesystem has been created on each of the two logical volumes.

Figure 1: LVM allows combining partitions and entire hard drives into Volume Groups.

Adding disk space to a host is fairly straightforward but, in my experience, is done relatively infrequently. The basic steps needed are listed below. You can either create an entirely new volume group or you can add the new space to an existing volume group and either expand an existing logical volume or create a new one.

Adding a new logical volume

There are times when it is necessary to add a new logical volume to a host. For example, after noticing that the directory containing virtual disks for my VirtualBox virtual machines was filling up the /home filesystem, I decided to create a new logical volume in which to store the virtual machine data, including the virtual disks. This would free up a great deal of space in my /home filesystem and also allow me to manage the disk space for the VMs independently.

The basic steps for adding a new logical volume are as follows.

  1. If necessary, install a new hard drive.
  2. Optional: Create a partition on the hard drive.
  3. Create a physical volume (PV) of the complete hard drive or a partition on the hard drive.
  4. Assign the new physical volume to an existing volume group (VG) or create a new volume group.
  5. Create a new logical volumes (LV) from the space in the volume group.
  6. Create a filesystem on the new logical volume.
  7. Add appropriate entries to /etc/fstab for mounting the filesystem.
  8. Mount the filesystem.

Now for the details. The following sequence is taken from an example I used as a lab project when teaching about Linux filesystems.

Example

This example shows how to use the CLI to extend an existing volume group to add more space to it, create a new logical volume in that space, and create a filesystem on the logical volume. This procedure can be performed on a running, mounted filesystem.

WARNING: Only the EXT3 and EXT4 filesystems can be resized on the fly on a running, mounted filesystem. Many other filesystems including BTRFS and ZFS cannot be resized.

Install hard drive

If there is not enough space in the volume group on the existing hard drive(s) in the system to add the desired amount of space it may be necessary to add a new hard drive and create the space to add to the Logical Volume. First, install the physical hard drive, and then perform the following steps.

Create Physical Volume from hard drive

It is first necessary to create a new Physical Volume (PV). Use the command below, which assumes that the new hard drive is assigned as /dev/hdd.

pvcreate /dev/hdd

It is not necessary to create a partition of any kind on the new hard drive. This creation of the Physical Volume which will be recognized by the Logical Volume Manager can be performed on a newly installed raw disk or on a Linux partition of type 83. If you are going to use the entire hard drive, creating a partition first does not offer any particular advantages and uses disk space for metadata that could otherwise be used as part of the PV.

Extend the existing Volume Group

In this example we will extend an existing volume group rather than creating a new one; you can choose to do it either way. After the Physical Volume has been created, extend the existing Volume Group (VG) to include the space on the new PV. In this example the existing Volume Group is named MyVG01.

vgextend /dev/MyVG01 /dev/hdd
Create the Logical Volume

First create the Logical Volume (LV) from existing free space within the Volume Group. The command below creates a LV with a size of 50GB. The Volume Group name is MyVG01 and the Logical Volume Name is Stuff.

lvcreate -L +50G --name Stuff MyVG01
Create the filesystem

Creating the Logical Volume does not create the filesystem. That task must be performed separately. The command below creates an EXT4 filesystem that fits the newly created Logical Volume.

mkfs -t ext4 /dev/MyVG01/Stuff
Add a filesystem label

Adding a filesystem label makes it easy to identify the filesystem later in case of a crash or other disk related problems.

e2label /dev/MyVG01/Stuff Stuff
Mount the filesystem

At this point you can create a mount point, add an appropriate entry to the /etc/fstab file, and mount the filesystem.

You should also check to verify the volume has been created correctly. You can use the df , lvs, and vgs commands to do this.

Resizing a logical volume in an LVM filesystem

The need to resize a filesystem has been around since the beginning of the first versions of Unix and has not gone away with Linux. It has gotten easier, however, with Logical Volume Management.

  1. If necessary, install a new hard drive.
  2. Optional: Create a partition on the hard drive.
  3. Create a physical volume (PV) of the complete hard drive or a partition on the hard drive.
  4. Assign the new physical volume to an existing volume group (VG) or create a new volume group.
  5. Create one or more logical volumes (LV) from the space in the volume group, or expand an existing logical volume with some or all of the new space in the volume group.
  6. If you created a new logical volume, create a filesystem on it. If adding space to an existing logical volume, use the resize2fs command to enlarge the filesystem to fill the space in the logical volume.
  7. Add appropriate entries to /etc/fstab for mounting the filesystem.
  8. Mount the filesystem.
Example

This example describes how to resize an existing Logical Volume in an LVM environment using the CLI. It adds about 50GB of space to the /Stuff filesystem. This procedure can be used on a mounted, live filesystem only with the Linux 2.6 Kernel (and higher) and EXT3 and EXT4 filesystems. I do not recommend that you do so on any critical system, but it can be done and I have done so many times; even on the root (/) filesystem. Use your judgment.

WARNING: Only the EXT3 and EXT4 filesystems can be resized on the fly on a running, mounted filesystem. Many other filesystems including BTRFS and ZFS cannot be resized.

Install the hard drive

If there is not enough space on the existing hard drive(s) in the system to add the desired amount of space it may be necessary to add a new hard drive and create the space to add to the Logical Volume. First, install the physical hard drive and then perform the following steps.

Create a Physical Volume from the hard drive

It is first necessary to create a new Physical Volume (PV). Use the command below, which assumes that the new hard drive is assigned as /dev/hdd.

pvcreate /dev/hdd

It is not necessary to create a partition of any kind on the new hard drive. This creation of the Physical Volume which will be recognized by the Logical Volume Manager can be performed on a newly installed raw disk or on a Linux partition of type 83. If you are going to use the entire hard drive, creating a partition first does not offer any particular advantages and uses disk space for metadata that could otherwise be used as part of the PV.

Add PV to existing Volume Group

For this example, we will use the new PV to extend an existing Volume Group. After the Physical Volume has been created, extend the existing Volume Group (VG) to include the space on the new PV. In this example, the existing Volume Group is named MyVG01.

vgextend /dev/MyVG01 /dev/hdd
Extend the Logical Volume

Extend the Logical Volume (LV) from existing free space within the Volume Group. The command below expands the LV by 50GB. The Volume Group name is MyVG01 and the Logical Volume Name is Stuff.

lvextend -L +50G /dev/MyVG01/Stuff
Expand the filesystem

Extending the Logical Volume will also expand the filesystem if you use the -r option. If you do not use the -r option, that task must be performed separately. The command below resizes the filesystem to fit the newly resized Logical Volume.

resize2fs /dev/MyVG01/Stuff

You should check to verify the resizing has been performed correctly. You can use the df , lvs, and vgs commands to do this.

Tips

Over the years I have learned a few things that can make logical volume management even easier than it already is. Hopefully these tips can prove of some value to you.

  • Use the Extended file systems unless you have a clear reason to use another filesystem. Not all filesystems support resizing but EXT2, 3, and 4 do. The EXT filesystems are also very fast and efficient. In any event, they can be tuned by a knowledgeable sysadmin to meet the needs of most environments if the defaults tuning parameters do not.
  • Use meaningful volume and volume group names.
  • Use EXT filesystem labels.

I know that, like me, many sysadmins have resisted the change to Logical Volume Management. I hope that this article will encourage you to at least try LVM. I am really glad that I did; my disk management tasks are much easier since I made the switch. Topics Business Linux How-tos and tutorials Sysadmin About the author David Both - David Both is an Open Source Software and GNU/Linux advocate, trainer, writer, and speaker who lives in Raleigh North Carolina. He is a strong proponent of and evangelist for the "Linux Philosophy." David has been in the IT industry for nearly 50 years. He has taught RHCE classes for Red Hat and has worked at MCI Worldcom, Cisco, and the State of North Carolina. He has been working with Linux and Open Source Software for over 20 years. David prefers to purchase the components and build his...

[Nov 08, 2019] Five open source tools for remote the OpenStack team Opensource.com

Nov 08, 2019 | opensource.com

5 tools to support distributed sysadmin teams 28 Jul 2016 Elizabeth K. Joseph Feed 304 up 1 comment Image by : Opensource.com x Subscribe now

Get the highlights in your inbox every week.

https://opensource.com/eloqua-embedded-email-capture-block.html?offer_id=70160000000QzXNAA0

We also add in a few more provisos:

  • we must do our work in public
  • we work for different companies
  • we must use open source software for everything we do

The following five open source tools allow us to satisfy these goals while remaining a high-functioning team.

1. Text-based chat

We use Internet Relay Chat (IRC), provided to us by the freenode IRC network, which runs on an open source Internet Relay Chat daemon (IRCd). There is a vast array of open source clients available to connect to it with. Our team channel (a chat room, in IRC terms) is the soul of our team. We discuss ongoing issues and challenges, come up with solutions, inform each other of changes in progress, post project status changes and alerts, and have a bot that reports changes submitted to our infrastructure for review. The IRC channel we use is fully public, and we maintain a server that serves channel logs on a web server where anyone can view them.

The following is a quick snapshot of the channel one morning:

<clarkb> hrm no world dump on that failure?
<openstackgerrit> Anita Kuno proposed openstack-infra/storyboard: Add example commands for the Timeline api https://review.openstack.org/337854
<openstackgerrit> Victor Ryzhenkin proposed openstack-infra/project-config: Add openstack/fuel-plugin-murano-tests project https://review.openstack.org/332151
<clarkb> its definitely an io error of some sort
<clarkb> possibly run out of disk space?
<therve> The df output looks normal...
<greghaynes> or, is it writing out to tmpfs?

It takes some getting used to, but once you're familiar with the flow of the channel and working with the team, these conversations and logs become an invaluable resource for keeping up with our work. You can learn more about IRC, from commands to etiquette guidelines by reading VM (Vicky) Brasseur's pair of articles, Getting started with IRC and An IRC quickstart guide .

There are times when a quick voice call is valuable for high bandwith communcation, so we also run an Asterisk system to support Voice over IP (VoIP) calls.

Running a private IRCd within a company or other organization is common. There are a variety of open source options available; review them with project activity and security track records in mind to select one that fits your needs. If your team has tried IRC but prefers tooling that has a more modern interface and features, like logs build into the interface even when you're not directly connected, you may want to look into Mattermost , an alternative to proprietary SaaS messaging, which you can read more about in Charlie Reisinger's article on how his organization replaced IRC with Mattermost .

2. Etherpad

Etherpads are hosted collaborative text editors that allow a group to edit a document together in real time. Our team uses these for a variety of purposes: collaborating on announcements that go out to the entire project, sharing ideas, talks and topics for our in-person gatherings, writing out maintenance and upgrade plans, and working through tasks during maintenance windows.

The use of an Etherpad typically goes hand-in-hand with our collaboration on IRC and our mailing list, using the Etherpad as a place to keep track of longer-lived, asynchronous notes while we discuss things more broadly directly with each other. We use the open source Etherpad Lite in our infrastructure.

3. Pastebin

A pastebin allows you to paste a large amount of text and returns a URL that you can easily share with members of the team. For our team, that means the ability to share snippets of logs with members of our team that don't have access to the servers, share configuration files, and share instructions for performing a task that has not yet been documented. We prefer a pastebin to pasting large amounts of text in an IRC channel, or the overhead of an Etherpad for text that makes more sense to be read-only.

There are several open source options for running your own pastebin. We're currently using a version of LodgeIt that we maintain ourselves . Tip: If you're running a public pastebin, use a robots.txt file to stop search engines from indexing them. There are rarely valuable indexable data, and if the data are not indexed it will help keep the spammers off your pastebin.

4. GNU Screen

Officially known as a terminal multiplexer , GNU Screen allows you to run a command, or multiple commands, inside of a screen terminal session and keep applications running after you log out. This has been particularly valuable if we're running a rare long-lived, manually triggered command that would fail if we lose our connection, such as reindexing of our code review system, or performing certain upgrades. Many of the people on our team also use it to keep their IRC clients running 24/7.

Most interestingly, we use GNU Screen sessions for training other root members of our team on system administration tasks that we haven't yet automated or documented. Several users on a system can attach to a screen session at the same time for a collaborative terminal session. We can then walk new members of the team through access to our password vault, or share the procedures for rare or complicated maintenance tasks. It also has built-in tooling to keep a log of the entire session.

While it works for us, some critique GNU Screen for being a bit lean on modern features. Alternatives also exist like tmux and Byobu .

5. Git

Git was invented by Linus Torvalds for managing Linux kernel development. Git is now the go-to revision control system for open source projects. It may go without saying that systems administration teams should be using some kind of version control for all changes to their infrastructure, but this is particularly important for a distributed team. With our team spanning time zones, it can be hard to catch up on eight hours of discussion. With Git you can see all changes made to systems by reading commit logs, and learn who changed things.

It also allows us to more easily roll back to prior states, or at least see what prior states existed before possibly disruptive changes were made. When documenting changes more formally for team consumption, the commit history also allows us to effectively describe what and why specific changes were made.

Tip: Always make sure you include why a change was made in your commit message. We can typically figure out what was changed by reading the commit itself, but the rationale can be tricker to track down after a change is made, particularly several months later and few people remember.

Learn more about the OpenStack infrastructure, and other tools we use to support the massive development community in OpenStack running by visiting the OpenStack Project Infrastructure documentation .

[Nov 08, 2019] 10 killer tools for the admin in a hurry Opensource.com

Nov 08, 2019 | opensource.com

NixCraft
Use the site's internal search function. With more than a decade of regular updates, there's gold to be found here -- useful scripts and handy hints that can solve your problem straight away. This is often the second place I look after Google.

Webmin
This gives you a nice web interface to remotely edit your configuration files. It cuts down on a lot of time spent having to juggle directory paths and sudo nano , which is handy when you're handling several customers.

Windows Subsystem for Linux
The reality of the modern workplace is that most employees are on Windows, while the grown-up gear in the server room is on Linux. So sometimes you find yourself trying to do admin tasks from (gasp) a Windows desktop.

What do you do? Install a virtual machine? It's actually much faster and far less work to configure if you install the Windows Subsystem for Linux compatibility layer, now available at no cost on Windows 10.

This gives you a Bash terminal in a window where you can run Bash scripts and Linux binaries on the local machine, have full access to both Windows and Linux filesystems, and mount network drives. It's available in Ubuntu, OpenSUSE, SLES, Debian, and Kali flavors.

mRemoteNG
This is an excellent SSH and remote desktop client for when you have 100+ servers to manage.

Setting up a network so you don't have to do it again

A poorly planned network is the sworn enemy of the admin who hates working overtime.

IP Addressing Schemes that Scale
The diabolical thing about running out of IP addresses is that, when it happens, the network's grown large enough that a new addressing scheme is an expensive, time-consuming pain in the proverbial.

Ain't nobody got time for that!

At some point, IPv6 will finally arrive to save the day. Until then, these one-size-fits-most IP addressing schemes should keep you going, no matter how many network-connected wearables, tablets, smart locks, lights, security cameras, VoIP headsets, and espresso machines the world throws at us.

Linux Chmod Permissions Cheat Sheet
A short but sweet cheat sheet of Bash commands to set permissions across the network. This is so when Bill from Customer Service falls for that ransomware scam, you're recovering just his files and not the entire company's.

VLSM Subnet Calculator
Just put in the number of networks you want to create from an address space and the number of hosts you want per network, and it calculates what the subnet mask should be for everything.

Single-purpose Linux distributions

Need a Linux box that does just one thing? It helps if someone else has already sweated the small stuff on an operating system you can install and have ready immediately.

Each of these has, at one point, made my work day so much easier.

Porteus Kiosk
This is for when you want a computer totally locked down to just a web browser. With a little tweaking, you can even lock the browser down to just one website. This is great for public access machines. It works with touchscreens or with a keyboard and mouse.

Parted Magic
This is an operating system you can boot from a USB drive to partition hard drives, recover data, and run benchmarking tools.

IPFire
Hahahaha, I still can't believe someone called a router/firewall/proxy combo "I pee fire." That's my second favorite thing about this Linux distribution. My favorite is that it's a seriously solid software suite. It's so easy to set up and configure, and there is a heap of plugins available to extend it.

What about your top tools and cheat sheets?

So, how about you? What tools, resources, and cheat sheets have you found to make the workday easier? I'd love to know. Please share in the comments.

[Nov 08, 2019] Command-line tools for collecting system statistics Opensource.com

Nov 08, 2019 | opensource.com

Examining collected data

The output from the sar command can be detailed, or you can choose to limit the data displayed. For example, enter the sar command with no options, which displays only aggregate CPU performance data. The sar command uses the current day by default, starting at midnight, so you should only see the CPU data for today.

On the other hand, using the sar -A command shows all of the data that has been collected for today. Enter the sar -A | less command now and page through the output to view the many types of data collected by SAR, including disk and network usage, CPU context switches (how many times per second the CPU switched from one program to another), page swaps, memory and swap space usage, and much more. Use the man page for the sar command to interpret the results and to get an idea of the many options available. Many of those options allow you to view specific data, such as network and disk performance.

I typically use the sar -A command because many of the types of data available are interrelated, and sometimes I find something that gives me a clue to a performance problem in a section of the output that I might not have looked at otherwise. The -A option displays all of the collected data types.

Look at the entire output of the sar -A | less command to get a feel for the type and amount of data displayed. Be sure to look at the CPU usage data as well as the processes started per second (proc/s) and context switches per second (cswch/s). If the number of context switches increases rapidly, that can indicate that running processes are being swapped off the CPU very frequently.

You can limit the total amount of data to the total CPU activity with the sar -u command. Try that and notice that you only get the composite CPU data, not the data for the individual CPUs. Also try the -r option for memory, and -S for swap space. Combining these options so the following command will display CPU, memory, and swap space is also possible:

sar -urS

Using the -p option displays block device names for hard drives instead of the much more cryptic device identifiers, and -d displays only the block devices -- the hard drives. Issue the following command to view all of the block device data in a readable format using the names as they are found in the /dev directory:

sar -dp | less

If you want only data between certain times, you can use -s and -e to define the start and end times, respectively. The following command displays all CPU data, both individual and aggregate for the time period between 7:50 AM and 8:11 AM today:

sar -P ALL -s 07:50:00 -e 08:11:00

Note that all times must be in 24-hour format. If you have multiple CPUs, each CPU is detailed individually, and the average for all CPUs is also given.

The next command uses the -n option to display network statistics for all interfaces:

sar -n ALL | less
Data for previous days

Data collected for previous days can also be examined by specifying the desired log file. Assume that today's date is September 3 and you want to see the data for yesterday, the following command displays all collected data for September 2. The last two digits of each file are the day of the month on which the data was collected:

sar -A -f /var/log/sa/sa02 | less

You can use the command below, where DD is the day of the month for yesterday:

sar -A -f /var/log/sa/saDD | less
Realtime data

You can also use SAR to display (nearly) realtime data. The following command displays memory usage in 5- second intervals for 10 iterations:

sar -r 5 10

This is an interesting option for sar as it can provide a series of data points for a defined period of time that can be examined in detail and compared. The /proc filesystem All of this data for SAR and the system monitoring tools covered in my previous article must come from somewhere. Fortunately, all of that kernel data is easily available in the /proc filesystem. In fact, because the kernel performance data stored there is all in ASCII text format, it can be displayed using simple commands like cat so that the individual programs do not have to load their own kernel modules to collect it. This saves system resources and makes the data more accurate. SAR and the system monitoring tools I have discussed in my previous article all collect their data from the /proc filesystem.

Note that /proc is a virtual filesystem and only exists in RAM while Linux is running. It is not stored on the hard drive.

Even though I won't get into detail, the /proc filesystem also contains the live kernel tuning parameters and variables. Thus you can change the kernel tuning by simply changing the appropriate kernel tuning variable in /proc; no reboot is required.

Change to the /proc directory and list the files there.You will see, in addition to the data files, a large quantity of numbered directories. Each of these directories represents a process where the directory name is the Process ID (PID). You can delve into those directories to locate information about individual processes that might be of interest.

To view this data, simply cat some of the following files:

  • cmdline -- displays the kernel command line, including all parameters passed to it.
  • cpuinfo -- displays information about the CPU(s) including flags, model name stepping, and cache size.
  • meminfo -- displays very detailed information about memory, including data such as active and inactive memory, and total virtual memory allocated and that used, that is not always displayed by other tools.
  • iomem and ioports -- lists the memory ranges and ports defined for various I/O devices.

You will see that, although the data is available in these files, much of it is not annotated in any way. That means you will have work to do to identify and extract the desired data. However, the monitoring tools already discussed already do that for the data they are designed to display.

There is so much more data in the /proc filesystem that the best way to learn more about it is to refer to the proc(5) man page, which contains detailed information about the various files found there.

Next time I will pull all this together and discuss how I have used these tools to solve problems.

David Both - David Both is an Open Source Software and GNU/Linux advocate, trainer, writer, and speaker who lives in Raleigh North Carolina. He is a strong proponent of and evangelist for the "Linux Philosophy." David has been in the IT industry for nearly 50 years. He has taught RHCE classes for Red Hat and has worked at MCI Worldcom, Cisco, and the State of North Carolina. He has been working with Linux and Open Source Software for over 20 years.

[Nov 08, 2019] How to use Sanoid to recover from data disasters Opensource.com

Nov 08, 2019 | opensource.com

filesystem-level snapshot replication to move data from one machine to another, fast . For enormous blobs like virtual machine images, we're talking several orders of magnitude faster than rsync .

If that isn't cool enough already, you don't even necessarily need to restore from backup if you lost the production hardware; you can just boot up the VM directly on the local hotspare hardware, or the remote disaster recovery hardware, as appropriate. So even in case of catastrophic hardware failure , you're still looking at that 59m RPO, <1m RTO.

https://www.youtube.com/embed/5hEixXutaPo

Backups -- and recoveries -- don't get much easier than this.

The syntax is dead simple:

root@box1:~# syncoid pool/images/vmname root@box2:pooln
ame/images/vmname

Or if you have lots of VMs, like I usually do... recursion!

root@box1:~# syncoid -r pool/images/vmname root@box2:po
olname/images/vmname

This makes it not only possible, but easy to replicate multiple-terabyte VM images hourly over a local network, and daily over a VPN. We're not talking enterprise 100mbps symmetrical fiber, either. Most of my clients have 5mbps or less available for upload, which doesn't keep them from automated, nightly over-the-air backups, usually to a machine sitting quietly in an owner's house.

Preventing your own Humpty Level Events

Sanoid is open source software, and so are all its dependencies. You can run Sanoid and Syncoid themselves on pretty much anything with ZFS. I developed it and use it on Linux myself, but people are using it (and I support it) on OpenIndiana, FreeBSD, and FreeNAS too.

You can find the GPLv3 licensed code on the website (which actually just redirects to Sanoid's GitHub project page), and there's also a Chef Cookbook and an Arch AUR repo available from third parties.

[Nov 08, 2019] Introduction to the Linux chgrp and newgrp commands by Alan Formy-Duval

Sep 23, 2019 | opensource.com
The chgrp and newgrp commands help you manage files that need to maintain group ownership. In a recent article, I introduced the chown command , which is used for modifying ownership of files on systems. Recall that ownership is the combination of the user and group assigned to an object. The chgrp and newgrp commands provide additional help for managing files that need to maintain group ownership. Using chgrp

The chgrp command simply changes the group ownership of a file. It is the same as the chown :<group> command. You can use:

$chown :alan mynotes

or:

$chgrp alan mynotes
Recursive

A few additional arguments to chgrp can be useful at both the command line and in a script. Just like many other Linux commands, chgrp has a recursive argument, -R . You will need this to operate on a directory and its contents recursively, as I'll demonstrate below. I added the -v ( verbose ) argument so chgrp tells me what it is doing:

$ ls -l . conf
.:
drwxrwxr-x 2 alan alan 4096 Aug 5 15 : 33 conf

conf:
-rw-rw-r-- 1 alan alan 0 Aug 5 15 : 33 conf.xml
# chgrp -vR delta conf
changed group of 'conf/conf.xml' from alan to delta
changed group of 'conf' from alan to delta

Reference

More Linux resources

A reference file ( --reference=RFILE ) can be used when changing the group on files to match a certain configuration or when you don't know the group, as might be the case when running a script. You can duplicate another file's group ( RFILE ), referred to as a reference file. For example, to undo the changes made above (recall that a dot [ . ] refers to the present working directory):
$ chgrp -vR --reference=. conf
Report changes

Most commands have arguments for controlling their output. The most common is -v to enable verbose, and the chgrp command has a verbose mode. It also has a -c ( --changes ) argument, which instructs chgrp to report only when a change is made. Chgrp will still report other things, such as if an operation is not permitted.

The argument -f ( --silent , --quiet ) is used to suppress most error messages. I will use this argument and -c in the next section so it will show only actual changes.

Preserve root

The root ( / ) of the Linux filesystem should be treated with great respect. If a command mistake is made at this level, the consequences can be terrible and leave a system completely useless. Particularly when you are running a recursive command that will make any kind of change -- or worse, deletions. The chgrp command has an argument that can be used to protect and preserve the root. The argument is --preserve-root . If this argument is used with a recursive chgrp command on the root, nothing will happen and a message will appear instead:

[ root @ localhost / ] # chgrp -cfR --preserve-root a+w /
chgrp: it is dangerous to operate recursively on '/'
chgrp: use --no-preserve-root to override this failsafe

The option has no effect when it's not used in conjunction with recursive. However, if the command is run by the root user, the permissions of / will change, but not those of other files or directories within it:

[ alan @ localhost / ] $ chgrp -c --preserve-root alan /
chgrp: changing group of '/' : Operation not permitted
[ root @ localhost / ] # chgrp -c --preserve-root alan /
changed group of '/' from root to alan

Surprisingly, it seems, this is not the default argument. The option --no-preserve-root is the default. If you run the command above without the "preserve" option, it will default to "no preserve" mode and possibly change permissions on files that shouldn't be changed:

[ alan @ localhost / ] $ chgrp -cfR alan /
changed group of '/dev/pts/0' from tty to alan
changed group of '/dev/tty2' from tty to alan
changed group of '/var/spool/mail/alan' from mail to alan About newgrp

The newgrp command allows a user to override the current primary group. newgrp can be handy when you are working in a directory where all files must have the same group ownership. Suppose you have a directory called share on your intranet server where different teams store marketing photos. The group is share . As different users place files into the directory, the files' primary groups might become mixed up. Whenever new files are added, you can run chgrp to correct any mix-ups by setting the group to share :

$ cd share
ls -l
-rw-r--r--. 1 alan share 0 Aug 7 15 : 35 pic13
-rw-r--r--. 1 alan alan 0 Aug 7 15 : 35 pic1
-rw-r--r--. 1 susan delta 0 Aug 7 15 : 35 pic2
-rw-r--r--. 1 james gamma 0 Aug 7 15 : 35 pic3
-rw-rw-r--. 1 bill contract 0 Aug 7 15 : 36 pic4

I covered setgid mode in my article on the chmod command . This would be one way to solve this problem. But, suppose the setgid bit was not set for some reason. The newgrp command is useful in this situation. Before any users put files into the share directory, they can run the command newgrp share . This switches their primary group to share so all files they put into the directory will automatically have the group share , rather than the user's primary group. Once they are finished, users can switch back to their regular primary group with (for example):

newgrp alan
Conclusion

It is important to understand how to manage users, groups, and permissions. It is also good to know a few alternative ways to work around problems you might encounter since not all environments are set up the same way.

[Nov 07, 2019] Is BDFL a death sentence Opensource.com

Nov 07, 2019 | opensource.com

What happens when a Benevolent Dictator For Life moves on from an open source project? 16 Jul 2018 Jason Baker (Red Hat) Feed 131 up 2 comments Image credits : Original photo by Gabriel Kamener, Sown Together, Modified by Jen Wike Huger x Subscribe now

Get the highlights in your inbox every week.

https://opensource.com/eloqua-embedded-email-capture-block.html?offer_id=70160000000QzXNAA0 Guido van Rossum , creator of the Python programming language and Benevolent Dictator For Life (BDFL) of the project, announced his intention to step away.

Below is a portion of his message, although the entire email is not terribly long and worth taking the time to read if you're interested in the circumstances leading to van Rossum's departure.

I would like to remove myself entirely from the decision process. I'll still be there for a while as an ordinary core dev, and I'll still be available to mentor people -- possibly more available. But I'm basically giving myself a permanent vacation from being BDFL, and you all will be on your own.

After all that's eventually going to happen regardless -- there's still that bus lurking around the corner, and I'm not getting younger... (I'll spare you the list of medical issues.)

I am not going to appoint a successor.

So what are you all going to do? Create a democracy? Anarchy? A dictatorship? A federation?

It's worth zooming out for a moment to consider the issue at a larger scale. How an open source project is governed can have very real consequences on the long-term sustainability of its user and developer communities alike.

BDFLs tend to emerge from passion projects, where a single individual takes on a project before growing a community around it. Projects emerging from companies or other large organization often lack this role, as the distribution of authority is more formalized, or at least more dispersed, from the start. Even then, it's not uncommon to need to figure out how to transition from one form of project governance to another as the community grows and expands.

More Python Resources

But regardless of how an open source project is structured, ultimately, there needs to be some mechanism for deciding how to make technical decisions. Someone, or some group, has to decide which commits to accept, which to reject, and more broadly what direction the project is going to take from a technical perspective.

Surely the Python project will be okay without van Rossum. The Python Software Foundation has plenty of formalized structure in place bringing in broad representation from across the community. There's even been a humorous April Fools Python Enhancement Proposal (PEP) addressing the BDFL's retirement in the past.

That said, it's interesting that van Rossum did not heed the fifth lesson of Eric S. Raymond from his essay, The Mail Must Get Through (part of The Cathedral & the Bazaar ) , which stipulates: "When you lose interest in a program, your last duty to it is to hand it off to a competent successor." One could certainly argue that letting the community pick its own leadership, though, is an equally-valid choice.

What do you think? Are projects better or worse for being run by a BDFL? What can we expect when a BDFL moves on? And can someone truly step away from their passion project after decades of leading it? Will we still turn to them for the hard decisions, or can a community smoothly transition to new leadership without the pitfalls of forks or lost participants?

Can you truly stop being a BDFL? Or is it a title you'll hold, at least informally, until your death? Topics Community management Python 2018 Open Source Yearbook Yearbook About the author Jason Baker - I use technology to make the world more open. Linux desktop enthusiast. Map/geospatial nerd. Raspberry Pi tinkerer. Data analysis and visualization geek. Occasional coder. Cloud nativist. Civic tech and open government booster. More about me

Recommended reading
Conquering documentation challenges on a massive project

4 Python tools for getting started with astronomy

5 reasons why I love Python

Building trust in the Linux community

Pylint: Making your Python code consistent

Perceiving Python programming paradigms
2 Comments

Mike James on 17 Jul 2018 Permalink

My take on the issue:
https://www.i-programmer.info/news/216-python/11967-guido-van-rossum-qui...

Maxim Stewart on 05 Aug 2018 Permalink

"So what are you all going to do? Create a democracy? Anarchy? A dictatorship? A federation?"

Power coalesced to one point is always scary when thought about in the context of succession. A vacuum invites anarchy and I often think about this for when Linus Torvalds leaves the picture. We really have no concrete answers for what is the best way forward but my hope is towards a democratic process. But, as current history indicates, a democracy untended by its citizens invites quite the nightmare and so too does this translate to the keeping up of a project.

[Nov 07, 2019] Manage NTP with Chrony Opensource.com

Nov 07, 2019 | opensource.com

Manage NTP with Chrony Chronyd is a better choice for most networks than ntpd for keeping computers synchronized with the Network Time Protocol. 03 Dec 2018 David Both (Community Moderator) Feed 100 up 1 comment Image by : Matteo Ianeselli. Modified by Opensource.com. CC-BY-3.0. x Subscribe now

Get the highlights in your inbox every week.

https://opensource.com/eloqua-embedded-email-capture-block.html?offer_id=70160000000QzXNAA0

"Does anybody really know what time it is? Does anybody really care?"
Chicago , 1969

Perhaps that rock group didn't care what time it was, but our computers do need to know the exact time. Timekeeping is very important to computer networks. In banking, stock markets, and other financial businesses, transactions must be maintained in the proper order, and exact time sequences are critical for that. For sysadmins and DevOps professionals, it's easier to follow the trail of email through a series of servers or to determine the exact sequence of events using log files on geographically dispersed hosts when exact times are kept on the computers in question.

I used to work at an organization that received over 20 million emails per day and had four servers just to accept and do a basic filter on the incoming flood of email. From there, emails were sent to one of four other servers to perform more complex anti-spam assessments, then they were delivered to one of several additional servers where the emails were placed in the correct inboxes. At each layer, the emails would be sent to one of the next-level servers, selected only by the randomness of round-robin DNS. Sometimes we had to trace a new message through the system until we could determine where it "got lost," according to the pointy-haired bosses. We had to do this with frightening regularity.

Most of that email turned out to be spam. Some people actually complained that their [joke, cat pic, recipe, inspirational saying, or other-strange-email]-of-the-day was missing and asked us to find it. We did reject those opportunities.

Our email and other transactional searches were aided by log entries with timestamps that -- today -- can resolve down to the nanosecond in even the slowest of modern Linux computers. In very high-volume transaction environments, even a few microseconds of difference in the system clocks can mean sorting thousands of transactions to find the correct one(s).

The NTP server hierarchy

Computers worldwide use the Network Time Protocol (NTP) to synchronize their times with internet standard reference clocks via a hierarchy of NTP servers. The primary servers are at stratum 1, and they are connected directly to various national time services at stratum 0 via satellite, radio, or even modems over phone lines. The time service at stratum 0 may be an atomic clock, a radio receiver tuned to the signals broadcast by an atomic clock, or a GPS receiver using the highly accurate clock signals broadcast by GPS satellites.

To prevent time requests from time servers lower in the hierarchy (i.e., with a higher stratum number) from overwhelming the primary reference servers, there are several thousand public NTP stratum 2 servers that are open and available for anyone to use. Many organizations with large numbers of hosts that need an NTP server will set up their own time servers so that only one local host accesses the stratum 2 time servers, then they configure the remaining network hosts to use the local time server which, in my case, is a stratum 3 server.

NTP choices

The original NTP daemon, ntpd , has been joined by a newer one, chronyd . Both keep the local host's time synchronized with the time server. Both services are available, and I have seen nothing to indicate that this will change anytime soon.

Chrony has features that make it the better choice for most environments for the following reasons:

  • Chrony can synchronize to the time server much faster than NTP. This is good for laptops or desktops that don't run constantly.
  • It can compensate for fluctuating clock frequencies, such as when a host hibernates or enters sleep mode, or when the clock speed varies due to frequency stepping that slows clock speeds when loads are low.
  • It handles intermittent network connections and bandwidth saturation.
  • It adjusts for network delays and latency.
  • After the initial time sync, Chrony never steps the clock. This ensures stable and consistent time intervals for system services and applications.
  • Chrony can work even without a network connection. In this case, the local host or server can be updated manually.

The NTP and Chrony RPM packages are available from standard Fedora repositories. You can install both and switch between them, but modern Fedora, CentOS, and RHEL releases have moved from NTP to Chrony as their default time-keeping implementation. I have found that Chrony works well, provides a better interface for the sysadmin, presents much more information, and increases control.

Just to make it clear, NTP is a protocol that is implemented with either NTP or Chrony. If you'd like to know more, read this comparison between NTP and Chrony as implementations of the NTP protocol.

This article explains how to configure Chrony clients and servers on a Fedora host, but the configuration for CentOS and RHEL current releases works the same.

Chrony structure

The Chrony daemon, chronyd , runs in the background and monitors the time and status of the time server specified in the chrony.conf file. If the local time needs to be adjusted, chronyd does it smoothly without the programmatic trauma that would occur if the clock were instantly reset to a new time.

Chrony's chronyc tool allows someone to monitor the current status of Chrony and make changes if necessary. The chronyc utility can be used as a command that accepts subcommands, or it can be used as an interactive text-mode program. This article will explain both uses.

Client configuration

The NTP client configuration is simple and requires little or no intervention. The NTP server can be defined during the Linux installation or provided by the DHCP server at boot time. The default /etc/chrony.conf file (shown below in its entirety) requires no intervention to work properly as a client. For Fedora, Chrony uses the Fedora NTP pool, and CentOS and RHEL have their own NTP server pools. Like many Red Hat-based distributions, the configuration file is well commented.

# Use public servers from the pool.ntp.org project.
# Please consider joining the pool (http://www.pool.ntp.org/join.html).
pool 2.fedora.pool.ntp.org iburst

# Record the rate at which the system clock gains/losses time.
driftfile /var/lib/chrony/drift

# Allow the system clock to be stepped in the first three updates
# if its offset is larger than 1 second.
makestep 1.0 3

# Enable kernel synchronization of the real-time clock (RTC).

# Enable hardware timestamping on all interfaces that support it.
#hwtimestamp *

# Increase the minimum number of selectable sources required to adjust
# the system clock.
#minsources 2

# Allow NTP client access from local network.
#allow 192.168.0.0/16

# Serve time even if not synchronized to a time source.
#local stratum 10

# Specify file containing keys for NTP authentication.
keyfile /etc/chrony.keys

# Get TAI-UTC offset and leap seconds from the system tz database.
leapsectz right/UTC

# Specify directory for log files.
logdir /var/log/chrony

# Select which information is logged.
#log measurements statistics tracking

Let's look at the current status of NTP on a virtual machine I use for testing. The chronyc command, when used with the tracking subcommand, provides statistics that report how far off the local system is from the reference server.

[root@studentvm1 ~]# chronyc tracking
Reference ID : 23ABED4D (ec2-35-171-237-77.compute-1.amazonaws.com)
Stratum : 3
Ref time (UTC) : Fri Nov 16 16:21:30 2018
System time : 0.000645622 seconds slow of NTP time
Last offset : -0.000308577 seconds
RMS offset : 0.000786140 seconds
Frequency : 0.147 ppm slow
Residual freq : -0.073 ppm
Skew : 0.062 ppm
Root delay : 0.041452706 seconds
Root dispersion : 0.022665167 seconds
Update interval : 1044.2 seconds
Leap status : Normal
[root@studentvm1 ~]#

The Reference ID in the first line of the result is the server the host is synchronized to -- in this case, a stratum 3 reference server that was last contacted by the host at 16:21:30 2018. The other lines are described in the chronyc(1) man page .

The sources subcommand is also useful because it provides information about the time source configured in chrony.conf .

[root@studentvm1 ~]# chronyc sources
210 Number of sources = 5
MS Name/IP address Stratum Poll Reach LastRx Last sample
===============================================================================
^+ 192.168.0.51 3 6 377 0 -2613us[-2613us] +/- 63ms
^+ dev.smatwebdesign.com 3 10 377 28m -2961us[-3534us] +/- 113ms
^+ propjet.latt.net 2 10 377 465 -1097us[-1085us] +/- 77ms
^* ec2-35-171-237-77.comput> 2 10 377 83 +2388us[+2395us] +/- 95ms
^+ PBX.cytranet.net 3 10 377 507 -1602us[-1589us] +/- 96ms
[root@studentvm1 ~]#

The first source in the list is the time server I set up for my personal network. The others were provided by the pool. Even though my NTP server doesn't appear in the Chrony configuration file above, my DHCP server provides its IP address for the NTP server. The "S" column -- Source State -- indicates with an asterisk ( * ) the server our host is synced to. This is consistent with the data from the tracking subcommand.

The -v option provides a nice description of the fields in this output.

[root@studentvm1 ~]# chronyc sources -v
210 Number of sources = 5

.-- Source mode '^' = server, '=' = peer, '#' = local clock.
/ .- Source state '*' = current synced, '+' = combined , '-' = not combined,
| / '?' = unreachable, 'x' = time may be in error, '~' = time too variable.
|| .- xxxx [ yyyy ] +/- zzzz
|| Reachability register (octal) -. | xxxx = adjusted offset,
|| Log2(Polling interval) --. | | yyyy = measured offset,
|| \ | | zzzz = estimated error.
|| | | \
MS Name/IP address Stratum Poll Reach LastRx Last sample
===============================================================================
^+ 192.168.0.51 3 7 377 28 -2156us[-2156us] +/- 63ms
^+ triton.ellipse.net 2 10 377 24 +5716us[+5716us] +/- 62ms
^+ lithium.constant.com 2 10 377 351 -820us[ -820us] +/- 64ms
^* t2.time.bf1.yahoo.com 2 10 377 453 -992us[ -965us] +/- 46ms
^- ntp.idealab.com 2 10 377 799 +3653us[+3674us] +/- 87ms
[root@studentvm1 ~]#

If I wanted my server to be the preferred reference time source for this host, I would add the line below to the /etc/chrony.conf file.

server 192.168.0.51 iburst prefer

I usually place this line just above the first pool server statement near the top of the file. There is no special reason for this, except I like to keep the server statements together. It would work just as well at the bottom of the file, and I have done that on several hosts. This configuration file is not sequence-sensitive.

The prefer option marks this as the preferred reference source. As such, this host will always be synchronized with this reference source (as long as it is available). We can also use the fully qualified hostname for a remote reference server or the hostname only (without the domain name) for a local reference time source as long as the search statement is set in the /etc/resolv.conf file. I prefer the IP address to ensure that the time source is accessible even if DNS is not working. In most environments, the server name is probably the better option, because NTP will continue to work even if the server's IP address changes.

If you don't have a specific reference source you want to synchronize to, it is fine to use the defaults.

Configuring an NTP server with Chrony

The nice thing about the Chrony configuration file is that this single file configures the host as both a client and a server. To add a server function to our host -- it will always be a client, obtaining its time from a reference server -- we just need to make a couple of changes to the Chrony configuration, then configure the host's firewall to accept NTP requests.

Open the /etc/chrony.conf file in your favorite text editor and uncomment the local stratum 10 line. This enables the Chrony NTP server to continue to act as if it were connected to a remote reference server if the internet connection fails; this enables the host to continue to be an NTP server to other hosts on the local network.

Let's restart chronyd and track how the service is working for a few minutes. Before we enable our host as an NTP server, we want to test a bit.

[root@studentvm1 ~]# systemctl restart chronyd ; watch chronyc tracking

The results should look like this. The watch command runs the chronyc tracking command every two seconds so we can watch changes occur over time.

Every 2.0s: chronyc tracking studentvm1: Fri Nov 16 20:59:31 2018

Reference ID : C0A80033 (192.168.0.51)
Stratum : 4
Ref time (UTC) : Sat Nov 17 01:58:51 2018
System time : 0.001598277 seconds fast of NTP time
Last offset : +0.001791533 seconds
RMS offset : 0.001791533 seconds
Frequency : 0.546 ppm slow
Residual freq : -0.175 ppm
Skew : 0.168 ppm
Root delay : 0.094823152 seconds
Root dispersion : 0.021242738 seconds
Update interval : 65.0 seconds
Leap status : Normal

Notice that my NTP server, the studentvm1 host, synchronizes to the host at 192.168.0.51, which is my internal network NTP server, at stratum 4. Synchronizing directly to the Fedora pool machines would result in synchronization at stratum 3. Notice also that the amount of error decreases over time. Eventually, it should stabilize with a tiny variation around a fairly small range of error. The size of the error depends upon the stratum and other network factors. After a few minutes, use Ctrl+C to break out of the watch loop.

To turn our host into an NTP server, we need to allow it to listen on the local network. Uncomment the following line to allow hosts on the local network to access our NTP server.

# Allow NTP client access from local network.
allow 192.168.0.0/16

Note that the server can listen for requests on any local network it's attached to. The IP address in the "allow" line is just intended for illustrative purposes. Be sure to change the IP network and subnet mask in that line to match your local network's.

Restart chronyd .

[root@studentvm1 ~]# systemctl restart chronyd

To allow other hosts on your network to access this server, configure the firewall to allow inbound UDP packets on port 123. Check your firewall's documentation to find out how to do that.

Testing

Your host is now an NTP server. You can test it with another host or a VM that has access to the network on which the NTP server is listening. Configure the client to use the new NTP server as the preferred server in the /etc/chrony.conf file, then monitor that client using the chronyc tools we used above.

Chronyc as an interactive tool

As I mentioned earlier, chronyc can be used as an interactive command tool. Simply run the command without a subcommand and you get a chronyc command prompt.

[root@studentvm1 ~]# chronyc
chrony version 3.4
Copyright (C) 1997-2003, 2007, 2009-2018 Richard P. Curnow and others
chrony comes with ABSOLUTELY NO WARRANTY. This is free software, and
you are welcome to redistribute it under certain conditions. See the
GNU General Public License version 2 for details.

chronyc>

You can enter just the subcommands at this prompt. Try using the tracking , ntpdata , and sources commands. The chronyc command line allows command recall and editing for chronyc subcommands. You can use the help subcommand to get a list of possible commands and their syntax.

Conclusion

Chrony is a powerful tool for synchronizing the times of client hosts, whether they are all on the local network or scattered around the globe. It's easy to configure because, despite the large number of options available, only a few configurations are required for most circumstances.

After my client computers have synchronized with the NTP server, I like to set the system hardware clock from the system (OS) time by using the following command:

/sbin/hwclock --systohc

This command can be added as a cron job or a script in cron.daily to keep the hardware clock synced with the system time.

Chrony and NTP (the service) both use the same configuration, and the files' contents are interchangeable. The man pages for chronyd , chronyc , and chrony.conf contain an amazing amount of information that can help you get started or learn about esoteric configuration options.

Do you run your own NTP server? Let us know in the comments and be sure to tell us which implementation you are using, NTP or Chrony.

[Nov 07, 2019] 5 alerting and visualization tools for sysadmins Opensource.com

Nov 07, 2019 | opensource.com

Common types of alerts and visualizations Alerts

Let's first cover what alerts are not . Alerts should not be sent if the human responder can't do anything about the problem. This includes alerts that are sent to multiple individuals with only a few who can respond, or situations where every anomaly in the system triggers an alert. This leads to alert fatigue and receivers ignoring all alerts within a specific medium until the system escalates to a medium that isn't already saturated.

For example, if an operator receives hundreds of emails a day from the alerting system, that operator will soon ignore all emails from the alerting system. The operator will respond to a real incident only when he or she is experiencing the problem, emailed by a customer, or called by the boss. In this case, alerts have lost their meaning and usefulness.

Alerts are not a constant stream of information or a status update. They are meant to convey a problem from which the system can't automatically recover, and they are sent only to the individual most likely to be able to recover the system. Everything that falls outside this definition isn't an alert and will only damage your employees and company culture.

Everyone has a different set of alert types, so I won't discuss things like priority levels (P1-P5) or models that use words like "Informational," "Warning," and "Critical." Instead, I'll describe the generic categories emergent in complex systems' incident response.

You might have noticed I mentioned an "Informational" alert type right after I wrote that alerts shouldn't be informational. Well, not everyone agrees, but I don't consider something an alert if it isn't sent to anyone. It is a data point that many systems refer to as an alert. It represents some event that should be known but not responded to. It is generally part of the visualization system of the alerting tool and not an event that triggers actual notifications. Mike Julian covers this and other aspects of alerting in his book Practical Monitoring . It's a must read for work in this area.

Non-informational alerts consist of types that can be responded to or require action. I group these into two categories: internal outage and external outage. (Most companies have more than two levels for prioritizing their response efforts.) Degraded system performance is considered an outage in this model, as the impact to each user is usually unknown.

Internal outages are a lower priority than external outages, but they still need to be responded to quickly. They often include internal systems that company employees use or components of applications that are visible only to company employees.

External outages consist of any system outage that would immediately impact a customer. These don't include a system outage that prevents releasing updates to the system. They do include customer-facing application failures, database outages, and networking partitions that hurt availability or consistency if either can impact a user. They also include outages of tools that may not have a direct impact on users, as the application continues to run but this transparent dependency impacts performance. This is common when the system uses some external service or data source that isn't necessary for full functionality but may cause delays as the application performs retries or handles errors from this external dependency.

Visualizations

There are many visualization types, and I won't cover them all here. It's a fascinating area of research. On the data analytics side of my career, learning and applying that knowledge is a constant challenge. We need to provide simple representations of complex system outputs for the widest dissemination of information. Google Charts and Tableau have a wide selection of visualization types. We'll cover the most common visualizations and some innovative solutions for quickly understanding systems.

Line chart

The line chart is probably the most common visualization. It does a pretty good job of producing an understanding of a system over time. A line chart in a metrics system would have a line for each unique metric or some aggregation of metrics. This can get confusing when there are a lot of metrics in the same dashboard (as shown below), but most systems can select specific metrics to view rather than having all of them visible. Also, anomalous behavior is easy to spot if it's significant enough to escape the noise of normal operations. Below we can see purple, yellow, and light blue lines that might indicate anomalous behavior.

monitoring_guide_line_chart.png

Another feature of a line chart is that you can often stack them to show relationships. For example, you might want to look at requests on each server individually, but also in aggregate. This allows you to understand the overall system as well as each instance in the same graph.

monitoring_guide_line_chart_aggregate.png Heatmaps

Another common visualization is the heatmap. It is useful when looking at histograms. This type of visualization is similar to a bar chart but can show gradients within the bars representing the different percentiles of the overall metric. For example, suppose you're looking at request latencies and you want to quickly understand the overall trend as well as the distribution of all requests. A heatmap is great for this, and it can use color to disambiguate the quantity of each section with a quick glance.

The heatmap below shows the higher concentration around the centerline of the graph with an easy-to-understand visualization of the distribution vertically for each time bucket. We might want to review a couple of points in time where the distribution gets wide while the others are fairly tight like at 14:00. This distribution might be a negative performance indicator.

monitoring_guide_histogram.png Gauges

The last common visualization I'll cover here is the gauge, which helps users understand a single metric quickly. Gauges can represent a single metric, like your speedometer represents your driving speed or your gas gauge represents the amount of gas in your car. Similar to the gas gauge, most monitoring gauges clearly indicate what is good and what isn't. Often (as is shown below), good is represented by green, getting worse by orange, and "everything is breaking" by red. The middle row below shows traditional gauges.

monitoring_guide_gauges.png Image source: Grafana.org (© Grafana Labs)

This image shows more than just traditional gauges. The other gauges are single stat representations that are similar to the function of the classic gauge. They all use the same color scheme to quickly indicate system health with just a glance. Arguably, the bottom row is probably the best example of a gauge that allows you to glance at a dashboard and know that everything is healthy (or not). This type of visualization is usually what I put on a top-level dashboard. It offers a full, high-level understanding of system health in seconds.

Flame graphs

A less common visualization is the flame graph, introduced by Netflix's Brendan Gregg in 2011. It's not ideal for dashboarding or quickly observing high-level system concerns; it's normally seen when trying to understand a specific application problem. This visualization focuses on CPU and memory and the associated frames. The X-axis lists the frames alphabetically, and the Y-axis shows stack depth. Each rectangle is a stack frame and includes the function being called. The wider the rectangle, the more it appears in the stack. This method is invaluable when trying to diagnose system performance at the application level and I urge everyone to give it a try.

monitoring_guide_flame_graph.png Image source: Wikimedia.org ( Creative Commons BY SA 3.0 ) Tool options

There are several commercial options for alerting, but since this is Opensource.com, I'll cover only systems that are being used at scale by real companies that you can use at no cost. Hopefully, you'll be able to contribute new and innovative features to make these systems even better.

Alerting tools Bosun

If you've ever done anything with computers and gotten stuck, the help you received was probably thanks to a Stack Exchange system. Stack Exchange runs many different websites around a crowdsourced question-and-answer model. Stack Overflow is very popular with developers, and Super User is popular with operations. However, there are now hundreds of sites ranging from parenting to sci-fi and philosophy to bicycles.

Stack Exchange open-sourced its alert management system, Bosun , around the same time Prometheus and its AlertManager system were released. There were many similarities in the two systems, and that's a really good thing. Like Prometheus, Bosun is written in Golang. Bosun's scope is more extensive than Prometheus' as it can interact with systems beyond metrics aggregation. It can also ingest data from log and event aggregation systems. It supports Graphite, InfluxDB, OpenTSDB, and Elasticsearch.

Bosun's architecture consists of a single server binary, a backend like OpenTSDB, Redis, and scollector agents . The scollector agents automatically detect services on a host and report metrics for those processes and other system resources. This data is sent to a metrics backend. The Bosun server binary then queries the backends to determine if any alerts need to be fired. Bosun can also be used by tools like Grafana to query the underlying backends through one common interface. Redis is used to store state and metadata for Bosun.

A really neat feature of Bosun is that it lets you test your alerts against historical data. This was something I missed in Prometheus several years ago, when I had data for an issue I wanted alerts on but no easy way to test it. To make sure my alerts were working, I had to create and insert dummy data. This system alleviates that very time-consuming process.

Bosun also has the usual features like showing simple graphs and creating alerts. It has a powerful expression language for writing alerting rules. However, it only has email and HTTP notification configurations, which means connecting to Slack and other tools requires a bit more customization ( which its documentation covers ). Similar to Prometheus, Bosun can use templates for these notifications, which means they can look as awesome as you want them to. You can use all your HTML and CSS skills to create the baddest email alert anyone has ever seen.

Cabot

Cabot was created by a company called Arachnys . You may not know who Arachnys is or what it does, but you have probably felt its impact: It built the leading cloud-based solution for fighting financial crimes. That sounds pretty cool, right? At a previous company, I was involved in similar functions around "know your customer" laws. Most companies would consider it a very bad thing to be linked to a terrorist group, for example, funneling money through their systems. These solutions also help defend against less-atrocious offenders like fraudsters who could also pose a risk to the institution.

So why did Arachnys create Cabot? Well, it is kind of a Christmas present to everyone, as it was a Christmas project built because its developers couldn't wrap their heads around Nagios . And really, who can blame them? Cabot was written with Django and Bootstrap, so it should be easy for most to contribute to the project. (Another interesting factoid: The name comes from the creator's dog.)

The Cabot architecture is similar to Bosun in that it doesn't collect any data. Instead, it accesses data through the APIs of the tools it is alerting for. Therefore, Cabot uses a pull (rather than a push) model for alerting. It reaches out into each system's API and retrieves the information it needs to make a decision based on a specific check. Cabot stores the alerting data in a Postgres database and also has a cache using Redis.

Cabot natively supports Graphite , but it also supports Jenkins , which is rare in this area. Arachnys uses Jenkins like a centralized cron, but I like this idea of treating build failures like outages. Obviously, a build failure isn't as critical as a production outage, but it could still alert the team and escalate if the failure isn't resolved. Who actually checks Jenkins every time an email comes in about a build failure? Yeah, me too!

Another interesting feature is that Cabot can integrate with Google Calendar for on-call rotations. Cabot calls this feature Rota, which is a British term for a roster or rotation. This makes a lot of sense, and I wish other systems would take this idea further. Cabot doesn't support anything more complex than primary and backup personnel, but there is certainly room for additional features. The docs say if you want something more advanced, you should look at a commercial option.

StatsAgg

StatsAgg ? How did that make the list? Well, it's not every day you come across a publishing company that has created an alerting platform. I think that deserves recognition. Of course, Pearson isn't just a publishing company anymore; it has several web presences and a joint venture with O'Reilly Media . However, I still think of it as the company that published my schoolbooks and tests.

StatsAgg isn't just an alerting platform; it's also a metrics aggregation platform. And it's kind of like a proxy for other systems. It supports Graphite, StatsD, InfluxDB, and OpenTSDB as inputs, but it can also forward those metrics to their respective platforms. This is an interesting concept, but potentially risky as loads increase on a central service. However, if the StatsAgg infrastructure is robust enough, it can still produce alerts even when a backend storage platform has an outage.

StatsAgg is written in Java and consists only of the main server and UI, which keeps complexity to a minimum. It can send alerts based on regular expression matching and is focused on alerting by service rather than host or instance. Its goal is to fill a void in the open source observability stack, and I think it does that quite well.

Visualization tools Grafana

Almost everyone knows about Grafana , and many have used it. I have used it for years whenever I need a simple dashboard. The tool I used before was deprecated, and I was fairly distraught about that until Grafana made it okay. Grafana was gifted to us by Torkel Ödegaard. Like Cabot, Grafana was also created around Christmastime, and released in January 2014. It has come a long way in just a few years. It started life as a Kibana dashboarding system, and Torkel forked it into what became Grafana.

Grafana's sole focus is presenting monitoring data in a more usable and pleasing way. It can natively gather data from Graphite, Elasticsearch, OpenTSDB, Prometheus, and InfluxDB. There's an Enterprise version that uses plugins for more data sources, but there's no reason those other data source plugins couldn't be created as open source, as the Grafana plugin ecosystem already offers many other data sources.

What does Grafana do for me? It provides a central location for understanding my system. It is web-based, so anyone can access the information, although it can be restricted using different authentication methods. Grafana can provide knowledge at a glance using many different types of visualizations. However, it has started integrating alerting and other features that aren't traditionally combined with visualizations.

Now you can set alerts visually. That means you can look at a graph, maybe even one showing where an alert should have triggered due to some degradation of the system, click on the graph where you want the alert to trigger, and then tell Grafana where to send the alert. That's a pretty powerful addition that won't necessarily replace an alerting platform, but it can certainly help augment it by providing a different perspective on alerting criteria.

Grafana has also introduced more collaboration features. Users have been able to share dashboards for a long time, meaning you don't have to create your own dashboard for your Kubernetes cluster because there are several already available -- with some maintained by Kubernetes developers and others by Grafana developers.

The most significant addition around collaboration is annotations. Annotations allow a user to add context to part of a graph. Other users can then use this context to understand the system better. This is an invaluable tool when a team is in the middle of an incident and communication and common understanding are critical. Having all the information right where you're already looking makes it much more likely that knowledge will be shared across the team quickly. It's also a nice feature to use during blameless postmortems when the team is trying to understand how the failure occurred and learn more about their system.

Vizceral

Netflix created Vizceral to understand its traffic patterns better when performing a traffic failover. Unlike Grafana, which is a more general tool, Vizceral serves a very specific use case. Netflix no longer uses this tool internally and says it is no longer actively maintained, but it still updates the tool periodically. I highlight it here primarily to point out an interesting visualization mechanism and how it can help solve a problem. It's worth running it in a demo environment just to better grasp the concepts and witness what's possible with these systems.

[Nov 07, 2019] What breaks our systems A taxonomy of black swans Opensource.com

Nov 07, 2019 | opensource.com

What breaks our systems: A taxonomy of black swans Find and fix outlier events that create issues before they trigger severe production problems. 25 Oct 2018 Laura Nolan Feed 147 up 2 comments Image credits : Eumelincen . CC0 x Subscribe now

Get the highlights in your inbox every week.

https://opensource.com/eloqua-embedded-email-capture-block.html?offer_id=70160000000QzXNAA0

Black swans, by definition, can't be predicted, but sometimes there are patterns we can find and use to create defenses against categories of related problems.

For example, a large proportion of failures are a direct result of changes (code, environment, or configuration). Each bug triggered in this way is distinctive and unpredictable, but the common practice of canarying all changes is somewhat effective against this class of problems, and automated rollbacks have become a standard mitigation.

As our profession continues to mature, other kinds of problems are becoming well-understood classes of hazards with generalized prevention strategies.

Black swans observed in the wild

All technology organizations have production problems, but not all of them share their analyses. The organizations that publicly discuss incidents are doing us all a service. The following incidents describe one class of a problem and are by no means isolated instances. We all have black swans lurking in our systems; it's just some of us don't know it yet.

Hitting limits

Programming and development

Running headlong into any sort of limit can produce very severe incidents. A canonical example of this was Instapaper's outage in February 2017 . I challenge any engineer who has carried a pager to read the outage report without a chill running up their spine. Instapaper's production database was on a filesystem that, unknown to the team running the service, had a 2TB limit. With no warning, it stopped accepting writes. Full recovery took days and required migrating its database.

The organizations that publicly discuss incidents are doing us all a service. Limits can strike in various ways. Sentry hit limits on maximum transaction IDs in Postgres . Platform.sh hit size limits on a pipe buffer . SparkPost triggered AWS's DDoS protection . Foursquare hit a performance cliff when one of its datastores ran out of RAM .

One way to get advance knowledge of system limits is to test periodically. Good load testing (on a production replica) ought to involve write transactions and should involve growing each datastore beyond its current production size. It's easy to forget to test things that aren't your main datastores (such as Zookeeper). If you hit limits during testing, you have time to fix the problems. Given that resolution of limits-related issues can involve major changes (like splitting a datastore), time is invaluable.

When it comes to cloud services, if your service generates unusual loads or uses less widely used products or features (such as older or newer ones), you may be more at risk of hitting limits. It's worth load testing these, too. But warn your cloud provider first.

Finally, where limits are known, add monitoring (with associated documentation) so you will know when your systems are approaching those ceilings. Don't rely on people still being around to remember.

Spreading slowness
"The world is much more correlated than we give credit to. And so we see more of what Nassim Taleb calls 'black swan events' -- rare events happen more often than they should because the world is more correlated."
-- Richard Thaler

HostedGraphite's postmortem on how an AWS outage took down its load balancers (which are not hosted on AWS) is a good example of just how much correlation exists in distributed computing systems. In this case, the load-balancer connection pools were saturated by slow connections from customers that were hosted in AWS. The same kinds of saturation can happen with application threads, locks, and database connections -- any kind of resource monopolized by slow operations.

HostedGraphite's incident is an example of externally imposed slowness, but often slowness can result from saturation somewhere in your own system creating a cascade and causing other parts of your system to slow down. An incident at Spotify demonstrates such spread -- the streaming service's frontends became unhealthy due to saturation in a different microservice. Enforcing deadlines for all requests, as well as limiting the length of request queues, can prevent such spread. Your service will serve at least some traffic, and recovery will be easier because fewer parts of your system will be broken.

Retries should be limited with exponential backoff and some jitter. An outage at Square, in which its Redis datastore became overloaded due to a piece of code that retried failed transactions up to 500 times with no backoff, demonstrates the potential severity of excessive retries. The Circuit Breaker design pattern can be helpful here, too.

Dashboards should be designed to clearly show utilization, saturation, and errors for all resources so problems can be found quickly.

Thundering herds

Often, failure scenarios arise when a system is under unusually heavy load. This can arise organically from users, but often it arises from systems. A surge of cron jobs that starts at midnight is a venerable example. Mobile clients can also be a source of coordinated demand if they are programmed to fetch updates at the same time (of course, it is much better to jitter such requests).

Events occurring at pre-configured times aren't the only source of thundering herds. Slack experienced multiple outages over a short time due to large numbers of clients being disconnected and immediately reconnecting, causing large spikes of load. CircleCI saw a severe outage when a GitLab outage ended, leading to a surge of builds queued in its database, which became saturated and very slow.

Almost any service can be the target of a thundering herd. Planning for such eventualities -- and testing that your plan works as intended -- is therefore a must. Client backoff and load shedding are often core to such approaches.

If your systems must constantly ingest data that can't be dropped, it's key to have a scalable way to buffer this data in a queue for later processing.

Automation systems are complex systems
"Complex systems are intrinsically hazardous systems."
-- Richard Cook, MD

If your systems must constantly ingest data that can't be dropped, it's key to have a scalable way to buffer this data in a queue for later processing. The trend for the past several years has been strongly towards more automation of software operations. Automation of anything that can reduce your system's capacity (e.g., erasing disks, decommissioning devices, taking down serving jobs) needs to be done with care. Accidents (due to bugs or incorrect invocations) with this kind of automation can take down your system very efficiently, potentially in ways that are hard to recover from.

Christina Schulman and Etienne Perot of Google describe some examples in their talk Help Protect Your Data Centers with Safety Constraints . One incident sent Google's entire in-house content delivery network (CDN) to disk-erase.

Schulman and Perot suggest using a central service to manage constraints, which limits the pace at which destructive automation can operate, and being aware of system conditions (for example, avoiding destructive operations if the service has recently had an alert).

Automation systems can also cause havoc when they interact with operators (or with other automated systems). Reddit experienced a major outage when its automation restarted a system that operators had stopped for maintenance. Once you have multiple automation systems, their potential interactions become extremely complex and impossible to predict.

It will help to deal with the inevitable surprises if all this automation writes logs to an easily searchable, central place. Automation systems should always have a mechanism to allow them to be quickly turned off (fully or only for a subset of operations or targets).

Defense against the dark swans

These are not the only black swans that might be waiting to strike your systems. There are many other kinds of severe problem that can be avoided using techniques such as canarying, load testing, chaos engineering, disaster testing, and fuzz testing -- and of course designing for redundancy and resiliency. Even with all that, at some point your system will fail.

To ensure your organization can respond effectively, make sure your key technical staff and your leadership have a way to coordinate during an outage. For example, one unpleasant issue you might have to deal with is a complete outage of your network. It's important to have a fail-safe communications channel completely independent of your own infrastructure and its dependencies. For instance, if you run on AWS, using a service that also runs on AWS as your fail-safe communication method is not a good idea. A phone bridge or an IRC server that runs somewhere separate from your main systems is good. Make sure everyone knows what the communications platform is and practices using it.

Another principle is to ensure that your monitoring and your operational tools rely on your production systems as little as possible. Separate your control and your data planes so you can make changes even when systems are not healthy. Don't use a single message queue for both data processing and config changes or monitoring, for example -- use separate instances. In SparkPost: The Day the DNS Died , Jeremy Blosser presents an example where critical tools relied on the production DNS setup, which failed.

The psychology of battling the black swan

To ensure your organization can respond effectively, make sure your key technical staff and your leadership have a way to coordinate during an outage. Dealing with major incidents in production can be stressful. It really helps to have a structured incident-management process in place for these situations. Many technology organizations ( including Google ) successfully use a version of FEMA's Incident Command System. There should be a clear way for any on-call individual to call for assistance in the event of a major problem they can't resolve alone.

For long-running incidents, it's important to make sure people don't work for unreasonable lengths of time and get breaks to eat and sleep (uninterrupted by a pager). It's easy for exhausted engineers to make a mistake or overlook something that might resolve the incident faster.

Learn more

There are many other things that could be said about black (or formerly black) swans and strategies for dealing with them. If you'd like to learn more, I highly recommend these two books dealing with resilience and stability in production: Susan Fowler's Production-Ready Microservices and Michael T. Nygard's Release It! .


Laura Nolan will present What Breaks Our Systems: A Taxonomy of Black Swans at LISA18 , October 29-31 in Nashville, Tennessee, USA.

[Nov 07, 2019] Linux commands to display your hardware information

Nov 07, 2019 | opensource.com

Get the details on what's inside your computer from the command line. 16 Sep 2019 Howard Fosdick Feed 44 up 5 comments Image by : Opensource.com x Subscribe now

Get the highlights in your inbox every week.

https://opensource.com/eloqua-embedded-email-capture-block.html?offer_id=70160000000QzXNAA0

The easiest way is to do that is with one of the standard Linux GUI programs:

  • i-nex collects hardware information and displays it in a manner similar to the popular CPU-Z under Windows.
  • HardInfo displays hardware specifics and even includes a set of eight popular benchmark programs you can run to gauge your system's performance.
  • KInfoCenter and Lshw also display hardware details and are available in many software repositories.

Alternatively, you could open up the box and read the labels on the disks, memory, and other devices. Or you could enter the boot-time panels -- the so-called UEFI or BIOS panels. Just hit the proper program function key during the boot process to access them. These two methods give you hardware details but omit software information.

Or, you could issue a Linux line command. Wait a minute that sounds difficult. Why would you do this?

The Linux Terminal

Sometimes it's easy to find a specific bit of information through a well-targeted line command. Perhaps you don't have a GUI program available or don't want to install one.

Probably the main reason to use line commands is for writing scripts. Whether you employ the Linux shell or another programming language, scripting typically requires coding line commands.

Many line commands for detecting hardware must be issued under root authority. So either switch to the root user ID, or issue the command under your regular user ID preceded by sudo :

sudo <the_line_command>

and respond to the prompt for the root password.

This article introduces many of the most useful line commands for system discovery. The quick reference chart at the end summarizes them.

Hardware overview

There are several line commands that will give you a comprehensive overview of your computer's hardware.

The inxi command lists details about your system, CPU, graphics, audio, networking, drives, partitions, sensors, and more. Forum participants often ask for its output when they're trying to help others solve problems. It's a standard diagnostic for problem-solving:

inxi -Fxz

The -F flag means you'll get full output, x adds details, and z masks out personally identifying information like MAC and IP addresses.

The hwinfo and lshw commands display much of the same information in different formats:

hwinfo --short

or

lshw -short

The long forms of these two commands spew out exhaustive -- but hard to read -- output:

hwinfo

or

lshw
CPU details

You can learn everything about your CPU through line commands. View CPU details by issuing either the lscpu command or its close relative lshw :

lscpu

or

lshw -C cpu

In both cases, the last few lines of output list all the CPU's capabilities. Here you can find out whether your processor supports specific features.

With all these commands, you can reduce verbiage and narrow any answer down to a single detail by parsing the command output with the grep command. For example, to view only the CPU make and model:

lshw -C cpu | grep -i product

To view just the CPU's speed in megahertz:

lscpu | grep -i mhz

or its BogoMips power rating:

lscpu | grep -i bogo

The -i flag on the grep command simply ensures your search ignores whether the output it searches is upper or lower case.

Memory

Linux line commands enable you to gather all possible details about your computer's memory. You can even determine whether you can add extra memory to the computer without opening up the box.

To list each memory stick and its capacity, issue the dmidecode command:

dmidecode -t memory | grep -i size

For more specifics on system memory, including type, size, speed, and voltage of each RAM stick, try:

lshw -short -C memory

One thing you'll surely want to know is is the maximum memory you can install on your computer:

dmidecode -t memory | grep -i max

Now find out whether there are any open slots to insert additional memory sticks. You can do this without opening your computer by issuing this command:

lshw -short -C memory | grep -i empty

A null response means all the memory slots are already in use.

Determining how much video memory you have requires a pair of commands. First, list all devices with the lspci command and limit the output displayed to the video device you're interested in:

lspci | grep -i vga

The output line that identifies the video controller will typically look something like this:

00:02.0 VGA compatible controller: Intel Corporation 82Q35 Express Integrated Graphics Controller (rev 02)

Now reissue the lspci command, referencing the video device number as the selected device:

lspci -v -s 00:02.0

The output line identified as prefetchable is the amount of video RAM on your system:

...
Memory at f0100000 ( 32 -bit, non-prefetchable ) [ size =512K ]
I / O ports at 1230 [ size = 8 ]
Memory at e0000000 ( 32 -bit, prefetchable ) [ size =256M ]
Memory at f0000000 ( 32 -bit, non-prefetchable ) [ size =1M ]
...

Finally, to show current memory use in megabytes, issue:

free -m

This tells how much memory is free, how much is in use, the size of the swap area, and whether it's being used. For example, the output might look like this:

total used free shared buff / cache available
Mem: 11891 1326 8877 212 1687 10077
Swap: 1999 0 1999

The top command gives you more detail on memory use. It shows current overall memory and CPU use and also breaks it down by process ID, user ID, and the commands being run. It displays full-screen text output:

top
Disks, filesystems, and devices

You can easily determine whatever you wish to know about disks, partitions, filesystems, and other devices.

To display a single line describing each disk device:

lshw -short -C disk

Get details on any specific SATA disk, such as its model and serial numbers, supported modes, sector count, and more with:

hdparm -i /dev/sda

Of course, you should replace sda with sdb or another device mnemonic if necessary.

To list all disks with all their defined partitions, along with the size of each, issue:

lsblk

For more detail, including the number of sectors, size, filesystem ID and type, and partition starting and ending sectors:

fdisk -l

To start up Linux, you need to identify mountable partitions to the GRUB bootloader. You can find this information with the blkid command. It lists each partition's unique identifier (UUID) and its filesystem type (e.g., ext3 or ext4):

blkid

To list the mounted filesystems, their mount points, and the space used and available for each (in megabytes):

df -m

Finally, you can list details for all USB and PCI buses and devices with these commands:

lsusb

or

lspci
Network

Linux offers tons of networking line commands. Here are just a few.

To see hardware details about your network card, issue:

lshw -C network

Traditionally, the command to show network interfaces was ifconfig :

ifconfig -a

But many people now use:

ip link show

or

netstat -i

In reading the output, it helps to know common network abbreviations:

Abbreviation Meaning
lo Loopback interface
eth0 or enp* Ethernet interface
wlan0 Wireless interface
ppp0 Point-to-Point Protocol interface (used by a dial-up modem, PPTP VPN connection, or USB modem)
vboxnet0 or vmnet* Virtual machine interface

The asterisks in this table are wildcard characters, serving as a placeholder for whatever series of characters appear from system to system.

To show your default gateway and routing tables, issue either of these commands:

ip route | column -t

or

netstat -r
Software

Let's conclude with two commands that display low-level software details. For example, what if you want to know whether you have the latest firmware installed? This command shows the UEFI or BIOS date and version:

dmidecode -t bios

What is the kernel version, and is it 64-bit? And what is the network hostname? To find out, issue:

uname -a
Quick reference chart

This chart summarizes all the commands covered in this article:

Display info about all hardware inxi -Fxz --or--
hwinfo --short --or--
lshw -short
Display all CPU info lscpu --or--
lshw -C cpu
Show CPU features (e.g., PAE, SSE2) lshw -C cpu | grep -i capabilities
Report whether the CPU is 32- or 64-bit lshw -C cpu | grep -i width
Show current memory size and configuration dmidecode -t memory | grep -i size --or--
lshw -short -C memory
Show maximum memory for the hardware dmidecode -t memory | grep -i max
Determine whether memory slots are available lshw -short -C memory | grep -i empty
(a null answer means no slots available)
Determine the amount of video memory lspci | grep -i vga
then reissue with the device number;
for example: lspci -v -s 00:02.0
The VRAM is the prefetchable value.
Show current memory use free -m --or--
top
List the disk drives lshw -short -C disk
Show detailed information about a specific disk drive hdparm -i /dev/sda
(replace sda if necessary)
List information about disks and partitions lsblk (simple) --or--
fdisk -l (detailed)
List partition IDs (UUIDs) blkid
List mounted filesystems, their mount points,
and megabytes used and available for each
df -m
List USB devices lsusb
List PCI devices lspci
Show network card details lshw -C network
Show network interfaces ifconfig -a --or--
ip link show --or--
netstat -i
Display routing tables ip route | column -t --or--
netstat -r
Display UEFI/BIOS info dmidecode -t bios
Show kernel version, network hostname, more uname -a

Do you have a favorite command that I overlooked? Please add a comment and share it

[Nov 07, 2019] 13 open source backup solutions Opensource.com

Nov 07, 2019 | opensource.com

13 open source backup solutions Readers suggest more than a dozen of their favorite solutions for protecting data. 07 Mar 2019 Don Watkins (Community Moderator) Feed 124 up 6 comments Image by : Opensource.com x Subscribe now

Get the highlights in your inbox every week.

https://opensource.com/eloqua-embedded-email-capture-block.html?offer_id=70160000000QzXNAA0 poll that asked readers to vote on their favorite open source backup solution. We offered six solutions recommended by our moderator community -- Cronopete, Deja Dup, Rclone, Rdiff-backup, Restic, and Rsync -- and invited readers to share other options in the comments. And you came through, offering 13 other solutions (so far) that we either hadn't considered or hadn't even heard of.

By far the most popular suggestion was BorgBackup . It is a deduplicating backup solution that features compression and encryption. It is supported on Linux, MacOS, and BSD and has a BSD License.

Second was UrBackup , which does full and incremental image and file backups; you can save whole partitions or single directories. It has clients for Windows, Linux, and MacOS and has a GNU Affero Public License.

Third was LuckyBackup ; according to its website, "it is simple to use, fast (transfers over only changes made and not all data), safe (keeps your data safe by checking all declared directories before proceeding in any data manipulation), reliable, and fully customizable." It carries a GNU Public License.

Casync is content-addressable synchronization -- it's designed for backup and synchronizing and stores and retrieves multiple related versions of large file systems. It is licensed with the GNU Lesser Public License.

Syncthing synchronizes files between two computers. It is licensed with the Mozilla Public License and, according to its website, is secure and private. It works on MacOS, Windows, Linux, FreeBSD, Solaris, and OpenBSD.

Duplicati is a free backup solution that works on Windows, MacOS, and Linux and a variety of standard protocols, such as FTP, SSH, and WebDAV, and cloud services. It features strong encryption and is licensed with the GPL.

Dirvish is a disk-based virtual image backup system licensed under OSL-3.0. It also requires Rsync, Perl5, and SSH to be installed.

Bacula 's website says it "is a set of computer programs that permits the system administrator to manage backup, recovery, and verification of computer data across a network of computers of different kinds." It is supported on Linux, FreeBSD, Windows, MacOS, OpenBSD, and Solaris and the bulk of its source code is licensed under AGPLv3.

BackupPC "is a high-performance, enterprise-grade system for backing up Linux, Windows, and MacOS PCs and laptops to a server's disk," according to its website. It is licensed under the GPLv3.

Amanda is a backup system written in C and Perl that allows a system administrator to back up an entire network of client machines to a single server using tape, disk, or cloud-based systems. It was developed and copyrighted in 1991 at the University of Maryland and has a BSD-style license.

Back in Time is a simple backup utility designed for Linux. It provides a command line client and a GUI, both written in Python. To do a backup, just specify where to store snapshots, what folders to back up, and the frequency of the backups. BackInTime is licensed with GPLv2.

Timeshift is a backup utility for Linux that is similar to System Restore for Windows and Time Capsule for MacOS. According to its GitHub repository, "Timeshift protects your system by taking incremental snapshots of the file system at regular intervals. These snapshots can be restored at a later date to undo all changes to the system."

Kup is a backup solution that was created to help users back up their files to a USB drive, but it can also be used to perform network backups. According to its GitHub repository, "When you plug in your external hard drive, Kup will automatically start copying your latest changes."

[Nov 06, 2019] Destroying multiple production databases by Jan Gerrit Kootstra

Aug 08, 2019 | www.redhat.com
In my 22-year-old career as an IT specialist, I encountered two major issues where -- due to my mistakes -- important production databases were blown apart. Here are my stories. Freshman mistake

The first time was in the late 1990s when I started working at a service provider for my local municipality's social benefit agency. I got an assignment as a newbie system administrator to remove retired databases from the server where databases for different departments were consolidated.

Due to a type error on a top-level directory, I removed two live database files instead of the one retired database. What was worse was that due to the complexity of the database consolidation during the restore, other databases were hit, too. Repairing all databases took approximately 22 hours.

What helped

A good backup that was tested each night by recovering an empty file at the end of the tar archive catalog, after the backup was made. Future-looking statement

It's important to learn from our mistakes. What I learned is this:

  • Write down the steps you will perform and have them checked by a senior sysadmin. It was the first time I did not ask for a review by one of the seniors. My bad.
  • Be nice to colleagues from other teams. It was a DBA that saved me.
  • Do not copy such a complex setup of sharing databases over shared file systems.
  • Before doing a life cycle management migration, go for a separation of the filesystems per database to avoid the complexity and reduce the chances of human error.
  • Change your approach: Later in my career, I always tried to avoid lift and shift migrations.
Senior sysadmin mistake

In a period where partly offshoring IT activities was common practice in order to reduce costs, I had to take over a database filesystem extension on a Red Hat 5 cluster. Given that I set up this system a couple of years before, I had not checked the current situation.

I assumed the offshore team was familiar with the need to attach all shared LUNs to both nodes of the two-node cluster. My bad, never assume. As an Australian tourist once mentioned when a friend and I were on a vacation in Ireland after my Latin grammar school graduation: "Do not make an ars out of you me." Or, another phrase: "Assuming is the mother of all mistakes."

Well, I fell for my own trap. I went for the filesystem extension on the active node, and without checking the passive node's ( node2 ) status, tested a failover. Because we had agreed to run the database on node2 until the next update window, I had put myself in trouble.

As the databases started to fail, we brought the database cluster down. No issues yet, but all hell broke loose when I ran a filesystem check on an LVM-based system with missing physical volumes.

Looking back

I would say you're stupid to myself. Running pvs , lvs , or vgs would have alerted me that LVM detected issues. Also, comparing multipath configuration files would have revealed probable issues.

So, next time, I would first, check to see if LVM contains issues before going for the last resort: A filesystem check and trying to fix the millions of errors. Most of the time you will destroy files, anyway.

What saved my day

What saved my day back then was:

  • My good relations with colleagues over all teams, where a short talk with a great storage admin created the correct zoning to the required LUNs, and I got great help from a DBA who had deep knowledge of the clustered databases.
  • A good database backup.
  • Great management and a great service manager. They kept the annoyed customer away from us.
  • Not making make promises I could not keep, like: "I will fix it in three hours." Instead, statements such as the one below help keep the customer satisfied: "At the current rate of fixing the filesystem, I cannot guarantee a fix within so many hours. As we just passed the 10% mark, I suggest we stop this approach and discuss another way to solve the issue."
Future-looking statement

I definitely learned some things. For example, always check the environment you're about to work on before any change. Never assume that you know how an environment looks -- change is a constant in IT.

Also, share what you learned from your mistakes. Train offshore colleagues instead of blaming them. Also, inform them about the impact the issue had on the customer's business. A continent's major transport hub cannot be put on hold due to a sysadmin's mistake.

A shutdown of the transport hub might have been needed if we failed to solve the issue and the backup site in a data centre of another service provider would have been hurt too. Part of the hub is a harbour and we could have blown up a part of the harbour next to a village of about 10,000 people if both a cotton ship and an oil tanker would have gotten lost on the harbour master's map and collided.

General lessons learned

I learned some important lessons overall from these and other mistakes:

  • Be humble enough to admit your mistakes.
  • Be arrogant enough to state that you are one of the few people that can help fix the issues you caused.
  • Show leadership of the solvers' team, or at least make sure that all of the team's roles will be fulfilled -- including the customer relations manager.
  • Take back the role of problem-solver after the team is created, if that is what was requested.
  • "Be part of the solution, do not become part of the problem," as a colleague says.

I cannot stress this enough: Learn from your mistakes to avoid them in the future, rather than learning how to make them on a weekly basis. Jan Gerrit Kootstra Solution Designer (for Telco network services). Red Hat Accelerator. More about me

[Nov 02, 2019] LVM spanning over multiple disks What disk is a file on? Can I lose a drive without total loss

Notable quotes:
"... If you lose a drive in a volume group, you can force the volume group online with the missing physical volume, but you will be unable to open the LV's that were contained on the dead PV, whether they be in whole or in part. ..."
"... So, if you had for instance 10 LV's, 3 total on the first drive, #4 partially on first drive and second drive, then 5-7 on drive #2 wholly, then 8-10 on drive 3, you would be potentially able to force the VG online and recover LV's 1,2,3,8,9,10.. #4,5,6,7 would be completely lost. ..."
"... LVM doesn't really have the concept of a partition it uses PVs (Physical Volumes), which can be a partition. These PVs are broken up into extents and then these are mapped to the LVs (Logical Volumes). When you create the LVs you can specify if the data is striped or mirrored but the default is linear allocation. So it would use the extents in the first PV then the 2nd then the 3rd. ..."
"... As Peter has said the blocks appear as 0's if a PV goes missing. So you can potentially do data recovery on files that are on the other PVs. But I wouldn't rely on it. You normally see LVM used in conjunction with RAIDs for this reason. ..."
"... it's effectively as if a huge chunk of your disk suddenly turned to badblocks. You can patch things back together with a new, empty drive to which you give the same UUID, and then run an fsck on any filesystems on logical volumes that went across the bad drive to hope you can salvage something. ..."
Mar 16, 2015 | serverfault.com

LVM spanning over multiple disks: What disk is a file on? Can I lose a drive without total loss? Ask Question Asked 8 years, 10 months ago Active 4 years, 6 months ago Viewed 9k times 7 2 I have three 990GB partitions over three drives in my server. Using LVM, I can create one ~3TB partition for file storage.

1) How does the system determine what partition to use first?
2) Can I find what disk a file or folder is physically on?
3) If I lose a drive in the LVM, do I lose all data, or just data physically on that disk? storage lvm share

edited Mar 16 '15 at 12:53

HopelessN00b 49k 25 25 gold badges 121 121 silver badges 194 194 bronze badges asked Dec 2 '10 at 2:28 Luke has no name Luke has no name 989 10 10 silver badges 13 13 bronze badges

add a comment | 3 Answers 3 active oldest votes 12
  1. The system fills from the first disk in the volume group to the last, unless you configure striping with extents.
  2. I don't think this is possible, but where I'd start to look is in the lvs/vgs commands man pages.
  3. If you lose a drive in a volume group, you can force the volume group online with the missing physical volume, but you will be unable to open the LV's that were contained on the dead PV, whether they be in whole or in part.
  4. So, if you had for instance 10 LV's, 3 total on the first drive, #4 partially on first drive and second drive, then 5-7 on drive #2 wholly, then 8-10 on drive 3, you would be potentially able to force the VG online and recover LV's 1,2,3,8,9,10.. #4,5,6,7 would be completely lost.
Peter Grace Peter Grace 2,676 2 2 gold badges 22 22 silver badges 38 38 bronze badges add a comment | 3

1) How does the system determine what partition to use first?

LVM doesn't really have the concept of a partition it uses PVs (Physical Volumes), which can be a partition. These PVs are broken up into extents and then these are mapped to the LVs (Logical Volumes). When you create the LVs you can specify if the data is striped or mirrored but the default is linear allocation. So it would use the extents in the first PV then the 2nd then the 3rd.

2) Can I find what disk a file or folder is physically on?

You can determine what PVs a LV has allocation extents on. But I don't know of a way to get that information for an individual file.

3) If I lose a drive in the LVM, do I lose all data, or just data physically on that disk?

As Peter has said the blocks appear as 0's if a PV goes missing. So you can potentially do data recovery on files that are on the other PVs. But I wouldn't rely on it. You normally see LVM used in conjunction with RAIDs for this reason.

3dinfluence 3dinfluence 12k 1 1 gold badge 23 23 silver badges 38 38 bronze badges

  • So here's a derivative of my question: I have 3x1 TB drives and I want to use 3TB of that space. What's the best way to configure the drives so I am not splitting my data over folders/mountpoints? or is there a way at all, other than what I've implied above? – Luke has no name Dec 2 '10 at 5:12
  • If you want to use 3TB and aren't willing to split data over folders/mount points I don't see any other way. There may be some virtual filesystem solution to this problem like unionfs although I'm not sure if it would solve your particular problem. But LVM is certainly the most straight forward and simple solution as such it's the one I'd go with. – 3dinfluence Dec 2 '10 at 14:51
add a comment | 2 I don't know the answer to #2, so I'll leave that to someone else. I suspect "no", but I'm willing to be happily surprised.

1 is: you tell it, when you combine the physical volumes into a volume group.

3 is: it's effectively as if a huge chunk of your disk suddenly turned to badblocks. You can patch things back together with a new, empty drive to which you give the same UUID, and then run an fsck on any filesystems on logical volumes that went across the bad drive to hope you can salvage something.

And to the overall, unasked question: yeah, you probably don't really want to do that.

[Nov 02, 2019] Raid-5 is obsolete if you use large drives , such as 2TB or 3TB disks. You should instead use raid-6 ( two disks can fail)

Notable quotes:
"... RAID5 can survive a single drive failure. However, once you replace that drive, it has to be initialized. Depending on the controller and other things, this can take anywhere from 5-18 hours. During this time, all drives will be in constant use to re-create the failed drive. It is during this time that people worry that the rebuild would cause another drive near death to die, causing a complete array failure. ..."
"... If during a rebuild one of the remaining disks experiences BER, your rebuild stops and you may have headaches recovering from such a situation, depending on controller design and user interaction. ..."
"... RAID5 + a GOOD backup is something to consider, though. ..."
"... Raid-5 is obsolete if you use large drives , such as 2TB or 3TB disks. You should instead use raid-6 ..."
"... RAID 6 offers more redundancy than RAID 5 (which is absolutely essential, RAID 5 is a walking disaster) at the cost of multiple parity writes per data write. This means the performance will be typically worse (although it's not theoretically much worse, since the parity operations are in parallel). ..."
Oct 03, 2019 | hardforum.com

RAID5 can survive a single drive failure. However, once you replace that drive, it has to be initialized. Depending on the controller and other things, this can take anywhere from 5-18 hours. During this time, all drives will be in constant use to re-create the failed drive. It is during this time that people worry that the rebuild would cause another drive near death to die, causing a complete array failure.

This isn't the only danger. The problem with 2TB disks, especially if they are not 4K sector disks, is that they have relative high BER rate for their capacity, so the likelihood of BER actually occurring and translating into an unreadable sector is something to worry about.

If during a rebuild one of the remaining disks experiences BER, your rebuild stops and you may have headaches recovering from such a situation, depending on controller design and user interaction.

So i would say with modern high-BER drives you should say:

  • RAID5: 0 complete disk failures, BER covered
  • RAID6: 1 complete disk failure, BER covered

So essentially you'll lose one parity disk alone for the BER issue. Not everyone will agree with my analysis, but considering RAID5 with today's high-capacity drives 'safe' is open for debate.

RAID5 + a GOOD backup is something to consider, though.

  1. So you're saying BER is the error count that 'escapes' the ECC correction? I do not believe that is correct, but i'm open to good arguments or links.

    As i understand, the BER is what prompt bad sectors, where the number of errors exceed that of the ECC error correcting ability; and you will have an unrecoverable sector (Current Pending Sector in SMART output).

    Also these links are interesting in this context:

    http://blog.econtech.selfip.org/200...s-not-fully-readable-a-lawsuit-in-the-making/

    The short story first: Your consumer level 1TB SATA drive has a 44% chance that it can be completely read without any error. If you run a RAID setup, this is really bad news because it may prevent rebuilding an array in the case of disk failure, making your RAID not so Redundant. Click to expand...
    Not sure on the numbers the article comes up with, though.

    Also this one is interesting:
    http://lefthandnetworks.typepad.com/virtual_view/2008/02/what-does-data.html

    BER simply means that while reading your data from the disk drive you will get an average of one non-recoverable error in so many bits read, as specified by the manufacturer. Click to expand...
    Rebuilding the data on a replacement drive with most RAID algorithms requires that all the other data on the other drives be pristine and error free. If there is a single error in a single sector, then the data for the corresponding sector on the replacement drive cannot be reconstructed, and therefore the RAID rebuild fails and data is lost. The frequency of this disastrous occurrence is derived from the BER. Simple calculations will show that the chance of data loss due to BER is much greater than all other reasons combined. Click to expand...
    These links do suggest that BER works to produce un-recoverable sectors, and not 'escape' them as 'undetected' bad sectors, if i understood you correctly.
  1. parityOCP said:
    That's guy's a bit of a scaremonger to be honest. He may have a point with consumer drives, but the article is sensationalised to a certain degree. However, there are still a few outfits that won't go past 500GB/drive in an array (even with enterprise drives), simply to reduce the failure window during a rebuild. Click to expand...
    Why is he a scaremonger? He is correct. Have you read his article? In fact, he has copied his argument from Adam Leventhal(?) that was one of the ZFS developers I believe.

    Adam's argument goes likes this:
    Disks are getting larger all the time, in fact, the storage increases exponentially. At the same time, the bandwidth is increasing not that fast - we are still at 100MB/sek even after decades. So, bandwidth has increased maybe 20x after decades. While storage has increased from 10MB to 3TB = 300.000 times.

    The trend is clear. In the future when we have 10TB drives, they will not be much faster than today. This means, to repair an raid with 3TB disks today, will take several days, maybe even one week. With 10TB drives, it will take several weeks, maybe a month.

    Repairing a raid stresses the other disks much, which means they can break too. Experienced sysadmins reports that this happens quite often during a repair. Maybe because those disks come from the same batch, they have the same weakness. Some sysadmins therefore mix disks from different vendors and batches.

    Hence, I would not want to run a raid with 3TB disks and only use raid-5. During those days, if only another disk crashes you have lost all your data.

    Hence, that article is correct, and he is not a scaremonger. Raid-5 is obsolete if you use large drives, such as 2TB or 3TB disks. You should instead use raid-6 (two disks can fail). That is the conclusion of the article: use raid-6 with large disks, forget raid-5. This is true, and not scaremongery.

    In fact, ZFS has therefore something called raidz3 - which means that three disks can fail without problems. To the OT: no raid-5 is not safe. Neither is raid-6, because neither of them can not always repair nor detect corrupted data. There are cases when they dont even notice that you got corrupted bits. See my other thread for more information about this. That is the reason people are switching to ZFS - which always CAN detect and repair those corrupted bits. I suggest, sell your hardware raid card, and use ZFS which requires no hardware. ZFS just uses JBOD.

    Here are research papers on raid-5, raid-6 and ZFS and corruption:
    http://hardforum.com/showpost.php?p=1036404173&postcount=73

  1. brutalizer said:
    The trend is clear. In the future when we have 10TB drives, they will not be much faster than today. This means, to repair an raid with 3TB disks today, will take several days, maybe even one week. With 10TB drives, it will take several weeks, maybe a month. Click to expand...
    While I agree with the general claim that the larger HDDs (1.5, 2, 3TBs) are best used in RAID 6, your claim about rebuild times is way off.

    I think it is not unreasonable to assume that the 10TB drives will be able to read and write at 200 MB/s or more. We already have 2TB drives with 150MB/s sequential speeds, so 200 MB/s is actually a conservative estimate.

    10e12/200e6 = 50000 secs = 13.9 hours. Even if there is 100% overhead (half the throughput), that is less than 28 hours to do the rebuild. It is a long time, but it is no where near a month! Try to back your claims in reality.

    And you have again made the false claim that "ZFS - which always CAN detect and repair those corrupted bits". ZFS can usually detect corrupted bits, and can usually correct them if you have duplication or parity, but nothing can always detect and repair. ZFS is safer than many alternatives, but nothing is perfectly safe. Corruption can and has happened with ZFS, and it will happen again.

Is RAID5 safe with Five 2TB Hard Drives ? | [H]ard|Forum Your browser indicates if you've visited this link

https://hardforum.com /threads/is-raid5-safe-with-five-2tb-hard-drives.1560198/

Hence, that article is correct, and he is not a scaremonger. Raid-5 is obsolete if you use large drives , such as 2TB or 3TB disks. You should instead use raid-6 ( two disks can fail). That is the conclusion of the article: use raid-6 with large disks, forget raid-5 . This is true, and not scaremongery.

RAID 5 Data Recovery How to Rebuild a Failed RAID 5 - YouTube

RAID 5 vs RAID 10: Recommended RAID For Safety and ... Your browser indicates if you've visited this link

https://www.cyberciti.biz /tips/raid5-vs-raid-10-safety-performance.html

RAID 6 offers more redundancy than RAID 5 (which is absolutely essential, RAID 5 is a walking disaster) at the cost of multiple parity writes per data write. This means the performance will be typically worse (although it's not theoretically much worse, since the parity operations are in parallel).

[Oct 22, 2019] Equifax Used 'admin' as Username and Password for Sensitive Data: Lawsuit

Oct 22, 2019 | tech.slashdot.org

(yahoo.com) 59 AndrewFlagg writes: When it comes to using strong username and passwords for administrative purposes let alone customer facing portals, Equifax appears to have dropped the ball. Equifax used the word "admin" as both password and username for a portal that contained sensitive information , according to a class action lawsuit filed in federal court in the Northern District of Georgia. The ongoing lawsuit, filed after the breach, went viral on Twitter Friday after Buzzfeed reporter Jane Lytvynenko came across the detail. "Equifax employed the username 'admin' and the password 'admin' to protect a portal used to manage credit disputes, a password that 'is a surefire way to get hacked,'" the lawsuit reads. The lawsuit also notes that Equifax admitted using unencrypted servers to store the sensitive personal information and had it as a public-facing website. When Equifax, one of the three largest consumer credit reporting agencies, did encrypt data, the lawsuit alleges, "it left the keys to unlocking the encryption on the same public-facing servers, making it easy to remove the encryption from the data." The class-action suit consolidated 373 previous lawsuits into one. Unlike other lawsuits against Equifax, these don't come from wronged consumers, but rather shareholders that allege the company didn't adequately disclose risks or its security practices.

[Oct 22, 2019] Flaw In Sudo Enables Non-Privileged Users To Run Commands As Root

Notable quotes:
"... the function which converts user id into its username incorrectly treats -1, or its unsigned equivalent 4294967295, as 0, which is always the user ID of root user. ..."
Oct 22, 2019 | linux.slashdot.org

(thehackernews.com) 139 Posted by BeauHD on Monday October 14, 2019 @07:30PM from the Su-doh dept. exomondo shares a report from The Hacker News:

... ... ...

The vulnerability, tracked as CVE-2019-14287 and discovered by Joe Vennix of Apple Information Security, is more concerning because the sudo utility has been designed to let users use their own login password to execute commands as a different user without requiring their password. \

What's more interesting is that this flaw can be exploited by an attacker to run commands as root just by specifying the user ID "-1" or "4294967295."

That's because the function which converts user id into its username incorrectly treats -1, or its unsigned equivalent 4294967295, as 0, which is always the user ID of root user.

The vulnerability affects all Sudo versions prior to the latest released version 1.8.28, which has been released today.

    • Re:Not many systems vulnerable )
      mysidia ( 191772 ) #59309858)

      If you have been blessed with the power to run commands as ANY user you want, then you are still specially privileged, even though you are not fully privileged.

      Its a rare/unusual configuration to say (all, !root) --- the people using this configuration on their systems should probably KNOW there are going to exist some ways that access can be abused to ultimately circumvent the intended !root rule - If not within sudo itself, then by using sudo to get a shell as a different user UID that belongs to some person or program who DOES have root permissions, and then causing crafted code to run as that user --- For example, by installing a
      Trojanned version of the screen command and modifying files in the home directory of a legitimate root user to alias the screen command to trojanned version that will log the password the next time that Other user logs in normally and uses the sudo command.

[Oct 20, 2019] SSL certificate installation in RHEL 6 and 7

Oct 20, 2019 | access.redhat.com

Resolution

  1. Install/update the latest ca-certificates package. Raw
    # yum install ca-certificates
    
  2. For example, download the RapidSSL Intermediate CA Certificates .
  3. Enable the dynamic CA configuration feature by running: Raw
    # update-ca-trust enable
    
  4. Copy the RapidSSL Intermediate CA certificate to the directory /etc/pki/ca-trust/source/anchors/ : Raw
    # cp rapidSSL-ca.crt /etc/pki/ca-trust/source/anchors/
    
  5. Extract and add the Intermediate CA certificate to the list of trusted CA's: Raw
    # update-ca-trust extract
    
  6. Verify the SSL certificate that signed by RapidSSL: Raw
    # openssl verify  server.crt 
    server.crt : OK
    

[Oct 02, 2019] raid5 - Can I recover a RAID 5 array if two drives have failed - Server Fault

Oct 02, 2019 | serverfault.com

Can I recover a RAID 5 array if two drives have failed? Ask Question Asked 9 years ago Active 2 years, 3 months ago Viewed 58k times I have a Dell 2600 with 6 drives configured in a RAID 5 on a PERC 4 controller. 2 drives failed at the same time, and according to what I know a RAID 5 is recoverable if 1 drive fails. I'm not sure if the fact I had six drives in the array might save my skin.

I bought 2 new drives and plugged them in but no rebuild happened as I expected. Can anyone shed some light? raid raid5 dell-poweredge share Share a link to this question

add a comment | 4 Answers 4 active oldest votes

11 Regardless of how many drives are in use, a RAID 5 array only allows for recovery in the event that just one disk at a time fails.

What 3molo says is a fair point but even so, not quite correct I think - if two disks in a RAID5 array fail at the exact same time then a hot spare won't help, because a hot spare replaces one of the failed disks and rebuilds the array without any intervention, and a rebuild isn't possible if more than one disk fails.

For now, I am sorry to say that your options for recovering this data are going to involve restoring a backup.

For the future you may want to consider one of the more robust forms of RAID (not sure what options a PERC4 supports) such as RAID 6 or a nested RAID array . Once you get above a certain amount of disks in an array you reach the point where the chance that more than one of them can fail before a replacement is installed and rebuilt becomes unacceptably high. share Share a link to this answer Copy link | improve this answer edited Jun 8 '12 at 13:37 longneck 21.1k 3 3 gold badges 43 43 silver badges 76 76 bronze badges answered Sep 21 '10 at 14:43 Rob Moir Rob Moir 30k 4 4 gold badges 53 53 silver badges 84 84 bronze badges

  • 1 thanks robert I will take this advise into consideration when I rebuild the server, lucky for me I full have backups that are less than 6 hours old. regards – bonga86 Sep 21 '10 at 15:00
  • If this is (somehow) likely to occur again in the future, you may consider RAID6. Same idea as RAID5 but with two Parity disks, so the array can survive any two disks failing. – gWaldo Sep 21 '10 at 15:04
  • g man(mmm...), i have recreated the entire system from scratch with a RAID 10. So hopefully if 2 drives go out at the same time again the system will still function? Otherwise everything has been restored and working thanks for ideas and input – bonga86 Sep 23 '10 at 11:34
  • Depends which two drives go... RAID 10 means, for example, 4 drives in two mirrored pairs (2 RAID 1 mirrors) striped together (RAID 0) yes? If you lose both disks in 1 of the mirrors then you've still got an outage. – Rob Moir Sep 23 '10 at 11:43
  • 1 Remember, that RAID is not a backup. No more robust forms of RAID will save you from data corruption. – Kazimieras Aliulis May 31 '13 at 10:57
add a comment | 2 You can try to force one or both of the failed disks to be online from the BIOS interface of the controller. Then check that the data and the file system are consistent. share Share a link to this answer Copy link | improve this answer answered Sep 21 '10 at 15:35 Mircea Vutcovici Mircea Vutcovici 13.6k 3 3 gold badges 42 42 silver badges 69 69 bronze badges
  • 2 Dell systems, especially, in my experience, built on PERC3 or PERC4 cards had a nasty tendency to simply have a hiccup on the SCSI bus which would know two or more drives off-line. A drive being offline does NOT mean it failed. I've never had a two drives fail at the same time, but probably a half dozen times, I've had two or more drives go off-line. I suggest you try Mircea's suggestion first... could save you a LOT of time. – Multiverse IT Sep 21 '10 at 16:32
  • Hey guys, i tried the force option many times. Both "failed" drives would than come back online, but when I do a restart it says logical drive :degraded and obviously because of that they system still could not boot. – bonga86 Sep 23 '10 at 11:27
add a comment | 2 Direct answer is "No". In-direct -- "It depends". Mainly it depends on whether disks are partially out of order, or completely. In case there're partially broken, you can give it a try -- I would copy (using tool like ddrescue) both failed disks. Then I'd try to run the bunch of disks using Linux SoftRAID -- re-trying with proper order of disks and stripe-size in read-only mode and counting CRC mismatches. It's quite doable, I should say -- this text in Russian mentions 12 disk RAID50's recovery using LSR , for example. share Share a link to this answer Copy link | improve this answer edited Jun 8 '12 at 15:12 Skyhawk 13.5k 3 3 gold badges 45 45 silver badges 91 91 bronze badges answered Jun 8 '12 at 14:11 poige poige 7,370 2 2 gold badges 16 16 silver badges 38 38 bronze badges add a comment | 0 It is possible if raid was with one spare drive , and one of your failed disks died before the second one. So, you just need need to try reconstruct array virtually with 3d party software . Found small article about this process on this page: http://www.angeldatarecovery.com/raid5-data-recovery/

And, if you realy need one of died drives you can send it to recovery shops. With of this images you can reconstruct raid properly with good channces.

[Sep 23, 2019] How to recover deleted files with foremost on Linux - LinuxConfig.org

Sep 23, 2019 | linuxconfig.org
Details
System Administration
15 September 2019
Contents In this article we will talk about foremost , a very useful open source forensic utility which is able to recover deleted files using the technique called data carving . The utility was originally developed by the United States Air Force Office of Special Investigations, and is able to recover several file types (support for specific file types can be added by the user, via the configuration file). The program can also work on partition images produced by dd or similar tools.

In this tutorial you will learn:

  • How to install foremost
  • How to use foremost to recover deleted files
  • How to add support for a specific file type
foremost-manual <img src=https://linuxconfig.org/images/foremost_manual.png alt=foremost-manual width=1200 height=675 /> Foremost is a forensic data recovery program for Linux used to recover files using their headers, footers, and data structures through a process known as file carving. Software Requirements and Conventions Used
Software Requirements and Linux Command Line Conventions
Category Requirements, Conventions or Software Version Used
System Distribution-independent
Software The "foremost" program
Other Familiarity with the command line interface
Conventions # - requires given linux commands to be executed with root privileges either directly as a root user or by use of sudo command
$ - requires given linux commands to be executed as a regular non-privileged user
Installation

Since foremost is already present in all the major Linux distributions repositories, installing it is a very easy task. All we have to do is to use our favorite distribution package manager. On Debian and Ubuntu, we can use apt :

$ sudo apt install foremost

In recent versions of Fedora, we use the dnf package manager to install packages , the dnf is a successor of yum . The name of the package is the same:

$ sudo dnf install foremost

If we are using ArchLinux, we can use pacman to install foremost . The program can be found in the distribution "community" repository:

$ sudo pacman -S foremost

SUBSCRIBE TO NEWSLETTER
Subscribe to Linux Career NEWSLETTER and receive latest Linux news, jobs, career advice and tutorials.

me name=


Basic usage
WARNING
No matter which file recovery tool or process your are going to use to recover your files, before you begin it is recommended to perform a low level hard drive or partition backup, hence avoiding an accidental data overwrite !!! In this case you may re-try to recover your files even after unsuccessful recovery attempt. Check the following dd command guide on how to perform hard drive or partition low level backup.

The foremost utility tries to recover and reconstruct files on the base of their headers, footers and data structures, without relying on filesystem metadata . This forensic technique is known as file carving . The program supports various types of files, as for example:

  • jpg
  • gif
  • png
  • bmp
  • avi
  • exe
  • mpg
  • wav
  • riff
  • wmv
  • mov
  • pdf
  • ole
  • doc
  • zip
  • rar
  • htm
  • cpp

The most basic way to use foremost is by providing a source to scan for deleted files (it can be either a partition or an image file, as those generated with dd ). Let's see an example. Imagine we want to scan the /dev/sdb1 partition: before we begin, a very important thing to remember is to never store retrieved data on the same partition we are retrieving the data from, to avoid overwriting delete files still present on the block device. The command we would run is:

$ sudo foremost -i /dev/sdb1

By default, the program creates a directory called output inside the directory we launched it from and uses it as destination. Inside this directory, a subdirectory for each supported file type we are attempting to retrieve is created. Each directory will hold the corresponding file type obtained from the data carving process:

output
├── audit.txt
├── avi
├── bmp
├── dll
├── doc
├── docx
├── exe
├── gif
├── htm
├── jar
├── jpg
├── mbd
├── mov
├── mp4
├── mpg
├── ole
├── pdf
├── png  
├── ppt
├── pptx
├── rar
├── rif
├── sdw
├── sx
├── sxc
├── sxi
├── sxw
├── vis
├── wav
├── wmv
├── xls
├── xlsx
└── zip

When foremost completes its job, empty directories are removed. Only the ones containing files are left on the filesystem: this let us immediately know what type of files were successfully retrieved. By default the program tries to retrieve all the supported file types; to restrict our search, we can, however, use the -t option and provide a list of the file types we want to retrieve, separated by a comma. In the example below, we restrict the search only to gif and pdf files:

$ sudo foremost -t gif,pdf -i /dev/sdb1

https://www.youtube.com/embed/58S2wlsJNvo

In this video we will test the forensic data recovery program Foremost to recover a single png file from /dev/sdb1 partition formatted with the EXT4 filesystem.

me name=


Specifying an alternative destination

As we already said, if a destination is not explicitly declared, foremost creates an output directory inside our cwd . What if we want to specify an alternative path? All we have to do is to use the -o option and provide said path as argument. If the specified directory doesn't exist, it is created; if it exists but it's not empty, the program throws a complain:

ERROR: /home/egdoc/data is not empty
        Please specify another directory or run with -T.

To solve the problem, as suggested by the program itself, we can either use another directory or re-launch the command with the -T option. If we use the -T option, the output directory specified with the -o option is timestamped. This makes possible to run the program multiple times with the same destination. In our case the directory that would be used to store the retrieved files would be:

/home/egdoc/data_Thu_Sep_12_16_32_38_2019
The configuration file

The foremost configuration file can be used to specify file formats not natively supported by the program. Inside the file we can find several commented examples showing the syntax that should be used to accomplish the task. Here is an example involving the png type (the lines are commented since the file type is supported by default):

# PNG   (used in web pages)
#       (NOTE THIS FORMAT HAS A BUILTIN EXTRACTION FUNCTION)
#       png     y       200000  \x50\x4e\x47?   \xff\xfc\xfd\xfe

The information to provide in order to add support for a file type, are, from left to right, separated by a tab character: the file extension ( png in this case), whether the header and footer are case sensitive ( y ), the maximum file size in Bytes ( 200000 ), the header ( \x50\x4e\x47? ) and and the footer ( \xff\xfc\xfd\xfe ). Only the latter is optional and can be omitted.

If the path of the configuration file it's not explicitly provided with the -c option, a file named foremost.conf is searched and used, if present, in the current working directory. If it is not found the default configuration file, /etc/foremost.conf is used instead.

Adding the support for a file type

By reading the examples provided in the configuration file, we can easily add support for a new file type. In this example we will add support for flac audio files. Flac (Free Lossless Audio Coded) is a non-proprietary lossless audio format which is able to provide compressed audio without quality loss. First of all, we know that the header of this file type in hexadecimal form is 66 4C 61 43 00 00 00 22 ( fLaC in ASCII), and we can verify it by using a program like hexdump on a flac file:

$ hexdump -C
blind_guardian_war_of_wrath.flac|head
00000000  66 4c 61 43 00 00 00 22  12 00 12 00 00 00 0e 00  |fLaC..."........|
00000010  36 f2 0a c4 42 f0 00 4d  04 60 6d 0b 64 36 d7 bd  |6...B..M.`m.d6..|
00000020  3e 4c 0d 8b c1 46 b6 fe  cd 42 04 00 03 db 20 00  |>L...F...B.... .|
00000030  00 00 72 65 66 65 72 65  6e 63 65 20 6c 69 62 46  |..reference libF|
00000040  4c 41 43 20 31 2e 33 2e  31 20 32 30 31 34 31 31  |LAC 1.3.1 201411|
00000050  32 35 21 00 00 00 12 00  00 00 54 49 54 4c 45 3d  |25!.......TITLE=|
00000060  57 61 72 20 6f 66 20 57  72 61 74 68 11 00 00 00  |War of Wrath....|
00000070  52 45 4c 45 41 53 45 43  4f 55 4e 54 52 59 3d 44  |RELEASECOUNTRY=D|
00000080  45 0c 00 00 00 54 4f 54  41 4c 44 49 53 43 53 3d  |E....TOTALDISCS=|
00000090  32 0c 00 00 00 4c 41 42  45 4c 3d 56 69 72 67 69  |2....LABEL=Virgi|

As you can see the file signature is indeed what we expected. Here we will assume a maximum file size of 30 MB, or 30000000 Bytes. Let's add the entry to the file:

flac    y       30000000    \x66\x4c\x61\x43\x00\x00\x00\x22

The footer signature is optional so here we didn't provide it. The program should now be able to recover deleted flac files. Let's verify it. To test that everything works as expected I previously placed, and then removed, a flac file from the /dev/sdb1 partition, and then proceeded to run the command:

$ sudo foremost -i /dev/sdb1 -o $HOME/Documents/output

As expected, the program was able to retrieve the deleted flac file (it was the only file on the device, on purpose), although it renamed it with a random string. The original filename cannot be retrieved because, as we know, files metadata is contained in the filesystem, and not in the file itself:

/home/egdoc/Documents
└── output
    ├── audit.txt
    └── flac
        └── 00020482.flac

me name=


The audit.txt file contains information about the actions performed by the program, in this case:

Foremost version 1.5.7 by Jesse Kornblum, Kris
Kendall, and Nick Mikus
Audit File

Foremost started at Thu Sep 12 23:47:04 2019
Invocation: foremost -i /dev/sdb1 -o /home/egdoc/Documents/output
Output directory: /home/egdoc/Documents/output
Configuration file: /etc/foremost.conf
------------------------------------------------------------------
File: /dev/sdb1
Start: Thu Sep 12 23:47:04 2019
Length: 200 MB (209715200 bytes)

Num      Name (bs=512)         Size      File Offset     Comment

0:      00020482.flac         28 MB        10486784
Finish: Thu Sep 12 23:47:04 2019

1 FILES EXTRACTED

flac:= 1
------------------------------------------------------------------

Foremost finished at Thu Sep 12 23:47:04 2019
Conclusion

In this article we learned how to use foremost, a forensic program able to retrieve deleted files of various types. We learned that the program works by using a technique called data carving , and relies on files signatures to achieve its goal. We saw an example of the program usage and we also learned how to add the support for a specific file type using the syntax illustrated in the configuration file. For more information about the program usage, please consult its manual page.

[Sep 22, 2019] Easing into automation with Ansible Enable Sysadmin

Sep 19, 2019 | www.redhat.com
It's easier than you think to get started automating your tasks with Ansible. This gentle introduction gives you the basics you need to begin streamlining your administrative life.

Posted | by Jörg Kastning (Red Hat Accelerator)

Image
"DippingToes 02.jpg" by Zhengan is licensed under CC BY-SA 4.0

In the end of 2015 and the beginning of 2016, we decided to use Red Hat Enterprise Linux (RHEL) as our third operating system, next to Solaris and Microsoft Windows. I was part of the team that tested RHEL, among other distributions, and would engage in the upcoming operation of the new OS. Thinking about a fast-growing number of Red Hat Enterprise Linux systems, it came to my mind that I needed a tool to automate things because without automation the number of hosts I can manage is limited.

I had experience with Puppet back in the day but did not like that tool because of its complexity. We had more modules and classes than hosts to manage back then. So, I took a look at Ansible version 2.1.1.0 in July 2016.

What I liked about Ansible and still do is that it is push-based. On a target node, only Python and SSH access are needed to control the node and push configuration settings to it. No agent needs to be removed if you decide that Ansible isn't the right tool for you. The YAML syntax is easy to read and write, and the option to use playbooks as well as ad hoc commands makes Ansible a flexible solution that helps save time in our day-to-day business. So, it was at the end of 2016 when we decided to evaluate Ansible in our environment.

First steps

As a rule of thumb, you should begin automating things that you have to do on a daily or at least a regular basis. That way, automation saves time for more interesting or more important things. I followed this rule by using Ansible for the following tasks:

  1. Set a baseline configuration for newly provisioned hosts (set DNS, time, network, sshd, etc.)
  2. Set up patch management to install Red Hat Security Advisories (RHSAs) .
  3. Test how useful the ad hoc commands are, and where we could benefit from them.
Baseline Ansible configuration

For us, baseline configuration is the configuration every newly provisioned host gets. This practice makes sure the host fits into our environment and is able to communicate on the network. Because the same configuration steps have to be made for each new host, this is an awesome step to get started with automation.

The following are the tasks I started with:

(Some of these steps are already published here on Enable Sysadmin, as you can see, and others might follow soon.)

All of these tasks have in common that they are small and easy to start with, letting you gather experience with using different kinds of Ansible modules, roles, variables, and so on. You can run each of these roles and tasks standalone, or tie them all together in one playbook that sets the baseline for your newly provisioned system.

Red Hat Enterprise Linux Server patch management with Ansible

As I explained on my GitHub page for ansible-role-rhel-patchmanagement , in our environment, we deploy Red Hat Enterprise Linux Servers for our operating departments to run their applications.

More about automation

This role was written to provide a mechanism to install Red Hat Security Advisories on target nodes once a month. In our special use case, only RHSAs are installed to ensure a minimum security limit. The installation is enforced once a month. The advisories are summarized in "Patch-Sets." This way, it is ensured that the same advisories are used for all stages during a patch cycle.

The Ansible Inventory nodes are summarized in one of the following groups, each of which defines when a node is scheduled for patch installation:

  • [rhel-patch-phase1] - On the second Tuesday of a month.
  • [rhel-patch-phase2] - On the third Tuesday of a month.
  • [rhel-patch-phase3] - On the fourth Tuesday of a month.
  • [rhel-patch-phase4] - On the fourth Wednesday of a month.

In case packages were updated on target nodes, the hosts will reboot afterward.

Because the production systems are most important, they are divided into two separate groups (phase3 and phase4) to decrease the risk of failure and service downtime due to advisory installation.

You can find more about this role in my GitHub repo: https://github.com/Tronde/ansible-role-rhel-patchmanagement .

Updating and patch management are tasks every sysadmin has to deal with. With these roles, Ansible helped me get this task done every month, and I don't have to care about it anymore. Only when a system is not reachable, or yum has a problem, do I get an email report telling me to take a look. But, I got lucky, and have not yet received any mail report for the last couple of months, now. (Yes, of course, the system is able to send mail.)

Ad hoc commands

The possibility to run ad hoc commands for quick (and dirty) tasks was one of the reasons I chose Ansible. You can use these commands to gather information when you need them or to get things done without the need to write a playbook first.

I used ad hoc commands in cron jobs until I found the time to write playbooks for them. But, with time comes practice, and today I try to use playbooks and roles for every task that has to run more than once.

Here are small examples of ad hoc commands that provide quick information about your nodes.

Query package version
ansible all -m command -a'/usr/bin/rpm -qi <PACKAGE NAME>' | grep 'SUCCESS\|Version'
Query OS-Release
ansible all -m command -a'/usr/bin/cat /etc/os-release'
Query running kernel version
ansible all -m command -a'/usr/bin/uname -r'
Query DNS servers in use by nodes
ansible all -m command -a'/usr/bin/cat /etc/resolv.conf' | grep 'SUCCESS\|nameserver'

Hopefully, these samples give you an idea for what ad hoc commands can be used.

Summary

It's not hard to start with automation. Just look for small and easy tasks you do every single day, or even more than once a day, and let Ansible do these tasks for you.

Eventually, you will be able to solve more complex tasks as your automation skills grow. But keep things as simple as possible. You gain nothing when you have to troubleshoot a playbook for three days when it solves a task you could have done in an hour.

[Want to learn more about Ansible? Check out these free e-books .]

[Sep 18, 2019] Oracle woos Red Hat users with Autonomous Linux InfoWorld

Notable quotes:
"... Sending a shot across the bow to rival Red Hat, Oracle has introduced Oracle Autonomous Linux, an "autonomous operating system" in the Oracle Cloud that is designed to eliminate manual OS management and human error. Oracle Autonomous Linux is patched, updated, and tuned without human interaction. ..."
Sep 18, 2019 | www.infoworld.com
Sending a shot across the bow to rival Red Hat, Oracle has introduced Oracle Autonomous Linux, an "autonomous operating system" in the Oracle Cloud that is designed to eliminate manual OS management and human error. Oracle Autonomous Linux is patched, updated, and tuned without human interaction.

The Oracle Autonomous Linux service is paired with the new Oracle OS Management Service, which is a cloud infrastructure component providing control over systems running Autonomous Linux, Linux, or Windows. Binary compatibility is offered for IBM Red Hat Enterprise Linux, providing for application compatibility on Oracle Cloud infrastructure.

[ Get Matt Asay's take on cloud industry developments on InfoWorld: Public cloud winners take most . • Oracle's and IBM's hybrid cloud defense may not hold . • The ugly truth about cloud computing in the enterprise . | Keep up with the latest developments in cloud computing with InfoWorld's Cloud Computing newsletter . ]

The benefits of Oracle Autonomous Linux include:

  • Automated daily package updates, enhanced OS parameter tuning, and OS diagnostics gathering.
  • In-depth security protection, with daily updates to the Linux kernel and key user space libraries with no downtime.
  • Exploit detection is provided, offering automated alerts should anyone try to exploit a vulnerability already patched by Oracle.

Leveraging Oracle Cloud Infrastructure, Autonomous Linux has capabilities such autoscaling, monitoring, and lifecycle management across pools. Autonomous Linux and OS Management Service are included with Oracle Premier Support at no extra charge with Oracle Cloud Infrastructure compute services. Based on Oracle Linux, Autonomous Linux is available for deployment only in Oracle Cloud and not for on-premises deployments.

Where to access Oracle Autonomous Linux

You can access an instance of Oracle Autonomous Linux from the Oracle Cloud Marketplace.

[Sep 16, 2019] 10 Ansible modules you need to know Opensource.com

Sep 16, 2019 | opensource.com

10 Ansible modules you need to know See examples and learn the most important modules for automating everyday tasks with Ansible. 11 Sep 2019 DirectedSoul (Red Hat) Feed 25 up 4 comments x Subscribe now

Get the highlights in your inbox every week.

https://opensource.com/eloqua-embedded-email-capture-block.html?offer_id=70160000000QzXNAA0 Ansible is an open source IT configuration management and automation platform. It uses human-readable YAML templates so users can program repetitive tasks to happen automatically without having to learn an advanced programming language.

Ansible is agentless, which means the nodes it manages do not require any software to be installed on them. This eliminates potential security vulnerabilities and makes overall management smoother.

Ansible modules are standalone scripts that can be used inside an Ansible playbook. A playbook consists of a play, and a play consists of tasks. These concepts may seem confusing if you're new to Ansible, but as you begin writing and working more with playbooks, they will become familiar.

More on Ansible

There are some modules that are frequently used in automating everyday tasks; those are the ones that we will cover in this article.

Ansible has three main files that you need to consider:

  • Host/inventory file: Contains the entry of the nodes that need to be managed
  • Ansible.cfg file: Located by default at /etc/ansible/ansible.cfg , it has the necessary privilege escalation options and the location of the inventory file
  • Main file: A playbook that has modules that perform various tasks on a host listed in an inventory or host file
Module 1: Package management

There is a module for most popular package managers, such as DNF and APT, to enable you to install any package on a system. Functionality depends entirely on the package manager, but usually these modules can install, upgrade, downgrade, remove, and list packages. The names of relevant modules are easy to guess. For example, the DNF module is dnf_module , the old YUM module (required for Python 2 compatibility) is yum_module , while the APT module is apt_module , the Slackpkg module is slackpkg_module , and so on.

Example 1:

- name : install the latest version of Apache and MariaDB
dnf :
name :
- httpd
- mariadb-server
state : latest

This installs the Apache web server and the MariaDB SQL database.

Example 2: - name : Install a list of packages
yum :
name :
- nginx
- postgresql
- postgresql-server
state : present

This installs the list of packages and helps download multiple packages.

Module 2: Service

After installing a package, you need a module to start it. The service module enables you to start, stop, and reload installed packages; this comes in pretty handy.

Example 1: - name : Start service foo, based on running process /usr/bin/foo
service :
name : foo
pattern : /usr/bin/foo
state : started

This starts the service foo .

Example 2: - name : Restart network service for interface eth0
service :
name : network
state : restarted
args : eth0

This restarts the network service of the interface eth0 .

Module 3: Copy

The copy module copies a file from the local or remote machine to a location on the remote machine.

Example 1: - name : Copy a new "ntp.conf file into place, backing up the original if it differs from the copied version
copy:
src: /mine/ntp.conf
dest: /etc/ntp.conf
owner: root
group: root
mode: '0644'
backup: yes Example 2: - name : Copy file with owner and permission, using symbolic representation
copy :
src : /srv/myfiles/foo.conf
dest : /etc/foo.conf
owner : foo
group : foo
mode : u=rw,g=r,o=r Module 4: Debug

The debug module prints statements during execution and can be useful for debugging variables or expressions without having to halt the playbook.

Example 1: - name : Display all variables/facts known for a host
debug :
var : hostvars [ inventory_hostname ]
verbosity : 4

This displays all the variable information for a host that is defined in the inventory file.

Example 2: - name : Write some content in a file /tmp/foo.txt
copy :
dest : /tmp/foo.txt
content : |
Good Morning!
Awesome sunshine today.
register : display_file_content
- name : Debug display_file_content
debug :
var : display_file_content
verbosity : 2

This registers the content of the copy module output and displays it only when you specify verbosity as 2. For example:

ansible-playbook demo.yaml -vv
Module 5: File

The file module manages the file and its properties.

  • It sets attributes of files, symlinks, or directories.
  • It also removes files, symlinks, or directories.
Example 1: - name : Change file ownership, group and permissions
file :
path : /etc/foo.conf
owner : foo
group : foo
mode : '0644'

This creates a file named foo.conf and sets the permission to 0644 .

Example 2: - name : Create a directory if it does not exist
file :
path : /etc/some_directory
state : directory
mode : '0755'

This creates a directory named some_directory and sets the permission to 0755 .

Module 6: Lineinfile

The lineinfile module manages lines in a text file.

  • It ensures a particular line is in a file or replaces an existing line using a back-referenced regular expression.
  • It's primarily useful when you want to change just a single line in a file.
Example 1: - name : Ensure SELinux is set to enforcing mode
lineinfile :
path : /etc/selinux/config
regexp : '^SELINUX='
line : SELINUX=enforcing

This sets the value of SELINUX=enforcing .

Example 2: - name : Add a line to a file if the file does not exist, without passing regexp
lineinfile :
path : /etc/resolv.conf
line : 192.168.1.99 foo.lab.net foo
create : yes

This adds an entry for the IP and hostname in the resolv.conf file.

Module 7: Git

The git module manages git checkouts of repositories to deploy files or software.

Example 1: # Example Create git archive from repo
- git :
repo : https://github.com/ansible/ansible-examples.git
dest : /src/ansible-examples
archive : /tmp/ansible-examples.zip Example 2: - git :
repo : https://github.com/ansible/ansible-examples.git
dest : /src/ansible-examples
separate_git_dir : /src/ansible-examples.git

This clones a repo with a separate Git directory.

Module 8: Cli_command

The cli_command module , first available in Ansible 2.7, provides a platform-agnostic way of pushing text-based configurations to network devices over the network_cli connection plugin.

Example 1: - name : commit with comment
cli_config :
config : set system host-name foo
commit_comment : this is a test

This sets the hostname for a switch and exits with a commit message.

Example 2: - name : configurable backup path
cli_config :
config : "{{ lookup('template', 'basic/config.j2') }}"
backup : yes
backup_options :
filename : backup.cfg
dir_path : /home/user

This backs up a config to a different destination file.

Module 9: Archive

The archive module creates a compressed archive of one or more files. By default, it assumes the compression source exists on the target.

Example 1: - name : Compress directory /path/to/foo/ into /path/to/foo.tgz
archive :
path : /path/to/foo
dest : /path/to/foo.tgz Example 2: - name : Create a bz2 archive of multiple files, rooted at /path
archive :
path :
- /path/to/foo
- /path/wong/foo
dest : /path/file.tar.bz2
format : bz2 Module 10: Command

One of the most basic but useful modules, the command module takes the command name followed by a list of space-delimited arguments.

Example 1: - name : return motd to registered var
command : cat /etc/motd
register : mymotd Example 2: - name : Change the working directory to somedir/ and run the command as db_owner if /path/to/database does not exist.
command : /usr/bin/make_database.sh db_user db_name
become : yes
become_user : db_owner
args :
chdir : somedir/
creates : /path/to/database Conclusion

There are tons of modules available in Ansible, but these ten are the most basic and powerful ones you can use for an automation job. As your requirements change, you can learn about other useful modules by entering ansible-doc <module-name> on the command line or refer to the official documentation .

[Sep 16, 2019] Creating virtual machines with Vagrant and Ansible for website development Opensource.com

Sep 16, 2019 | opensource.com

Using Vagrant and Ansible to deploy virtual machines for web development 23 Feb 2016 Betsy Gamrat Feed 363 up 1 comment Image by : Jason Baker for Opensource.com. x Subscribe now

Get the highlights in your inbox every week.

https://opensource.com/eloqua-embedded-email-capture-block.html?offer_id=70160000000QzXNAA0 Vagrant and Ansible are tools to efficiently provision virtual machines, also called VMs, or in Vagrant terms, the word "boxes" is often used. We begin with a short discussion of why a web developer would invest the time to use these tools, then cover the required software, an overview of how Vagrant works with virtual machine providers, and the use of Ansible to provision a virtual machine. Background

First, this article builds on our guide to installing and testing eZ Publish 5 in a virtual machine, which may be a useful reference. We'll describe the tools you can use to automate the creation and provisioning of virtual machines. There is much to cover beyond what we talk about here, but for now we will focus on providing a general overview.

Why should you use virtual machines for website development?

The traditional ways that developers work on websites are on remote development machines or locally, directly on their main operating system. There are many advantages to using virtual machines for development, including the following:

  • The entire development team can have the same server and configuration, without purchasing additional hardware.
  • The local virtual machines can be more representative of the production servers.
  • You can bring up the virtual machine when you need to work with it and shut it down when you're done. This is especially helpful if you need different versions of software, such as different version of PHP for different projects.

Vagrant and Ansible help to automate the provisioning of the virtual machines. Vagrant handles the starting and stopping of the machines as well as some configuration, and Ansible breaks down the details of the machines into easy-to-read configuration files and installs and configures the software within the virtual machines.

General approach

We usually keep the site code on the host operating system and share these files from the host operating system to the virtual machines. This way, the virtual machines can be loaded with only the software required to run the site, and each team member can use their favorite local tools for code editing, version control, and more under the operating system they use most often.

Risks

This scheme is not without risk and complexity. A virtual machine is a great tool for your collection, but like any tool, you will need to take some time to learn how to use it.

  • When Vagrant, Ansible, and VirtualBox or another virtual machine provider are running well together, they help development to run more efficiently and can improve the quality of your work. However, when things go wrong they can distract from actual web development. They represent additional tools that you need to maintain and troubleshoot, and you need to train and support your development team to use them properly.
  • Host operating system: Be sure your host operating system supports the tools you're planning to use. As I mentioned, this blog post focuses on Ansible. Ansible is not officially supported on Windows as the host. That means if you only have a Windows machine to work with, you'll need to consider using Linux as the controller. (There are some tricks to get it to work on Windows.)
  • Performance: Remember the intent of these virtual machines is to support development. They will not run as fast as standalone servers. If that is an issue, you will likely need to invest some time in improving the performance.
  • An implicit assumption is made that only one instance of the virtual machine will be running on the host at any given time. If you will be using more than one instance of the virtual machine, you'll need to take that into account as you set it up.
Getting started

The first step is to install the required software: Vagrant , Ansible , and VirtualBox . We will focus only on VirtualBox in this post, but you can also use a different provider , including many options which are open source. You may need some VirtualBox extensions and Vagrant plugins as well. Take the time to read the documentation carefully.

Then, you will need a starting point for your virtual server, typically called a "base box." For your first virtual machine, it is easiest to use an existing box. There are many boxes up at HashiCorp's Atlas and at Vagrantbox.es which are suitable for testing, but be sure you use a trusted provider for any box which is used in production.

Once you've chosen your box, these commands should bring it to life:

$ vagrant box add name-of-box url-of-box
$ vagrant init name-of-box
$ vagrant up
Provisioning, customizing, and accessing the virtual machine

Once the box is up and running, you can start adding software to it using Ansible. Plan to spend a lot of time learning Ansible. It is well worth the investment. You'll use Ansible to load the system software, create databases, configure the server, create users, set file ownership and permissions, set up services, and much more -- basically, to configure the virtual machine to include everything you need. Once you have the Ansible scripts set up you will be able to re-use them with different virtual machines and also run them against remote servers (which is a topic for another day!).

The easiest way to SSH into the box is:

$ vagrant ssh

You may update your /etc/hosts file to map the virtual machine box IP address to an easy to remember name for SSH and browser access. Once the box is running and serving pages, you can start working on the site.

Development workflow

Using a virtual machine as described here doesn't significantly change normal development workflows when it comes to editing code. The host machine and virtual machine share the application files through the path configured under Vagrant. You can edit the files using your code editor of choice on the host, or make minor adjustments with a text editor on the virtual machine. You can also keep version control tools on the host machine. In other words, all of the code modifications made on the host machine automatically show up on the virtual machine.

You just get to enjoy the convenience of local development, and an environment that mimics the server(s) where your code will be deployed!

This article was originally published at Mugo Web . Republished with permission.

[Sep 16, 2019] An introduction to Virtual Machine Manager Opensource.com

Sep 16, 2019 | opensource.com

When installing a VM to run as a server, I usually disable or remove its sound capability. To do so, select Sound and click Remove or right-click on Sound and choose Remove Hardware .

You can also add hardware with the Add Hardware button at the bottom. This brings up the Add New Virtual Hardware screen where you can add additional storage devices, memory, sound, etc. It's like having access to a very well-stocked (if virtual) computer hardware warehouse.

11-vmm_addnewhardware.png

Once you are happy with your VM configuration, click Begin Installation , and the system will boot and begin installing your specified operating system from the ISO.

12-vmm_rhelbegininstall.png

Once it completes, it reboots, and your new VM is ready for use.

13-vmm_rhelinstalled.png

Virtual Machine Manager is a powerful tool for desktop Linux users. It is open source and an excellent alternative to proprietary and closed virtualization products.

[Sep 13, 2019] How to use Ansible Galaxy Enable Sysadmin

Sep 13, 2019 | www.redhat.com

Ansible is a multiplier, a tool that automates and scales infrastructure of every size. It is considered to be a configuration management, orchestration, and deployment tool. It is easy to get up and running with Ansible. Even a new sysadmin could start automating with Ansible in a matter of a few hours.

Ansible automates using the SSH protocol. The control machine uses an SSH connection to communicate with its target hosts, which are typically Linux hosts. If you're a Windows sysadmin, you can still use Ansible to automate your Windows environments using WinRM as opposed to SSH. Presently, though, the control machine still needs to run Linux.

More about automation

As a new sysadmin, you might start with just a few playbooks. But as your automation skills continue to grow, and you become more familiar with Ansible, you will learn best practices and further realize that as your playbooks increase, using Ansible Galaxy becomes invaluable.

In this article, you will learn a bit about Ansible Galaxy, its structure, and how and when you can put it to use.

What Ansible does

Common sysadmin tasks that can be performed with Ansible include patching, updating systems, user and group management, and provisioning. Ansible presently has a huge footprint in IT Automation -- if not the largest presently -- and is considered to be the most popular and widely used configuration management, orchestration, and deployment tool available today.

One of the main reasons for its popularity is its simplicity. It's simple, powerful, and agentless. Which means a new or entry-level sysadmin can hit the ground automating in a matter of hours. Ansible allows you to scale quickly, efficiently, and cross-functionally.

Create roles with Ansible Galaxy

Ansible Galaxy is essentially a large public repository of Ansible roles. Roles ship with READMEs detailing the role's use and available variables. Galaxy contains a large number of roles that are constantly evolving and increasing.

Galaxy can use git to add other role sources, such as GitHub. You can initialize a new galaxy role using ansible-galaxy init , or you can install a role directly from the Ansible Galaxy role store by executing the command ansible-galaxy install <name of role> .

Here are some helpful ansible-galaxy commands you might use from time to time:

  • ansible-galaxy list displays a list of installed roles, with version numbers.
  • ansible-galaxy remove <role> removes an installed role.
  • ansible-galaxy info provides a variety of information about Ansible Galaxy.
  • ansible-galaxy init can be used to create a role template suitable for submission to Ansible Galaxy.

To create an Ansible role using Ansible Galaxy, we need to use the ansible-galaxy command and its templates. Roles must be downloaded before they can be used in playbooks, and they are placed into the default directory /etc/ansible/roles . You can find role examples at https://galaxy.ansible.com/geerlingguy :

Image Create collections

While Ansible Galaxy has been the go-to tool for constructing and managing roles, with new iterations of Ansible you are bound to see changes or additions. On Ansible version 2.8 you get the new feature of collections .

What are collections and why are they worth mentioning? As the Ansible documentation states:

Collections are a distribution format for Ansible content. They can be used to package and distribute playbooks, roles, modules, and plugins.

Collections follow a simple structure:

collection/
├── docs/
├── galaxy.yml
├── plugins/
│ ├──  modules/
│ │ └──  module1.py
│ ├──  inventory/
│ └──  .../
├── README.md
├── roles/
│ ├──  role1/
│ ├──  role2/
│ └──  .../
├── playbooks/
│ ├──  files/
│ ├──  vars/
│ ├──  templates/
│ └──  tasks/
└──  tests/
Image
Creating a collection skeleton.

The ansible-galaxy-collection command implements the following commands. Notably, a few of the subcommands are the same as used with ansible-galaxy :

  • init creates a basic collection skeleton based on the default template included with Ansible, or your own template.
  • build creates a collection artifact that can be uploaded to Galaxy, or your own repository.
  • publish publishes a built collection artifact to Galaxy.
  • install installs one or more collections.

In order to determine what can go into a collection, a great resource can be found here .

Conclusion

Establish yourself as a stellar sysadmin with an automation solution that is simple, powerful, agentless, and scales your infrastructure quickly and efficiently. Using Ansible Galaxy to create roles is superb thinking, and an ideal way to be organized and thoughtful in managing your ever-growing playbooks.

The only way to improve your automation skills is to work with a dedicated tool and prove the value and positive impact of automation on your infrastructure.

[Sep 06, 2019] Python vs. Ruby Which is best for web development Opensource.com

Sep 06, 2019 | opensource.com

Python was developed organically in the scientific space as a prototyping language that easily could be translated into C++ if a prototype worked. This happened long before it was first used for web development. Ruby, on the other hand, became a major player specifically because of web development; the Rails framework extended Ruby's popularity with people developing complex websites.

Which programming language best suits your needs? Here is a quick overview of each language to help you choose:

Approach: one best way vs. human-language Python

Python takes a direct approach to programming. Its main goal is to make everything obvious to the programmer. In Python, there is only one "best" way to do something. This philosophy has led to a language strict in layout.

Python's core philosophy consists of three key hierarchical principles:

  • Explicit is better than implicit
  • Simple is better than complex
  • Complex is better than complicated

This regimented philosophy results in Python being eminently readable and easy to learn -- and why Python is great for beginning coders. Python has a big foothold in introductory programming courses . Its syntax is very simple, with little to remember. Because its code structure is explicit, the developer can easily tell where everything comes from, making it relatively easy to debug.

Python's hierarchy of principles is evident in many aspects of the language. Its use of whitespace to do flow control as a core part of the language syntax differs from most other languages, including Ruby. The way you indent code determines the meaning of its action. This use of whitespace is a prime example of Python's "explicit" philosophy, the shape a Python app takes spells out its logic and how the app will act.

Ruby

In contrast to Python, Ruby focuses on "human-language" programming, and its code reads like a verbal language rather than a machine-based one, which many programmers, both beginners and experts, like. Ruby follows the principle of " least astonishment ," and offers myriad ways to do the same thing. These similar methods can have multiple names, which many developers find confusing and frustrating.

Unlike Python, Ruby makes use of "blocks," a first-class object that is treated as a unit within a program. In fact, Ruby takes the concept of OOP (Object-Oriented Programming) to its limit. Everything is an object -- even global variables are actually represented within the ObjectSpace object. Classes and modules are themselves objects, and functions and operators are methods of objects. This ability makes Ruby especially powerful, especially when combined with its other primary strength: functional programming and the use of lambdas.

In addition to blocks and functional programming, Ruby provides programmers with many other features, including fragmentation, hashable and unhashable types, and mutable strings.

Ruby's fans find its elegance to be one of its top selling points. At the same time, Ruby's "magical" features and flexibility can make it very hard to track down bugs.

Communities: stability vs. innovation

Although features and coding philosophy are the primary drivers for choosing a given language, the strength of a developer community also plays an important role. Fortunately, both Python and Ruby boast strong communities.

Python

Python's community already includes a large Linux and academic community and therefore offers many academic use cases in both math and science. That support gives the community a stability and diversity that only grows as Python increasingly is used for web development.

Ruby

However, Ruby's community has focused primarily on web development from the get-go. It tends to innovate more quickly than the Python community, but this innovation also causes more things to break. In addition, while it has gotten more diverse, it has yet to reach the level of diversity that Python has.

Final thoughts

For web development, Ruby has Rails and Python has Django. Both are powerful frameworks, so when it comes to web development, you can't go wrong with either language. Your decision will ultimately come down to your level of experience and your philosophical preferences.

If you plan to focus on building web applications, Ruby is popular and flexible. There is a very strong community built upon it and they are always on the bleeding edge of development.

If you are interested in building web applications and would like to learn a language that's used more generally, try Python. You'll get a diverse community and lots of influence and support from the various industries in which it is used.

Tom Radcliffe - Tom Radcliffe has over 20 years experience in software development and management in both academia and industry. He is a professional engineer (PEO and APEGBC) and holds a PhD in physics from Queen's University at Kingston. Tom brings a passion for quantitative, data-driven processes to ActiveState .

[Aug 31, 2019] Linux on your laptop A closer look at EFI boot options

Aug 31, 2019 | www.zdnet.com
Before EFI, the standard boot process for virtually all PC systems was called "MBR", for Master Boot Record; today you are likely to hear it referred to as "Legacy Boot". This process depended on using the first physical block on a disk to hold some information needed to boot the computer (thus the name Master Boot Record); specifically, it held the disk address at which the actual bootloader could be found, and the partition table that defined the layout of the disk. Using this information, the PC firmware could find and execute the bootloader, which would then bring up the computer and run the operating system.

This system had a number of rather obvious weaknesses and shortcomings. One of the biggest was that you could only have one bootable object on each physical disk drive (at least as far as the firmware boot was concerned). Another was that if that first sector on the disk became corrupted somehow, you were in deep trouble.

Over time, as part of the Extensible Firmware Interface, a new approach to boot configuration was developed. Rather than storing critical boot configuration information in a single "magic" location, EFI uses a dedicated "EFI boot partition" on the desk. This is a completely normal, standard disk partition, the same as which may be used to hold the operating system or system recovery data.

The only requirement is that it be FAT formatted, and it should have the boot and esp partition flags set (esp stands for EFI System Partition). The specific data and programs necessary for booting is then kept in directories on this partition, typically in directories named to indicate what they are for. So if you have a Windows system, you would typically find directories called 'Boot' and 'Microsoft' , and perhaps one named for the manufacturer of the hardware, such as HP. If you have a Linux system, you would find directories called opensuse, debian, ubuntu, or any number of others depending on what particular Linux distribution you are using.

It should be obvious from the description so far that it is perfectly possible with the EFI boot configuration to have multiple boot objects on a single disk drive.

Before going any further, I should make it clear that if you install Linux as the only operating system on a PC, it is not necessary to know all of this configuration information in detail. The installer should take care of setting all of this up, including creating the EFI boot partition (or using an existing EFI boot partition), and further configuring the system boot list so that whatever system you install becomes the default boot target.

If you were to take a brand new computer with UEFI firmware, and load it from scratch with any of the current major Linux distributions, it would all be set up, configured, and working just as it is when you purchase a new computer preloaded with Windows (or when you load a computer from scratch with Windows). It is only when you want to have more than one bootable operating system – especially when you want to have both Linux and Windows on the same computer – that things may become more complicated.

The problems that arise with such "multiboot" systems are generally related to getting the boot priority list defined correctly.

When you buy a new computer with Windows, this list typically includes the Windows bootloader on the primary disk, and then perhaps some other peripheral devices such as USB, network interfaces and such. When you install Linux alongside Windows on such a computer, the installer will add the necessary information to the EFI boot partition, but if the boot priority list is not changed, then when the system is rebooted after installation it will simply boot Windows again, and you are likely to think that the installation didn't work.

There are several ways to modify this boot priority list, but exactly which ones are available and whether or how they work depends on the firmware of the system you are using, and this is where things can get really messy. There are just about as many different UEFI firmware implementations as there are PC manufacturers, and the manufacturers have shown a great deal of creativity in the details of this firmware.

First, in the simplest case, there is a software utility included with Linux called efibootmgr that can be used to modify, add or delete the boot priority list. If this utility works properly, and the changes it makes are permanent on the system, then you would have no other problems to deal with, and after installing it would boot Linux and you would be happy. Unfortunately, while this is sometimes the case it is frequently not. The most common reason for this is that changes made by software utilities are not actually permanently stored by the system BIOS, so when the computer is rebooted the boot priority list is restored to whatever it was before, which generally means that Windows gets booted again.

The other common way of modifying the boot priority list is via the computer BIOS configuration program. The details of how to do this are different for every manufacturer, but the general procedure is approximately the same. First you have to press the BIOS configuration key (usually F2, but not always, unfortunately) during system power-on (POST). Then choose the Boot item from the BIOS configuration menu, which should get you to a list of boot targets presented in priority order. Then you need to modify that list; sometimes this can be done directly in that screen, via the usual F5/F6 up/down key process, and sometimes you need to proceed one level deeper to be able to do that. I wish I could give more specific and detailed information about this, but it really is different on every system (sometimes even on different systems produced by the same manufacturer), so you just need to proceed carefully and figure out the steps as you go.

I have seen a few rare cases of systems where neither of these methods works, or at least they don't seem to be permanent, and the system keeps reverting to booting Windows. Again, there are two ways to proceed in this case. The first is by simply pressing the "boot selection" key during POST (power-on). Exactly which key this is varies, I have seen it be F12, F9, Esc, and probably one or two others. Whichever key it turns out to be, when you hit it during POST you should get a list of bootable objects defined in the EFI boot priority list, so assuming your Linux installation worked you should see it listed there. I have known of people who were satisfied with this solution, and would just use the computer this way and have to press boot select each time they wanted to boot Linux.

The alternative is to actually modify the files in the EFI boot partition, so that the (unchangeable) Windows boot procedure would actually boot Linux. This involves overwriting the Windows file bootmgfw.efi with the Linux file grubx64.efi. I have done this, especially in the early days of EFI boot, and it works, but I strongly advise you to be extremely careful if you try it, and make sure that you keep a copy of the original bootmgfw.efi file. Finally, just as a final (depressing) warning, I have also seen systems where this seemed to work, at least for a while, but then at some unpredictable point the boot process seemed to notice that something had changed and it restored bootmgfw.efi to its original state – thus losing the Linux boot configuration again. Sigh.

So, that's the basics of EFI boot, and how it can be configured. But there are some important variations possible, and some caveats to be aware of.

[Aug 31, 2019] Slashdot Asks What Are Some Programming Books You Wish You Had Read Earlier - Slashdot

Aug 31, 2019 | developers.slashdot.org

raymorris ( 2726007 ) , Friday February 22, 2019 @02:05PM ( #58164784 ) Journal

It's the exact opposite of systemd ( Score: 5 , Insightful)

Most of the backlash against systemd isn't because it's *bad* per se, but because systemd is in so many ways the opposite of the Unix philosophy.

Windows and Unix have very different approaches. Windows has MS Office and Word, a multu-gigabyte word processor with literally thousands of functions. Unix has sed, awk, grep, sort, and cut. Each a few kilobytes at most, each doing one small job. In Unix complex jobs are done by piping together small, simple pieces.

Unix manages complexity by building on top of simplicity. Windows manages complexity by hiding it under a veneer, putting the complex stuff at the base and trying to build simplicity on top of complexity. Each approach has its own strengths. The first, building complex systems by putting a simple on top of simplicity, stacking simple layers, is very much the Unix way. Systemd is very much the Windows way of having a bunch of complexity underneath and then throwing a UI on top that is supposed to make it appear simple.

Anonymous Coward , Friday February 22, 2019 @02:39PM ( #58165034 )
Re:It's the exact opposite of systemd ( Score: 1 )

GNU's Not UNIX...

[Aug 28, 2019] How to navigate Ansible documentation Enable Sysadmin

Aug 28, 2019 | www.redhat.com

We take our first glimpse at the Ansible documentation on the official website. While Ansible can be overwhelming with so many immediate options, let's break down what is presented to us here. Putting our attention on the page's main pane, we are given five offerings from Ansible. This pane is a central location, or one-stop-shop, to maneuver through the documentation for products like Ansible Tower, Ansible Galaxy, and Ansible Lint: Image

We can even dive into Ansible Network for specific module documentation that extends the power and ease of Ansible automation to network administrators. The focal point of the rest of this article will be around Ansible Project, to give us a great starting point into our automation journey:

Image

Once we click the Ansible Documentation tile under the Ansible Project section, the first action we should take is to ensure we are viewing the documentation's correct version. We can get our current version of Ansible from our control node's command line by running ansible --version . Armed with the version information provided by the output, we can select the matching version in the site's upper-left-hand corner using the drop-down menu, that by default says latest :

Image

[Aug 18, 2019] Oracle Linux 7 Update 7 Released

Oracle Linux 7 Update 7 features many of the same changes as RHEL 7.7 but now also adding an updated Unbreakable Enterprise Kernel Release 5 based on Linux 4.14.35 with many extra patches compared to RHEL7's default Linux 3.10 based kernel.
Oracle Linux can be used for free with paid support if needed. So it is more flexible option than Centos. At least for companies who have no money to buy RHEL and are reluctant to run Centos as some of their applications are not supported on it.
Aug 15, 2019 | blogs.oracle.com

Oracle Linux 7 Update 7 ships with the following kernel packages, that include bug fixes, security fixes and enhancements:
•Unbreakable Enterprise Kernel (UEK) Release 5 (kernel-uek-4.14.35-1902.3.2.el7) for x86-64 and aarch64
•Red Hat Compatible Kernel (RHCK) (kernel-3.10.0-1062.el7) for x86-64 only

Oracle Linux maintains user space compatibility with Red Hat Enterprise Linux (RHEL), which is independent of the kernel version that underlies the operating system. Existing applications in user space will continue to run unmodified on Oracle Linux 7 Update 7 with UEK Release 5 and no re-certifications are needed for applications already certified with Red Hat Enterprise Linux 7 or Oracle Linux 7.

[Aug 04, 2019] 10 YAML tips for people who hate YAML Enable SysAdmin

Aug 04, 2019 | www.redhat.com

10 YAML tips for people who hate YAML Do you hate YAML? These tips might ease your pain.

Posted June 10, 2019 | by Seth Kenlon (Red Hat)

Image
There are lots of formats for configuration files: a list of values, key and value pairs, INI files, YAML, JSON, XML, and many more. Of these, YAML sometimes gets cited as a particularly difficult one to handle for a few different reasons. While its ability to reflect hierarchical values is significant and its minimalism can be refreshing to some, its Python-like reliance upon syntactic whitespace can be frustrating.

However, the open source world is diverse and flexible enough that no one has to suffer through abrasive technology, so if you hate YAML, here are 10 things you can (and should!) do to make it tolerable. Starting with zero, as any sensible index should.

0. Make your editor do the work

Whatever text editor you use probably has plugins to make dealing with syntax easier. If you're not using a YAML plugin for your editor, find one and install it. The effort you spend on finding a plugin and configuring it as needed will pay off tenfold the very next time you edit YAML.

For example, the Atom editor comes with a YAML mode by default, and while GNU Emacs ships with minimal support, you can add additional packages like yaml-mode to help.

Emacs in YAML and whitespace mode.

If your favorite text editor lacks a YAML mode, you can address some of your grievances with small configuration changes. For instance, the default text editor for the GNOME desktop, Gedit, doesn't have a YAML mode available, but it does provide YAML syntax highlighting by default and features configurable tab width:

Configuring tab width and type in Gedit.

With the drawspaces Gedit plugin package, you can make white space visible in the form of leading dots, removing any question about levels of indentation.

Take some time to research your favorite text editor. Find out what the editor, or its community, does to make YAML easier, and leverage those features in your work. You won't be sorry.

1. Use a linter

Ideally, programming languages and markup languages use predictable syntax. Computers tend to do well with predictability, so the concept of a linter was invented in 1978. If you're not using a linter for YAML, then it's time to adopt this 40-year-old tradition and use yamllint .

You can install yamllint on Linux using your distribution's package manager. For instance, on Red Hat Enterprise Linux 8 or Fedora :

$ sudo dnf install yamllint

Invoking yamllint is as simple as telling it to check a file. Here's an example of yamllint 's response to a YAML file containing an error:

$ yamllint errorprone.yaml
errorprone.yaml
23:10     error    syntax error: mapping values are not allowed here
23:11     error    trailing spaces  (trailing-spaces)

That's not a time stamp on the left. It's the error's line and column number. You may or may not understand what error it's talking about, but now you know the error's location. Taking a second look at the location often makes the error's nature obvious. Success is eerily silent, so if you want feedback based on the lint's success, you can add a conditional second command with a double-ampersand ( && ). In a POSIX shell, && fails if a command returns anything but 0, so upon success, your echo command makes that clear. This tactic is somewhat superficial, but some users prefer the assurance that the command did run correctly, rather than failing silently. Here's an example:

$ yamllint perfect.yaml && echo "OK"
OK

The reason yamllint is so silent when it succeeds is that it returns 0 errors when there are no errors.

2. Write in Python, not YAML

If you really hate YAML, stop writing in YAML, at least in the literal sense. You might be stuck with YAML because that's the only format an application accepts, but if the only requirement is to end up in YAML, then work in something else and then convert. Python, along with the excellent pyyaml library, makes this easy, and you have two methods to choose from: self-conversion or scripted.

Self-conversion

In the self-conversion method, your data files are also Python scripts that produce YAML. This works best for small data sets. Just write your JSON data into a Python variable, prepend an import statement, and end the file with a simple three-line output statement.

#!/usr/bin/python3	
import yaml 

d={
"glossary": {
  "title": "example glossary",
  "GlossDiv": {
	"title": "S",
	"GlossList": {
	  "GlossEntry": {
		"ID": "SGML",
		"SortAs": "SGML",
		"GlossTerm": "Standard Generalized Markup Language",
		"Acronym": "SGML",
		"Abbrev": "ISO 8879:1986",
		"GlossDef": {
		  "para": "A meta-markup language, used to create markup languages such as DocBook.",
		  "GlossSeeAlso": ["GML", "XML"]
		  },
		"GlossSee": "markup"
		}
	  }
	}
  }
}

f=open('output.yaml','w')
f.write(yaml.dump(d))
f.close

Run the file with Python to produce a file called output.yaml file.

$ python3 ./example.json
$ cat output.yaml
glossary:
  GlossDiv:
	GlossList:
	  GlossEntry:
		Abbrev: ISO 8879:1986
		Acronym: SGML
		GlossDef:
		  GlossSeeAlso: [GML, XML]
		  para: A meta-markup language, used to create markup languages such as DocBook.
		GlossSee: markup
		GlossTerm: Standard Generalized Markup Language
		ID: SGML
		SortAs: SGML
	title: S
  title: example glossary

This output is perfectly valid YAML, although yamllint does issue a warning that the file is not prefaced with --- , which is something you can adjust either in the Python script or manually.

Scripted conversion

In this method, you write in JSON and then run a Python conversion script to produce YAML. This scales better than self-conversion, because it keeps the converter separate from the data.

Create a JSON file and save it as example.json . Here is an example from json.org :

{
	"glossary": {
	  "title": "example glossary",
	  "GlossDiv": {
		"title": "S",
		"GlossList": {
		  "GlossEntry": {
			"ID": "SGML",
			"SortAs": "SGML",
			"GlossTerm": "Standard Generalized Markup Language",
			"Acronym": "SGML",
			"Abbrev": "ISO 8879:1986",
			"GlossDef": {
			  "para": "A meta-markup language, used to create markup languages such as DocBook.",
			  "GlossSeeAlso": ["GML", "XML"]
			  },
			"GlossSee": "markup"
			}
		  }
		}
	  }
	}

Create a simple converter and save it as json2yaml.py . This script imports both the YAML and JSON Python modules, loads a JSON file defined by the user, performs the conversion, and then writes the data to output.yaml .

#!/usr/bin/python3
import yaml
import sys
import json

OUT=open('output.yaml','w')
IN=open(sys.argv[1], 'r')

JSON = json.load(IN)
IN.close()
yaml.dump(JSON, OUT)
OUT.close()

Save this script in your system path, and execute as needed:

$ ~/bin/json2yaml.py example.json
3. Parse early, parse often

Sometimes it helps to look at a problem from a different angle. If your problem is YAML, and you're having a difficult time visualizing the data's relationships, you might find it useful to restructure that data, temporarily, into something you're more familiar with.

If you're more comfortable with dictionary-style lists or JSON, for instance, you can convert YAML to JSON in two commands using an interactive Python shell. Assume your YAML file is called mydata.yaml .

$ python3
>>> f=open('mydata.yaml','r')
>>> yaml.load(f)
{'document': 34843, 'date': datetime.date(2019, 5, 23), 'bill-to': {'given': 'Seth', 'family': 'Kenlon', 'address': {'street': '51b Mornington Road\n', 'city': 'Brooklyn', 'state': 'Wellington', 'postal': 6021, 'country': 'NZ'}}, 'words': 938, 'comments': 'Good article. Could be better.'}

There are many other examples, and there are plenty of online converters and local parsers, so don't hesitate to reformat data when it starts to look more like a laundry list than markup.

4. Read the spec

After I've been away from YAML for a while and find myself using it again, I go straight back to yaml.org to re-read the spec. If you've never read the specification for YAML and you find YAML confusing, a glance at the spec may provide the clarification you never knew you needed. The specification is surprisingly easy to read, with the requirements for valid YAML spelled out with lots of examples in chapter 6 .

5. Pseudo-config

Before I started writing my book, Developing Games on the Raspberry Pi , Apress, 2019, the publisher asked me for an outline. You'd think an outline would be easy. By definition, it's just the titles of chapters and sections, with no real content. And yet, out of the 300 pages published, the hardest part to write was that initial outline.

YAML can be the same way. You may have a notion of the data you need to record, but that doesn't mean you fully understand how it's all related. So before you sit down to write YAML, try doing a pseudo-config instead.

A pseudo-config is like pseudo-code. You don't have to worry about structure or indentation, parent-child relationships, inheritance, or nesting. You just create iterations of data in the way you currently understand it inside your head.

A pseudo-config.

Once you've got your pseudo-config down on paper, study it, and transform your results into valid YAML.

6. Resolve the spaces vs. tabs debate

OK, maybe you won't definitively resolve the spaces-vs-tabs debate , but you should at least resolve the debate within your project or organization. Whether you resolve this question with a post-process sed script, text editor configuration, or a blood-oath to respect your linter's results, anyone in your team who touches a YAML project must agree to use spaces (in accordance with the YAML spec).

Any good text editor allows you to define a number of spaces instead of a tab character, so the choice shouldn't negatively affect fans of the Tab key.

Tabs and spaces are, as you probably know all too well, essentially invisible. And when something is out of sight, it rarely comes to mind until the bitter end, when you've tested and eliminated all of the "obvious" problems. An hour wasted to an errant tab or group of spaces is your signal to create a policy to use one or the other, and then to develop a fail-safe check for compliance (such as a Git hook to enforce linting).

7. Less is more (or more is less)

Some people like to write YAML to emphasize its structure. They indent vigorously to help themselves visualize chunks of data. It's a sort of cheat to mimic markup languages that have explicit delimiters.

Here's a good example from Ansible's documentation :

# Employee records
-  martin:
        name: Martin D'vloper
        job: Developer
        skills:
            - python
            - perl
            - pascal
-  tabitha:
        name: Tabitha Bitumen
        job: Developer
        skills:
            - lisp
            - fortran
            - erlang

For some users, this approach is a helpful way to lay out a YAML document, while other users miss the structure for the void of seemingly gratuitous white space.

If you own and maintain a YAML document, then you get to define what "indentation" means. If blocks of horizontal white space distract you, then use the minimal amount of white space required by the YAML spec. For example, the same YAML from the Ansible documentation can be represented with fewer indents without losing any of its validity or meaning:

---
- martin:
   name: Martin D'vloper
   job: Developer
   skills:
   - python
   - perl
   - pascal
- tabitha:
   name: Tabitha Bitumen
   job: Developer
   skills:
   - lisp
   - fortran
   - erlang
8. Make a recipe

I'm a big fan of repetition breeding familiarity, but sometimes repetition just breeds repeated stupid mistakes. Luckily, a clever peasant woman experienced this very phenomenon back in 396 AD (don't fact-check me), and invented the concept of the recipe .

If you find yourself making YAML document mistakes over and over, you can embed a recipe or template in the YAML file as a commented section. When you're adding a section, copy the commented recipe and overwrite the dummy data with your new real data. For example:

---
# - <common name>:
#   name: Given Surname
#   job: JOB
#   skills:
#   - LANG
- martin:
  name: Martin D'vloper
  job: Developer
  skills:
  - python
  - perl
  - pascal
- tabitha:
  name: Tabitha Bitumen
  job: Developer
  skills:
  - lisp
  - fortran
  - erlang
9. Use something else

I'm a fan of YAML, generally, but sometimes YAML isn't the answer. If you're not locked into YAML by the application you're using, then you might be better served by some other configuration format. Sometimes config files outgrow themselves and are better refactored into simple Lua or Python scripts.

YAML is a great tool and is popular among users for its minimalism and simplicity, but it's not the only tool in your kit. Sometimes it's best to part ways. One of the benefits of YAML is that parsing libraries are common, so as long as you provide migration options, your users should be able to adapt painlessly.

If YAML is a requirement, though, keep these tips in mind and conquer your YAML hatred once and for all! What to read next

[Aug 04, 2019] Ansible IT automation for everybody Enable SysAdmin

Aug 04, 2019 | www.redhat.com

Skip to main content We use cookies on our websites to deliver our online services. Details about how we use cookies and how you may disable them are set out in our Privacy Statement . By using this website you agree to our use of cookies. × Search Enable SysAdmin

Ansible: IT automation for everybody Kick the tires with Ansible and start automating with these simple tasks.

Posted July 31, 2019 | by Jörg Kastning

Image

Ansible is an open source tool for software provisioning, application deployment, orchestration, configuration, and administration. Its purpose is to help you automate your configuration processes and simplify the administration of multiple systems. Thus, Ansible essentially pursues the same goals as Puppet, Chef, or Saltstack.

What I like about Ansible is that it's flexible, lean, and easy to start with. In most use cases, it keeps the job simple.

I chose to use Ansible back in 2016 because no agent has to be installed on the managed nodes -- a node is what Ansible calls a managed remote system. All you need to start managing a remote system with Ansible is SSH access to the system, and Python installed on it. Python is preinstalled on most Linux systems, and I was already used to managing my hosts via SSH, so I was ready to start right away. And if the day comes where I decide not to use Ansible anymore, I just have to delete my Ansible controller machine (control node) and I'm good to go. There are no agents left on the managed nodes that have to be removed.

Ansible offers two ways to control your nodes. The first one uses playbooks . These are simple ASCII files written in Yet Another Markup Language (YAML) , which is easy to read and write. And second, there are the ad-hoc commands , which allow you to run a command or module without having to create a playbook first.

You organize the hosts you would like to manage and control in an inventory file, which offers flexible format options. For example, this could be an INI-like file that looks like:

mail.example.com

[webservers]
foo.example.com
bar.example.com

[dbservers]
one.example.com
two.example.com
three.example.com

[site1:children]
webservers
dbservers
Examples

I would like to give you two small examples of how to use Ansible. I started with these really simple tasks before I used Ansible to take control of more complex tasks in my infrastructure.

Ad-hoc: Check if Ansible can remote manage a system

As you might recall from the beginning of this article, all you need to manage a remote host is SSH access to it, and a working Python interpreter on it. To check if these requirements are fulfilled, run the following ad-hoc command against a host from your inventory:

[jkastning@ansible]$ ansible mail.example.com -m ping
mail.example.com | SUCCESS => {
    "changed": false, 
    "ping": "pong"
}
Playbook: Register a system and attach a subscription

This example shows how to use a playbook to keep installed packages up to date. The playbook is an ASCII text file which looks like this:

---
# Make sure all packages are up to date
- name: Update your system
  hosts: mail.example.com
  tasks:
  - name: Make sure all packages are up to date
    yum:
      name: "*"
      state: latest

Now, we are ready to run the playbook:

[jkastning@ansible]$ ansible-playbook yum_update.yml 

PLAY [Update your system] **************************************************************************

TASK [Gathering Facts] *****************************************************************************
ok: [mail.example.com]

TASK [Make sure all packages are up to date] *******************************************************
ok: [mail.example.com]

PLAY RECAP *****************************************************************************************
mail.example.com : ok=2    changed=0    unreachable=0    failed=0

Here everything is ok and there is nothing else to do. All installed packages are already the latest version.

It's simple: Try and use it

The examples above are quite simple and should only give you a first impression. But, from the start, it did not take me long to use Ansible for more complex tasks like the Poor Man's RHEL Mirror or the Ansible Role for RHEL Patchmanagment .

Today, Ansible saves me a lot of time and supports my day-to-day work tasks quite well. So what are you waiting for? Try it, use it, and feel a bit more comfortable at work. What to read next Image 10 YAML tips for people who hate YAML Do you hate YAML? These tips might ease your pain. Posted: June 10, 2019 Author: Seth Kenlon (Red Hat) Topics: Automation Ansible AUTOMATION FOR EVERYONE

Getting started with Ansible Get started Jörg Kastning Joerg is a sysadmin for over ten years now. He is a member of the Red Hat Accelerators and runs his own blog at https://www.my-it-brain.de. More about me Related Content Image 10 YAML tips for people who hate YAML Do you hate YAML? These tips might ease your pain. Posted: June 10, 2019 Author: Seth Kenlon (Red Hat)

OUR BEST CONTENT, DELIVERED TO YOUR INBOX

https://www.redhat.com/sysadmin/eloqua-embedded-subscribe.html?offer_id=701f20000012gE7AAI The opinions expressed on this website are those of each author, not of the author's employer or of Red Hat.

Red Hat and the Red Hat logo are trademarks of Red Hat, Inc., registered in the United States and other countries.

Copyright ©2019 Red Hat, Inc.

Twitter Facebook 0 LinkedIn Reddit 33 Email Twitter Facebook 0 LinkedIn Reddit 33 Email x Subscribe now

Get the highlights in your inbox every week.

https://www.redhat.com/sysadmin/eloqua-embedded-email-capture-block.html?offer_id=701f20000012gE7AAI ✓ Thanks for sharing! Facebook Twitter Email Pinterest LinkedIn Reddit WhatsApp Gmail Telegram Pocket Mix Tumblr Amazon Wish List AOL Mail Balatarin BibSonomy Bitty Browser Blinklist Blogger BlogMarks Bookmarks.fr Box.net Buffer Care2 News CiteULike Copy Link Design Float Diary.Ru Diaspora Digg Diigo Douban Draugiem DZone Evernote Facebook Messenger Fark Flipboard Folkd Google Bookmarks Google Classroom Google+ Hacker News Hatena Houzz Instapaper Kakao Kik Kindle It Known Line LiveJournal Mail.Ru Mastodon Mendeley Meneame Mixi MySpace Netvouz Odnoklassniki Outlook.com Papaly Pinboard Plurk Print PrintFriendly Protopage Bookmarks Pusha Qzone Rediff MyPage Refind Renren Sina Weibo SiteJot Skype Slashdot SMS StockTwits Svejo Symbaloo Bookmarks Threema Trello Tuenti Twiddla TypePad Post Viadeo Viber VK Wanelo WeChat WordPress Wykop XING Yahoo Mail Yoolink Yummly AddToAny Facebook Twitter Email Pinterest LinkedIn Reddit WhatsApp Gmail Email Gmail AOL Mail Outlook.com Yahoo Mail More

https://static.addtoany.com/menu/sm.21.html#type=page&event=load&url=https%3A%2F%2Fwww.redhat.com%2Fsysadmin%2Fansible-it-automation-everybody&referrer=https%3A%2F%2Fwww.linuxtoday.com%2Fit_management%2Fansible-it-automation-for-everybody-190731052032.html

https://redhat.demdex.net/dest5.html?d_nsid=0#https%3A%2F%2Fwww.redhat.com%2Fsysadmin%2Fansible-it-automation-everybody

[Aug 03, 2019] Creating Bootable Linux USB Drive with Etcher

Aug 03, 2019 | linuxize.com

There are several different applications available for free use which will allow you to flash ISO images to USB drives. In this example, we will use Etcher. It is a free and open-source utility for flashing images to SD cards & USB drives and supports Windows, macOS, and Linux.

Head over to the Etcher downloads page , and download the most recent Etcher version for your operating system. Once the file is downloaded, double-click on it and follow the installation wizard.

Creating Bootable Linux USB Drive using Etcher is a relatively straightforward process, just follow the steps outlined below:

  1. Connect the USB flash drive to your system and Launch Etcher.
  2. Click on the Select image button and locate the distribution .iso file.
  3. If only one USB drive is attached to your machine, Etcher will automatically select it. Otherwise, if more than one SD cards or USB drives are connected make sure you have selected the correct USB drive before flashing the image.

[Jul 30, 2019] FreeDOS turns 25 years old by Jim Hall

Jul 28, 2019 | opensource.com

FreeDOS turns 25 years old: An origin story The operating system's history is a great example of the open source software model: developers working together to create something.

Get the highlights in your inbox every week.

FreeDOS .

That's a major milestone for any open source software project, and I'm proud of the work that we've done on it over the past quarter century. I'm also proud of how we built FreeDOS because it is a great example of how the open source software model works.

For its time, MS-DOS was a powerful operating system. I'd used DOS for years, ever since my parents replaced our aging Apple II computer with a newer IBM machine. MS-DOS provided a flexible command line, which I quite liked and that came in handy to manipulate my files. Over the years, I learned how to write my own utilities in C to expand its command-line capabilities even further.

Around 1994, Microsoft announced that its next planned version of Windows would do away with MS-DOS. But I liked DOS. Even though I had started migrating to Linux, I still booted into MS-DOS to run applications that Linux didn't have yet.

I figured that if we wanted to keep DOS, we would need to write our own. And that's how FreeDOS was born.

On June 29, 1994, I made a small announcement about my idea to the comp.os.msdos.apps newsgroup on Usenet.

ANNOUNCEMENT OF PD-DOS PROJECT:
A few months ago, I posted articles relating to starting a public domain version of DOS. The general support for this at the time was strong, and many people agreed with the statement, "start writing!" So, I have

Announcing the first effort to produce a PD-DOS. I have written up a "manifest" describing the goals of such a project and an outline of the work, as well as a "task list" that shows exactly what needs to be written. I'll post those here, and let discussion follow.

While I announced the project as PD-DOS (for "public domain," although the abbreviation was meant to mimic IBM's "PC-DOS"), we soon changed the name to Free-DOS and later FreeDOS.

I started working on it right away. First, I shared the utilities I had written to expand the DOS command line. Many of them reproduced MS-DOS features, including CLS, DATE, DEL, FIND, HELP, and MORE. Some added new features to DOS that I borrowed from Unix, such as TEE and TRCH (a simple implementation of Unix's tr). I contributed over a dozen FreeDOS utilities

By sharing my utilities, I gave other developers a starting point. And by sharing my source code under the GNU General Public License (GNU GPL), I implicitly allowed others to add new features and fix bugs.

Other developers who saw FreeDOS taking shape contacted me and wanted to help. Tim Norman was one of the first; Tim volunteered to write a command shell (COMMAND.COM, later named FreeCOM). Others contributed utilities that replicated or expanded the DOS command line.

We released our first alpha version as soon as possible. Less than three months after announcing FreeDOS, we had an Alpha 1 distribution that collected our utilities. By the time we released Alpha 5, FreeDOS boasted over 60 utilities. And FreeDOS included features never imagined in MS-DOS, including internet connectivity via a PPP dial-up driver and dual-monitor support using a primary VGA monitor and a secondary Hercules Mono monitor.

New developers joined the project, and we welcomed them. By October 1998, FreeDOS had a working kernel, thanks to Pat Villani. FreeDOS also sported a host of new features that brought not just parity with MS-DOS but surpassed MS-DOS, including ANSI support and a print spooler that resembled Unix lpr.

You may be familiar with other milestones. We crept our way towards the 1.0 label, finally releasing FreeDOS 1.0 in September 2006, FreeDOS 1.1 in January 2012, and FreeDOS 1.2 in December 2016. MS-DOS stopped being a moving target long ago, so we didn't need to update as frequently after the 1.0 release.

Today, FreeDOS is a very modern DOS. We've moved beyond "classic DOS," and now FreeDOS features lots of development tools such as compilers, assemblers, and debuggers. We have lots of editors beyond the plain DOS Edit editor, including Fed, Pico, TDE, and versions of Emacs and Vi. FreeDOS supports networking and even provides a simple graphical web browser (Dillo). And we have tons of new utilities, including many that will make Linux users feel at home.

FreeDOS got where it is because developers worked together to create something. In the spirit of open source software, we contributed to each other's work by fixing bugs and adding new features. We treated our users as co-developers; we always found ways to include people, whether they were writing code or writing documentation. And we made decisions through consensus based on merit. If that sounds familiar, it's because those are the core values of open source software: transparency, collaboration, release early and often, meritocracy, and community. That's the open source way !

I encourage you to download FreeDOS 1.2 and give it a try.

[Jul 26, 2019] How to check open ports in Linux using the CLI> by Vivek Gite

Jul 26, 2019 | www.cyberciti.biz

Using netstat to list open ports

Type the following netstat command
sudo netstat -tulpn | grep LISTEN

... ... ...

For example, TCP port 631 opened by cupsd process and cupsd only listing on the loopback address (127.0.0.1). Similarly, TCP port 22 opened by sshd process and sshd listing on all IP address for ssh connections:

Proto Recv-Q Send-Q Local Address           Foreign Address         State       User       Inode      PID/Program name 
tcp   0      0      127.0.0.1:631           0.0.0.0:*               LISTEN      0          43385      1821/cupsd  
tcp   0      0      0.0.0.0:22              0.0.0.0:*               LISTEN      0          44064      1823/sshd

Where,

  • -t : All TCP ports
  • -u : All UDP ports
  • -l : Display listening server sockets
  • -p : Show the PID and name of the program to which each socket belongs
  • -n : Don't resolve names
  • | grep LISTEN : Only display open ports by applying grep command filter.
Use ss to list open ports

The ss command is used to dump socket statistics. It allows showing information similar to netstat. It can display more TCP and state information than other tools. The syntax is:
sudo ss -tulpn

... ... ...

Vivek Gite is the creator of nixCraft and a seasoned sysadmin, DevOps engineer, and a trainer for the Linux operating system/Unix shell scripting. Get the latest tutorials on SysAdmin, Linux/Unix and open source topics via RSS/XML feed or weekly email newsletter.

[Jul 26, 2019] The day the virtual machine manager died by Nathan Lager

"Dangerous" commands like dd should probably be always typed first in the editor and only when you verity that you did not make a blunder , executed...
A good decision was to go home and think the situation over, not to aggravate it with impulsive attempts to correct the situation, which typically only make it worse.
Lack of checking of the health of backups suggest that this guy is an arrogant sucker, despite his 20 years of sysadmin experience.
Notable quotes:
"... I started dd as root , over the top of an EXISTING DISK ON A RUNNING VM. What kind of idiot does that?! ..."
"... Since my VMs were still running, and I'd already done enough damage for one night, I stopped touching things and went home. ..."
Jul 26, 2019 | www.redhat.com

... ... ...

See, my RHEV manager was a VM running on a stand-alone Kernel-based Virtual Machine (KVM) host, separate from the cluster it manages. I had been running RHEV since version 3.0, before hosted engines were a thing, and I hadn't gone through the effort of migrating. I was already in the process of building a new set of clusters with a new manager, but this older manager was still controlling most of our production VMs. It had filled its disk again, and the underlying database had stopped itself to avoid corruption.

See, for whatever reason, we had never set up disk space monitoring on this system. It's not like it was an important box, right?

So, I logged into the KVM host that ran the VM, and started the well-known procedure of creating a new empty disk file, and then attaching it via virsh . The procedure goes something like this: Become root , use dd to write a stream of zeros to a new file, of the proper size, in the proper location, then use virsh to attach the new disk to the already running VM. Then, of course, log into the VM and do your disk expansion.

I logged in, ran sudo -i , and started my work. I ran cd /var/lib/libvirt/images , ran ls -l to find the existing disk images, and then started carefully crafting my dd command:

dd ... bs=1k count=40000000 if=/dev/zero ... of=./vmname-disk ...

Which was the next disk again? <Tab> of=vmname-disk2.img <Back arrow, Back arrow, Back arrow, Back arrow, Backspace> Don't want to dd over the existing disk, that'd be bad. Let's change that 2 to a 3 , and Enter . OH CRAP, I CHANGED THE 2 TO