Softpanorama

May the source be with you, but remember the KISS principle ;-)
Home Switchboard Unix Administration Red Hat TCP/IP Networks Neoliberalism Toxic Managers
(slightly skeptical) Educational society promoting "Back to basics" movement against IT overcomplexity and  bastardization of classic Unix

HTML Pretty Printing and Beatifying

News Recommended Links  Webmaster Toolset Perl HTML Processors and Converters

Since most documents in the world are getting converted to HTML format, and in many case they are tranformed form different format using some kind of automated tool,  the HTML beautifier is immensely important.

FrontPage provides good HTML pretty printer. It is also available in its free version SharePoint Designer 2007

The other well established package with known (pretty high) quality is tidy. It is written is C and as such it is difficult to maintain, but it does its job well.

tidy is available as a package and can be installed in Cygwin, so it can be used in Windows too. Cygwin for some reason wants to install Gnome with it :-)

 


Top Visited
Switchboard
Latest
Past week
Past month

NEWS CONTENTS

Old News ;-)

[Mar 04, 2020] A command-line HTML pretty-printer Making messy HTML readable - Stack Overflow

Jan 01, 2019 | stackoverflow.com

A command-line HTML pretty-printer: Making messy HTML readable [closed] Ask Question Asked 10 years, 1 month ago Active 10 months ago Viewed 51k times


knorv ,

Closed. This question is off-topic . It is not currently accepting answers.

jonjbar ,

Have a look at the HTML Tidy Project: http://www.html-tidy.org/

The granddaddy of HTML tools, with support for modern standards.

There used to be a fork called tidy-html5 which since became the official thing. Here is its GitHub repository .

Tidy is a console application for Mac OS X, Linux, Windows, UNIX, and more. It corrects and cleans up HTML and XML documents by fixing markup errors and upgrading legacy code to modern standards.

For your needs, here is the command line to call Tidy:

tidy inputfile.html

Paul Brit ,

Update 2018: The homebrew/dupes is now deprecated, tidy-html5 may be directly installed.
brew install tidy-html5

Original reply:

Tidy from OS X doesn't support HTML5 . But there is experimental branch on Github which does.

To get it:

 brew tap homebrew/dupes
 brew install tidy --HEAD
 brew untap homebrew/dupes

That's it! Have fun!

Boris , 2019-11-16 01:27:35

Error: No available formula with the name "tidy" . brew install tidy-html5 works. – Pysis Apr 4 '17 at 13:34

Perl HTMLPrettyPrinter - Handling self-closing tags

Stack Overflow

Perl: HTML::PrettyPrinter - Handling self-closing tags

I am a newcomer to Perl (Strawberry Perl v5.12.3 on Windows 7), trying to write a script to aid me with a repetitive HTML formatting task. The files need to be hand-edited in future and I want them to be human-friendly, so after processing using the HTML package (HTML::TreeBuilder etc.), I am writing the result to a file using HTML::PrettyPrinter. All of this works well and the output from PrettyPrinter is very nice and human-readable. However, PrettyPrinter is not handling self-closing tags well; basically, it seems to be treat the slash as an HTML attribute. With input like:
<img />

PrettyPrinter returns:
<img /="/" >

Is there anything I can do to avoid this other than preprocessing with a regex to remove the backslash?

Not sure it will be helpful, but here is my setup for the pretty printing:
my $hpp = HTML::PrettyPrinter->new('linelength' => 120, 'quote_attr' => 1);
$hpp->allow_forced_nl(1);

my $output = new FileHandle ">output.html";
if (defined $output) {
$hpp->select($output);
my $linearray_ref = $hpp->format($internal);
undef $output;
$hpp->select(undef),
}

perl html-formatting

shareimprove this question

asked Jan 18 '12 at 15:18



SenatorForLife
305

add a comment

1 Answer

active oldest votes



up vote

1

down vote

accepted

You can print formatted human readable html with TreeBuilder method:
$h = HTML::TreeBuilder->new_from_content($html);
print $h->as_HTML('',"\t");

but if you still prefer this bugged prettyprinter try to remove problem tags, no idea why someone need ...
$h = HTML::TreeBuilder->new_from_content($html);
while(my $n = $h->look_down(_tag=>img,'src'=>undef)) { $n->delete }

UPD:

well... then we can fix the PrettyPrinter. It's pure perl module so lets see... No idea where on windows perl modules are for me it's /usr/local/share/perl/5.10.1/HTML/PrettyPrinter.pm

maybe not an elegant solution, but will work i hope. this sub parse attribute/value pairs, a little fix and it will add single '/' at the end

~line 756 in PrettyPrinter.pm I've marked the stings that i added with ###<<<<<< at the end
#
# format the attributes
#
sub _attributes {
my ($self, $e) = @_;
my @result = (); # list of ATTR="value" strings to return

my $self_closing = 0; ###<<<<<<
my @attrs = $e->all_external_attr(); # list (name0, val0, name1, val1, ...)

while (@attrs) {
my ($a,$v) = (shift @attrs,shift @attrs); # get current name, value pair
if($a eq '/') { ###<<<<<<
$self_closing=1; ###<<<<<<
next; ###<<<<<<
} ###<<<<<<

# string for output: 1. attribute name
my $s = $self->uppercase? "\U$a" : $a;.

# value part, skip for boolean attributes if desired
unless ($a eq lc($v) &&
$self->min_bool_attr &&.
exists($HTML::Tagset::boolean_attr{$e->tag}) &&
(ref($HTML::Tagset::boolean_attr{$e->tag}).
? $HTML::Tagset::boolean_attr{$e->tag}{$a}.
: $HTML::Tagset::boolean_attr{$e->tag} eq $a)) {
my $q = '';
# quote value?
if ($self->quote_attr || $v =~ tr/a-zA-Z0-9.-//c) {
# use single quote if value contains double quotes but no single quotes
$q = ($v =~ tr/"// && $v !~ tr/'//) ? "'" : '"'; # catch emacs ");
}
# add value part
$s .= '='.$q.(encode_entities($v,$q.$self->entities)).$q;
}
# add string to resulting list
push @result, $s;
}

push @result,'/' if $self_closing; ###<<<<<<
return @result; # return list ('attr="val"','attr="val"',...);
}

Thanks for the answer. However, neither of these is a solution. Even with the tab indention option you suggested, print $h->as_HTML still runs things together in odd ways that a human never would (for example, all of the h2's are run together on the same line with the preceding p tag). Hence the use of PrettyPrinter. I think you misunderstood my miminal example with regard to PrettyPrinter. There is nothing wrong with my img tags--PrettyPrinter prints all self-closing tags as standard tags with a / property set to "/", e.g. <br /> becomes <br /="/"> � SenatorForLife Jan 19 '12 at 15:05

i updated the post about how to fix the module. hope this will help � Dimanoid Jan 19 '12 at 19:20

This seems to work very nicely, thank you! I'm definitely too much of a Perl newbie to have figured out the hack on my own. For other Strawberry Perl users, here's where HTML::PrettyPrinter showed up on my machine after being installed with cpanm: C:\strawberry\perl\site\lib\HTML\PrettyPrinter.pm � SenatorForLife Jan 19 '12 at 20:20

Thanks. Your fix worked for me. � AJ Dhaliwal Jun 19 '12 at 7:05

[Jan 18, 2012] Prettify - Beautify HTML in Perl

Stack Overflow

I have some HTML output sitting in a variable that I will like to Prettify / Beautify but struggling to make sense out of the results of my web searches.

The example in the HTML::Tidy documentation shows it uses the source parameter as a variable (which is what was requested). � Thomas Dickey Mar 28 '15 at 10:55

HTML::Tidy does exactly that:

#!/usr/bin/perl
use strict;
use warnings;

use HTML::Tidy;

my $str = '<div><p>Text<h2>Heading</h2>';
my $tidy = 'HTML::Tidy'->new;
print $tidy->clean($str);

output

<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 3.2//EN">
<html>
<head>
<meta name="generator" content="tidyp for Linux (v1.04), see www.w3.org">
<title></title>
</head>
<body>
<div>
<p>Text</p>
<h2>Heading</h2>
</div>
</body>
</html>

Given that the HTML is not under your control, you might want to also consider HTML::PrettyPrinter if HTML::TreeBuilder generates the correct syntax tree for your HTML.

[Dec 12,2005] Programming - HTML and CSS Code Beautifier The Tabifier - Arantius.com

I've been working on this tool for a while. It's not quite done yet, but it's definitely at a useful stage. I call it the tabifier, but in truth it's a code beautifier.

The overarching design goal for this tool was to beautify HTML code without breaking it. This is of course not totally possible, but I've strived to get as close as possible. There are great HTML beautifiers out there like HTML Tidy but they generally do too much.

When I take an ugly HTML page that someone else has written and I don't know, I want to pass it through a beautifier to make it easier to work with. Things like tidy, though, will drastically alter the code, often making it more work to turn the result of the beautification back into something that displays like the original than it would be to just plain fix it.

htmlpp A Simple HTML Pretty Printer by Len Budney.

htmlpp is a simple HTML pretty printer, based on nsgmls and SGMLS.pm. The code is pretty alpha, but gives attractive results for many HTML docs. Some things, like nested tables, are rendered only passably. Other deeply-nested structures may render badly as well.

Note that this pretty-printer is oldish, and alpha, and unlikely to be developed any further. It's not a bad illustration of some of the possibilities for SGML technology in web authoring. Perhaps someone will take up the challenge, and build the "right" tool!

Since htmlpp gets its input from nsgmls, invalid documents should not be expected to work. However, a side effect of this approach is that minor errors and inconsistencies are actually fixed. Attribute values are always quoted in the pretty printed version. Characters like "<", ">" and "&" are converted into the appropriate SGML entities in attribute values and in document text. End tags are inserted automatically -- which will surprise you if you thought it was legal to imbed <pre> elements inside <p> elements, for example.

HTMLPrettyPrinter - generate nice HTML files from HTML syntax trees

[June 7, 2002] A prettyprinter for HTML documents

From the author book The Web Architect's Handbook; makes heavy use of modules:

use LWP::Simple;
use HTML::Parse;
use HTML::Entities;
use Text::Wrap;
use Getopt::Long;

[July 14, 2001] Clean up your Web pages with HTML TIDY

a free utility to fix mistakes made while editing HTML and to automatically tidy up sloppy editing into nicely layed out markup.

It also works great on the atrociously hard to read markup generated by specialized HTML editors and conversion tools, and can help you identify where you need to pay further attention on making your pages more accessible to people with disabilities.

[July 14, 1999] hindent

HTML indentation (pretty printing) utility Mar 28th 1999, 19:16 stable: 1.0.1 - devel: none license: GPL
http://www.domtools.com/pub/hindent1.1.0.tar.gz (12 hits)
Homepage: http://www.domtools.com/unix/hindent.shtml (34 hits)
Changelog: http://www.domtools.com/pub/hindent1.1.0-changes.txt

FHTML.PL by John Watson

(Perl) Formats and indents HTML code and writes a new file with the results.

ZDNet Software Library - Pretty HTML

Pretty HTML is an easy-to-use program that formats your HTML Web pages. After processing, your HTML code is neatly arranged, commented, spaced, and indented, making it much easier to read and maintain. You can also use Pretty HTML to compress your Web pages by eliminating unnecessary spaces and carriage returns. Process your Web pages one at a time or batch-format entire folders in a single operation. Pretty HTML offers a number of options to ensure that the HTML formatting is done to your liking. To play it extra safe, you can have the program make backup copies of your originals. Excellent online help is included.

Recommended Links

Google matched content

Softpanorama Recommended

Top articles

Sites

Tidy:

Etc

Validators



Etc

Society

Groupthink : Two Party System as Polyarchy : Corruption of Regulators : Bureaucracies : Understanding Micromanagers and Control Freaks : Toxic Managers :   Harvard Mafia : Diplomatic Communication : Surviving a Bad Performance Review : Insufficient Retirement Funds as Immanent Problem of Neoliberal Regime : PseudoScience : Who Rules America : Neoliberalism  : The Iron Law of Oligarchy : Libertarian Philosophy

Quotes

War and Peace : Skeptical Finance : John Kenneth Galbraith :Talleyrand : Oscar Wilde : Otto Von Bismarck : Keynes : George Carlin : Skeptics : Propaganda  : SE quotes : Language Design and Programming Quotes : Random IT-related quotesSomerset Maugham : Marcus Aurelius : Kurt Vonnegut : Eric Hoffer : Winston Churchill : Napoleon Bonaparte : Ambrose BierceBernard Shaw : Mark Twain Quotes

Bulletin:

Vol 25, No.12 (December, 2013) Rational Fools vs. Efficient Crooks The efficient markets hypothesis : Political Skeptic Bulletin, 2013 : Unemployment Bulletin, 2010 :  Vol 23, No.10 (October, 2011) An observation about corporate security departments : Slightly Skeptical Euromaydan Chronicles, June 2014 : Greenspan legacy bulletin, 2008 : Vol 25, No.10 (October, 2013) Cryptolocker Trojan (Win32/Crilock.A) : Vol 25, No.08 (August, 2013) Cloud providers as intelligence collection hubs : Financial Humor Bulletin, 2010 : Inequality Bulletin, 2009 : Financial Humor Bulletin, 2008 : Copyleft Problems Bulletin, 2004 : Financial Humor Bulletin, 2011 : Energy Bulletin, 2010 : Malware Protection Bulletin, 2010 : Vol 26, No.1 (January, 2013) Object-Oriented Cult : Political Skeptic Bulletin, 2011 : Vol 23, No.11 (November, 2011) Softpanorama classification of sysadmin horror stories : Vol 25, No.05 (May, 2013) Corporate bullshit as a communication method  : Vol 25, No.06 (June, 2013) A Note on the Relationship of Brooks Law and Conway Law

History:

Fifty glorious years (1950-2000): the triumph of the US computer engineering : Donald Knuth : TAoCP and its Influence of Computer Science : Richard Stallman : Linus Torvalds  : Larry Wall  : John K. Ousterhout : CTSS : Multix OS Unix History : Unix shell history : VI editor : History of pipes concept : Solaris : MS DOSProgramming Languages History : PL/1 : Simula 67 : C : History of GCC developmentScripting Languages : Perl history   : OS History : Mail : DNS : SSH : CPU Instruction Sets : SPARC systems 1987-2006 : Norton Commander : Norton Utilities : Norton Ghost : Frontpage history : Malware Defense History : GNU Screen : OSS early history

Classic books:

The Peter Principle : Parkinson Law : 1984 : The Mythical Man-MonthHow to Solve It by George Polya : The Art of Computer Programming : The Elements of Programming Style : The Unix Hater�s Handbook : The Jargon file : The True Believer : Programming Pearls : The Good Soldier Svejk : The Power Elite

Most popular humor pages:

Manifest of the Softpanorama IT Slacker Society : Ten Commandments of the IT Slackers Society : Computer Humor Collection : BSD Logo Story : The Cuckoo's Egg : IT Slang : C++ Humor : ARE YOU A BBS ADDICT? : The Perl Purity Test : Object oriented programmers of all nations : Financial Humor : Financial Humor Bulletin, 2008 : Financial Humor Bulletin, 2010 : The Most Comprehensive Collection of Editor-related Humor : Programming Language Humor : Goldman Sachs related humor : Greenspan humor : C Humor : Scripting Humor : Real Programmers Humor : Web Humor : GPL-related Humor : OFM Humor : Politically Incorrect Humor : IDS Humor : "Linux Sucks" Humor : Russian Musical Humor : Best Russian Programmer Humor : Microsoft plans to buy Catholic Church : Richard Stallman Related Humor : Admin Humor : Perl-related Humor : Linus Torvalds Related humor : PseudoScience Related Humor : Networking Humor : Shell Humor : Financial Humor Bulletin, 2011 : Financial Humor Bulletin, 2012 : Financial Humor Bulletin, 2013 : Java Humor : Software Engineering Humor : Sun Solaris Related Humor : Education Humor : IBM Humor : Assembler-related Humor : VIM Humor : Computer Viruses Humor : Bright tomorrow is rescheduled to a day after tomorrow : Classic Computer Humor

The Last but not Least Technology is dominated by two types of people: those who understand what they do not manage and those who manage what they do not understand ~Archibald Putt. Ph.D


Copyright � 1996-2021 by Softpanorama Society. www.softpanorama.org was initially created as a service to the (now defunct) UN Sustainable Development Networking Programme (SDNP) without any remuneration. This document is an industrial compilation designed and created exclusively for educational use and is distributed under the Softpanorama Content License. Original materials copyright belong to respective owners. Quotes are made for educational purposes only in compliance with the fair use doctrine.

FAIR USE NOTICE This site contains copyrighted material the use of which has not always been specifically authorized by the copyright owner. We are making such material available to advance understanding of computer science, IT technology, economic, scientific, and social issues. We believe this constitutes a 'fair use' of any such copyrighted material as provided by section 107 of the US Copyright Law according to which such material can be distributed without profit exclusively for research and educational purposes.

This is a Spartan WHYFF (We Help You For Free) site written by people for whom English is not a native language. Grammar and spelling errors should be expected. The site contain some broken links as it develops like a living tree...

You can use PayPal to to buy a cup of coffee for authors of this site

Disclaimer:

The statements, views and opinions presented on this web page are those of the author (or referenced source) and are not endorsed by, nor do they necessarily reflect, the opinions of the Softpanorama society. We do not warrant the correctness of the information provided or its fitness for any purpose. The site uses AdSense so you need to be aware of Google privacy policy. You you do not want to be tracked by Google please disable Javascript for this site. This site is perfectly usable without Javascript.

Last modified: March 05, 2020