|
Home | Switchboard | Unix Administration | Red Hat | TCP/IP Networks | Neoliberalism | Toxic Managers |
(slightly skeptical) Educational society promoting "Back to basics" movement against IT overcomplexity and bastardization of classic Unix |
|
|
You can think of a data frame as structure similar to Excel spreadsheet.
It is essentially a list of vectors. A data frame is more general than a matrix as different columns can contain different modes of data (numeric, character, etc.). Data frames are the most common data structure in R.
Like Excel spreadsheet R data frame is two dimensional object comprising of rows and columns, and you can address columns by names. The rows are referred by the first (left-hand] subscripts, while columns by the second (right-hand) subscript or name. Each element can be addressed by two indexes provided in square brackets, for example
gld[2,3]
You can use intervals instead of single index, for example
gld[1,4:5]
you can drop a raw or a column from the data set with minus operation, for example
gld[,-2] # drops the second column
gld[-(1:n),] # drop rows from 1 to n in x
You can select multiple columns with c operator
gld(c(date,open,close,volume)]
To select all the entries in a column, the syntax is "comma, number of column", for example
gld[,2]
A data frame is created with the data.frame() function:
mydata <- data.frame(vector1, vector2, ...)
Assign the result to the metals variable three vectors:
metals <- data.frame(date, open, close, )
Now, try printing metals to see its contents using statement print(metals)
> print(metals) date open close ... ... ...
There's your new data frame, neatly organized into rows, with column names (derived from the variable names) across the top.
You can get individual columns by providing their index number in double-brackets. Try getting the second column (prices) of metals:
metals[[2]]
You could instead provide a column name as a string in double-brackets. (This is often more readable.) Retrieve the "close" column:
metals[["close"]]
There are numerous ways to download data into data.frame. The simplest and the most common is downloading and then reading comma separated values (CSV) files. You can use read.csv to do that. It actually calls read.table with some arguments preset. The result of using read.table is a data.frame.
The first argument to read.table is the full path of the file to be loaded or URL. If you specify just the name of the file, it is assumed to be in your project folder.
yahooUrl <- "http://real-chart.finance.yahoo.com/table.csv?s=GLD&d=6&e=30&f=2015&g=d&a=10&b=18&c=2004&ignore=.csv"
theGold<- read.table (file = yahooUrl, header = TRUE, sep = ",")
The result can now be seen using head.
> head(theGold)
the first argument is the file name in quotes (or as a character variable). Notice how we explicitly used the argument names file, header and sep. The second argument, header, indicates that the first row of data holds the column names. The third argument gives the delimiter separating data cells. Changing this to other values such as “\t” (tab delimited) or “;” (semicolon delimited) allows it to read other types of files.
There is another little argument that is helpful to use is stringsAsFactors. Setting this to FALSE (the default is TRUE) prevents character columns from being converted to factor columns. This saves computation time, which can be substantial in case of a large dataset with large number of rows and several character columns with many unique values. Also keeping the columns as character data in many case makes them easier to work with. Conversion to factors is often overkill unless we are dealing with a members of a set and can benefit from set-style operations.
BTW stringsAsFactors argument can also be used in data.frame function for blocking conversion of strings into factors:
theGold <- data.frame(datet=d, close=cc, volume=v, stringsAsFactors=FALSE)
There are several other arguments to read.table function. Among them the most useful are quote and colClasses. The former specifies the character used for enclosing cells and the latter the data type for each column, respectively.
When comma delimited files are poorly built, for example the cell separator has been used inside a cell you can try to use functions read.csv2 (or read.delim2) instead of read.table.
Typing in all your data by hand only works up to a point, obviously, which is why R was given the capability to easily load data in from external files.
You can create a couple data files to experiment with in you project directory. To check what files your project directory contains use function list.files():
> list.files()
Let's assume that there is a CSV (Comma Separated Values) file "gld150730.csv" in your project directory. You can export such a file from any spreadsheet programs or download from any web site that provide stock quotes such as http://finance.yahoo.com . For example:
Date,Open,High,Low,Close,Volume,Adj Close 2015-07-29,104.93,105.629997,104.489998,105.169998,5613700,105.169998 2015-07-28,105.089996,105.330002,104.830002,105.019997,5522200,105.019997 2015-07-27,104.940002,105.68,104.660004,104.860001,9330000,104.860001 2015-07-24,103.610001,105.589996,103.43,105.349998,11442700,105.349998 2015-07-23,104.980003,105.300003,104.199997,104.330002,5691800,104.330002 2015-07-22,104.389999,105.089996,104.18,104.800003,8288700,104.800003 2015-07-21,105.809998,106.32,105.25,105.370003,9391100,105.370003 2015-07-20,106.599998,106.650002,105.620003,105.699997,15437900,105.699997 2015-07-17,109.110001,109.160004,108.400002,108.650002,13954500,108.650002 2015-07-16,109.669998,110.010002,109.599998,109.760002,4221900,109.760002 2015-07-15,110.00,110.190002,109.580002,110.160004,8157600,110.160004 2015-07-14,111.00,111.080002,110.629997,110.739998,2575900,110.739998 2015-07-13,110.43,111.139999,110.360001,110.989998,4268600,110.989998 2015-07-10,111.18,111.709999,111.029999,111.489998,3585200,111.489998 2015-07-09,111.800003,111.93,111.150002,111.360001,3793800,111.360001 2015-07-08,111.379997,111.650002,111.080002,111.089996,5655100,111.089996 2015-07-07,111.080002,111.139999,110.050003,110.760002,9062300,110.760002 2015-07-06,111.709999,112.580002,111.629997,112.059998,4228800,112.059998 2015-07-02,111.660004,111.839996,111.410004,111.760002,3828800,111.760002 2015-07-01,112.120003,112.510002,111.940002,111.980003,4368000,111.980003
You can load a CSV file's content into a data frame by passing the file name to the read.csv function. Try it with the "gld150730.csv" file:
theGold <- read.csv("gld150730.csv")
Fields in a file can be separated by tab characters rather than commas.
For files that use separator strings other than commas, you can use the read.table function. The sep argument defines the separator character, and you can specify a tab character with "\t".
theGold <- read.table("gld150730.csv", sep="\t", header=TRUE)
To get a single column of data from data frame you need to specify the row and do not specify any rows. For example to access the first column in data frame theGold you can use index of this column:
theDate=theGold[,1]
In general each index can be a vector. That means that you can use ranges to select set of consecutive columns (or not consequtive if the step in sequnce is larger then 1):
theDate=theGold[,3:5]
Unlike most other programming language, in R you also can use column names, which is more convenient, then using numeric indexes. Remember that column names are a factor vector so each name has its numeric equivalent. You can get the list of names of the columns for particular data frame using the function names, for example
names(theGold)
To access a single column using this "column name" feature just put a name of the column instead of numeric index:
theDate=theGold[,"date"]
To access multiple columns by name, make the column argument a character vector of the names of the columns you want to be in the output.
goldSelectedCol <- theGold[, c("date", "open", "close", "volume")]When you are selecting a single column R converts it into a vector and displays values horizontally. If you want the values to be displayed vertically as you used to in viewing data frames, you need to ensure that result is still a data frame, despite having just a single column. That can be achieved using an argument drop=false
theDate=theGold[,"date",drop=false]
You can check the mode of result: it will be a data frame not a vector. For example:
class(theGold[,"date",drop=false])
Typing all those brackets can get tedious and error prone, so in R there is a shorthand notation: the data frame name, a dollar sign, and the column name (without quotes). Try using it to get the "close" column:
theGold$close
The $ notation selects a particular column (vector) from a given data frame.
To get a single row you can use nation similar to getting a single column -- specify row and do not specify any columns:
theRow2<-theGold[2, ]To specify multiple row, use a vector, for example
theRow2<-theGold[2:10, ]If rows are not adjacent use c function to construct a vector, for example
theRow2<-theGold[c(2,5,10), ]As nrow function provides number of rows you can calculate variables and use them instead of constants. For example to select the last 200 rows you can use the follwong:
rMax <- nrow(theGold) rMin <- nMax-200 theRow2<-theGold[rMin:rMax, ]
R contain two very useful functions for operating with rows called head and tail. Which are similar to Unix utilities with the same names:
head(theGold, n = 10) # n = 10 means that the first 10 lines are printed in the R console
Usually a data.frame has far too many rows to print them all to the screen, so thankfully the head function prints out only the first few rows.
Try the following command on our example data frame
head(theGold) head(theGold, n = 7) tail(theGold) tail(thegold, n = 10)
There are various ways to inspect a data frame, such as:
RStudio has a nice data browser (View(mydata)). The data frame will be displayed in nice spreadsheet like format in the upper left pane.
You can also use functions head() and tail() to display rows that are interesting for you in command window.
Most of the times when you are working with data frames, you are changing the data and one of the several changes you can do to a data frame is adding column or row and as the result increase the dimension of your data frame.
There are few different ways to do it but the easiest ones are cbind() and rbind() which are part of the base package:
mydata <- cbind(mydata, newVector) mydata <- rbind(mydata, newVector)
Remember that the length of the newVector should match the length of the side of the data frame that you are attaching it to. For example in the cbind() command the following statement should be TRUE:
dim(mydata)[2]==length(newVector)
To see more samples, you can always do ?base::cbind and ?base::rbind.
|
Switchboard | ||||
Latest | |||||
Past week | |||||
Past month |
It can get tiresome typing patientdata$ at the beginning of every variable name, so shortcuts are available. You can use either the attach() and detach() or with() functions to simplify your code.
Attach, Detach, and With
The attach() function adds the data frame to the R search path. When a variable name is encountered, data frames in the search path are checked in order to locate the variable. Using the mtcars data frame from chapter 1 as an example, you could use the following code to obtain summary statistics for automobile mileage (mpg), and plot this variable against engine displacement (disp), and weight (wt):
summary(mtcars$mpg) plot(mtcars$mpg, mtcars$disp) plot(mtcars$mpg, mtcars$wt)This could also be written as
attach(mtcars) summary(mpg) plot(mpg, disp) plot(mpg, wt) detach(mtcars)The detach() function removes the data frame from the search path. Note that detach() does nothing to the data frame itself. The statement is optional but is good programming practice and should be included routinely. (I'll sometimes ignore this sage advice in later chapters in order to keep code fragments simple and short.)
The limitations with this approach are evident when more than one object can have the same name. Consider the following code:
> mpg <- c(25, 36, 47) > attach(mtcars) The following object(s) are masked _by_ '.GlobalEnv': mpg > plot(mpg, wt) Error in xy.coords(x, y, xlabel, ylabel, log) : 'x' and 'y' lengths differ > mpg [1] 25 36 47Here we already have an object named mpg in our environment when the mtcars data frame is attached. In such cases, the original object takes precedence, which isn't what you want. The plot statement fails because mpg has 3 elements and disp has 32 elements. The attach() and detach() functions are best used when you're analyzing a single data frame and you're unlikely to have multiple objects with the same name. In any case, be vigilant for warnings that say that objects are being masked.
An alternative approach is to use the with() function. You could write the previous example as
with(mtcars, { summary(mpg, disp, wt) plot(mpg, disp) plot(mpg, wt) })In this case, the statements within the {} brackets are evaluated with reference to the mtcars data frame. You don't have to worry about name conflicts here. If there's only one statement (for example, summary(mpg)), the {} brackets are optional.
The limitation of the with() function is that assignments will only exist within the function brackets. Consider the following:
> with(mtcars, { stats <- summary(mpg) stats }) Min. 1st Qu. Median Mean 3rd Qu. Max. 10.40 15.43 19.20 20.09 22.80 33.90 > stats Error: object 'stats' not foundIf you need to create objects that will exist outside of the with() construct, use the special assignment operator <<- instead of the standard one (<-). It will save the object to the global environment outside of the with() call. This can be demonstrated with the following code:
> with(mtcars, { nokeepstats <- summary(mpg) keepstats <<- summary(mpg) }) > nokeepstats Error: object 'nokeepstats' not found > keepstats Min. 1st Qu. Median Mean 3rd Qu. Max. 10.40 15.43 19.20 20.09 22.80 33.90Most books on R recommend using with() over attach(). I think that ultimately the choice is a matter of preference and should be based on what you're trying to achieve and your understanding of the implications. We'll use both in this book.
Google matched content |
Society
Groupthink : Two Party System as Polyarchy : Corruption of Regulators : Bureaucracies : Understanding Micromanagers and Control Freaks : Toxic Managers : Harvard Mafia : Diplomatic Communication : Surviving a Bad Performance Review : Insufficient Retirement Funds as Immanent Problem of Neoliberal Regime : PseudoScience : Who Rules America : Neoliberalism : The Iron Law of Oligarchy : Libertarian Philosophy
Quotes
War and Peace : Skeptical Finance : John Kenneth Galbraith :Talleyrand : Oscar Wilde : Otto Von Bismarck : Keynes : George Carlin : Skeptics : Propaganda : SE quotes : Language Design and Programming Quotes : Random IT-related quotes : Somerset Maugham : Marcus Aurelius : Kurt Vonnegut : Eric Hoffer : Winston Churchill : Napoleon Bonaparte : Ambrose Bierce : Bernard Shaw : Mark Twain Quotes
Bulletin:
Vol 25, No.12 (December, 2013) Rational Fools vs. Efficient Crooks The efficient markets hypothesis : Political Skeptic Bulletin, 2013 : Unemployment Bulletin, 2010 : Vol 23, No.10 (October, 2011) An observation about corporate security departments : Slightly Skeptical Euromaydan Chronicles, June 2014 : Greenspan legacy bulletin, 2008 : Vol 25, No.10 (October, 2013) Cryptolocker Trojan (Win32/Crilock.A) : Vol 25, No.08 (August, 2013) Cloud providers as intelligence collection hubs : Financial Humor Bulletin, 2010 : Inequality Bulletin, 2009 : Financial Humor Bulletin, 2008 : Copyleft Problems Bulletin, 2004 : Financial Humor Bulletin, 2011 : Energy Bulletin, 2010 : Malware Protection Bulletin, 2010 : Vol 26, No.1 (January, 2013) Object-Oriented Cult : Political Skeptic Bulletin, 2011 : Vol 23, No.11 (November, 2011) Softpanorama classification of sysadmin horror stories : Vol 25, No.05 (May, 2013) Corporate bullshit as a communication method : Vol 25, No.06 (June, 2013) A Note on the Relationship of Brooks Law and Conway Law
History:
Fifty glorious years (1950-2000): the triumph of the US computer engineering : Donald Knuth : TAoCP and its Influence of Computer Science : Richard Stallman : Linus Torvalds : Larry Wall : John K. Ousterhout : CTSS : Multix OS Unix History : Unix shell history : VI editor : History of pipes concept : Solaris : MS DOS : Programming Languages History : PL/1 : Simula 67 : C : History of GCC development : Scripting Languages : Perl history : OS History : Mail : DNS : SSH : CPU Instruction Sets : SPARC systems 1987-2006 : Norton Commander : Norton Utilities : Norton Ghost : Frontpage history : Malware Defense History : GNU Screen : OSS early history
Classic books:
The Peter Principle : Parkinson Law : 1984 : The Mythical Man-Month : How to Solve It by George Polya : The Art of Computer Programming : The Elements of Programming Style : The Unix Hater’s Handbook : The Jargon file : The True Believer : Programming Pearls : The Good Soldier Svejk : The Power Elite
Most popular humor pages:
Manifest of the Softpanorama IT Slacker Society : Ten Commandments of the IT Slackers Society : Computer Humor Collection : BSD Logo Story : The Cuckoo's Egg : IT Slang : C++ Humor : ARE YOU A BBS ADDICT? : The Perl Purity Test : Object oriented programmers of all nations : Financial Humor : Financial Humor Bulletin, 2008 : Financial Humor Bulletin, 2010 : The Most Comprehensive Collection of Editor-related Humor : Programming Language Humor : Goldman Sachs related humor : Greenspan humor : C Humor : Scripting Humor : Real Programmers Humor : Web Humor : GPL-related Humor : OFM Humor : Politically Incorrect Humor : IDS Humor : "Linux Sucks" Humor : Russian Musical Humor : Best Russian Programmer Humor : Microsoft plans to buy Catholic Church : Richard Stallman Related Humor : Admin Humor : Perl-related Humor : Linus Torvalds Related humor : PseudoScience Related Humor : Networking Humor : Shell Humor : Financial Humor Bulletin, 2011 : Financial Humor Bulletin, 2012 : Financial Humor Bulletin, 2013 : Java Humor : Software Engineering Humor : Sun Solaris Related Humor : Education Humor : IBM Humor : Assembler-related Humor : VIM Humor : Computer Viruses Humor : Bright tomorrow is rescheduled to a day after tomorrow : Classic Computer Humor
The Last but not Least Technology is dominated by two types of people: those who understand what they do not manage and those who manage what they do not understand ~Archibald Putt. Ph.D
Copyright © 1996-2021 by Softpanorama Society. www.softpanorama.org was initially created as a service to the (now defunct) UN Sustainable Development Networking Programme (SDNP) without any remuneration. This document is an industrial compilation designed and created exclusively for educational use and is distributed under the Softpanorama Content License. Original materials copyright belong to respective owners. Quotes are made for educational purposes only in compliance with the fair use doctrine.
FAIR USE NOTICE This site contains copyrighted material the use of which has not always been specifically authorized by the copyright owner. We are making such material available to advance understanding of computer science, IT technology, economic, scientific, and social issues. We believe this constitutes a 'fair use' of any such copyrighted material as provided by section 107 of the US Copyright Law according to which such material can be distributed without profit exclusively for research and educational purposes.
This is a Spartan WHYFF (We Help You For Free) site written by people for whom English is not a native language. Grammar and spelling errors should be expected. The site contain some broken links as it develops like a living tree...
|
You can use PayPal to to buy a cup of coffee for authors of this site |
Disclaimer:
The statements, views and opinions presented on this web page are those of the author (or referenced source) and are not endorsed by, nor do they necessarily reflect, the opinions of the Softpanorama society. We do not warrant the correctness of the information provided or its fitness for any purpose. The site uses AdSense so you need to be aware of Google privacy policy. You you do not want to be tracked by Google please disable Javascript for this site. This site is perfectly usable without Javascript.
Last modified: October, 16, 2019