Im looking for a way to put a bottom line under a dataframe that I print out in R
The Output looks like
I want to print out the bottom line as long as the dataframe output is. But the width is varying.
Any Idea?
EDIT
I'm trying to get rid of
cat("---------------------------------------\n")
and want to make that dynamic to the output size of a given dataframe. The "-----" line should be not longer or shorter then the dataframe.
Use getOption("width"):
> getOption("width")
[1] 80
You can see a description of this option via ?options which states
‘width’: controls the maximum number of columns on a line used in
printing vectors, matrices and arrays, and when filling by
‘cat’.
That doesn't mean that the entire 80 (in my case) characters are used, but R's printing shouldn't extend beyond that so it should be an upper limit.
You should probably also check this in IDE or other front-ends to R. For example, RStudio might do something different depending on the width of the console widget in their app.
To actually format exactly the correct width for the data frame, you'll need to process the data frame into character strings for each line (much as print.data.frame does via its format method. Something like:
df <- data.frame(Price = round(runif(10), 2),
Date = Sys.Date() + 0:9,
Subject = rep(c("Foo", "Bar", "DJGHSJIBIBFUIBSFIUBFUIS"),
length.out = 10),
Category = rep("Media", 10))
class(df) <- c("MyDF", "data.frame")
print.MyDF <- function(x, ...) {
fdf <- format(x)
strings <- apply(x, 2, function(x) unlist(format(x)))[1, ]
rowname <- format(rownames(fdf))[[1]]
strings <- c(rowname, strings)
widths <- nchar(strings)
names <- c("", colnames(x))
widths <- pmax(nchar(strings), nchar(names))
csum <- sum(widths + 1) - 1
print.data.frame(df)
writeLines(paste(rep("-", csum), collapse = ""))
writeLines("Balance: 48") ## FIXME !!
invisible(x)
}
which gives:
> df
Price Date Subject Category
1 0.73 2015-06-29 Foo Media
2 0.11 2015-06-30 Bar Media
3 0.19 2015-07-01 DJGHSJIBIBFUIBSFIUBFUIS Media
4 0.54 2015-07-02 Foo Media
5 0.04 2015-07-03 Bar Media
6 0.37 2015-07-04 DJGHSJIBIBFUIBSFIUBFUIS Media
7 0.59 2015-07-05 Foo Media
8 0.85 2015-07-06 Bar Media
9 0.15 2015-07-07 DJGHSJIBIBFUIBSFIUBFUIS Media
10 0.05 2015-07-08 Foo Media
----------------------------------------------------
Balance: 48
See how this works. Very simple counting of characters, no bells and whistles, but should do the expected job:
EDIT: Printing of the data.frame and the line done in the function.
# create a function that prints the data.frame with the line we want
lineLength <- function( testDF )
{
# start with the characters in the row names,
# plus empty space between columns
dashes <- max( nchar( rownames( testDF ) ) ) + length ( testDF )
# loop finding the longest string in each column, including header
for( i in 1 : length ( testDF ) )
{
x <- nchar( colnames( testDF ) )[ i ]
y <- max( nchar( testDF[ , i ] ) )
if( x > y ) dashes <- dashes + x else dashes <- dashes + y
}
myLine <- paste( rep( "-", dashes ), collapse = "" )
print( testDF )
cat( myLine, "\n" )
}
# sample data
data( mtcars )
# see how it works
lineLength( head( mtcars ) )
mpg cyl disp hp drat wt qsec vs am gear carb
Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
--------------------------------------------------------------------
Related
I'm writing a function that takes a data.table as an argument. The column names of data.table are partially specified as arguments, but not all columns names are specified and all original columns need to be maintained. Inside the function, some columns need to be added to the data.table. Even if the data.table is copied inside the function, I want to add these columns in a way that is guaranteed not to overwrite existing columns. What's the best way to ensure I'm not overwriting columns given that column names are not known?
Here's one approach:
#x is a data.table and knownvar is a column name of that data.table
f <- function(x,knownvar){
x <- copy(x)
tempcol <- "z"
while(tempcol %in% names(x))
tempcol <- paste0("i.",tempcol)
tempcol2 <- "q"
while(tempcol2 %in% names(x))
tempcol2 <- paste0("i.",tempcol2)
x[, (tempcol):=3]
eval(parse(text=paste0("x[,(tempcol2):=",tempcol,"+4]")))
x
}
Note that even though I'm copying x here, I still need this process to be memory efficient. Is there an easier way of doing this? Possibly without using eval(parse(text=?
Obviously I could just create a local variable (e.g. a vector) in the function environment (rather than adding it explicitly as column of the data.table), but this wouldn't work if I then need to sort/join the data.table. Plus I may want to explicitly return a data.table that contains both the original variables and the new column.
Here is one way to write the function using set and non-standard evaluation with substitute() + eval().
Note 1: if new columns are created based on the column names in newcols (instead of the column name in knownvar), the character names in newcols are converted to symbols with as.name() (or equivalently as.symbol()).
Note 2: new columns in newvals can only be added in a sensible order, i.e. if column q requires column z, column z should be added before column q.
library(data.table)
f <- function(x, knownvar) {
## remove if x should be modified in-place
x <- copy(x)
## new column names
newcols <- setdiff(make.unique(c(names(x), c("z", "q"))), names(x))
## new column values based on knownvar or new column names
zcol <- as.name(newcols[1])
newvals <- list(substitute(3 * knownvar), substitute(zcol + 4))
for(i in seq_along(newvals)) {
set(x, j = newcols[i], value = eval(newvals[[i]], envir = x))
}
return(x)
}
## example data
x <- as.data.table(mtcars)
x[, c("q", "q.1") := .(mpg, 2 * mpg)]
head(f(x, mpg))
#> mpg cyl disp hp drat wt qsec vs am gear carb q q.1 z q.2
#> 1: 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4 21.0 42.0 63.0 67.0
#> 2: 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4 21.0 42.0 63.0 67.0
#> 3: 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1 22.8 45.6 68.4 72.4
#> 4: 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1 21.4 42.8 64.2 68.2
#> 5: 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2 18.7 37.4 56.1 60.1
#> 6: 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1 18.1 36.2 54.3 58.3
I have a large csv file and would like to read only certain lines, defined by a vector of row numbers to be read. Is there any way to read these rows without reading the whole csv into memory?
The only solutions I've found seem to allow reading consecutive lines (e.g. lines 2-100).
A simple example of how you might combine the sed approach I linked to into an R function:
read_rows <- function(file,rows,...){
tmp <- tempfile()
row_cmd <- paste(paste(rows,"p",sep = ""),collapse = ";")
cmd <- sprintf(paste0("sed -n '",row_cmd,"' %s > %s"),file,tmp)
system(command = cmd)
read.table(file = tmp,...)
}
write.csv(x = mtcars,file = "~/Desktop/scratch/mtcars.csv")
> read_rows(file = "~/Desktop/scratch/mtcars.csv",rows = c(3,6,7),sep = ",",header = FALSE,row.names = 1)
V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12
Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
> read_rows(file = "~/Desktop/scratch/mtcars.csv",rows = c(1,5,9),sep = ",",header = TRUE,row.names = 1)
mpg cyl disp hp drat wt qsec vs am gear carb
Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1
Merc 240D 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2
Note the difference with row 1 as the column headers.
sqldf will read it into a database (which it will create and then delete for you) and then read only the rows you want into R. Assuming the csv file created in the Note at the end define the desired Rows and then use read.csv.sql. We have used a temporary file for the database but if the data is sufficiently small you can omit the dbname argument and it will use memory.
library(sqldf)
Rows <- c(3, 5, 10)
s <- toString(Rows)
fn$read.csv.sql("Letters.csv", "select * from file where rowid in ($s)",
dbname = tempfile())
giving:
X Letters
1 "3" "c"
2 "5" "e"
3 "10" "j"
If the number of rows desired is very large then rather than embedding the row numbers in the SQL statement create a data frame from them and join it:
library(sqldf)
Rows <- c(3, 5, 10)
RowsDF <- data.frame(Rows)
s <- toString(Rows)
fn$read.csv.sql("Letters.csv",
"select file.* from file join RowsDF on file.rowid = RowsDF.Rows",
dbname = tempfile())
Note
Letters <- data.frame(Letters = letters, stringsAsFactors = FALSE)
write.csv(Letters, "Letters.csv")
I'm wondering if there's a way to apply a function in a string variable to .SD cols in a data.table.
I can generalize all other parts of function calls using a data.table, including input and output columns, which I'm very happy about. But the final piece seems to be applying a variable function to a data.table, which is something I believe I've done before with dplyr and do.call.
mtcars <- as.data.table(mtcars)
returnNames <- "calculatedColumn"
SDnames <- c("mpg","hp")
myfunc <- function(data) {
print(data)
return(data[,1]*data[,2])
}
This obviously works:
mtcars[,eval(returnNames) := myfunc(.SD),.SDcols = SDnames,by = cyl]
But if I want to apply a dynamic function, something like this does not work:
functionCall <- "myfunc"
mtcars[,eval(returnNames) := lapply(.SD,eval(functionCall)),.SDcols = SDnames,by = cyl]
I get this error:
Error in `[.data.table`(mtcars, , `:=`(eval(returnNames), lapply(.SD, : attempt to apply non-function
Is using "apply" with "eval" the right idea, or am I on the wrong track entirely?
You don't want lapply. Since myfunc takes a data.table with multiple columns, you just want to feed such a data table into the function as one object.
To get the function you need get instead of eval
On the left-hand-side of :=, you can just put the character vector in parentheses, eval isn't needed
-
mtcars[, (returnNames) := get(functionCall)(.SD)
, .SDcols = SDnames
, by = cyl]
head(mtcars)
# mpg cyl disp hp drat wt qsec vs am gear carb calculatedColumn
# 1: 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4 2310.0
# 2: 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4 2310.0
# 3: 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1 2120.4
# 4: 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1 2354.0
# 5: 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2 3272.5
# 6: 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1 1900.5
The code above was run after the following code
mtcars <- as.data.table(mtcars)
returnNames <- "calculatedColumn"
SDnames <- c("mpg","hp")
myfunc <- function(data) {
print(data)
return(data[,1]*data[,2])
}
functionCall <- "myfunc"
In this blog post, Paul Hiemstra shows how to sum up two columns using dplyr::mutate_. Copy/paste-ing relevant parts:
library(lazyeval)
f = function(col1, col2, new_col_name) {
mutate_call = lazyeval::interp(~ a + b, a = as.name(col1), b = as.name(col2))
mtcars %>% mutate_(.dots = setNames(list(mutate_call), new_col_name))
}
allows one to then do:
head(f('wt', 'mpg', 'hahaaa'))
Great!
I followed up with a question (see comments) as to how one could extend this to a 100 columns, since it wasn't quite clear (to me) how one could do it without having to type all the names using the above method. Paul was kind enough to indulge me and provided this answer (thanks!):
# data
df = data.frame(matrix(1:100, 10, 10))
names(df) = LETTERS[1:10]
# answer
sum_all_rows = function(list_of_cols) {
summarise_calls = sapply(list_of_cols, function(col) {
lazyeval::interp(~col_name, col_name = as.name(col))
})
df %>% select_(.dots = summarise_calls) %>% mutate(ans1 = rowSums(.))
}
sum_all_rows(LETTERS[sample(1:10, 5)])
I'd like to improve this answer on these points:
The other columns are gone. I'd like to keep them.
It uses rowSums() which has to coerce the data.frame to a matrix which I'd like to avoid.
Also I'm not sure if the use of . within non-do() verbs is encouraged? Because . within mutate() doesn't seem to adapt to just those rows when used with group_by().
And most importantly, how can I do the same using mutate_() instead of mutate()?
I found this answer, which addresses point 1, but unfortunately, both dplyr answers use rowSums() along with mutate().
PS: I just read Hadley's comment under that answer. IIUC, 'reshape to long form + group by + sum + reshape to wide form' is the recommend dplyr way for these type of operations?
Here's a different approach:
library(dplyr); library(lazyeval)
f <- function(df, list_of_cols, new_col) {
df %>%
mutate_(.dots = ~Reduce(`+`, .[list_of_cols])) %>%
setNames(c(names(df), new_col))
}
head(f(mtcars, c("mpg", "cyl"), "x"))
# mpg cyl disp hp drat wt qsec vs am gear carb x
#1 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4 27.0
#2 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4 27.0
#3 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1 26.8
#4 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1 27.4
#5 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2 26.7
#6 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1 24.1
Regarding your points:
Other columns are kept
It doesn't use rowSums
You are specifically asking for a row-wise operation here so I'm not sure (yet) how a group_by could do any harm when using . inside mutate/mutate_
It makes use of mutate_
I know I can plot a dendrogram as follows
library(cluster)
d <- mtcars
d[,8:11] <- lapply(d[,8:11], as.factor)
gdist <- daisy(d, metric = c("gower"), stand = FALSE)
dendro <- hclust(gdist, method = "average")
plot(as.dendrogram(dendro))
However I have some groups identified (eg. by an iterative classification method), given as the last column in d
G <- c(1,2,3,3,4,4,5,5,5,5,1,2,1,1,2,4,1,3,4,5,1,7,4,3,3,2,1,1,1,3,5,6)
d$Group <- G
head(d)
mpg cyl disp hp drat wt qsec vs am gear carb Group
Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4 1
Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4 2
Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1 3
Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1 3
Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2 4
Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1 4
I am trying to plot all the dendrograms together on the same plot with the same scale. The groups with only a single member also needs to be plotted. (group 6 and 7)
I am able to plot individual dendrograms for subset of the data except when number of members in a group is only one. But I don't think this is the right approach.
layout(matrix(1:9, 3,3,byrow=TRUE))
gdist <- as.matrix(gdist)
N <- max(G)
for (i in 1:N){
rc_tokeep <- row.names(subset(d, G==i))
dis <- as.dist(gdist[rc_tokeep, rc_tokeep])
dend <- hclust(dis, method = "average")
plot(as.dendrogram(dend))
}
The loop is giving this error for the last two groups. (6 and 7) having only a single member.
Error in hclust(dis, method = "average") :
must have n >= 2 objects to cluster
Essentially I wan't to reproduce these type of plots. The clusters with single members are also plotted here.
If you want to mimic the last few graphs, you can do something like this:
N <- max(G)
layout(matrix(c(0,1:N,0),nc=1))
gdist <- as.matrix(gdist)
for (i in 1:N){
par(mar=c(0,3,0,7))
rc_tokeep <- row.names(subset(d, G==i))
if(length(rc_tokeep)>2){ #The idea is to catch the groups with one single element to plot them differently
dis <- as.dist(gdist[rc_tokeep, rc_tokeep])
dend <- hclust(dis, method = "average")
plot(as.dendrogram(dend),horiz=TRUE,
xlim=c(.8,0),axes=FALSE) # giving the same xlim will scale all of them, here i used 0.8 to fit your data but you can change it to whatever
}else{
plot(NA,xlim=c(.8,0),ylim=c(0,1),axes=F,ann=F)
segments(0,.5,.1,.5) #I don't know how you intend to compute the length of the branch in a group of 1 element, you might want to change that
text(0,.5, pos=4,rc_tokeep,xpd=TRUE)
}
}
With your example it gives:
If you want to add the scale you can add a grid in all graphs and a scale in the last one:
N <- max(G)
layout(matrix(c(0,1:N,0),nc=1))
gdist <- as.matrix(gdist)
for (i in 1:N){
par(mar=c(0,3,0,7))
rc_tokeep <- row.names(subset(d, G==i))
if(length(rc_tokeep)>2){
dis <- as.dist(gdist[rc_tokeep, rc_tokeep])
dend <- hclust(dis, method = "average")
plot(as.dendrogram(dend),horiz=TRUE,xlim=c(.8,0),xaxt="n",yaxt="n")
abline(v=seq(0,.8,.1),lty=3) #Here the grid
}else{
plot(NA,xlim=c(.8,0),ylim=c(0,1),axes=F,ann=F)
segments(0,.5,.1,.5)
text(0,.5, pos=4,rc_tokeep,xpd=TRUE)
abline(v=seq(0,.8,.1),lty=3) #Here the grid
}
}
axis(1,at=seq(0,.8,.1)) #Here the axis
And finally if you want to even the spaces between the different branches in the resulting plot, you can use table(d$Group) to get the number of members of each group and use it as a height for each subplot:
N <- max(G)
layout(matrix(c(0,1:7,0),nc=1), height=c(3,table(d$Group),3)) #Plus the height of the empty spaces.
gdist <- as.matrix(gdist)
for (i in 1:N){
par(mar=c(0,3,0,7))
rc_tokeep <- row.names(subset(d, G==i))
if(length(rc_tokeep)>2){
dis <- as.dist(gdist[rc_tokeep, rc_tokeep])
dend <- hclust(dis, method = "average")
plot(as.dendrogram(dend),horiz=TRUE,xlim=c(.8,0),xaxt="n",yaxt="n")
abline(v=seq(0,.8,.1),lty=3)
}else{
plot(NA,xlim=c(.8,0),ylim=c(0,1),axes=F,ann=F)
segments(0,.5,.1,.5)
text(0,.5, pos=4,rc_tokeep,xpd=TRUE)
abline(v=seq(0,.8,.1),lty=3)
}
}
axis(1,at=seq(0,.8,.1))