How to make 'head' be applied automatically to output? - r

I have a bunch of large dataframes, so every time I want to display them, I have to use head:
head( blahblah(somedata) )
Typing head all the time gets old after the first few hundred times, so I'd like an easy way to do this if possible. One of the cool things about R compared to java that things like this are often really easy, if you know the secret incantation.
I searched in options, and found max.print, which almost works, except there is now a time delay.
head( blahblah(somedata) )
.... is instantaneous (to within the limits of my perception)
options(max.print=100)
blahblah(somedata)
.... takes about 3 seconds, so longer than typing head
Is there some way of making head be applied automatically when printing large data structures?
An piece of code which reproduces this behavior:
long_dataset = data.frame(a = runif(10e5),
b = runif(10e5),
c = runif(10e5))
system.time(head(long_dataset))
options(max.print = 6)
system.time(print(long_dataset))

Putting my comment into an answer, using the data.table package (and data.table not data.frame objects) will automatically print only the first 5 and last 5 rows (once the data.table is larger than 100 rows)
library(data.table)
DT <- data.table(long_data)
DT
1: 0.19613138 0.88714284 0.25715067
2: 0.25405787 0.76544909 0.75632468
3: 0.24841384 0.22095875 0.52588596
4: 0.72766161 0.79696771 0.88802759
5: 0.02448372 0.77885568 0.38199993
---
999996: 0.28230967 0.09410921 0.84420162
999997: 0.73598931 0.86043537 0.30147089
999998: 0.86314546 0.90334347 0.08545391
999999: 0.85507851 0.46621131 0.23892566
1000000: 0.33172155 0.43060483 0.44173400
The data.table FAQ 2.11 deals with this explicitly.
EDIT to deal with existing data.frame objects you don't want to convert.
If you were hesitant at converting existing data.frame objects to data.table objects, you could simply define print.data.frame as data.table:::print.data.table
print.data.frame <- data.table:::print.data.table
long_dataset
1: 0.19613138 0.88714284 0.25715067
2: 0.25405787 0.76544909 0.75632468
3: 0.24841384 0.22095875 0.52588596
4: 0.72766161 0.79696771 0.88802759
5: 0.02448372 0.77885568 0.38199993
---
999996: 0.28230967 0.09410921 0.84420162
999997: 0.73598931 0.86043537 0.30147089
999998: 0.86314546 0.90334347 0.08545391
999999: 0.85507851 0.46621131 0.23892566
1000000: 0.33172155 0.43060483 0.44173400

I'd go along with #thelatemail's suggestion, i.e. redefine print.data.frame:
print.data.frame <- function(df) {
if (nrow(df) > 10) {
base::print.data.frame(head(df, 5))
cat("----\n")
base::print.data.frame(tail(df, 5))
} else {
base::print.data.frame(df)
}
}
data.frame(x=1:100, y=1:100)
# x y
# 1 1 1
# 2 2 2
# 3 3 3
# 4 4 4
# 5 5 5
# ----
# x y
# 96 96 96
# 97 97 97
# 98 98 98
# 99 99 99
# 100 100 100
A more elaborate version could line everything up together and avoid the repeated header, but you get the idea.
You could put such function in your .Rprofile or Rprofile.site files (see ?Startup) so it will be there every time you start an R session.

Related

Need for print() function in R

Sometimes I read posts where people use the print() function and I don't understand why it is used. Here for example in one answer the code is
print(fitted(m))
# 1 2 3 4 5 6 7 8
# 0.3668989 0.6083009 0.4677463 0.8685777 0.8047078 0.6116263 0.5688551 0.4909217
# 9 10
# 0.5583372 0.6540281
But using fitted(m) would give the same output. I know there are situations where we need print(), for example if we want create plots inside of loops. But why is the print() function used in cases like the one above?
I guess that in many cases usage of print is just a bad/redundant habit, however print has a couple of interesting options:
Data:
x <- rnorm(5)
y <- rpois(5, exp(x))
m <- glm(y ~ x, family="poisson")
m2 <- fitted(m)
# 1 2 3 4 5
# 0.8268702 1.0523189 1.9105627 1.0776197 1.1326286
digits - shows wanted number of digits
print(m2, digits = 3) # same as round(m2, 3)
# 1 2 3 4 5
# 0.827 1.052 1.911 1.078 1.133
na.print - turns NA values into a specified value (very similar to zero.print argument)
m2[1] <- NA
print(m2, na.print = "Failed")
# 1 2 3 4 5
# Failed 1.052319 1.910563 1.077620 1.132629
max - prints wanted number of values
print(m2, max = 2) # similar to head(m2, 2)
# 1 2
# NA 1.052319
I'm guessing, as I rarely use print myself:
using print() makes it obvious which lines of your code do printing and which ones do actual staff. It might make re-reading your code later easier.
using print() explicitly might make it easier to later refactor your code into a function, you just need to change the print into a return
programmers coming from a language with strict syntax might have a strong dislike towards the automatic printing feature of r

R enumerate duplicates in a dataframe with unique value

I have a dataframe containing a set of parts and test results. The parts are tested on 3 sites (North Centre and South). Sometimes those parts are re-tested. I want to eventually create some charts that compare the results from the first time that a part was tested with the second (or third, etc.) time that it was tested, e.g. to look at tester repeatability.
As an example, I've come up with the below code. I've explicitly removed the "Experiment" column from the morley data set, as this is the column I'm effectively trying to recreate. The code works, however it seems that there must be a more elegant way to approach this problem. Any thoughts?
Edit - I realise that the example given was overly simplistic for my actual needs (I was trying to generate a reproducible example as easily as possible).
New example:
part<-as.factor(c("A","A","A","B","B","B","A","A","A","C","C","C"))
site<-as.factor(c("N","C","S","C","N","S","N","C","S","N","S","C"))
result<-c(17,20,25,51,50,49,43,45,47,52,51,56)
data<-data.frame(part,site,result)
data$index<-1
repeat {
if(!anyDuplicated(data[,c("part","site","index")]))
{ break }
data$index<-ifelse(duplicated(data[,1:2]),data$index+1,data$index)
}
data
part site result index
1 A N 17 1
2 A C 20 1
3 A S 25 1
4 B C 51 1
5 B N 50 1
6 B S 49 1
7 A N 43 2
8 A C 45 2
9 A S 47 2
10 C N 52 1
11 C S 51 1
12 C C 56 1
Old example:
#Generate a trial data frame from the morley dataset
df<-morley[,c(2,3)]
#Set up an iterative variable
#Create the index column and initialise to 1
df$index<-1
# Loop through the dataframe looking for duplicate pairs of
# Runs and Indices and increment the index if it's a duplicate
repeat {
if(!anyDuplicated(df[,c(1,3)]))
{ break }
df$index<-ifelse(duplicated(df[,c(1,3)]),df$index+1,df$index)
}
# Check - The below vector should all be true
df$index==morley$Expt
We may use diff and cumsum on the 'Run' column to get the expected output. In this method, we are not creating a column of 1s i.e 'index' and also assuming that the sequence in 'Run' is ordered as showed in the OP's example.
indx <- cumsum(c(TRUE,diff(df$Run)<0))
identical(indx, morley$Expt)
#[1] TRUE
Or we can use ave
indx2 <- with(df, ave(Run, Run, FUN=seq_along))
identical(indx2, morley$Expt)
#[1] TRUE
Update
Using the new example
with(data, ave(seq_along(part), part, site, FUN=seq_along))
#[1] 1 1 1 1 1 1 2 2 2 1 1 1
Or we can use getanID from library(splitstackshape)
library(splitstackshape)
getanID(data, c('part', 'site'))[]
I think this is a job for make.unique, with some manipulation.
index <- 1L + as.integer(sub("\\d+(\\.)?","",make.unique(as.character(morley$Run))))
index <- ifelse(is.na(index),1L,index)
identical(index,morley$Expt)
[1] TRUE
Details of your actual data.frame may matter. However, a couple of options working with your example:
#this works if each group starts with 1:
df$index<-cumsum(df$Run==1)
#this is maybe more general, with data.table
require(data.table)
dt<-as.data.table(df)
dt[,index:=seq_along(Speed),by=Run]

Programmatically generating a list of columns to be assigned to data.table with `:=` syntax

In data.table, I can generate a list of new columns that are immediately assigned to the table using the `:=` syntax, like so:
x <- data.table(x1=1:5, x2=1:5)
x[, `:=` (x3=x1+2, x4=x2*3)]
Alternatively, I could have done the following:
x[, c("x3","x4") := list(x1+2, x2*3)]
I would like to do something like the first method, but have the right hand side of the assignment statement be built up automatically using a custom function. For example, suppose I want a function that will accept a set of column names, then generate new columns that are the mean of the given columns, with the column name being equal to the original column plus some suffix. For example,
x[, `:=` MEAN(x1,x2)]
would yield the same result as
x[, `:=` (x1_mean=mean(x1), x2_mean=mean(x2))]
Is this possible in data.table? I realize this is possible if I'm willing to pass in a list of column names like in the c("x3","x4") := ... example, but I want to avoid this so I don't have to write as much code.
Just refer to the function by name:
myfun <- "mean"
x[,paste(names(x),myfun,sep="_"):=lapply(.SD,myfun)]
# x1 x2 x1_mean x2_mean
# 1: 1 1 3 3
# 2: 2 2 3 3
# 3: 3 3 3 3
# 4: 4 4 3 3
# 5: 5 5 3 3
Customization is straightforward:
divby2 <- function(x) x/2 # custom function
myfun <- "divby2"
mycols <- "x1" # custom columns
x[,paste(mycols,myfun,sep="_"):=lapply(.SD,myfun),.SDcols=mycols]
# x1 x2 x1_mean x2_mean x1_divby2
# 1: 1 1 3 3 0.5
# 2: 2 2 3 3 1.0
# 3: 3 3 3 3 1.5
# 4: 4 4 3 3 2.0
# 5: 5 5 3 3 2.5
We may some day have syntax like paste(.SDcols,myfun,sep="_"):=lapply(.SD,myfun), but .SDcols on the left-hand side is not supported currently.
Making a function. If you want a function to do this, there's
add_myfun <- function(DT,myfun,mycols){
DT[,paste(mycols,myfun,sep="_"):=lapply(.SD,myfun),.SDcols=mycols]
}
add_myfun(x,"median","x2")
Can a function be written that will work inside j of DT[i,j]? Maybe. But I think it's not a good idea.
Can you be sure your function will be robust to all the other uses of j, like by?
Can your function take advantage of data.table's optimization (e.g., of mean)?
Will anyone else be able to read your code?
Using [ can be slow. If you're doing this for many columns, you might be better off initializing the new columns and assigning with set.

Creating combination of sequences

I am trying to solve following problem:
Consider 5 simple sequences: 0:100, 100:0, rep(0,101), rep(50,101), rep(100,101)
I need sets of 3 numeric variables, which have above sequences in all combinations. Since there are 5 sequences and 3 variables, there can be 5*5*5 combinations, hence total of 12625 (5*5*5*101) numbers in each variable (101 for each sequence).
These can be grouped in a data.frame of 12625 rows and 4 columns. First column (V) will simply have seq(1:12625) (rownumbers can be used in its place). Other 3 columns (A,B,C) will have above 5 sequences in different combinations. For example, the first 101 rows will have 0:100 in all 3 A,B and C. Next 101 rows will have 0:100 in A and B, and 100:0 in C. And so on...
I can create sequences as:
s = list()
s[[1]] = 0:100
s[[2]] = 100:0
s[[3]] = rep(0,101)
s[[4]] = rep(50,101)
s[[5]] = rep(100,101)
But how to proceed further? I do not really need the data frame but I need a function that returns a list containing the values of c(A,B,C) for the number (first or V column) sent to it. The number can obviously vary from 1 to 12625.
How can I create such a function. I will prefer a vector solution or one using apply family functions to optimize the speed.
You asked for a vectorized solution, so here's one using only data.table (similar to #SimonGs methodology)
library(data.table)
grd <- CJ(A = seq_len(5), B = seq_len(5), C = seq_len(5))
res <- grd[, lapply(.SD, function(x) unlist(s[x]))]
res
# A B C
# 1: 0 0 0
# 2: 1 1 1
# 3: 2 2 2
# 4: 3 3 3
# 5: 4 4 4
# ---
# 12621: 100 100 100
# 12622: 100 100 100
# 12623: 100 100 100
# 12624: 100 100 100
# 12625: 100 100 100
I came up with two solutions. I find this hard to do with apply and the likes since they tend to give an output that is not so nice to handle (maybe someone can "tame" them better than I can :D)
First solution uses seperate calls to lapply, second one uses a for loop and some programming No-No's. Personally I prefer the second one, first one is faster though...
grd <- expand.grid(a=1:5,b=1:5,c=1:5)
# apply-ish
A <- lapply(grd[,1], function(z){ s[[z]] })
B <- lapply(grd[,2], function(z){ s[[z]] })
C <- lapply(grd[,3], function(z){ s[[z]] })
dfr <- data.frame(A=do.call(c,A), B=do.call(c,B), C=do.call(c,C))
# for-ish
mat <- NULL
for(i in 1:nrow(grd)){
cur <- grd[i,]
tmp <- cbind(s[[cur[,1]]],s[[cur[,2]]],s[[cur[,3]]])
mat <- rbind(mat,tmp)
}
The output of both dfr and mat seem to be what you describe.
Cheers!

Is a copy made when function returns a data.table?

I am updating a set of functions that previously only accepted data.frame objects to work with data.table arguments.
I decided to implement the function using R's method dispatch so that the old code using data.frames will still work with the updated functions. In one of my functions, I take in a data.frame as input, modify it, and then return the modified data.frame. I created a data.table implementation as well. For example:
# The functions
foo <- function(d) {
UseMethod("foo")
}
foo.data.frame <- function(d) {
<Do Something>
return(d)
}
foo.data.table <- function(d) {
<Do Something>
return(d)
}
I know that data.table works by making changes without copying, and I implemented foo.data.table while keeping that in mind. However, I return the data.table object at the end of the function because I want my old scripts to work with the new data.table objects. Will this make a copy of the data.table? How can I check? According to the documentation, one has to be very explicit to create a copy of a data.table, but I am not sure in this case.
The reason I want to return something when I do not have to with data.tables:
My old scripts look like this
someData <- read.table(...)
...
someData <- foo(someData)
I want the scripts to be able to run with data.tables by just changing the data ingest lines. In other words, I want the script to work by just changing someData <- read.table(...) to someData <- fread(...).
Thanks to Arun for his answer in the comments. I will be using his example in his comments to answer the question.
One can check if copies are being made by using the tracemem function to track an object in R. From the help file of the function, ?tracemem, the description says:
This function marks an object so that a message is printed whenever the internal code copies the object. It is a major cause of hard-to-predict memory use in R.
For example:
# Using a data.frame
df <- data.frame(x=1:5, y=6:10)
tracemem(df)
## [1] "<0x32618220>"
df$y[2L] <- 11L
## tracemem[0x32618220 -> 0x32661a98]:
## tracemem[0x32661a98 -> 0x32661b08]: $<-.data.frame $<-
## tracemem[0x32661b08 -> 0x32661268]: $<-.data.frame $<-
df
## x y
## 1 1 6
## 2 2 11
## 3 3 8
## 4 4 9
## 5 5 10
# Using a data.table
dt <- data.table(x=1:5, y=6:10)
tracemem(dt)
## [1] "<0x5fdab40>"
set(dt, i=2L, j=2L, value=11L) # No memory output!
address(dt) # Verify the address in memory is the same
## [1] "0x5fdab40"
dt
## x y
## 1: 1 6
## 2: 2 11
## 3: 3 8
## 4: 4 9
## 5: 5 10
It appears that the data.frame object is copied twice when changing one element in the data.frame, while the data.table is modified in place without making copies!
From my question, I can just track the data.table or data.frame object, d, before passing it on to the function, foo, to check if any copies were made.
Not sure this adds anything, but as a cautionary tale note the following behavior:
library(data.table)
foo.data.table <- function(d) {
d[,A:=4]
d$B <- 1
d[,C:=1]
return(d)
}
set.seed(1)
dt <- data.table(A=rnorm(5),B=runif(5),C=rnorm(5))
dt
# A B C
# 1: -0.6264538 0.2059746 -0.005767173
# 2: 0.1836433 0.1765568 2.404653389
# 3: -0.8356286 0.6870228 0.763593461
# 4: 1.5952808 0.3841037 -0.799009249
# 5: 0.3295078 0.7698414 -1.147657009
result <- foo.data.table(dt)
dt
# A B C
# 1: 4 0.2059746 -0.005767173
# 2: 4 0.1765568 2.404653389
# 3: 4 0.6870228 0.763593461
# 4: 4 0.3841037 -0.799009249
# 5: 4 0.7698414 -1.147657009
result
# A B C
# 1: 4 1 1
# 2: 4 1 1
# 3: 4 1 1
# 4: 4 1 1
# 5: 4 1 1
So, evidently, dt is passed by reference to foo.data.table(...) and the first statement, d[,A:=4], modifies it by reference, changing column A in dt.
The second statement, d$B <- 1, forces the creation of a copy of d (also named d) scoped internal to the function. Then rhe third statement, d[,C:=1], modifies that by reference (but does not affect dt), and return(d) then returns the copy.
If you change the order of the second and third statements, the effect of the function call on dt is different.

Resources