creating special data objects in r - r

For some packages I can see special type of objects. For example I am getting following message when I try to print a dataset from a package.
multitrait
This is an object of class "cross".
It is too complex to print, so we provide just this summary.
RI strains via selfing
No. individuals: 162
......................and other summary information
is (multitrait)
[1] "riself"
I wonder how we can created such object. Are they special lists of dataframe, matrix of vector.
X <- c("A", "B", "C")
Y <- data.frame (A = 1:10, B = 21:30, C = 31:40)
myeq <- c("Y ~ X1 + Y1")
K <- 100
A = 1:20
B = B= 21:40
J <- as.matrix(A,B )
myl1 <- list(J, K)
Now my complex object:
mycomplexobject <- list(X, Y, myeq, K, J, myl1)
mycomplexobject
str(mycomplexobject)
List of 6
$ : chr [1:3] "A" "B" "C"
$ :'data.frame': 10 obs. of 3 variables:
..$ A: int [1:10] 1 2 3 4 5 6 7 8 9 10
..$ B: int [1:10] 21 22 23 24 25 26 27 28 29 30
..$ C: int [1:10] 31 32 33 34 35 36 37 38 39 40
$ : chr "Y ~ X1 + Y1"
$ : num 100
$ : int [1:20, 1] 1 2 3 4 5 6 7 8 9 10 ...
$ :List of 2
..$ : int [1:20, 1] 1 2 3 4 5 6 7 8 9 10 ...
..$ : num 100
is(mycomplexobject)
[1] "list" "vector"
Is there way to make special object and prevent printing whole list instead message like "it is complex to print" and provide summary instead ?

Just set the class of your object and provide a print method.
class(mycomplexobject) <- c("too_complex", class(mycomplexobject))
print.too_complex <- function(x) {
cat("Complex object of length", length(x), "\n")
}
mycomplexobject

Related

How to I write an S3 extract method [ in r

I have a data frame called gfe_obj with structure as follows:
And I want to write an extract function such that when I run the code below, I get the corresponding output:
Currently, I have:
str(gfe_obj)
'[.gfe_obj' <- function(x,i) {
class(x) <- "gfe"
as.gfe_obj(x[i])
}
sub_gfe_obj <- gfe_obj[1:3]
str(sub_gfe_obj)
But when I run this code, I get Error in as.gfe_obj(x[i]) : could not find function "as.gfe_obj".
I referenced the method to write from here : How to implement extracting/subsetting ([, [<-, [[, [[<-) functions for custom S3 classes?
Thank you for your help.
I'm not sure what the exact structure of your gfe class is supposed to be, but assuming it is a list consisting of two objects (a 3D array called frames and a data frame called info with the same number of rows as the third dimension of frames, then your S3 method would be:
`[.gfe`<- function(x, i) {
x$frames <- x$frames[,,i]
x$info <- x$info[i,]
x
}
To test this, I need a mock class constructor and some dummy data:
gfe <- function(frames, info) {
structure(list(frames = frames, info = info), class = "gfe")
}
gfe_obj <- gfe(frames = array(1:90, dim = c(3, 3, 10)),
info = data.frame(x = 1:10, y = letters[1:10]))
str(gfe_obj)
#> List of 2
#> $ frames: int [1:3, 1:3, 1:10] 1 2 3 4 5 6 7 8 9 10 ...
#> $ info :'data.frame': 10 obs. of 2 variables:
#> ..$ x: int [1:10] 1 2 3 4 5 6 7 8 9 10
#> ..$ y: chr [1:10] "a" "b" "c" "d" ...
#> - attr(*, "class")= chr "gfe"
Now we can see the extractor method works as expected:
sub_gfe_obj <- gfe_obj[2:3]
str(sub_gfe_obj)
#> List of 2
#> $ frames: int [1:3, 1:3, 1:2] 10 11 12 13 14 15 16 17 18 19 ...
#> $ info :'data.frame': 2 obs. of 2 variables:
#> ..$ x: int [1:2] 2 3
#> ..$ y: chr [1:2] "b" "c"
#> - attr(*, "class")= chr "gfe"
Created on 2022-09-25 with reprex v2.0.2

Why does mutate change the variable type?

activity <- mutate(
activity, steps = ifelse(is.na(steps), lookup_mean(interval), steps))
The "steps" variable changes from an int to a list. I want it to stay an "int" so I can aggregate it (aggregate is failing because it is a list type).
Before:
> str(activity)
'data.frame': 17568 obs. of 3 variables:
$ steps : int NA NA NA NA NA NA NA NA NA NA ...
$ date : Factor w/ 61 levels "2012-10-01","2012-10-02",..: 1 1 1 1 1 1 1 1 1 1 ...
$ interval: int 0 5 10 15 20 25 30 35 40 45 ...
After:
> str(activity)
'data.frame': 17568 obs. of 3 variables:
$ steps :List of 17568
..$ : num 1.72
..$ : num 1.72
Lookup mean is defined here:
lookup_mean <- function(i) {
return filter(daily_activity_pattern, interval == 0) %>% select(steps)
}
The problem is that lookup_mean returns a list, so R casts each value in activity$steps to a list. lookup_mean should be:
lookup_mean <- function(i) {
interval <- filter(daily_activity_pattern, interval == 0) %>% select(steps)
return(interval$steps)
}

Get named variable from list of list

this is probably trivial but can someone help me with this?
I been using the apply to call a function that returns a list, as such
l <- apply(df, 1, manydo); manydo function returns a list list("a" = a, "b" = b)
the output l appears to be a list of list, because when I type str(l) it returns
List of 5
$ 1:List of 2
..$ a: Named num [1:36] 3.29 3.25 3.36 3.26 3.34 ...
.. ..- attr(*, "names")= chr [1:36] "V1" "V2" "V3" "V4" ...
..$ b: Named num [1:36] 0.659 0.65 0.672 0.652 0.669 ...
I tried to access it many ways such as
l[1][1]
or l[1]['a']
unlist(l[1][1]['a'])
but nothing works. What I want is to be able to get for example, the first element and 'a' variable?
in addition, if I just call the function directly say:
l <- manydo(c(1:36)) # I can access this
l['a'] # this works, so I'm confuse ;(
thanks!
[ returns a list containing the selected elements. [[ returns a single element (not wrapped in a list), so that's what you want to use here.
l <- list(list(a=1:10, b=10:22), list(), list(), list(), list())
str(l)
## List of 5
## $ :List of 2
## ..$ a: int [1:10] 1 2 3 4 5 6 7 8 9 10
## ..$ b: int [1:13] 10 11 12 13 14 15 16 17 18 19 ...
...
Now to retrieve a:
l[[1]][['a']]
[1] 1 2 3 4 5 6 7 8 9 10
l[[1]] is the list containing a. l[[1]][['a']] is the value of a itself.

Replace integer(0) by NA

I have a function that I apply to a column and puts results in another column and it sometimes gives me integer(0) as output. So my output column will be something like:
45
64
integer(0)
78
How can I detect these integer(0)'s and replace them by NA? Is there something like is.na() that will detect them ?
Edit: Ok I think I have a reproducible example:
df1 <-data.frame(c("267119002","257051033",NA,"267098003","267099020","267047006"))
names(df1)[1]<-"ID"
df2 <-data.frame(c("257051033","267098003","267119002","267047006","267099020"))
names(df2)[1]<-"ID"
df2$vals <-c(11,22,33,44,55)
fetcher <-function(x){
y <- df2$vals[which(match(df2$ID,x)==TRUE)]
return(y)
}
sapply(df1$ID,function(x) fetcher(x))
The output from this sapply is the source of the problem.
> str(sapply(df1$ID,function(x) fetcher(x)))
List of 6
$ : num 33
$ : num 11
$ : num(0)
$ : num 22
$ : num 55
$ : num 44
I don't want this to be a list - I want a vector, and instead of num(0) I want NA (note in this toy data it gives num(0) - in my real data it gives (integer(0)).
Here's a way to (a) replace integer(0) with NA and (b) transform the list into a vector.
# a regular data frame
> dat <- data.frame(x = 1:4)
# add a list including integer(0) as a column
> dat$col <- list(45,
+ 64,
+ integer(0),
+ 78)
> str(dat)
'data.frame': 4 obs. of 2 variables:
$ x : int 1 2 3 4
$ col:List of 4
..$ : num 45
..$ : num 64
..$ : int
..$ : num 78
# find zero-length values
> idx <- !(sapply(dat$col, length))
# replace these values with NA
> dat$col[idx] <- NA
# transform list to vector
> dat$col <- unlist(dat$col)
# now the data frame contains vector columns only
> str(dat)
'data.frame': 4 obs. of 2 variables:
$ x : int 1 2 3 4
$ col: num 45 64 NA 78
Best to do that in your function, I'll call it myFunctionForApply but that's your current function. Before you return, check the length and if it is 0 return NA:
myFunctionForApply <- function(x, ...) {
# Do your processing
# Let's say it ends up in variable 'ret':
if (length(ret) == 0)
return(NA)
return(ret)
}

Generate sets for cross-validation

How to split automatically a matrix using R for 5-fold cross-validation?
I actually want to generate the 5 sets of (test_matrix_indices, train matrix_indices).
I suppose you want the matrix rows to be the cases to split. Then all you need is sample and split :
X <- matrix(rnorm(1000),ncol=5)
id <- sample(1:5,nrow(X),replace=TRUE)
ListX <- split(x,id) # gives you a list with the 5 matrices
X[id==2,] # gives you the second matrix
I'd work with the list, as it allows you to do something like :
names(ListX) <- c("Train1","Train2","Train3","Test1","Test2")
mean(ListX$Train3)
which makes for code that's easier to read, and keeps you from creating tons of matrices in your workspace. You're bound to mess up if you put the matrices individually in your workspace. Use lists!
In case you want the test matrix to be smaller or larger than the other ones, use the prob argument of sample :
id <- sample(1:5,nrow(X),replace=TRUE,prob=c(0.15,0.15,0.15,0.15,0.3))
gives you a test matrix that's double the size of the train matrices.
In case you want to determine the exact number of cases, sample and prob aren't the best options. You could use a trick like :
indices <- rep(1:5,c(100,20,20,20,40))
id <- sample(indices)
to get matrices with respectively 100, 20, ... and 40 cases.
f_K_fold <- function(Nobs,K=5){
rs <- runif(Nobs)
id <- seq(Nobs)[order(rs)]
k <- as.integer(Nobs*seq(1,K-1)/K)
k <- matrix(c(0,rep(k,each=2),Nobs),ncol=2,byrow=TRUE)
k[,1] <- k[,1]+1
l <- lapply(seq.int(K),function(x,k,d)
list(train=d[!(seq(d) %in% seq(k[x,1],k[x,2]))],
test=d[seq(k[x,1],k[x,2])]),k=k,d=id)
return(l)
}
Solution without split:
set.seed(7402313)
X <- matrix(rnorm(999), ncol=3)
k <- 5 # number of folds
# Generating random indices
id <- sample(rep(seq_len(k), length.out=nrow(X)))
table(id)
# 1 2 3 4 5
# 67 67 67 66 66
# lapply over them:
indicies <- lapply(seq_len(k), function(a) list(
test_matrix_indices = which(id==a),
train_matrix_indices = which(id!=a)
))
str(indicies)
# List of 5
# $ :List of 2
# ..$ test_matrix_indices : int [1:67] 12 13 14 17 18 20 23 28 41 45 ...
# ..$ train_matrix_indices: int [1:266] 1 2 3 4 5 6 7 8 9 10 ...
# $ :List of 2
# ..$ test_matrix_indices : int [1:67] 4 19 31 36 47 53 58 67 83 89 ...
# ..$ train_matrix_indices: int [1:266] 1 2 3 5 6 7 8 9 10 11 ...
# $ :List of 2
# ..$ test_matrix_indices : int [1:67] 5 8 9 30 32 35 37 56 59 60 ...
# ..$ train_matrix_indices: int [1:266] 1 2 3 4 6 7 10 11 12 13 ...
# $ :List of 2
# ..$ test_matrix_indices : int [1:66] 1 2 3 6 21 24 27 29 33 34 ...
# ..$ train_matrix_indices: int [1:267] 4 5 7 8 9 10 11 12 13 14 ...
# $ :List of 2
# ..$ test_matrix_indices : int [1:66] 7 10 11 15 16 22 25 26 40 42 ...
# ..$ train_matrix_indices: int [1:267] 1 2 3 4 5 6 8 9 12 13 ...
But you could return matrices too:
matrices <- lapply(seq_len(k), function(a) list(
test_matrix = X[id==a, ],
train_matrix = X[id!=a, ]
))
str(matrices)
List of 5
# $ :List of 2
# ..$ test_matrix : num [1:67, 1:3] -1.0132 -1.3657 -0.3495 0.6664 0.0762 ...
# ..$ train_matrix: num [1:266, 1:3] -0.65 0.797 0.689 0.484 0.682 ...
# $ :List of 2
# ..$ test_matrix : num [1:67, 1:3] 0.484 0.418 -0.622 0.996 0.414 ...
# ..$ train_matrix: num [1:266, 1:3] -0.65 0.797 0.689 0.682 0.186 ...
# $ :List of 2
# ..$ test_matrix : num [1:67, 1:3] 0.682 0.812 -1.111 -0.467 0.37 ...
# ..$ train_matrix: num [1:266, 1:3] -0.65 0.797 0.689 0.484 0.186 ...
# $ :List of 2
# ..$ test_matrix : num [1:66, 1:3] -0.65 0.797 0.689 0.186 -1.398 ...
# ..$ train_matrix: num [1:267, 1:3] 0.484 0.682 0.473 0.812 -1.111 ...
# $ :List of 2
# ..$ test_matrix : num [1:66, 1:3] 0.473 0.212 -2.175 -0.746 1.707 ...
# ..$ train_matrix: num [1:267, 1:3] -0.65 0.797 0.689 0.484 0.682 ...
Then you could use lapply to get results:
lapply(matrices, function(x) {
m <- build_model(x$train_matrix)
performance(m, x$test_matrix)
})
Edit: compare to Wojciech's solution:
f_K_fold <- function(Nobs, K=5){
id <- sample(rep(seq.int(K), length.out=Nobs))
l <- lapply(seq.int(K), function(x) list(
train = which(x!=id),
test = which(x==id)
))
return(l)
}
Edit : Thanks for your answers.
I have found the following solution (http://eric.univ-lyon2.fr/~ricco/tanagra/fichiers/fr_Tanagra_Validation_Croisee_Suite.pdf) :
n <- nrow(mydata)
K <- 5
size <- n %/% K
set.seed(5)
rdm <- runif(n)
ranked <- rank(rdm)
block <- (ranked-1) %/% size+1
block <- as.factor(block)
Then I use :
for (k in 1:K) {
matrix_train<-matrix[block!=k,]
matrix_test<-matrix[block==k,]
[Algorithm sequence]
}
in order to generate the adequate sets for each iterations.
However this solution can omit one individual for tests. I do not recommend it.
Below does the trick without having to create separate data.frames/matrices, all you need to do is to keep an integer sequnce, id that stores the shuffled indices for each fold.
X <- read.csv('data.csv')
k = 5 # number of folds
fold_size <-nrow(X)/k
indices <- rep(1:k,rep(fold_size,k))
id <- sample(indices, replace = FALSE) # random draws without replacement
log_models <- new.env(hash=T, parent=emptyenv())
for (i in 1:k){
train <- X[id != i,]
test <- X[id == i,]
# run algorithm, e.g. logistic regression
log_models[[as.character(i)]] <- glm(outcome~., family="binomial", data=train)
}
The sperrorest package provides this ability. You can choose between a random split (partition.cv()), a spatial split (partition.kmeans()), or a split based on factor levels (partition.factor.cv()). The latter is currently only available in the Github version.
Example:
library(sperrorest)
data(ecuador)
## non-spatial cross-validation:
resamp <- partition.cv(ecuador, nfold = 5, repetition = 1:1)
# first repetition, second fold, test set indices:
idx <- resamp[['1']][[2]]$test
# test sample used in this particular repetition and fold:
ecuador[idx , ]
If you have a spatial data set (with coords), you can also visualize your generated folds
# this may take some time...
plot(resamp, ecuador)
Cross-validation can then be performed using sperrorest() (sequential) or parsperrorest() (parallel).

Resources