Retrieving the path to all data.frame class objects in rData files - r

I have multiple .rData files whose top level Global Environment variables are a mix of data.frames, lists, deeply nested lists. I know that many of the nested lists have within them data.frame types, but I'm having trouble retrieving the path to them.
I had a faced a similar problem before with another type of class using the following code
names(rapply(mget(ls(.GlobalEnv), envir=.GlobalEnv), length, classes="fluor.spectral.data", how="unlist"))
and while not the most elegant solution, it achieved what I needed and quickly. returning names like "Fluor.Spec.WA.M12.SC.13" which then allows me to manipulate the object after formating the '.' into '$'.
Can someone help me retrieve the path to all data.frame class types, nested or otherwise in highly variable .rData files? Thanks in advance

If you want to return all data.frames loaded in the global environment, either present as individual object or as element of a nested list, use rrapply in the rrapply-package (extension of base rrapply).
library(rrapply)
w <- data.frame(1)
x <- list(1, 2, 3)
y <- 5
z <- list(1, 2, list(1, df = data.frame(a = 1, b = 2)))
rrapply(as.list(.GlobalEnv), classes = "data.frame", how = "flatten")
#> $w
#> X1
#> 1 1
#>
#> $df
#> a b
#> 1 1 2
Setting classes = "data.frame" avoids recursion into data.frame columns (as base rapply would do), and how = "flatten" will return the collected data.frames as a flattened list.
NB: If you want to return the complete object paths to the found data.frames, set how = "prune" instead of how ="flatten":
rrapply(as.list(.GlobalEnv), classes = "data.frame", how = "prune")
#> $w
#> X1
#> 1 1
#>
#> $z
#> $z[[1]]
#> $z[[1]]$df
#> a b
#> 1 1 2
Edit: In order to also return data.frames present in slots of some S4-class, a possible way to extend the above call would be:
## define S4-class with a data.frame in "df" slot
userClass <- setClass("user", slots = c(df = "data.frame"))
v <- userClass(df = data.frame(user = 1))
rrapply(as.list(.GlobalEnv),
classes = c("data.frame", "user"),
f = function(x) {
if(class(x) == "user") {
slot(x, "df")
} else {
x
}
},
how = "flatten")
#> $v
#> user
#> 1 1
#>
#> $w
#> X1
#> 1 1
#>
#> $df
#> a b
#> 1 1 2
In this case, classes = c("data.frame", "user") will check for data.frames and S4-objects of class "user". The f function applied to the object, returns the object itself if it is a data.frame or the "df" slot if it is an S4-object.
Note that this code assumes that the S4-class name is known as well as the slot(s) which contain the data.frame objects.

Related

R - rbind a dataframe with a NULL

I am trying to understand an example from a textbook.
The example code is like this:
x <- cbind(x1,x2,x3)
z <- NULL
y <- rbind(z,x)
My question is, why did it rbind to a Null when the output seems same as just x?
This example may be to illustrate that rbind()ing an object to NULL just returns the non-NULL object. This is something you can make use of if you have code that may or may not return a data.frame/matrix/vector (and otherwise returns NULL). The case you show is fairly trivial, but consider this example:
results <- lapply(1:4, function(i) {
if (i %% 2 == 0) {
return(data.frame(a = i, b = i / 2))
} else {
return(NULL)
}
})
# a list of mixed results (some may be NULL)
results
#> [[1]]
#> NULL
#>
#> [[2]]
#> a b
#> 1 2 1
#>
#> [[3]]
#> NULL
#>
#> [[4]]
#> a b
#> 1 4 2
# get a data.frame of just the non-NULL rows
do.call('rbind', results)
#> a b
#> 1 2 1
#> 2 4 2
Here, we are going to iteratively apply rbind() to each element in the list resulting from a call to lapply().
The function we apply is arbitrary, but it has some internal logic that leads to a data.frame result, but in other cases returns NULL. Consider for example an API request that may or may not return data, or may or may not be successful if you are not connected to internet.
Since rbind() of an object with NULL just returns the original object, there is no additional handling or need to return dummy values for the cases that do not meet the condition.

why write.csv is not updating the changes on test$number_s

I am changing the values of a column of a data frame. Then, I am saving the file, supposedly with the changes, but not. What am I missing? Thanks,
test <- data.frame(name_s = c("x","y","z"), number_s = c(1,2,3))
lapply(1:length(test$number_s), function(x) {
test$number_s[x] <- test$number_s[[x]] + 1
})
write.csv(test,paste0("test ",format(Sys.time(),"%Y%m%d"),".csv"),
row.names = F)
that was oversimplified, the real deal is this one:
date_format_1 = "[0-9]-[:alpha:][:alpha:][:alpha:]"
date_format_2 = "[:alpha:][:alpha:][:alpha:]-[0-9][0-9]"
test <- data.frame(name_s = c("v","w","x","y","z"), event_text = c("Aug-89","7-May","9-Jun","4-Dec-2021","Feb-99"))
lapply(1:length(test$event_text), function(x) {
if (str_detect(test$event_text[[x]], paste0("\\b",date_format_1,"\\b")) == T){
test$event_text[x] <- paste0(str_sub(test$event_text[[x]],1,1), "/F",
which(month.abb %in% str_sub(
test$event_text[[x]], 3,5)))
} else if(str_detect(test$event_text[[x]], paste0("\\b", date_format_2,"\\b"))
== T) {
test$event_text[[x]] = paste0(which(month.abb %in% str_sub(
test$event_text[x],1,3)),"/F",str_sub(test$event_text[[x]],-2))
} else {
test$event_text[x] <- test$event_text[[x]]
}
})
write.csv(test,paste0("test ",format(Sys.time(),"%Y%m%d"),".csv"),
row.names = F)
Below I have written two calls to lapply that fix the issue you were having. The problem stems from the fact that R has scoped variables and so the value is changed within the function but the result is never returned or extracted from the function. As such I have demonstrated this by printing the dataframe after each of the lapply() calls below.
We can fix this in two ways. The first more correct version is to let lapply modify the exact vector directly by adding one to each value and returning x+1. (Note I have skipped curly braces and this will return the value from the next ppiece of code run, in this case x+1 alternatively you could write function(x) {return(x+1)} in that argument).
An alternate approach that will run slower but still use the indexing method is to use global assignment. <<- assigns the variable to the global scope/environment rather than the local scope of the function. (Note this code is run sequentially so the written call to this function is adding + 1 for the second time to the dataframe when shown below).
test <- data.frame(name_s = c("x","y","z"), number_s = c(1,2,3))
# Original Behaviour, doesn't work due to scoping issues
lapply(1:length(test$number_s), function(x) {
test$number_s[x] <- test$number_s[[x]] + 1
})
#> [[1]]
#> [1] 2
#>
#> [[2]]
#> [1] 3
#>
#> [[3]]
#> [1] 4
print(test)
#> name_s number_s
#> 1 x 1
#> 2 y 2
#> 3 z 3
# function that is syntactically and functionally correct
# instead of modifying the vector in the function scope the function returns the
# mutated vector which we then assign to the dataframe's vector
test$number_s <- lapply(test$number_s, function(x) x + 1)
print(test)
#> name_s number_s
#> 1 x 2
#> 2 y 3
#> 3 z 4
# function that is syntactically odd but functionally correct
# the function affects the values in the global scope, this works but is slower
# and is not best practice as it would be difficult to read
lapply(1:length(test$number_s), function(x) {
test$number_s[x] <<- test$number_s[[x]] + 1
})
#> [[1]]
#> [1] 3
#>
#> [[2]]
#> [1] 4
#>
#> [[3]]
#> [1] 5
print(test)
#> name_s number_s
#> 1 x 3
#> 2 y 4
#> 3 z 5
Created on 2021-07-23 by the reprex package (v2.0.0)

Add tag/label/attr/attribute to dataframe columns (variables)

I find quite strange the confusion that has been made in the field of attributes/tags/labels or whatever it has been called to columns in dataframe.
The question is very easy: I have a dataframe (let's call it StackedDataStd) with cases (rows) and variables (columns). Each column have a name and you can access it through many of different ways (imho having 10 ways to make the same task make everything more confused...)
colnames(stackedDataStd)
names(stackedDataStd)
labels(stackedDataStd)2
attr(stackedDataStd,'names')
attributes(stackedDataStd)$names
[1] "segN" "patN" "Elongation" "Flatness" "LeastAxisLength"
[6] "MajorAxisLength" "Maximum2DDiameterColumn" "Maximum2DDiameterRow" "Maximum2DDiameterSlice" "Maximum3DDiameter"
[11] "MeshVolume" "MinorAxisLength" "Sphericity" "SurfaceArea" "SurfaceVolumeRatio"
[16] "VoxelVolume"
when I remove a column the corresponding name is removed from the attributes.
Now I have many other tags to identify the types of used variables...let's make the example of bw tag (an integer number, in my case 1,2,4,8 or 16). Each column has his own bw and what I want is to select, let's say, all the columns with bw==1.
At the moment I made what I called a selector (selectorBw): a vector of length dim(stackedDataStd)2 with the bw values of each column.
By using stackedDataStd[,selectorBw == 1] I select the ones with bw==1.
But, every time I remove a column I have to remember to remove the corresponding position in the selector (and having ten selectors it start being a mess).
stackedDataStd[,-4]
selectorBw[-4]
I tried to add attributes in several ways:
attr(stackedDataStd,'bw') <- selectorBw
adds the attribute but it's not linked to the column like colnames. If I remove one column (stackedDataStd$segN <- c()) the attribute is not removed. If I remove with stackedDataStd <- stackedDataStd[,-1] the attribute disappears... (magic).
Here it has been suggested to assign the attribute to each element of the list:
for (i in seq_along(stackedDataStd)) { attr(stackedDataStd[[i]], "bw") <- selectorBw[i] }
but it can't be used with the whole dataframe, since it's not a dataframe attribute (i should cycle for each variable for every selection).
I have tried other ways, but i don't want to bother you with my attempts....
Do you have any suggestion? Maybe with Hmisc package? I don't want to have non-standard dataframe, if possibile.
If you want to build extra functionality into data frames using attributes but without specifying a new S3 class, you have to define that functionality somewhere else. It's really pretty easy to do by adding an attribute setter, an attribute getter, and a little subsetting function:
# Subsetter
ss <- function(df, bw) df[sapply(df, function(x) attr(x, "bw") == bw)]
# Gets attributes as vector
get_col_attr <- function(df) sapply(df, attr, "bw")
# Sets attributes to columns from a single vector
set_col_attr <- function(df, attrs)
{
as.data.frame(mapply(function(col, bw) {
attr(col, "bw") <- bw
col
}, df, attrs, SIMPLIFY = FALSE))
}
I think this works quite nicely. Suppose we create a little data frame with two numeric columns and two character columns, and we wish to give the columns the attributes c(1, 1, 2, 2). We can just do:
df <- data.frame(A = 1:5, B = 6:10, C = LETTERS[1:5], D = letters[1:5])
df <- set_col_attr(df, c(1, 1, 2, 2))
This still looks and behaves like a normal data frame:
df
#> A B C D
#> 1 1 6 A a
#> 2 2 7 B b
#> 3 3 8 C c
#> 4 4 9 D d
#> 5 5 10 E e
But we can see each column has a bw attribute:
get_col_attr(df)
#> A B C D
#> 1 1 2 2
And we can use this attribute to subset very easily:
ss(df, bw = 2)
#> C D
#> 1 A a
#> 2 B b
#> 3 C c
#> 4 D d
#> 5 E e
ss(df, bw = 1)
#> A B
#> 1 1 6
#> 2 2 7
#> 3 3 8
#> 4 4 9
#> 5 5 10
Crucially, if we subset the data frame, the attributes are also subsetted appropriately:
df2 <- df[, 2:3]
get_col_attr(df2)
#> B C
#> 1 2
Created on 2020-07-03 by the reprex package (v0.3.0)
As suggested by #Allan Cameron I wrote my own functions to manage the tags.
With a non negligible pain I did a small edit to his code to allow the explicit declaration of more tags and I added a field in the main dataframe with all the names of the tags.
If you have further suggestions on how to improve it, please feel free to answer me.
NB: subsetting removes the attribute 'tags' in the main df.
# Subsetter
ss <- function(df, tag, val) {
if(!is.data.frame(df) | dim(df)[2] < 1 | is.null(attr(df[[1]], tag))) {
data.frame()
} else {
df[sapply(df, function(x) attr(x, tag) == val)]
}
}
# Gets attributes as vector
get_col_attr <- function(df,tag = 'all') {
if(tag!='all') {
sapply(df, function(x) attr(x, tag))
} else {
as.data.frame(mapply(function(z) sapply(attr(df,'tags'), function(x) attr(z,x)), df, SIMPLIFY =F))
}
}
# Sets attributes to columns from a single vector
set_col_attr <- function(df,tag, attrs)
{
outdf <- as.data.frame(mapply(function(col, tagName, bw) {
attr(col, tagName) <- bw
col
}, df, tag, attrs, SIMPLIFY = FALSE))
attr(outdf,'tags') <- unique(c(attr(df,'tags'),tag))
outdf
}

Hidden objects in list

I am creating a custom object for a package and I want to have a list of two objects, but for one of those elements to be 'hidden'
For example:
l = list(data = data.frame(a = 1:3, b = 4:6), hidden = list(obj1 = 1, obj2 = 2))
When I interact with the list I want to only interact with the data element and the other be only accessed specifically.
So, if i typed l
> l
a b
1 1 4
2 2 5
3 3 6
Which I can manage with a custom print method. But I also want to be able to do
> l[,1]
[1] 1 2 3
Which I don't think is possible with a custom print method.
I don't have any specific requirements for how the other element should be accessed, but something 'R friendly' I guess.
Is there a different class I should be using or creating a new class? Any advice would be appreciated.
You could indeed define a custom class for your object. Let
class(l) <- "myclass"
Then you may define custom-specific methods for your functions of interest. For instance, in the case of l[, 1] we have
`[.myclass` <- function(x, ...) `[`(x[[1]], ...)
which takes this double list and then calls the usual [ function on the first list element:
l[, 1]
# [1] 1 2 3
The same can be done with other functions, such as print:
fun.myclass <- function(x, ...) fun(x[[1]], ...)
And you still can always access the second object in the usual way,
l$hidden
# $obj1
# [1] 1
#
# $obj2
# [1] 2
I think it would be cleaner for you to use attributes :
l <- list(data = data.frame(a = 1:3, b = 4:6),
hidden = list(obj1 = 1, obj2 = 2))
foo <- function(x){
attr(x$data,"hidden") <- x$hidden
x$data
}
l <- foo(l)
l
# a b
# 1 1 4
# 2 2 5
# 3 3 6
l[,1]
# [1] 1 2 3
attr(l,"hidden")
#
# [1] 1
#
#
# [1] 2
#

How to use custom function to add multiple new columns to dataframe?

I have a dataframe and I would like to use a custom function to add multiple new columns to that dataframe. These new columns will be some function of an existing column, but they require the use of a custom function.
I am currently trying to have my custom function return the results in a list, which I then parse into separate columns. This sometimes works by returning a vector of lists, but sometimes this returns a matrix, in which case I get an error like
Error in $<-.data.frame(*tmp*, "z", value = list(1, 2, 2, 3, 3, 4)) : replacement has 2 rows, data has 3
Below is a sample of what I am trying to do.
sample_func <- function(number)
{
list(w = number + 1, u = number + 2)
}
data = data.frame(x = c(1,2,3), y= c(5,6,7))
data$z = sapply(c(1,2,3),sample_func)
data$w = sapply(data$z,"[[","w")
data$u = sapply(data$z,"[[","u")
The function sapply automatically simplifies the result. In this case, you obtain a matrix. You can avoid this behaviour with the argument simplify = FALSE. But it's easier to use lapply because this function doesn't try to simplify the result.
The command
tmp <- lapply(c(1,2,3), sample_func)
returns a list of lists:
[[1]]
[[1]]$w
[1] 2
[[1]]$u
[1] 3
[[2]]
[[2]]$w
[1] 3
[[2]]$u
[1] 4
[[3]]
[[3]]$w
[1] 4
[[3]]$u
[1] 5
You can use the following command to add the new columns to your data frame:
cbind(data, do.call(rbind, tmp))
# x y w u
# 1 1 5 2 3
# 2 2 6 3 4
# 3 3 7 4 5
Update to address comment:
If possible, you can modify the function to return a data frame.
sample_func <- function(number)
{
data.frame(w = number + 1, u = number + 2)
}
tmp <- lapply(c(1,2,3), sample_func)
cbind(data, do.call(rbind, tmp))
The result will be a data frame with numeric columns.

Resources