How to generalize union() to take N arguments? - r

How can I append/ push data into union dynamically?
For instance, I have 4 data sets to merge,
mydata <- union(data1, data2, data3, data4)
But sometimes I have less than 4 while sometimes more than that.
Any ideas how can I solve this problem?

Make some reproducible data:
#dummy data
data1 <- data.frame(x=letters[1:3])
data2 <- data.frame(x=letters[2:4])
data3 <- data.frame(x=letters[5:7])
We can use rbind with unique in a string then evaluate:
#get list of data frames to merge, update pattern as needed
data_names <- ls()[grepl("data\\d",ls())]
data_names <- paste(data_names,collapse=",")
#make command string
myUnion <- paste0("unique(rbind(",data_names,"))")
#evaluate
eval(parse(text=myUnion))
EDIT:
Here is another better/simpler way, using do.call:
unique(do.call("rbind",lapply(objects(pattern="data\\d"),get)))

You could roll your own function like vunion defined below. Not sure if this actually works, my [R] got a bit stale ;)
Basically, you accept any number of arguments (hence ...) and make use of those as if they were packed in a list. Just choose and remove the first 2 items from that list, calculate their union, append them to the list, repeat.
vunion <- function(...){
data <- list(...)
n <- length(data)
if(n > 2){
u <- list(t(union(data[[1]], data[[2]])))
return(do.call(vunion, as.list(c(tail(data, -2), u))))
} else {
return(union(data[[1]], data[[2]]))
}
}

Related

In R, how do save space with running a function on 192 dataframes?

I have around 192 CSV's that I have converted to dataframes. I would like to be able to put the names of each dataframe in a vector and then run a FOR LOOP through the vector like so:
for (i in length(vector)){
vector[i] <- f1(vector[i])
}
or just pass through the vector into the function like so: f1(vector).
If the vector is full of integers or strings, I can put the vector through a function and it will work fine. For example:
squared <- function(x) {
return(x*x)
}
This will work with with a vector c(1,2,3,4,5) and return c(1,4,9,16,25). Otherwise, I have to make 124 lines of code for each function I want to do.
Your advice would be greatly appreciated, please.
I think the most Rtistic way of doing it would be to have all your dataframes in a list to start with. For instance,
df1 <- mtcars
df2 <- mtcars
df3 <- mtcars
frames <- grep('df', ls(), value = T)
frame_list <- lapply(frames, get)
gets you there. Now you can apply whatever function you want to each dataframe in a lapply call. So, for instance, if you wanted to get all the squared values of mpg, you could write
frame_adj <- lapply(frame_list, function(x) x$mpg * x$mpg )
The above gives you all the squared values of mpg from the original dataframes, but does not keep the other columns. If you prefer to keep the other values, simply adjust your function to return the dataframe, e.g.
frames_with_squared_mpg <-
lapply(frame_list, function(x) {
x$mpg_sq <- x$mpg * x$mpg
return(x)})
will get you there.

How to do a complex edit of columns of all data frames in a list?

I have a list of 185 data frames called WaFramesNumeric. Each dataframe has several hundred columns and thousands of rows. I want to edit every data frame, so that it leaves all numeric columns as well as any non-numeric columns that I specify.
Using:
for(i in seq_along(WaFramesNumeric)) {
WaFramesNumeric[[i]] <- WaFramesNumeric[[i]][,sapply(WaFramesNumeric[[i]],is.numeric)]
}
successfully makes each dataframe contain only its numeric columns.
I've tried to amend this with lines to add specific columns. I have tried:
for (i in seq_along(WaFramesNumeric)) {
a <- WaFramesNumeric[[i]]$Device_Name
WaFramesNumeric[[i]] <- WaFramesNumeric[[i]][,sapply(WaFramesNumeric[[i]],is.numeric)]
cbind(WaFramesNumeric[[i]],a)
}
and in an attempt to call the column numbers of all integer columns as well as the specific ones and then combine based on that:
for (i in seq_along(WaFramesNumeric)) {
f <- which(sapply(WaFramesNumeric[[i]],is.numeric))
m <- match("Cost_Center",colnames(WaFramesNumeric[[i]]))
n <- match("Device_Name",colnames(WaFramesNumeric[[i]]))
combine <- c(f,m,n)
WaFramesNumeric[[i]][,i,combine]
}
These all return errors and I am stumped as to how I could do this. WaFramesNumeric is a copy of another list of dataframes (WaFramesNumeric <- WaFramesAll) and so I also tried adding the specific columns from the WaFramesAll but this was not successful.
I appreciate any advice you can give and I apologize if any of this is unclear.
You are mistakenly assuming that the last commmand in a for loop is meaningful. It is not. In fact, it is being discarded, so since you never assigned it anywhere (the cbind and the indexing of WaFramesNumeric...), it is silently discarded.
Additionally, you are over-indexing your data.frame in the third code block. First, it's using i within the data.frame, even though i is an index within the list of data.frames, not the frame itself. Second (perhaps caused by this), you are trying to index three dimensions of a 2D frame. Just change the last indexing from [,i,combine] to either [,combine] or [combine].
Third problem (though perhaps not seen yet) is that match will return NA if nothing is found. Indexing a frame with an NA returns an error (try mtcars[,NA] to see). I suggest that you can replace match with grep: it returns integer(0) when nothing is found, which is what you want in this case.
for (i in seq_along(WaFramesNumeric)) {
f <- which(sapply(WaFramesNumeric[[i]], is.numeric))
m <- grep("Cost_Center", colnames(WaFramesNumeric[[i]]))
n <- grep("Device_Name", colnames(WaFramesNumeric[[i]]))
combine <- c(f,m,n)
WaFramesNumeric[[i]] <- WaFramesNumeric[[i]][combine]
}
I'm not sure what you mean by "an attempt to call the column numbers of all integer columns...", but in case you want to go through a list of data frames and select some columns based on some function and keep given a column name you can do like this:
df <- data.frame(a=rnorm(20), b=rnorm(20), c=letters[1:20], d=letters[1:20], stringsAsFactors = FALSE)
WaFramesNumeric <- rep(list(df), 2)
Selector <- function(data, select_func, select_names) {
select_func <- match.fun(select_func)
idx_names <- match(select_names, colnames(data))
idx_names <- idx_names[!is.na(idx_names)]
idx_func <- which(sapply(data, select_func))
idx <- unique(c(idx_func, idx_names))
return(data[, idx])
}
res <- lapply(X = WaFramesNumeric, FUN = Selector, select_names=c("c"), select_func = is.numeric)

R: Adress objects deep inside lists with filter commands inside functions/loops (ExtremeBounds package)

I am using the ExtremeBounds package which provides as a result a multi level list with (amongst others) dataframes at the lowest level. I run this package over several specifications and I would like to collect some columns of selected dataframes in these results. These should be collected by specification (spec1 and spec2 in the example below) and arranged in a list of dataframes. This list of dataframes can then be used for all kind of things, for example to export the results of different specifications into different Excel Sheets.
Here is some code which creates the problematic object (just run this code blindly, my problem only concerns how to deal with the kind of list it creates: eba_results):
library("ExtremeBounds")
Data <- data.frame(var1=rbinom(30,1,0.2),var2=rbinom(30,2,0.2),
var3=rnorm(30),var4=rnorm(30),var5=rnorm(30))
spec1 <- list(y=c("var1"),
freevars=c("var2"),
doubtvars=c("var3","var4"))
spec2 <- list(y=c("var1"),
freevars=c("var2"),
doubtvars=c("var3","var4","var5"))
indicators <- c("spec1","spec2")
ebaFun <- function(x){
eba <- eba(data=Data, y=x$y,
free=x$freevars,
doubtful=x$doubtvars,
reg.fun=glm, k=1, vif=7, draws=50, weights = "lri", family = binomial(logit))}
eba_results <- lapply(mget(indicators),ebaFun) #eba_results is the object in question
Manually I know how to access each element, for example:
eba_results$spec1$bounds$type #look at str(eba_results) to see the different levels
So "bounds" is a dataframe with identical column names for both spec1 and spec2. I would like to collect the following 5 columns from "bounds":
type, cdf.mu.normal, cdf.above.mu.normal, cdf.mu.generic, cdf.above.mu.generic
into one dataframe per spec. Manually this is simple but ugly:
collectedManually <-list(
manual_spec1 = data.frame(
type=eba_results$spec1$bounds$type,
cdf.mu.normal=eba_results$spec1$bounds$cdf.mu.normal,
cdf.above.mu.normal=eba_results$spec1$bounds$cdf.above.mu.normal,
cdf.mu.generic=eba_results$spec1$bounds$cdf.mu.generic,
cdf.above.mu.generic=eba_results$spec1$bounds$cdf.above.mu.generic),
manual_spec2= data.frame(
type=eba_results$spec2$bounds$type,
cdf.mu.normal=eba_results$spec2$bounds$cdf.mu.normal,
cdf.above.mu.normal=eba_results$spec2$bounds$cdf.above.mu.normal,
cdf.mu.generic=eba_results$spec2$bounds$cdf.mu.generic,
cdf.above.mu.generic=eba_results$spec2$bounds$cdf.above.mu.generic))
But I have more than 2 specifications and I think this should be possible with lapply functions in a prettier way. Any help would be appreciated!
p.s.: A generic example to which hrbrmstr's answer applies but which turned out to be too simplistic:
exampleList = list(a=list(aa=data.frame(A=rnorm(10),B=rnorm(10)),bb=data.frame(A=rnorm(10),B=rnorm(10))),
b=list(aa=data.frame(A=rnorm(10),B=rnorm(10)),bb=data.frame(A=rnorm(10),B=rnorm(10))))
and I want to have an object which collects, for example, all the A and B vectors into two data frames (each with its respective A and B) which are then a list of data frames. Manually this would look like:
dfa <- data.frame(A=exampleList$a$aa$A,B=exampleList$a$aa$B)
dfb <- data.frame(A=exampleList$a$aa$A,B=exampleList$a$aa$B)
collectedResults <- list(a=dfa, b=dfb)
There's probably a less brute-force way to do this.
If you want lists of individual columns this is one way:
get_col <- function(my_list, col_name) {
unlist(lapply(my_list, function(x) {
lapply(x, function(y) { y[, col_name] })
}), recursive=FALSE)
}
get_col(exampleList, "A")
get_col(exampleList, "B")
If you want a consolidated data.frame of indicator columns this is one way:
collect_indicators <- function(my_list, indicators) {
lapply(my_list, function(x) {
do.call(rbind, c(lapply(x, function(y) { y[, indicators] }), make.row.names=FALSE))
})[[1]]
}
collect_indicators(exampleList, c("A", "B"))
If you just want to bring the individual data.frames up a level to make it easier to iterate over to write to a file:
unlist(exampleList, recursive=FALSE)
Much assumption about the true output format is being made (the question was a bit vague).
There is a brute force way which works but is dependent on several named objects:
collectEBA <- function(x){
df <- paste0("eba_results$",x,"$bounds")
df <- eval(parse(text=df))[,c("type",
"cdf.mu.normal","cdf.above.mu.normal",
"cdf.mu.generic","cdf.above.mu.generic")]
df[is.na(df)] <- "NA"
df
}
eba_export <- lapply(indicators,collectEBA)
names(eba_export) <- indicators

How to convert a List to Data Frame, but removing the internal List structure?

It sounds simple, but I have many problems trying to convert a List to Data Frame.
I did it with the as.data.frame function and it works, but when I use the str function the internal structure still remains with the List structure. And I would like to select a specific column to work in it.
There is some easy way to convert a List to Data Frame, but with a new data frame structure?
I have tryied also unlisting my List into a matrix but I lose the colnames and rownames, and I have to put it again manually.
For example that is my List, and I would like to use and plot the mystats$p.value column:
library(gtools)
x <- rnorm(100, sd=1)
y <- rnorm(100, sd =2)
mystats <- t(running(x, y, fun = cor.test, width=5, by=5))
Thanks
If and only if it's a list of data.frames you can use do.call
al <- split(airquality, airquality$Month)
sapply(al, class)
same.airquality <- do.call(rbind, al)
Here the list elements have the same structure of columns (for list that "splits" different variables across list elements, each the same nrow), you can use
do.call(cbind, another.list)
Finally (but not tested) with this approach you could try package abind
EDIT
After the example provided i understand a little more your setting: you shoud sanitize a bit the call to cor.test because with running it messes data a bit (currently you are trying to put a list, a complex data structure, in a matrix like object)
foobar <- function(x,y) {
my.test <- cor.test(x,y)
## look at values returned by names(cor.test) or ?cor.test for
## which object you can export
c(my.test$statistic, my.test$p.value, my.test$conf.int)
}
## mystats is a matrix
mystats <- as.data.frame(t(running(x, y, fun = foobar, width=5, by=5)))
names(mystats) <- c("statistic", "p.value", "low.ci", "up.ci")
mystats$p.value
If you have multiple objects like this one, eg
mystats$row <- row.names(mystats)
mystats$rep <- 1
row.names(mystats) <- NULL
mystats2 <- mystats
mystats2$rep <- 2
asd <- list(mystats, mystats2)
foo <- do.call("rbind", asd )
foo
foo$p.value
HTH

Using R: colnames() in for loop to change sequential datasets

I have multiple data frames named y1 to y13 - one column each. They all have a column name that I would like to change to "Date.Code". I've tried the following in a for loop:
for(i in 1:13){
colnames(get(paste("y", i, sep=""))) <- c("Date.Code")
}
That didn't work.
I also tried:
for(i in 1:13){
assign(("Date.Code"), colnames(get(paste("y", i, sep=""))))
}
Also with no luck.
Any suggestions?
Thanks,
E
The difficulty here is that you cannot use get with an assignment operator directly
eg, get(nm) <- value will not work. You can use assign, as you're trying, but in a slightly different fashion.
assuming cn is the column number whose name you would like to change
for(i in 1:13){
nm <- paste0("y", i)
tmp <- get(nm)
colnames(tmp)[cn] <- c("Date.Code")
assign(nm, tmp)
}
That being said, a cleaner way of approaching this would be to collect all of your DF's into a single list, then you can easily use lapply to operate on them. For Example:
# collect all of your data.frames into a single list.
df.list <- lapply(seq(13), function(i) get(paste0("y", i)))
# now you can easily change column names. Note the `x` at the end of the function which serves
# as a return of the value. It then gets assigned back to an object `df.list`
df.list <-
lapply(df.list, function(x) {colnames(x)[cn] <- "Date.Code"; x})
Lastly, search these boards for [r] data.table and you will see many options for changing values by reference and setting attributes more directly.
Here one liner solution:
list2env(lapply(mget(ls(pattern='y[0-9]+')),
function(x) setNames(x,"Date.Code")),.GlobalEnv)
Of course it is better to keep all your variable in the same list.

Resources