interweave two data.frames in R - r

I would like to interweave two data.frame in R. For example:
a = data.frame(x=1:5, y=5:1)
b = data.frame(x=2:6, y=4:0)
I would like the result to look like:
> x y
1 5
2 4
2 4
3 3
3 3
...
obtained by cbinding x[1] with y[1], x[2] with y[2], etc.
What is the cleanest way to do this? Right now my solution involves spitting everthing out to a list and merging. This is pretty ugly:
lst = lapply(1:length(x), function(i) cbind(x[i,], y[i,]))
res = do.call(rbind, lst)

There is, of course, the interleave function in the "gdata" package:
library(gdata)
interleave(a, b)
# x y
# 1 1 5
# 6 2 4
# 2 2 4
# 7 3 3
# 3 3 3
# 8 4 2
# 4 4 2
# 9 5 1
# 5 5 1
# 10 6 0

You can do this by giving x and y an index, rbind them and sort by the index.
a = data.frame(x=1:5, y=5:1)
b = data.frame(x=2:6, y=4:0)
df <- rbind(data.frame(a, index = 1:nrow(a)), data.frame(b, index = 1:nrow(b)))
df <- df[order(df$index), c("x", "y")]

This is how I'd approach:
dat <- do.call(rbind.data.frame, list(a, b))
dat[order(dat$x), ]
do.call was unnecessary in the first step but makes the solution more extendable.

Perhaps this is cheating a bit, but the (non-exported) function interleave from ggplot2 is something I've stolen for my own uses before:
as.data.frame(mapply(FUN=ggplot2:::interleave,a,b))

Related

Add new column to data.frame through loop in R

I have n number of data.frame i would like to add column to all data.frame
a <- data.frame(1:4,5:8)
b <- data.frame(1:4, 5:8)
test=ls()
for (j in test){
j = cbind(get(j),IssueType=j)
}
Problem that i'm running into is
j = cbind(get(j),IssueType=j)
because it assigns all the data to j instead of a, b.
As commented, it's mostly better to keep related data in a list structure. If you already have the data.frames in your global environment and you want to get them into a list, you can use:
dflist <- Filter(is.data.frame, as.list(.GlobalEnv))
This is from here and makes sure that you only get data.frame objects from your global environment.
You will notice that you now already have a named list:
> dflist
# $a
# X1.4 X5.8
# 1 1 5
# 2 2 6
# 3 3 7
# 4 4 8
#
# $b
# X1.4 X5.8
# 1 1 5
# 2 2 6
# 3 3 7
# 4 4 8
So you can easily select the data you want by typing for example
dflist[["a"]]
If you still want to create extra columns, you could do it like this:
dflist <- Map(function(df, x) {df$IssueType <- x; df}, dflist, names(dflist))
Now, each data.frame in dflist has a new column called IssueType:
> dflist
# $a
# X1.4 X5.8 IssueType
# 1 1 5 a
# 2 2 6 a
# 3 3 7 a
# 4 4 8 a
#
# $b
# X1.4 X5.8 IssueType
# 1 1 5 b
# 2 2 6 b
# 3 3 7 b
# 4 4 8 b
In the future, you can create the data inside a list from the beginning, i.e.
dflist <- list(
a = data.frame(1:4,5:8)
b = data.frame(1:4, 5:8)
)
To create a list of your data.frames do this:
a <- data.frame(1:4,5:8); b <- data.frame(1:4, 5:8); test <- list(a,b)
This allows you to us the lapply function to perform whatever you like to do with each of the dataframes, eg:
out <- lapply(test, function(x) cbind(j))
For most data.frame operations I recommend using the packages dplyr and tidyr.
wooo wooo
here is answer for the issue
helped by #docendo discimus
Created Dataframe
a <- data.frame(1:4,5:8)
b <- data.frame(1:4, 5:8)
Group data.frame into list
dflist <- Filter(is.data.frame, as.list(.GlobalEnv))
Add's extra column
dflist <- Map(function(df, x) {df$IssueType <- x; df}, dflist, names(dflist))
unstinting the data frame
list2env(dflist ,.GlobalEnv)

R Dataframe comparison which, scaling bad

The idea is extracting the position of df charactes with a reference of other df, example:
L<-LETTERS[1:25]
A<-c(1:25)
df<-data.frame(L,A)
Compare<-c(LETTERS[sample(1:25, 25)])
df[] <- lapply(df, as.character)
for (i in 1:nrow(df)){
df[i,1]<-which(df[i,1]==Compare)
}
head(df)
L A
1 14 1
2 12 2
3 2 3
This works good but scale very bad, like all for, any ideas with apply, or dplyr?
Thanks
Just use match
Your data (use set.seed when providing data using sample)
df <- data.frame(L = LETTERS[1:25], A = 1:25)
set.seed(1)
Compare <- LETTERS[sample(1:25, 25)]
Solution
df$L <- match(df$L, Compare)
head(df)
# L A
# 1 10 1
# 2 23 2
# 3 12 3
# 4 11 4
# 5 5 5
# 6 21 6

R:How to get name of element in lapply function?

Suppose I have a list of data.frames:
list <- list(A=data.frame(x=c(1,2),y=c(3,4)), B=data.frame(x=c(1,2),y=c(7,8)))
I want to combine them into one data.frame like this:
data.frame(x=c(1,2,1,2), y=c(3,4,7,8), group=c("A","A","B","B"))
x y group
1 1 3 A
2 2 4 A
3 1 7 B
4 2 8 B
I can do in this way:
add_group_name <- function(df, group) {
df$group <- group
df
}
Reduce(rbind, mapply(add_group_name, list, names(list), SIMPLIFY=FALSE))
But I want to know if it's possible to get the name inside the lapply loop without the use of names(list), just like this:
add_group_name <- function(df) {
df$group <- ? #How to get the name of df in the list here?
}
Reduce(rbind, lapply(list, add_group_name))
I renamed list to listy to remove the clash with the base function. This is a variation on SeƱor O's answer in essence:
do.call(rbind, Map("[<-", listy, TRUE, "group", names(listy) ) )
# x y group
#A.1 1 3 A
#A.2 2 4 A
#B.1 1 7 B
#B.2 2 8 B
This is also very similar to a previous question and answer here: r function/loop to add column and value to multiple dataframes
The inner Map part gives this result:
Map("[<-", listy, TRUE, "group", names(listy) )
#$A
# x y group
#1 1 3 A
#2 2 4 A
#
#$B
# x y group
#1 1 7 B
#2 2 8 B
...which in long form, for explanation's sake, could be written like:
Map(function(data, nms) {data[TRUE,"group"] <- nms; data;}, listy, names(listy) )
As #flodel suggests, you could also use R's built in transform function for updating dataframes, which may be simpler again:
do.call(rbind, Map(transform, listy, group = names(listy)) )
I think a much easier approach is:
> do.call(rbind, lapply(names(list), function(x) data.frame(list[[x]], group = x)))
x y group
1 1 3 A
2 2 4 A
3 1 7 B
4 2 8 B
Using plyr:
ldply(ll)
.id x y
1 A 1 3
2 A 2 4
3 B 1 7
4 B 2 8
Or in 2 steps :
xx <- do.call(rbind,ll)
xx$group <- sub('([A-Z]).*','\\1',rownames(xx))
xx
x y group
A.1 1 3 A
A.2 2 4 A
B.1 1 7 B
B.2 2 8 B

Efficient way to apply function to each row of data frame and return list of data frames

I have a function that takes a number of arguments and returns a data frame. I also have a data frame where each row contains the arguments that I'd like to pass to my function, and I'd like to store the resulting set of data frames in a list. What's an efficient way to do this? (I'm assuming it's some apply like method.)
For example, suppose you have the (meaningless) function
myfunc<-function(dfRow){
return(data.frame(x=dfRow$x:dfRow$y,y=mean(dfRow$x,dfRow$y)))
}
and the data frame
df<-data.frame(x=1:3,y=4:6)
df
x y
1 1 4
2 2 5
3 3 6
You can run
myfunc(df[1,])
x y
1 1 1
2 2 1
3 3 1
4 4 1
but how would you run myfunc for each row of the data frame and store the results in a list? I know how to do a basic for loop for this, but I'm looking for something that will run faster - something vectorized.
Your "meaningless" function needs to have some meaning for apply to be able to work. For starters, you won't be able to use $ since apply will see each row as a basic named vector.
Keeping that in mind, here's a re-write (along with a more *mean*ingful mean):
myfunc <- function(dfRow) {
data.frame(x = dfRow[1]:dfRow[2], y = mean(c(dfRow[1], dfRow[2])))
}
or even:
myfunc <- function(dfRow) {
data.frame(x = dfRow["x"]:dfRow["y"], y = mean(c(dfRow["x"], dfRow["y"])))
}
Here's what we get from apply with MARGIN = 1 (which is to apply the function by row):
apply(df, 1, myfunc)
# [[1]]
# x y
# 1 1 2.5
# 2 2 2.5
# 3 3 2.5
# 4 4 2.5
#
# [[2]]
# x y
# 1 2 3.5
# 2 3 3.5
# 3 4 3.5
# 4 5 3.5
#
# [[3]]
# x y
# 1 3 4.5
# 2 4 4.5
# 3 5 4.5
# 4 6 4.5
Also, don't always be too quick to write off for loops. apply is optimized, but basically hides a for loop somewhere in there.
Here are some speed comparisons:
## Function to use with `apply`
myfunc <- function(dfRow) {
data.frame(x = dfRow["y"]:dfRow["x"], y = mean(c(dfRow["x"], dfRow["y"])))
}
## Function to use with `lapply`
myfunc1<-function(dfRow){
return(data.frame(x=dfRow$x:dfRow$y,y=mean(dfRow$x,dfRow$y)))
}
## Sample data
set.seed(1)
df <- data.frame(x = sample(100, 100, TRUE),
y = sample(100, 100, TRUE))
Here are the functions to evaluate:
fun1 <- function() apply(df, 1, myfunc)
fun2a <- function() {
listargs <- split(df,1:nrow(df))
}
fun3 <- function() {
out <- vector("list", nrow(df))
for (i in 1:nrow(df)) {
out[[i]] <- data.frame(x = df$x[i]:df$y[i], y = mean(c(df$x[i], df$y[i])))
}
out
}
And here are the results:
microbenchmark(fun2(), fun2(), fun3(), times = 20)
# Unit: milliseconds
# expr min lq median uq max neval
# fun1() 39.72704 39.99255 40.84243 43.77641 48.16284 20
# fun2() 74.92324 79.20913 82.15130 83.12488 100.51695 20
# fun3() 48.61772 49.59304 50.16654 56.17891 88.65290 20
If you want a list of answers, why not pass a list of arguments? first split up your dataframe into a list, then lapply your function:
listargs <- split(df,1:nrow(df))
lapply(listargs,myfunc)
$`1`
x y
1 1 1
2 2 1
3 3 1
4 4 1
$`2`
x y
1 2 2
2 3 2
3 4 2
4 5 2
$`3`
x y
1 3 3
2 4 3
3 5 3
4 6 3
If you're willing to use external package, then here's one using data.table:
Here's a version by simplifying your function:
require(data.table) ## 1.9.2+
fA <- function(x, y) {
data.frame(x = x:y, y = y:x)
}
dt = as.data.table(df)
result1 = dt[, list(ans = list(fA(x, y))), by=seq_len(nrow(dt))]
# seq_len ans
# 1: 1 <data.frame>
# 2: 2 <data.frame>
# 3: 3 <data.frame>
We create a data.table first, then aggregate dt on each row using by=. and on each row, we pass the corresponding x and y to fA function, and wrap the result in a list. Now just doing result1$ans gives the desired result.
If you insist on not passing individual objects, then you can do:
require(data.table) ## 1.9.2+
fB <- function(dat) {
data.frame(x = dat$x:dat$y, y = dat$y:dat$x)
}
dt = as.data.table(df)
result2 = dt[, list(ans = list(fB(.SD))), by=seq_len(nrow(dt))]
# seq_len ans
# 1: 1 <data.frame>
# 2: 2 <data.frame>
# 3: 3 <data.frame>
Here, we pass Subset of Data, .SD - a special variable, which carries the data that belongs to each group, to function fB instead. Once again doing result2$ans should get your answer.
HTH
Oh and BTW, it's okay to use spaces in your code; doesn't cost much :).

Complete.cases used on list of data frames

I'm trying to remove all the NA values from a list of data frames. The only way I have got it to work is by cleaning the data with complete.cases in a for loop. Is there another way of doing this with lapply as I had been trying for a while to no avail. Here is the code that works.
I start with
data_in <- lapply (file_name,read.csv)
Then have:
clean_data <- list()
for (i in seq_along(id)) {
clean_data[[i]] <- data_in[[i]][complete.cases(data_in[[i]]), ]
}
But what I tried to get to work was using lapply all the way like this.
comp <- lapply(data_in, complete.cases)
clean_data <- lapply(data_in, data_in[[id]][comp,])
Which returns this error "Error in [.default(xj, i) : invalid subscript type 'list' "
What I'd like to know is some alternatives or if I was going about this right. And why didn't the last example not work?
Thank you so much for your time. Have a nice day.
I'm not sure what you expected with
clean_data <- lapply(data_in, data_in[[id]][comp,])
The second parameter to lapply should be a proper function to which each member of the data_in list will be passed one at a time. Your expression data_in[[id]][comp,] is not a function. I'm not sure where you expected id to come from, but lapply does not create magic variables for you like that. Also, at this point comp is now a list itself of indices. You are making no attempt to iterate over this list in sync with your data_in list. If you wanted to do it in two separate steps, a more appropriate approach would be
comp <- lapply(data_in, complete.cases)
clean_data <- Map(function(d,c) {d[c,]}, data_in, comp)
Here we use Map to iterate over the data_in and comp lists simultaneously. They each get passed in to the function as a parameter and we can do the proper extraction that way. Otherwise, if we wanted to do it in one step, we could do
clean_data <- lapply(data_in, function(x) x[complete.cases(x),])
welcome to SO, please provide some working code next time
here is how i would do it with na.omit (since complete.cases only returns a logical)
(dat.l <- list(dat1 = data.frame(x = 1:2, y = c(1, NA)),
dat2 = data.frame(x = 1:3, y = c(1, NA, 3))))
# $dat1
# x y
# 1 1 1
# 2 2 NA
#
# $dat2
# x y
# 1 1 1
# 2 2 NA
# 3 3 3
Map(na.omit, dat.l)
# $dat1
# x y
# 1 1 1
#
# $dat2
# x y
# 1 1 1
# 3 3 3
Do you mean like the below?
> lst
$a
a
1 1
2 2
3 NA
4 3
5 4
$b
b
1 1
2 NA
3 2
4 3
5 4
$d
d e
1 NA 1
2 NA 2
3 3 3
4 4 NA
5 5 NA
> f <- function(x) x[complete.cases(x),]
> lapply(lst, f)
$a
[1] 1 2 3 4
$b
[1] 1 2 3 4
$d
d e
3 3 3
file_name[complete.cases(file_name), ]
complete.cases() returns only a logical value. This should do the job and returns only the rows with no NA values.

Resources