interweave two data.frames in R

interweave two data.frames in R - r

I would like to interweave two data.frame in R. For example:
a = data.frame(x=1:5, y=5:1)
b = data.frame(x=2:6, y=4:0)
I would like the result to look like:
> x y
1 5
2 4
2 4
3 3
3 3
...
obtained by cbinding x[1] with y[1], x[2] with y[2], etc.
What is the cleanest way to do this? Right now my solution involves spitting everthing out to a list and merging. This is pretty ugly:
lst = lapply(1:length(x), function(i) cbind(x[i,], y[i,]))
res = do.call(rbind, lst)

There is, of course, the interleave function in the "gdata" package:
library(gdata)
interleave(a, b)
# x y
# 1 1 5
# 6 2 4
# 2 2 4
# 7 3 3
# 3 3 3
# 8 4 2
# 4 4 2
# 9 5 1
# 5 5 1
# 10 6 0

You can do this by giving x and y an index, rbind them and sort by the index.
a = data.frame(x=1:5, y=5:1)
b = data.frame(x=2:6, y=4:0)
df <- rbind(data.frame(a, index = 1:nrow(a)), data.frame(b, index = 1:nrow(b)))
df <- df[order(df$index), c("x", "y")]

This is how I'd approach:
dat <- do.call(rbind.data.frame, list(a, b))
dat[order(dat$x), ]
do.call was unnecessary in the first step but makes the solution more extendable.

Perhaps this is cheating a bit, but the (non-exported) function interleave from ggplot2 is something I've stolen for my own uses before:
as.data.frame(mapply(FUN=ggplot2:::interleave,a,b))

Related

Add new column to data.frame through loop in R

I have n number of data.frame i would like to add column to all data.frame
a <- data.frame(1:4,5:8)
b <- data.frame(1:4, 5:8)
test=ls()
for (j in test){
j = cbind(get(j),IssueType=j)
}
Problem that i'm running into is
j = cbind(get(j),IssueType=j)
because it assigns all the data to j instead of a, b.

As commented, it's mostly better to keep related data in a list structure. If you already have the data.frames in your global environment and you want to get them into a list, you can use:
dflist <- Filter(is.data.frame, as.list(.GlobalEnv))
This is from here and makes sure that you only get data.frame objects from your global environment.
You will notice that you now already have a named list:
> dflist
# $a
# X1.4 X5.8
# 1 1 5
# 2 2 6
# 3 3 7
# 4 4 8
#
# $b
# X1.4 X5.8
# 1 1 5
# 2 2 6
# 3 3 7
# 4 4 8
So you can easily select the data you want by typing for example
dflist[["a"]]
If you still want to create extra columns, you could do it like this:
dflist <- Map(function(df, x) {df$IssueType <- x; df}, dflist, names(dflist))
Now, each data.frame in dflist has a new column called IssueType:
> dflist
# $a
# X1.4 X5.8 IssueType
# 1 1 5 a
# 2 2 6 a
# 3 3 7 a
# 4 4 8 a
#
# $b
# X1.4 X5.8 IssueType
# 1 1 5 b
# 2 2 6 b
# 3 3 7 b
# 4 4 8 b
In the future, you can create the data inside a list from the beginning, i.e.
dflist <- list(
a = data.frame(1:4,5:8)
b = data.frame(1:4, 5:8)
)

To create a list of your data.frames do this:
a <- data.frame(1:4,5:8); b <- data.frame(1:4, 5:8); test <- list(a,b)
This allows you to us the lapply function to perform whatever you like to do with each of the dataframes, eg:
out <- lapply(test, function(x) cbind(j))
For most data.frame operations I recommend using the packages dplyr and tidyr.

wooo wooo
here is answer for the issue
helped by #docendo discimus
Created Dataframe
a <- data.frame(1:4,5:8)
b <- data.frame(1:4, 5:8)
Group data.frame into list
dflist <- Filter(is.data.frame, as.list(.GlobalEnv))
Add's extra column
dflist <- Map(function(df, x) {df$IssueType <- x; df}, dflist, names(dflist))
unstinting the data frame
list2env(dflist ,.GlobalEnv)

R Dataframe comparison which, scaling bad

The idea is extracting the position of df charactes with a reference of other df, example:
L<-LETTERS[1:25]
A<-c(1:25)
df<-data.frame(L,A)
Compare<-c(LETTERS[sample(1:25, 25)])
df[] <- lapply(df, as.character)
for (i in 1:nrow(df)){
df[i,1]<-which(df[i,1]==Compare)
}
head(df)
L A
1 14 1
2 12 2
3 2 3
This works good but scale very bad, like all for, any ideas with apply, or dplyr?
Thanks

Just use match
Your data (use set.seed when providing data using sample)
df <- data.frame(L = LETTERS[1:25], A = 1:25)
set.seed(1)
Compare <- LETTERS[sample(1:25, 25)]
Solution
df$L <- match(df$L, Compare)
head(df)
# L A
# 1 10 1
# 2 23 2
# 3 12 3
# 4 11 4
# 5 5 5
# 6 21 6

R:How to get name of element in lapply function?

Suppose I have a list of data.frames:
list <- list(A=data.frame(x=c(1,2),y=c(3,4)), B=data.frame(x=c(1,2),y=c(7,8)))
I want to combine them into one data.frame like this:
data.frame(x=c(1,2,1,2), y=c(3,4,7,8), group=c("A","A","B","B"))
x y group
1 1 3 A
2 2 4 A
3 1 7 B
4 2 8 B
I can do in this way:
add_group_name <- function(df, group) {
df$group <- group
df
}
Reduce(rbind, mapply(add_group_name, list, names(list), SIMPLIFY=FALSE))
But I want to know if it's possible to get the name inside the lapply loop without the use of names(list), just like this:
add_group_name <- function(df) {
df$group <- ? #How to get the name of df in the list here?
}
Reduce(rbind, lapply(list, add_group_name))

I renamed list to listy to remove the clash with the base function. This is a variation on Señor O's answer in essence:
do.call(rbind, Map("[<-", listy, TRUE, "group", names(listy) ) )
# x y group
#A.1 1 3 A
#A.2 2 4 A
#B.1 1 7 B
#B.2 2 8 B
This is also very similar to a previous question and answer here: r function/loop to add column and value to multiple dataframes
The inner Map part gives this result:
Map("[<-", listy, TRUE, "group", names(listy) )
#$A
# x y group
#1 1 3 A
#2 2 4 A
#
#$B
# x y group
#1 1 7 B
#2 2 8 B
...which in long form, for explanation's sake, could be written like:
Map(function(data, nms) {data[TRUE,"group"] <- nms; data;}, listy, names(listy) )
As #flodel suggests, you could also use R's built in transform function for updating dataframes, which may be simpler again:
do.call(rbind, Map(transform, listy, group = names(listy)) )

I think a much easier approach is:
> do.call(rbind, lapply(names(list), function(x) data.frame(list[[x]], group = x)))
x y group
1 1 3 A
2 2 4 A
3 1 7 B
4 2 8 B

Using plyr:
ldply(ll)
.id x y
1 A 1 3
2 A 2 4
3 B 1 7
4 B 2 8
Or in 2 steps :
xx <- do.call(rbind,ll)
xx$group <- sub('([A-Z]).*','\\1',rownames(xx))
xx
x y group
A.1 1 3 A
A.2 2 4 A
B.1 1 7 B
B.2 2 8 B

Efficient way to apply function to each row of data frame and return list of data frames

I have a function that takes a number of arguments and returns a data frame. I also have a data frame where each row contains the arguments that I'd like to pass to my function, and I'd like to store the resulting set of data frames in a list. What's an efficient way to do this? (I'm assuming it's some apply like method.)
For example, suppose you have the (meaningless) function
myfunc<-function(dfRow){
return(data.frame(x=dfRow$x:dfRow$y,y=mean(dfRow$x,dfRow$y)))
}
and the data frame
df<-data.frame(x=1:3,y=4:6)
df
x y
1 1 4
2 2 5
3 3 6
You can run
myfunc(df[1,])
x y
1 1 1
2 2 1
3 3 1
4 4 1
but how would you run myfunc for each row of the data frame and store the results in a list? I know how to do a basic for loop for this, but I'm looking for something that will run faster - something vectorized.

Your "meaningless" function needs to have some meaning for apply to be able to work. For starters, you won't be able to use $ since apply will see each row as a basic named vector.
Keeping that in mind, here's a re-write (along with a more *mean*ingful mean):
myfunc <- function(dfRow) {
data.frame(x = dfRow[1]:dfRow[2], y = mean(c(dfRow[1], dfRow[2])))
}
or even:
myfunc <- function(dfRow) {
data.frame(x = dfRow["x"]:dfRow["y"], y = mean(c(dfRow["x"], dfRow["y"])))
}
Here's what we get from apply with MARGIN = 1 (which is to apply the function by row):
apply(df, 1, myfunc)
# [[1]]
# x y
# 1 1 2.5
# 2 2 2.5
# 3 3 2.5
# 4 4 2.5
#
# [[2]]
# x y
# 1 2 3.5
# 2 3 3.5
# 3 4 3.5
# 4 5 3.5
#
# [[3]]
# x y
# 1 3 4.5
# 2 4 4.5
# 3 5 4.5
# 4 6 4.5
Also, don't always be too quick to write off for loops. apply is optimized, but basically hides a for loop somewhere in there.
Here are some speed comparisons:
## Function to use with `apply`
myfunc <- function(dfRow) {
data.frame(x = dfRow["y"]:dfRow["x"], y = mean(c(dfRow["x"], dfRow["y"])))
}
## Function to use with `lapply`
myfunc1<-function(dfRow){
return(data.frame(x=dfRow$x:dfRow$y,y=mean(dfRow$x,dfRow$y)))
}
## Sample data
set.seed(1)
df <- data.frame(x = sample(100, 100, TRUE),
y = sample(100, 100, TRUE))
Here are the functions to evaluate:
fun1 <- function() apply(df, 1, myfunc)
fun2a <- function() {
listargs <- split(df,1:nrow(df))
}
fun3 <- function() {
out <- vector("list", nrow(df))
for (i in 1:nrow(df)) {
out[[i]] <- data.frame(x = df$x[i]:df$y[i], y = mean(c(df$x[i], df$y[i])))
}
out
}
And here are the results:
microbenchmark(fun2(), fun2(), fun3(), times = 20)
# Unit: milliseconds
# expr min lq median uq max neval
# fun1() 39.72704 39.99255 40.84243 43.77641 48.16284 20
# fun2() 74.92324 79.20913 82.15130 83.12488 100.51695 20
# fun3() 48.61772 49.59304 50.16654 56.17891 88.65290 20

If you want a list of answers, why not pass a list of arguments? first split up your dataframe into a list, then lapply your function:
listargs <- split(df,1:nrow(df))
lapply(listargs,myfunc)
$`1`
x y
1 1 1
2 2 1
3 3 1
4 4 1
$`2`
x y
1 2 2
2 3 2
3 4 2
4 5 2
$`3`
x y
1 3 3
2 4 3
3 5 3
4 6 3

If you're willing to use external package, then here's one using data.table:
Here's a version by simplifying your function:
require(data.table) ## 1.9.2+
fA <- function(x, y) {
data.frame(x = x:y, y = y:x)
}
dt = as.data.table(df)
result1 = dt[, list(ans = list(fA(x, y))), by=seq_len(nrow(dt))]
# seq_len ans
# 1: 1 <data.frame>
# 2: 2 <data.frame>
# 3: 3 <data.frame>
We create a data.table first, then aggregate dt on each row using by=. and on each row, we pass the corresponding x and y to fA function, and wrap the result in a list. Now just doing result1$ans gives the desired result.
If you insist on not passing individual objects, then you can do:
require(data.table) ## 1.9.2+
fB <- function(dat) {
data.frame(x = dat$x:dat$y, y = dat$y:dat$x)
}
dt = as.data.table(df)
result2 = dt[, list(ans = list(fB(.SD))), by=seq_len(nrow(dt))]
# seq_len ans
# 1: 1 <data.frame>
# 2: 2 <data.frame>
# 3: 3 <data.frame>
Here, we pass Subset of Data, .SD - a special variable, which carries the data that belongs to each group, to function fB instead. Once again doing result2$ans should get your answer.
HTH
Oh and BTW, it's okay to use spaces in your code; doesn't cost much :).

Complete.cases used on list of data frames

I'm trying to remove all the NA values from a list of data frames. The only way I have got it to work is by cleaning the data with complete.cases in a for loop. Is there another way of doing this with lapply as I had been trying for a while to no avail. Here is the code that works.
I start with
data_in <- lapply (file_name,read.csv)
Then have:
clean_data <- list()
for (i in seq_along(id)) {
clean_data[[i]] <- data_in[[i]][complete.cases(data_in[[i]]), ]
}
But what I tried to get to work was using lapply all the way like this.
comp <- lapply(data_in, complete.cases)
clean_data <- lapply(data_in, data_in[[id]][comp,])
Which returns this error "Error in [.default(xj, i) : invalid subscript type 'list' "
What I'd like to know is some alternatives or if I was going about this right. And why didn't the last example not work?
Thank you so much for your time. Have a nice day.

I'm not sure what you expected with
clean_data <- lapply(data_in, data_in[[id]][comp,])
The second parameter to lapply should be a proper function to which each member of the data_in list will be passed one at a time. Your expression data_in[[id]][comp,] is not a function. I'm not sure where you expected id to come from, but lapply does not create magic variables for you like that. Also, at this point comp is now a list itself of indices. You are making no attempt to iterate over this list in sync with your data_in list. If you wanted to do it in two separate steps, a more appropriate approach would be
comp <- lapply(data_in, complete.cases)
clean_data <- Map(function(d,c) {d[c,]}, data_in, comp)
Here we use Map to iterate over the data_in and comp lists simultaneously. They each get passed in to the function as a parameter and we can do the proper extraction that way. Otherwise, if we wanted to do it in one step, we could do
clean_data <- lapply(data_in, function(x) x[complete.cases(x),])

welcome to SO, please provide some working code next time
here is how i would do it with na.omit (since complete.cases only returns a logical)
(dat.l <- list(dat1 = data.frame(x = 1:2, y = c(1, NA)),
dat2 = data.frame(x = 1:3, y = c(1, NA, 3))))
# $dat1
# x y
# 1 1 1
# 2 2 NA
#
# $dat2
# x y
# 1 1 1
# 2 2 NA
# 3 3 3
Map(na.omit, dat.l)
# $dat1
# x y
# 1 1 1
#
# $dat2
# x y
# 1 1 1
# 3 3 3

Do you mean like the below?
> lst
$a
a
1 1
2 2
3 NA
4 3
5 4
$b
b
1 1
2 NA
3 2
4 3
5 4
$d
d e
1 NA 1
2 NA 2
3 3 3
4 4 NA
5 5 NA
> f <- function(x) x[complete.cases(x),]
> lapply(lst, f)
$a
[1] 1 2 3 4
$b
[1] 1 2 3 4
$d
d e
3 3 3

file_name[complete.cases(file_name), ]
complete.cases() returns only a logical value. This should do the job and returns only the rows with no NA values.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

interweave two data.frames in R - r

There is, of course, the interleave function in the "gdata" package: library(gdata) interleave(a, b) # x y # 1 1 5 # 6 2 4 # 2 2 4 # 7 3 3 # 3 3 3 # 8 4 2 # 4 4 2 # 9 5 1 # 5 5 1 # 10 6 0

You can do this by giving x and y an index, rbind them and sort by the index. a = data.frame(x=1:5, y=5:1) b = data.frame(x=2:6, y=4:0) df <- rbind(data.frame(a, index = 1:nrow(a)), data.frame(b, index = 1:nrow(b))) df <- df[order(df$index), c("x", "y")]

This is how I'd approach: dat <- do.call(rbind.data.frame, list(a, b)) dat[order(dat$x), ] do.call was unnecessary in the first step but makes the solution more extendable.

Perhaps this is cheating a bit, but the (non-exported) function interleave from ggplot2 is something I've stolen for my own uses before: as.data.frame(mapply(FUN=ggplot2:::interleave,a,b))

Related

Add new column to data.frame through loop in R

R Dataframe comparison which, scaling bad

R:How to get name of element in lapply function?

Efficient way to apply function to each row of data frame and return list of data frames

Complete.cases used on list of data frames

Categories

Resources