I have the following two data.frames:
df1 <- data.frame(Var1=c(3,4,8,9),
Var2=c(11,32,1,7))
> df1
Var1 Var2
1 3 11
2 4 32
3 8 1
4 9 7
df2 <- data.frame(ID=c('A', 'B', 'C'),
ball=I(list(c("3","11", "12"), c("4","1"), c("9","32"))))
> df2
ID ball
1 A 3, 11, 12
2 B 4, 1
3 C 9, 32
Note that column ball in df2 is a list.
I want to select the ID in df2 with elements in column ball that match a row in df1.
The ideal output would look like this:
> df3
ID ball1 ball2
1 A 3 11
Does anyone have an idea how to do this efficiently? The original data consists of millions of rows in both data.frames.
A data.table solution would work much more quickly than this base R solution but here is a possibility.
your data:
df1 <- data.frame(Var1=c(3,4,8,9),
Var2=c(11,32,1,7))
df2 <- data.frame(ID=c('A', 'B', 'C'),
ball=I(list(c("3","11", "12"), c("4","1"), c("9","32"))))
the process:
df2$ID <- as.character(df2$ID) # just in case they are levels instead
n <- length(df2)# initialize the size of df3 to be big enough
df3 <- data.frame(ID = character(n),
Var1 = numeric(n), Var2 = numeric(n),
stringsAsFactors = F) # to make sure we get the ID as a string
count = 0 # counter
for(i in 1:nrow(df1)){
for(j in 1:nrow(df2)){
if(all(df1[i,] %in% df2$ball[[j]])){
count = count + 1
df3$ID[count] <- df2$ID[j]
df3$Var1[count] <- df1$Var1[i]
df3$Var2[count] <- df1$Var2[i]
}
}
}
df3_final <- df3[-which(df3$ID == ""),] # since we overestimated the size of d3
df3_final
Related
I have a dataset with thousands of rows and almost a hundred columns. Each row only contains unique elements, however, these elements may also be found in other rows.
Basically, I want to create two new columns in my data frame, one to store how many Unique and another to store how many Ambiguous elements there are in a given row but compared to the whole dataset.
Note there are NAs in the dataframe that should not be considered when counting unique and ambiguous elements.
df <- data.frame(
col1 = c('Ab', 'Cd', 'Ef', 'Gh', 'Ij'),
col2 = c('Ac', 'Ce', 'Eg', 'Gi', 'Ik'),
col3 = c('Acc', NA, 'Ab', 'Gef', 'Il'),
col4 = c(NA, NA, NA, 'Ce', 'Im')
)
In the dataframe created above, Ab is not unique, so in row 1 there are 2 unique and 1 ambiguous elements when compared to the whole dataset.
In my expected output, Unique in row 1 would be equal to 2, and Ambiguous = 1. In row five, it would be 4 and 0, respectively.
I've searched for possible solutions, but most only deals with unique or repeated elements in a particular row, or across multiple rows for a particular column. Anyway, any help would be greatly appreciated.
Another method avoiding some recalculations.
# First we get the duplicates to avoid recounting every time.
freqs <- table(as.matrix(df))
dupes <- names(freqs[freqs > 1])
# Check the values for (non-)duplication.
is_dupe <- rowSums(apply(df, 2, "%in%", dupes))
not_dupe <- rowSums(apply(df, 2, function(x) {!(x %in% dupes | is.na(x))}))
# Add the columns after we calculated the counts to avoid including them.
df$ambiguous <- is_dupe
df$unique <- not_dupe
df
# col1 col2 col3 col4 ambiguous unique
# 1 Ab Ac Acc <NA> 1 2
# 2 Cd Ce <NA> <NA> 1 1
# 3 Ef Eg Ab <NA> 1 2
# 4 Gh Gi Gef Ce 1 3
# 5 Ij Ik Il Im 0 4
How about something like this:
df <- data.frame(
col1 = c('Ab', 'Cd', 'Ef', 'Gh', 'Ij'),
col2 = c('Ac', 'Ce', 'Eg', 'Gi', 'Ik'),
col3 = c('Acc', NA, 'Ab', 'Gef', 'Il'),
col4 = c(NA, NA, NA, 'Ce', 'Im')
)
uvals <- avals <- rep(NA, nrow(df))
for(i in 1:nrow(df)){
other_vals <- na.omit(c(unique(as.matrix(df[-i,]))))
tmp <- na.omit(as.matrix(df)[i,]) %in% other_vals
uvals[i] <- sum(tmp == 0, na.rm=TRUE)
avals[i] <- sum(tmp == 1, na.rm=TRUE)
}
df <- df %>%
mutate(unique = uvals,
ambiguous = avals)
df
# col1 col2 col3 col4 unique ambiguous
# 1 Ab Ac Acc <NA> 2 1
# 2 Cd Ce <NA> <NA> 1 1
# 3 Ef Eg Ab <NA> 2 1
# 4 Gh Gi Gef Ce 3 1
# 5 Ij Ik Il Im 4 0
How do I swap one value with another in a column within a dataframe?
For example swap the 2's and 4's around in df1 to give df2:
df1 <- as.data.frame(col1 = c(1,2,1,4))
df2 <- as.data.frame(col1 = c(1,4,1,2))
Simple solution using replace in base R:
df2 <- data.frame(col1 = replace(df1$col1, c(4,2), c(2,4)))
Output
col1
1 1
2 4
3 1
4 2
We can try using case_when from the dplyr package for some switch functionality:
df2 <- df1
df2$col1 <- case_when(
df2$col1 == 2 ~ 4,
df2$col1 == 4 ~ 2,
TRUE ~ df2$col1
)
df2
col1
1 1
2 4
3 1
4 2
Data:
df1 <- data.frame(col1 = c(1,2,1,4))
you can swap by reassigning the index for that column.
With the dataframe:
df1 <- data.frame(col1 = c("a","b","c","d"))
> df1
col1
1 a
2 b
3 c
4 d
we can:
df1[,1] <- df1[c(1,4,3,2),1]
to get
> df1
col1
1 a
2 d
3 c
4 b
I have several data frames that I want to combine by row. In the resulting single data frame, I want to create a new variable identifying which data set the observation came from.
# original data frames
df1 <- data.frame(x = c(1, 3), y = c(2, 4))
df2 <- data.frame(x = c(5, 7), y = c(6, 8))
# desired, combined data frame
df3 <- data.frame(x = c(1, 3, 5, 7), y = c(2, 4, 6, 8),
source = c("df1", "df1", "df2", "df2")
# x y source
# 1 2 df1
# 3 4 df1
# 5 6 df2
# 7 8 df2
How can I achieve this?
Thanks in advance!
It's not exactly what you asked for, but it's pretty close. Put your objects in a named list and use do.call(rbind...)
> do.call(rbind, list(df1 = df1, df2 = df2))
x y
df1.1 1 2
df1.2 3 4
df2.1 5 6
df2.2 7 8
Notice that the row names now reflect the source data.frames.
Update: Use cbind and rbind
Another option is to make a basic function like the following:
AppendMe <- function(dfNames) {
do.call(rbind, lapply(dfNames, function(x) {
cbind(get(x), source = x)
}))
}
This function then takes a character vector of the data.frame names that you want to "stack", as follows:
> AppendMe(c("df1", "df2"))
x y source
1 1 2 df1
2 3 4 df1
3 5 6 df2
4 7 8 df2
Update 2: Use combine from the "gdata" package
> library(gdata)
> combine(df1, df2)
x y source
1 1 2 df1
2 3 4 df1
3 5 6 df2
4 7 8 df2
Update 3: Use rbindlist from "data.table"
Another approach that can be used now is to use rbindlist from "data.table" and its idcol argument. With that, the approach could be:
> rbindlist(mget(ls(pattern = "df\\d+")), idcol = TRUE)
.id x y
1: df1 1 2
2: df1 3 4
3: df2 5 6
4: df2 7 8
Update 4: use map_df from "purrr"
Similar to rbindlist, you can also use map_df from "purrr" with I or c as the function to apply to each list element.
> mget(ls(pattern = "df\\d+")) %>% map_df(I, .id = "src")
Source: local data frame [4 x 3]
src x y
(chr) (int) (int)
1 df1 1 2
2 df1 3 4
3 df2 5 6
4 df2 7 8
Another approach using dplyr:
df1 <- data.frame(x = c(1,3), y = c(2,4))
df2 <- data.frame(x = c(5,7), y = c(6,8))
df3 <- dplyr::bind_rows(list(df1=df1, df2=df2), .id = 'source')
df3
Source: local data frame [4 x 3]
source x y
(chr) (dbl) (dbl)
1 df1 1 2
2 df1 3 4
3 df2 5 6
4 df2 7 8
I'm not sure if such a function already exists, but this seems to do the trick:
bindAndSource <- function(df1, df2) {
df1$source <- as.character(match.call())[[2]]
df2$source <- as.character(match.call())[[3]]
rbind(df1, df2)
}
results:
bindAndSource(df1, df2)
1 1 2 df1
2 3 4 df1
3 5 6 df2
4 7 8 df2
Caveat: This will not work in *aply-like calls
A blend of the other two answers:
df1 <- data.frame(x = 1:3,y = 1:3)
df2 <- data.frame(x = 4:6,y = 4:6)
> foo <- function(...){
args <- list(...)
result <- do.call(rbind,args)
result$source <- rep(as.character(match.call()[-1]),times = sapply(args,nrow))
result
}
> foo(df1,df2,df1)
x y source
1 1 1 df1
2 2 2 df1
3 3 3 df1
4 4 4 df2
5 5 5 df2
6 6 6 df2
7 1 1 df1
8 2 2 df1
9 3 3 df1
If you want to avoid the match.call business, you can always limit yourself to naming the function arguments (i.e. df1 = df1, df2 = df2) and using names(args) to access the names.
Another workaround for this one is using ldply in the plyr package...
df1 <- data.frame(x = c(1,3), y = c(2,4))
df2 <- data.frame(x = c(5,7), y = c(6,8))
list = list(df1 = df1, df2 = df2)
df3 <- ldply(list)
df3
.id x y
df1 1 2
df1 3 4
df2 5 6
df2 7 8
Even though there are already some great answers here, I just wanted to add the one I have been using. It is base R so it might be be less limiting if you want to use it in a package, and it is a little faster than some of the other base R solutions.
dfs <- list(df1 = data.frame("x"=c(1,2), "y"=2),
df2 = data.frame("x"=c(2,4), "y"=4),
df3 = data.frame("x"=2, "y"=c(4,5,7)))
> microbenchmark(cbind(do.call(rbind,dfs),
rep(names(dfs), vapply(dfs, nrow, numeric(1)))), times = 1001)
Unit: microseconds
min lq mean median uq max neval
393.541 409.083 454.9913 433.422 453.657 6157.649 1001
The first part, do.call(rbind, dfs) binds the rows of data frames into a single data frame. The vapply(dfs, nrow, numeric(1)) finds how many rows each data frame has which is passed to rep in rep(names(dfs), vapply(dfs, nrow, numeric(1))) to repeat the name of the data frame once for each row of the data frame. cbind puts them all together.
This is similar to a previously posted solution, but about 2x faster.
> microbenchmark(do.call(rbind,
lapply(names(dfs), function(x) cbind(dfs[[x]], source = x))),
times = 1001)
Unit: microseconds
min lq mean median uq max neval
844.558 870.071 1034.182 896.464 1210.533 8867.858 1001
I am not 100% certain, but I believe the speed up is due to making a single call to cbind rather than one per data frame.
Here is one option using Map. First, I create a named list of dataframes. Then, I can cbind the names to each dataframe. Then, use unname to remove the row names. Finally, rbind all the dataframes together.
# original data frames
df1 <- data.frame(x = c(1, 3), y = c(2, 4))
df2 <- data.frame(x = c(5, 7), y = c(6, 8))
df.list <- Hmisc::llist(df1, df2)
do.call(rbind, unname(Map(cbind, source = names(df.list), df.list)))
Output
source x y
1 df1 1 2
2 df1 3 4
3 df2 5 6
4 df2 7 8
I have several data frames that I want to combine by row. In the resulting single data frame, I want to create a new variable identifying which data set the observation came from.
# original data frames
df1 <- data.frame(x = c(1, 3), y = c(2, 4))
df2 <- data.frame(x = c(5, 7), y = c(6, 8))
# desired, combined data frame
df3 <- data.frame(x = c(1, 3, 5, 7), y = c(2, 4, 6, 8),
source = c("df1", "df1", "df2", "df2")
# x y source
# 1 2 df1
# 3 4 df1
# 5 6 df2
# 7 8 df2
How can I achieve this?
Thanks in advance!
It's not exactly what you asked for, but it's pretty close. Put your objects in a named list and use do.call(rbind...)
> do.call(rbind, list(df1 = df1, df2 = df2))
x y
df1.1 1 2
df1.2 3 4
df2.1 5 6
df2.2 7 8
Notice that the row names now reflect the source data.frames.
Update: Use cbind and rbind
Another option is to make a basic function like the following:
AppendMe <- function(dfNames) {
do.call(rbind, lapply(dfNames, function(x) {
cbind(get(x), source = x)
}))
}
This function then takes a character vector of the data.frame names that you want to "stack", as follows:
> AppendMe(c("df1", "df2"))
x y source
1 1 2 df1
2 3 4 df1
3 5 6 df2
4 7 8 df2
Update 2: Use combine from the "gdata" package
> library(gdata)
> combine(df1, df2)
x y source
1 1 2 df1
2 3 4 df1
3 5 6 df2
4 7 8 df2
Update 3: Use rbindlist from "data.table"
Another approach that can be used now is to use rbindlist from "data.table" and its idcol argument. With that, the approach could be:
> rbindlist(mget(ls(pattern = "df\\d+")), idcol = TRUE)
.id x y
1: df1 1 2
2: df1 3 4
3: df2 5 6
4: df2 7 8
Update 4: use map_df from "purrr"
Similar to rbindlist, you can also use map_df from "purrr" with I or c as the function to apply to each list element.
> mget(ls(pattern = "df\\d+")) %>% map_df(I, .id = "src")
Source: local data frame [4 x 3]
src x y
(chr) (int) (int)
1 df1 1 2
2 df1 3 4
3 df2 5 6
4 df2 7 8
Another approach using dplyr:
df1 <- data.frame(x = c(1,3), y = c(2,4))
df2 <- data.frame(x = c(5,7), y = c(6,8))
df3 <- dplyr::bind_rows(list(df1=df1, df2=df2), .id = 'source')
df3
Source: local data frame [4 x 3]
source x y
(chr) (dbl) (dbl)
1 df1 1 2
2 df1 3 4
3 df2 5 6
4 df2 7 8
I'm not sure if such a function already exists, but this seems to do the trick:
bindAndSource <- function(df1, df2) {
df1$source <- as.character(match.call())[[2]]
df2$source <- as.character(match.call())[[3]]
rbind(df1, df2)
}
results:
bindAndSource(df1, df2)
1 1 2 df1
2 3 4 df1
3 5 6 df2
4 7 8 df2
Caveat: This will not work in *aply-like calls
A blend of the other two answers:
df1 <- data.frame(x = 1:3,y = 1:3)
df2 <- data.frame(x = 4:6,y = 4:6)
> foo <- function(...){
args <- list(...)
result <- do.call(rbind,args)
result$source <- rep(as.character(match.call()[-1]),times = sapply(args,nrow))
result
}
> foo(df1,df2,df1)
x y source
1 1 1 df1
2 2 2 df1
3 3 3 df1
4 4 4 df2
5 5 5 df2
6 6 6 df2
7 1 1 df1
8 2 2 df1
9 3 3 df1
If you want to avoid the match.call business, you can always limit yourself to naming the function arguments (i.e. df1 = df1, df2 = df2) and using names(args) to access the names.
Another workaround for this one is using ldply in the plyr package...
df1 <- data.frame(x = c(1,3), y = c(2,4))
df2 <- data.frame(x = c(5,7), y = c(6,8))
list = list(df1 = df1, df2 = df2)
df3 <- ldply(list)
df3
.id x y
df1 1 2
df1 3 4
df2 5 6
df2 7 8
Even though there are already some great answers here, I just wanted to add the one I have been using. It is base R so it might be be less limiting if you want to use it in a package, and it is a little faster than some of the other base R solutions.
dfs <- list(df1 = data.frame("x"=c(1,2), "y"=2),
df2 = data.frame("x"=c(2,4), "y"=4),
df3 = data.frame("x"=2, "y"=c(4,5,7)))
> microbenchmark(cbind(do.call(rbind,dfs),
rep(names(dfs), vapply(dfs, nrow, numeric(1)))), times = 1001)
Unit: microseconds
min lq mean median uq max neval
393.541 409.083 454.9913 433.422 453.657 6157.649 1001
The first part, do.call(rbind, dfs) binds the rows of data frames into a single data frame. The vapply(dfs, nrow, numeric(1)) finds how many rows each data frame has which is passed to rep in rep(names(dfs), vapply(dfs, nrow, numeric(1))) to repeat the name of the data frame once for each row of the data frame. cbind puts them all together.
This is similar to a previously posted solution, but about 2x faster.
> microbenchmark(do.call(rbind,
lapply(names(dfs), function(x) cbind(dfs[[x]], source = x))),
times = 1001)
Unit: microseconds
min lq mean median uq max neval
844.558 870.071 1034.182 896.464 1210.533 8867.858 1001
I am not 100% certain, but I believe the speed up is due to making a single call to cbind rather than one per data frame.
Here is one option using Map. First, I create a named list of dataframes. Then, I can cbind the names to each dataframe. Then, use unname to remove the row names. Finally, rbind all the dataframes together.
# original data frames
df1 <- data.frame(x = c(1, 3), y = c(2, 4))
df2 <- data.frame(x = c(5, 7), y = c(6, 8))
df.list <- Hmisc::llist(df1, df2)
do.call(rbind, unname(Map(cbind, source = names(df.list), df.list)))
Output
source x y
1 df1 1 2
2 df1 3 4
3 df2 5 6
4 df2 7 8
I have a two-column data frame. The first column is a timestamp and the second column is some value. For example:
library(tidyverse)
set.seed(123)
data_df <- tibble(t = 1:15,
value = sample(letters, 15))
I have a another data frame that specifies the range of timestamps that need to be updated and their corresponding values. For example:
criteria_df <- tibble(start = c(1, 3, 7),
end = c(2, 5, 10),
value = c('a', 'b', 'c')
)
This means that I need to mutate the value column in data_df so that its value from t=1 to t=2 is 'a', from t=3 to t=5 is 'b' and from t=7 to t=10 is 'c'.
What is the recommended way to do this in R?
The only way I could think of is to loop each row in criteria_df and mutate the value column in data_df after filtering the t column, like so:
library(iterators)
library(foreach)
foreach(row = row_iter, .combine = c) %do% {
seg_start = row$start
seg_end = row$end
new_value = row$value
data_df %<>%
mutate(value = if_else(between(t, seg_start, seg_end),
new_value,
value))
NULL
}
We can do a two-step base R solution, where we first find the values which lies in the range of criteria_df start and end and then replace the data_df value from it's equivalent criteria_df's value if it matches or keep it as it is.
inds <- sapply(data_df$t, function(x) criteria_df$value[x >= criteria_df$start
& x <= criteria_df$end])
data_df$value <- unlist(ifelse(lengths(inds) > 0, inds, data_df$value))
data_df
# t value
# <int> <chr>
# 1 1 a
# 2 2 a
# 3 3 b
# 4 4 b
# 5 5 b
# 6 6 a
# 7 7 c
# 8 8 c
# 9 9 c
#10 10 c
#11 11 p
#12 12 g
#13 13 r
#14 14 s
#15 15 b