pulling up matching rows from a matrix using dplyr - r

Suppose I have the following:
myDF <- cbind.data.frame("Id" = rep(1:5, each = 4), values = c(rnorm(4,0,1), rnorm(4, 10, 1), rnorm(4, 20,1 ), rnorm(4, 30,1), rnorm(4, 40,1)))
idVector <- sample(1:5, size = 5, replace = TRUE)
If my `idVector = 4,4,3,2,1', I want to pull all the rows with Id 4, then Id 4 again, then 3 then 2 then 1.
I can do it using the following:
do.call("rbind", lapply(idVector, function(x, currentDF){
currentDF[currentDF$Id == x,]}
, myDF))
Is there a neater way to do it using dplyr or plyr?

With dplyr
library(dplyr)
left_join(data.frame(Id=idVector), myDF)

Related

Remove columns with certain column name patterns in multiple dataframes in R

I have >100 dataframes loaded into R. I want to remove all the columns from all data frames containing a certain pattern, in the example case below "abc".
df1 <- data.frame(`abc_1` = rep(3, 5), `b` = seq(1, 5, 1), `c` = letters[1:5])
df2 <- data.frame(`d` = rep(5, 5), `e_abc` = seq(2, 6, 1), `f` = letters[6:10])
df3 <- data.frame(`g` = rep(5, 5), `h` = seq(2, 6, 1), `i_a_abc` = letters[6:10])
I would thus like to remove the column abc_1 in df1, e_abc in df2 and i_a_abc in df3. How could this be done?
Do all of your dataframes start with or contain a shared string (e.g., df)? If yes, then it might be easier to put all your dataframes in a list by using that shared string and then apply the function to remove the abc columns in every dataframe in that list.
You can then read your dataframes back into your environment with list2env(), but it probably is in your interest to keep everything in a list for convenience.
library(dplyr)
df1 <- data.frame(`abc_1` = rep(3, 5), `b` = seq(1, 5, 1), `c` = letters[1:5])
df2 <- data.frame(`d` = rep(5, 5), `e_abc` = seq(2, 6, 1), `f` = letters[6:10])
df3 <- data.frame(`g` = rep(5, 5), `h` = seq(2, 6, 1), `i_a_abc` = letters[6:10])
dfpattern <- grep("df", names(.GlobalEnv), value = TRUE)
dflist <- do.call("list", mget(dfpattern))
dflist <- lapply(dflist, function(x){ x <- x %>% select(!contains("abc")) })
list2env(dflist, envir = .GlobalEnv)

How to efficiently find the overlap between two data tables of sequence coordinates inn R?

I have two large data tables with the coordinates of different sequences. For example:
library(data.table)
dt1 <- data.table(cat = c(rep("A", 2), rep("B", 2)),
start = c(1, 4, 2, 15),
end = c(6, 9, 5, 20))
dt2 <- data.table(cat = c(rep("A", 2), rep("B", 2)),
start = c(2, 1, 10, 17),
end = c(7, 3, 12, 20))
I need to create a data table of the coordinates for the overlapping sequences (ie the integers that occur in the sequences given in both data tables, for each category). I can currently do this using a for loop. For example:
seq2 <- Vectorize(seq.default, vectorize.args = c("from", "to"))
out_list <- list()
for(i in 1:length(unique(dt1$cat))){
sub1 <- dt1[cat == unique(dt1$cat)[i]]
sub2 <- dt2[cat == unique(dt1$cat)[i]]
vec1 <- unique(unlist(c(seq2(from = sub1$start, to = sub1$end))))
vec2 <- unique(unlist(c(seq2(from = sub2$start, to = sub2$end))))
vec <- Reduce(intersect, list(vec1, vec2))
vec_dt <- data.table(V1 = vec)
output <- vec_dt[order(V1),
.(start = min(V1),
end = max(V1)),
by = .(grp = rleid(c(0, cumsum(diff(V1) > 1))))
]
output$grp <- NULL
output$cat <- unique(dt1$cat)[i]
out_list[[i]] <- output
print(i)
}
output_dt <- do.call("rbind", out_list)
However, the data sets I need to apply this to are very large (both in the number of rows and the size of the vectors). Is anyone able to suggest a way to improve performance?
Thanks
You could (a) convert your start/end variables to a sequence, (b) do an inner join, (c) convert back to start/end.
library(data.table)
dt1 <- data.table(cat = c(rep("A", 2), rep("B", 2)),
start = c(1, 4, 2, 15),
end = c(6, 9, 5, 20))
dt2 <- data.table(cat = c(rep("A", 2), rep("B", 2)),
start = c(2, 1, 10, 17),
end = c(7, 3, 12, 20))
# convert to sequence
dt1 = dt1[, .(sequence = start:end), by=.(cat, 1:nrow(dt1))][
, nrow := NULL]
dt2 = dt2[, .(sequence = start:end), by=.(cat, 1:nrow(dt2))][
, nrow := NULL]
# inner join + unique
overlap = merge(dt1, dt2)
overlap = unique(overlap)
# convert to start/end
overlap = overlap[, .(start=min(sequence), end=max(sequence)), by=.(cat)]
# result
overlap
#> cat start end
#> 1: A 1 7
#> 2: B 17 20

Is there a way in R to compute a new column on a df based on another df?

is it possible to do something like this in R (assuming both df1 and df2 have the same number of rows?
if (df1$var1 = 8) df2$var1 = 1.
if (df1$var2 = 9) df2$var2 = 1.
A simple two line code can be done with Base R ifelse statement
df1 <- data.frame(var1 = c(1:10), var2 = c(1:10))
df2 <- data.frame(var1 = c(1:10), var2 = c(1:10))
df2$var1 <- ifelse(df1$var1 == 8, 1,df2$var1)
df2$var2 <- ifelse(df1$var2 == 9, 1,df2$var2)
Here is one simple option in base R, where we replicate the values 8, 9 to make the lengths same and compare with the subset of columns of 'df1', resulting in a logical matrix. Subset the 'df2' and assign those columns to 1
nm1 <- c('var1', 'var2')
df2[nm1][df1[nm1] == c(8, 9)[col(df1[nm1])]] <- 1
df2
# var1 var2 var3
#1 5 1 1
#2 3 1 2
#3 1 3 3
#4 1 4 4
#5 4 2 5
Or this can be done in two steps
df2$var1[df1$var1 == 8] <- 1
df2$var2[df1$var2 == 9] <- 1
Or using Map
df2[nm1] <- Map(function(x, y, z) replace(x, y == z, 1),
df2[nm1], df1[nm1], c(8, 9))
The if/else loop can be also done, but it is not vectorized i.e. it expects input to be of length 1. If we do a loop, then it can be done (but would be inefficient in R)
vals <- c(8, 9)
for(i in seq_len(nrow(df1))) {
for(j in seq_along(nm1)) {
if(df1[[nm1[j]]][i] == vals[j]) df2[[nm1[j]]][i] <- 1
}
}
data
df1 <- data.frame(var1 = c(1, 3, 8, 5, 2), var2 = c(9, 3, 1, 8, 4),
var3 = 1:5)
df2 <- data.frame(var1 = c(5, 3, 2, 1, 4), var2 = c(3, 1, 3, 4, 2),
var3 = 1:5)

Compare element by element from two data frames

I'd like to compare element by element from two data.frame called df1 and df2. From they, I'd like to build a new data.frame called out. If the elements are equals, then the element in out is 1, otherwise is 0.
MWE
set.seed(1)
df1 <- data.frame(Q1 = sample(letters[1:5], 2, replace = TRUE),
Q2 = sample(letters[1:5], 2, replace = TRUE))
set.seed(2)
df2 <- data.frame(Q1 = sample(letters[1:5], 2, replace = TRUE),
Q2 = sample(letters[1:5], 2, replace = TRUE))
Expected output
out <- data.frame(Q1 = c(0, 0), Q2 = c(1, 0))
If the datasets are created with stringsAsFactors = FALSE while creating the data.frame - factor makes it difficult as the attributes would create difficulty in doing the comparison)
+(df1 == df2)
Or if it is factor convert to character columns with type.convert
+(type.convert(df1, as.is = TRUE) == type.convert(df2, as.is = TRUE))
Or make use of matrix hack way of changing to character
+(as.matrix(df1) == as.matrix(df2))

R add column for sparkline with value from each row vector

Start with a dataframe
library(dplyr)
library(sparkline)
df <- data.frame(matrix(1:9, nrow = 3, ncol = 3))
X1 X2 X3
1 1 4 7
2 2 5 8
3 3 6 9
Would like to add a column 'spark' for use with sparkline:
df <- df %>% mutate(spark = spk_chr(values = ?, type = "bar", elementId = X1))
So the question mark (?) would be replaced by a vector made up of each row of df.
For the first row, ? = c(1, 4, 7), the values from the first row, spark = spk(values = c(1, 4, 7)...)
I know how to extract a vector from any row, first row vector is unlist(df[1,]), but do not understand if this can be used in mutate.
Used Ronak's suggestion to create intermediate column:
cols = names(df)
df$y <- apply(df[,cols], 1, paste, collapse = "-")
Then created vectorized spk_chr:
sparky <- Vectorize(sparkline::spk_chr)
To use in making the spark column:
df <- df %>% mutate(spark = sparky(strsplit(y, split="-"), type = "bar", elementId = X1))

Resources