Check for string in dataframe containing list of strings

Check for string in dataframe containing list of strings - r

I'm trying to find a way to lookup multiple values in a dataframe and return a value. Simplified example:
df1 <- read.table(text="chk1 chk2 chk3 value
xx aa;bb;cc jj 1
xx;yy dd;ee;ff kk 2
zz gg;hh;ii ll;nn 3", header=T)
df2 <- read.table(text="val1 val2 val3
xx bb jj
xx dd kk
yy ee kk
zz hh jj
", header=T)
Lookup values val1, val2, and val3 from df2 in df1, return value from df1.
Desired results:
df2 <- read.table(text="
val1 val2 val3 value
xx bb jj 1
xx dd kk 2
yy ee kk 2
zz hh jj NA
")
Tried match x %in% y and looping over the rows, can't get it to work.

Here is one possibility:
library(tidyverse)
df3 <- df2 %>% rowwise %>%
mutate(rowmatch=which(grepl(val1, df1$chk1) &
grepl(val2, df1$chk2) &
grepl(val3, df1$chk3))[1],
value=df1$value[rowmatch])
Result:
# A tibble: 4 x 5
val1 val2 val3 rowmatch value
<chr> <chr> <chr> <int> <int>
1 xx bb jj 1 1
2 xx dd kk 2 2
3 yy ee kk 2 2
4 zz hh jj NA NA
Notes:
the [1] is to ensure that only first of the matching rows is used.
note that although rowmatch and value are identical in this example this is only because df1$value is equal to the row number.
tibble behaves like a data.frame, but if you really prefer a data frame, add %>% as.data.frame
The same can be done with base R and apply:
df2$rowmatch <- with(df1, apply(df2, 1, function(x)
which(grepl(x["val1"], chk1) &
grepl(x["val2"], chk2) &
grepl(x["val3"], chk3))[1]))
df2$value <- df1$value[df2$rowmatch]

another option would be splitting the values first:
df1 <- df1 %>%
splitstackshape::cSplit("chk1", ";", fixed = TRUE, direction = "long", drop = FALSE, type.convert = FALSE) %>%
splitstackshape::cSplit("chk2", ";", fixed = TRUE, direction = "long", drop = FALSE, type.convert = FALSE) %>%
splitstackshape::cSplit("chk3", ";", fixed = TRUE, direction = "long", drop = FALSE, type.convert = FALSE)
and then using join

You can also do it using two nested for loops. The logic is to take first row of df2 and then start going through rows of df1 to see if df2$val1 matches df1$chk, df2$val2 matches df1$chk2 and df2$val3 matches df1$chk3. I consider all values a match if there is at least one match per column. The caverat here is that if df2 does not have unique rows, the last matching row from df1 will be written to df2. But this can be changed by breaking out of the loop as soon as the match is found.
for (i in 1:nrow(df2)) {
for (j in 1:nrow(df1)) {
# Take i-th row and split by ;. Result is a vector of strings against
# which we'll use match.
i.split <- strsplit(as.character(unlist(df1[j, , drop = TRUE][-4])), ";")
# Pairwise check columns from df1 and df2.
all.ok <- all(mapply(FUN = function(x, y) {
any(x %in% y)
}, x = i.split, y = as.list(df2[i, 1:3])
))
if (all.ok) {
# If a match is found, write the value to df2.
df2[i, "value"] <- df1[j, "value"]
}
}
}
Output:
val1 val2 val3 value
1 xx bb jj 1
2 xx dd kk 2
3 yy ee kk 2
4 zz hh jj NA

Related

the number of variables that are repeated 2 or more times in R

I have two forms of data: a list (i.e., r) and a data.frame (i.e., df). For each form of data, how can I know the number of variables that are repeated 2 or more times (in the example below, my desired output is: AA 3 times, BB 2 times, CC 2 times)?
NOTE: the answer regardless of the form of data, should be the same.
r <- list( data.frame( AA = c(2,2,1,1,NA, NA), BB = c(1,1,1,2,2,NA), CC = c(1:5, NA)), # LIST
data.frame( AA = c(1,NA,3,1,NA,NA), DD = c(1,1,1,2,NA,NA)),
data.frame( AA = c(1,NA,3,1,NA,NA), BB = c(1,1,1,2,2,NA), CC = c(0:4, NA)) )
df <- do.call(cbind, r) ## DATA.FRAME

We can create a frequency count with >= 2 on the names of the dataset,
tbl <- table(names(df))
tbl1 <- tbl[tbl >=2]
tbl1
# AA BB CC
# 3 2 2
lapply(r, function(x) table(names(x)[names(x) %in% names(tbl1)]))
If we need it from another answer
vec <- names(unlist(r, recursive = FALSE))
nm1 <- unique(vec[duplicated(vec)])
lapply(r, function(x) table(names(x)[names(x) %in% nm1]))

Transpose whole dataframe into one row dataframe- (or transposing each row of data.table and column binding)

I have tried to transform my_dataset with the help of library reshape & data.table in order to achieve the result.dataset but haven't been successful as yet.
I have a data table my_dataset that looks like this :-
A X Count
id1 b 1
id1 c 2
And I want to have the result.dataset that should look like this :-
A X1 Count1 X2 Count2
id1 b 1 c 2
It would be great if anyone could help me to get the result.dataset as above, preferably by using reshape or data.table (or both lib).

Here's a solution that is using only reshape2 (trying to stick to the suggested packages). It starts by adding a column rep, that allows one to call dcast.
require(reshape2)
#adding rep
my_dataset$rep = unlist(tapply(my_dataset$A, my_dataset$A, function(x)1:length(x)))
#cast at work
C1 = dcast(my_dataset, A ~ paste('X',rep, sep=''), value.var='X')
C2 = dcast(my_dataset, A ~ paste('Count',rep, sep=''), value.var='Count')
result.dataset = cbind(C1, C2[,-1])
The columns will not be in the same order as your example though.

Try this:
dt <- read.table(text = 'A X Count
id1 b 1
id1 c 2',header=T)
a <- aggregate(.~A, dt, paste, collapse=",")
library(splitstackshape)
result <- concat.split.multiple(data = a, split.cols = c("X","Count"), seps = ",")
output:
> result
A X_1 X_2 Count_1 Count_2
1: id1 b c 1 2

We can aggregate the rows and use cSplit to split them.
library(data.table)
library(splitstackshape)
dat2 <- setDT(dat)[, lapply(.SD, paste, collapse = ","), by = A]
cols <- c(names(dat[, 1]), paste(names(dat[, -1]),
rep(1:nrow(dat), each = nrow(dat),
sep = "_"))
cSplit(dat2, splitCols = names(dat[, -1]))[, cols, with = FALSE]
# A X_1 Count_1 X_2 Count_2
# 1: id1 b 1 c 2
DATA
dat <- read.table(text = "A X Count
id1 b 1
id1 c 2",
header = TRUE, stringsAsFactors = FALSE)

column of unique values between of two other columns

sample data:
col1 col2
<NA> cc
a a
ab a
z a
I want to add a column unique with these values -- any valued that isn't shared between col1 and col2.
col1 col2 unique
<NA> cc cc
a a
ab a b
z a za
I tried using setdiff but
(for replication purposes:)
df <- read.table(header=TRUE, stringsAsFactors = FALSE, text =
"col1 col2
NA cc
a a
ab a
z a
")
Like this:
df$unique <- paste0(setdiff(df$col1, df$col2), setdiff(df$col2, df$col1))
But it returns
Error in `$<-.data.frame`(`*tmp*`, "unique", value = c("<NA>cc", "abcc" :
replacement has 2 rows, data has 3
From the error it looks like it's generating a vector of the differences between the columns, instead of the differences between the elements...
Edit: Added z and a sample data in last row.

You could do this using setdiff and Reduce in base R:
cols <- c(1,2)
df$unique <- unlist(lapply(apply(df[cols], 1, function(x)
Reduce(setdiff, strsplit(na.omit(x), split = ""))), paste0, collapse=""))
# col1 col2 unique
# 1 <NA> cc cc
# 2 a a
# 3 ab a b

Here is a length method with apply.
apply(df, 1, function(i) {
i <- i[!is.na(i)] # remove NAs
if(length(i[!is.na(i)]) == 1) i # check length and return singletons untouched
else { # for non-singletons
i <- unlist(strsplit(i, split="")) # strsplit and turn into a vector
i <- i[!(duplicated(i) | duplicated(i, fromLast=TRUE))] # drop duplicates
paste(i, collapse="")}}) # return collapsed singleton set of characters
[1] "cc" "" "b"
Note that for c("cc", "a", "c"), this will return "a" because "cc" and "c" will be marked as duplicates.

We need to split the string first:
df$unique <- mapply(function(x, y){
u <- setdiff(union(x, y), intersect(x, y))
paste0(u[!is.na(u)], collapse = '')
}, strsplit(df$col1, ''), strsplit(df$col2, ''))
# >df
# col1 col2 unique
# 1 <NA> cc c
# 2 a a
# 3 ab a b

Specific Ordering in R

I am making all possible combinations for a specific input, but it has to be ordered according to the order of the input aswell. Since the combinations are different sized, I'm struggling with the answers previously posted.
I would like to know if this is possible.
Input:
D N A 3
This means I need to output it in all combinations up to 3 character strings:
D
DD
DDD
DDN
DDA
DND
DNA
.
.
Which is basically ascending order if we consider D<N<A
So far my output looks like this:
A
AA
AAA
AAD
AAN
AD
ADA
ADD
ADN
AN
.
.
I have tried converting the input as factor c("D","N","A") and sort my output, but then it disappears any string bigger than 1 character.

Here's one possible solution:
generateCombs <- function(x, n){
if (n == 1) return(x[1]) # Base case
# Create a grid with all possible permutations of 0:n. 0 == "", and 1:n correspond to elements of x
permutations = expand.grid(replicate(n, 0:n, simplify = F))
# Order permutations
orderedPermutations = permutations[do.call(order, as.list(permutations)),]
# Map permutations now such that 0 == "", and 1:n correspond to elements of x
mappedPermutations = sapply(orderedPermutations, function(y) c("", x)[y + 1])
# Collapse each row into a single string
collapsedPermutations = apply(mappedPermutations, 1, function(x) paste0(x, collapse = ""))
# Due to the 0's, there will be duplicates. We remove the duplicates in reverse order
collapsedPermutations = rev(unique(rev(collapsedPermutations)))[-1] # -1 removes blank
# Return as data frame
return (as.data.frame(collapsedPermutations))
}
x = c("D", "N", "A")
n = 3
generateCombs(x, n)
The output is:
collapsedPermutations
1 D
2 DD
3 DDD
4 DDN
5 DDA
6 DN
7 DND
8 DNN
9 DNA
10 DA
11 DAD
...

A solution using a random library I just found (so I might be using it wrong) called iterpc.
Generate all the combinations, factor the elements, sort, then hack into a string.
ordered_combn = function(elems) {
require(data.table)
require(iterpc)
I = lapply(seq_along(elems), function(i) iterpc::iterpc(table(elems), i, replace=TRUE, ordered=TRUE))
I = lapply(I, iterpc::getall)
I = lapply(I, as.data.table)
dt = rbindlist(I, fill = TRUE)
dt[is.na(dt)] = ""
cols = paste0("V", 1:length(elems))
dt[, (cols) := lapply(.SD, factor, levels = c("", elems)), .SDcols = cols]
setkey(dt)
dt[, ID := 1:.N]
dt[, (cols) := lapply(.SD, as.character), .SDcols = cols]
dt[, ord := paste0(.SD, collapse = ""), ID, .SDcols = cols]
# return dt[, ord] as an ordered factor for neatness
dt
}
elems = c("D", "N", "A")
combs = ordered_combn(elems)
combs
Output
V1 V2 V3 ID ord
1: D 1 D
2: D D 2 DD
3: D D D 3 DDD
4: D D N 4 DDN
5: D D A 5 DDA
6: D N 6 DN
7: D N D 7 DND
8: D N N 8 DNN
...

R extract matching values from list of data frames

I have a relatively large amount of data stored in a list of data frames with several columns.
For each element of the list I wish to check one column against a reference and if present extract the value held in another column of the same element and place in a new summary matrix.
e.g. with the following example code:
add1 = c("N1","N1","N1")
coords1 = c(1,2,3)
vals1 = c("a","b","c")
extra1 = c("x","y","x")
add2 = c("N2","N2","N2","N2")
coords2 = c(2,3,4,5)
vals2 = c("b","c","d","e")
extra2 = c("z","y","x","x")
add3 = c("N3","N3","N3")
coords3 = c(1,3,5)
vals3 = c("a","c","e")
extra3 = c("z","z","x")
df1 <- data.frame(add1, coords1, vals1, extra1)
df2 <- data.frame(add2, coords2, vals2, extra2)
df3 <- data.frame(add3, coords3, vals3, extra3)
list_all <- list(df1, df2, df3)
coordinate.extract <- unique(unlist(lapply(list_all, "[", 1)))
my_matrix <- matrix(0, ncol = length(list_all)
, nrow = (length(coordinate.extract)))
my_matrix_new <- cbind(as.character(coordinate.extract)
, my_matrix)
I would like to end up with:
my_matrix_new = V1 V2 V3 V4
1 a a
2 b b
3 c c c
4 d
5 e e
i.e. the 3rd column of each list element is chosen based on the value of the second column.
I hope this is clear.
Thanks,
Matt

I would use data.frame as there are mixed classes. You may try merge with Reduce to get the expected output. Select the 2nd and 3rd columns,in each list element, change the column name for the 2nd to be same across all the list elements, merge, and if needed replace the NA elements with ''
lst1 <- lapply(list_all, function(x) {names(x)[2] <- 'V1';x[2:3] })
res <- Reduce(function(...) merge(..., by='V1', all=TRUE), lst1)
res[-1] <- lapply(res[-1], as.character)
res[is.na(res)] <- ''
res
# V1 vals1 vals2 vals3
#1 1 a a
#2 2 b b
#3 3 c c c
#4 4 d
#5 5 e e
We can change the column names
names(res) <- paste0('V', seq_along(res))

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Check for string in dataframe containing list of strings - r

Related

the number of variables that are repeated 2 or more times in R

Transpose whole dataframe into one row dataframe- (or transposing each row of data.table and column binding)

column of unique values between of two other columns

Specific Ordering in R

R extract matching values from list of data frames

Categories

Resources