My dataframe looks something like the following, where there are >100 columns that starts with "i10_" and many other columns with other data. I would like to create a new variable that tells me whether the values C7931 and C7932 are in each row within only the columns that start with "i10_". I would like to create a new variable that states TRUE or FALSE depending on whether the value exists in that row or not.
So the output would be c(TRUE, TRUE, FALSE, FALSE, FALSE, TRUE)
Create a vector with the columns of interest and use rowSums(), i.e.
i1 <- grep('i10_', names(d1))
rowSums(d1[i1] == 'C7931' | d1[i1] == 'C7932', na.rm = TRUE) > 0
where,
d1 <- structure(list(v1 = c("A", "B", "C", "D", "E", "F"), i10_a = c(NA,
"C7931", NA, NA, "S272XXA", "R55"), i10_1 = c("C7931", "C7931",
"R079", "S272XXA", "S234sfs", "N179")), class = "data.frame", row.names = c(NA,
-6L))
Ideally, you would give us a reproducible example with dput(). Assuming your dataframe is called df, you can do something like this with only base.
df$present <- apply(
df[, c(substr(names(df), 1, 3) == "i10")],
MARGIN = 1,
FUN = function(x){"C7931" %in% x & "C7932" %in% x})
This will go row by row and check columns that start with i10 if they contain "C7931" and "C7932".
Similar approach with dplyr::across()
my_eval<-c("C7932","C7931")
d1%>%
mutate(is_it_here=
rowSums(across(starts_with("i10_"),
~. %in% my_eval))!=0)
Related
Problem
I have two data frames that I want to join using a conditional statement on three non-numeric variables. Here is a pseudo-code version of what I want to achieve.
Join DF1 and DF2 on DF1$A == DF2$A | DF1$A == DF2$B
Dataset
Here's some code to create the two data frames. variant_index is the data frame that will be used to annotate input using a left_join:
library(dplyr)
options(stringsAsFactors = FALSE)
set.seed(5)
variant_index <- data.frame(
rsid = rep(sapply(1:5, function(x) paste0(c("rs", sample(0:9, 8, replace = TRUE)), collapse = "")), each = 2),
chrom = rep(sample(1:22, 5), each = 2),
ref = rep(sample(c("A", "T", "C", "G"), 5, replace = TRUE), each = 2),
alt = sample(c("A", "T", "C", "G"), 10, replace = TRUE),
eaf = runif(10),
stringAsFactors = FALSE
)
variant_index[1, "alt"] <- "T"
variant_index[8, "alt"] <- "A"
input <- variant_index[seq(1, 10, 2), ] %>%
select(rsid, chrom)
input$assessed <- c("G", "C", "T", "A", "T")
What I tried
I would like to perform a left_join on input to annotate with the eaf column from variant_index. As you can see from the input data frame, its assessed column can match either with input$ref or with input$alt. The rsid and chrom column will always match.
I know I can specify multiple column in the by argument of left_join, but if I understand correctly, the condition will always be
input$assessed == variant_index$ref & input$assessed == variant_index$alt
whereas I want to achieve
input$assessed == variant_index$ref | input$assessed == variant_index$alt
Possible solution
The desired output can be obtained like so:
input %>%
left_join(variant_index) %>%
filter(assessed == ref | assessed == alt)
But it doesn't seem like the best solution to me, since I am possibly generating double the lines, and would like to apply this join to data frames containing 100M+ lines. Is there a better solution?
Complex joins are straight forward in SQL:
library(sqldf)
sqldf("select *
from variant_index v
join input i on i.assessed = v.ref or i.assessed = v.alt")
Try this
library(dbplyr)
x1 <- memdb_frame(x = 1:5)
x2 <- memdb_frame(x1 = 1:3,x2 = letters[1:3])
x1 <- x1 %>% left_join(b, sql_on = "a.x=b.x1 or a.x=b.x2")
we can use show_query to see the code
I have a dataframe that I read from the XLSX file. Every column name looks like this: CODE___DESCRIPTION so for example A1___Some funky column here. It is easier to use the codes as colnames but I want to use description when needed so it must be stored in the dataframe. This is why I am using sjlabelled package later on.
Make yourself some random data and save it as some_data.xlsx.
library(dplyr) #to play with tibbles
library(stringi) #to play with strings
library(writexl) #name speaks for itself
tibble(col1 = sample(c("a", "b", "c", NA, "N/A"), 50, replace = T),
col2 = sample(c("d", "e", "f", NA, "N/A"), 50, replace = T),
col3 = sample(c("g", "h", "i", NA, "N/A"), 50, replace = T),
col4 = sample(c("j", "k", "l", NA, "N/A"), 50, replace = T)) %>%
setNames(stri_c("A", 1:4, "___", stri_rand_strings(4, 10))) %>%
write_xlsx(path = "some_data.xlsx", col_names = T, format_headers = F)
I've created simple function to prepare my data the way I want it.
library(sjlabelled) #to play with labelled data
label_it <- function(data = NULL, split = "___"){
#This basically makes an array of two columns (of codes and descriptions respectively)
k.n <- data %>%
names() %>%
stri_split_fixed(pattern = split, simplify = T)
data%>%
set_label(k.n[,2]) %>% #set description as each column's label
setNames(k.n[,1]) #set code as each column's name
}
First I read the data from XLSX file. Then I label it.
library(readxl) #name speaks for itself again
data <- read_xlsx("some_data.xlsx", na = c("", "N/A")) %>%
label_it()
Now each of my dataframe's column is character vector (in fact it's a structure) with two attributes:
label being description part
names being the original dataframe column name (CODE___DESCRIPTION style) and is not to be mistaken for output of names(data) which would be the codes part
Let's say I would like to change first and third column to factor.
To do this I have tried two things:
data[,1] <- factor(data[,1], levels = c("c", "a", "b"))
data[,3] <- factor(data[,3], levels = c("h", "g", "i"))
this changes all of those two columns values to NA_integer_.
data <- data %>%
mutate(A1 = factor(A1, levels = c("c", "a", "b")),
A3 = factor(A3, levels = c("h", "g", "i")))
this changes character vectors to factors as intended, but it drops both column attributes (label and names) which I need to be preserved.
I also tried quite a lot of functions from sjlabelled, labelled and haven packages. Nothing worked as I intended. Finally, I have found a solution, but it isn't perfect and I would love to find an easier way of doing this.
The solution is to lose those attributes but then regain ('copy' in fact) them.
data <- data %>%
mutate(A1 = factor(A1, levels = c("c", "a", "b")),
A3 = factor(A3, levels = c("h", "g", "i"))) %>%
copy_labels(data)
copy_labels is function from sjlabelled package which is used when labels are lost due to e.g. data subsetting as in this example.
P.S.
I would love to add r-sjlabelled and r-labelled tags because those packages are considered in this problem but am under 1500 reputation required to do this.
I have the following code that it taking forever to run on my 80k rows CBP table. Anyone could help me optimize my loop. Trying simply to find duplicates sharing the same values in certain (not all) columns, getting the number of duplicates there is and then returning the ids for each of the duplicates:
for (row in 1:nrow(CBP)){
subs <- subset(CBP, CBP$Lower_Bound__c == CBP[row,"Lower_Bound__c"] & CBP$Price_Book__c == CBP[row,"Price_Book__c"] & CBP$Price__c == CBP[row,"Price__c"] & CBP$Product__c == CBP[row,"Product__c"] & CBP$Department__c == CBP[row,"Department__c"] & CBP$UOM__c == CBP[row,"UOM__c"] & CBP$Upper_Bound__c == CBP[row,"Upper_Bound__c"])
if (nrow(subs)>1){
CBP[row,]$dup <- nrow(subs)
CBP[row,]$dupids <- paste(subs[,"Id"], collapse = ",")
}
print(row)
}
I'm having a hard time understanding your example. However, here's a simple approach with data.table that might work for your situation. You can create a variable (nsame in the example) that counts if the something is a duplicate by multiple variables (var1 and var2 in the example). Then just grab the row index.
library(data.table)
# generate some example data
dt <- data.table(
var1 = c("A", "A", "A", "B", "B", "B", "C", "C", "C"),
var2 = c("a", "a", "z", "b", "y", "b", "c", "c", "c"),
var3 = 1:9
)
# counter for each combination of var1-var2
dt[ , nsame := 1:.N, by=.(var1, var2)]
# duplicates are where the counter is > 1
which(dt$nsame > 1)
## 2 6 8 9
Using base R:
dupe_columns = c(
"Lower_Bound__c", "Price_Book__c", "Price__c", "Product__c",
"Department__c", "UOM__c", "Upper_Bound__c"
)
# which rows are duplicated
dupes = which(duplicated(CBP[, dupe_columns]) | duplicated(CBP[, dupe_columns], fromLast = TRUE))
# how many are there
length(dupes)
# IDs that are duplicated
CBP[dupes, "Id"]
# collapse Ids with duplicates by group:
aggregate(CBP$Id, by = CBP[dupe_columns], FUN = paste, collapse = ",")
If any of this doesn't work or you need more help, post 10-20 rows of sample data (use dput() so it is copy/pasteable!!!) so we can test and verify.
Subtle point, but I use CBP[, dupe_columns] in the duplicated() line because duplicated() will work the same whether we give it a data frame or a vector. CBP[, dupe_columns] will be a data frame if you have more than one column to check for dupes, but will be a vector if you give it a single column. However, when we get down to aggregate we need the by argument to be a list (like a data frame). So I use CBP[dupe_columns] (no comma) which will guarantee a data frame even if we are only checking a single column.
I will try to explain what I am doing the best I can it is kind of confusing but I'll give it a shot. Essentially I start with 2 data frames. Each one containing a unique row per person and two items per user as columns. My goal is to turn this into 1 data frame with one unique row per user and the first item from each of the two data frames upon the condition that the items do not repeat. For example if for customer 1 in the first data frame his items are "a" and "d" and in the second data frame his items are "a" and "c", I would want the final data frame to be "a" and "c" for this customer. I have written an apply that does this however when I perform this on roughly 160,000 rows it takes quite a bit of time. I was hoping someone would be able to come up with a more efficient solution to my problem.
d1 <- data.frame(id = c("1", "2", "3"), stringsAsFactors = F)
r1 <- data.frame(i1 = c("a", "b", "c"), i2 = c("d", "e", "f"), stringsAsFactors = F)
rownames(r1) = d1$id
r2 <- data.frame(i1 = c("a", "c", "f"), i2 = c("c", "t", "l"), stringsAsFactors = F)
rownames(r2) = d1$id
dFinal <- data.frame(id = d1$id, r1 = "", r2 = "", stringsAsFactors = F)
dFinal$r1 = apply(dFinal, 1, function(x){r1[rownames(r1) == x["id"], "i1"]})
dFinal$r2 = apply(dFinal, 1, function(x){r2[rownames(r2) == x["id"], which(!r2[rownames(r2) == x["id"],c("i1","i2")] %in% x["r1"])[1]]})
Would the following do what you're looking for:
# Keep only first column of first data.frame
df <- cbind(d1,r1,r2)[,-3]
names(df) <- c("id","r1_final","r2_i1","r2_i2")
df$r2_final <- df$r2_i1
# Keep only second column of second data.frame
# if the value in the first column is found in first data.frame
df[df$r1_final == df$r2_i1,"r2_final"] <- df[df$r1_final == df$r2_i1,"r2_i2"]
df_final <- df[,c("id","r1_final","r2_final")]
print(df_final)
id r1_final r2_final
1 1 a c
2 2 b c
3 3 c f
Edit:
OP asked for a solution if there were four data.frames instead of 2 like in the example, here is some code that I haven't tested but it should work with two additional columns
df$r2_final <- df$r2_i1
df$r3_final <- df$r3_i1
df$r4_final <- df$r4_i1
df[df$r1_final == df$r2_i1,"r2_final"] <- df[df$r1_final == df$r2_i1,"r2_i2"]
df[df$r3_i1 %in% c(df$r1_final,df$r2_final),"r3_final"] <- df[df$r3_i1 %in% c(df$r1_final,df$r2_final),"r3_i2"]
df[df$r4_i1 %in% c(df$r1_final,df$r2_final,df$r3_final),"r4_final"] <- df[df$r4_i1 %in% c(df$r1_final,df$r2_final,df$r3_final),"r4_i2"]
df_final <- df[,c("id","r1_final","r2_final","r3_final","r4_final")]
Thanks for the accepted answer as it worked perfectly! However it gave me an idea to use ifelse. While it doesn't work any better or worse than the accepted answer it was a little easier for me to wrap my head around when adding more columns or data frames.
dfInt <- cbind(df1, df2, df3, df4)
dfInt$R1_Final <- dfInt$R1_1
dfInt$R2_Final <- ifelse(dfInt$R1_Final == dfInt$R2_1,
dfInt$R2_2,
dfInt$R2_1)
dfInt$R3_Final <- ifelse(dfInt$R1_Final != dfInt$R3_1 & dfInt$R2_Final != dfInt$R3_1,
dfInt$R3_1,
ifelse(dfInt$R2_Final != dfInt$R3_2,
dfInt$R3_2,
dfInt$R3_3))
dfInt$R4_Final <- ifelse(dfInt$R1_Final != dfInt$R4_1 & dfInt$R2_Final != dfInt$R4_1 & dfInt$R3_Final != dfInt$R4_1,
dfInt$R4_1,
ifelse(dfInt$R2_Final != dfInt$R4_2 & dfInt$R3_Final != dfInt$R4_2,
dfInt$R4_2,
ifelse(dfInt$R3_Final != dfInt$R4_3,
dfInt$R4_3,
dfInt$R4_4)))
I have a data.table with multiple columns. One of these columns currently works as a 'key' (keyb for the example). Another column (let's say A), may or may not have data in it. I would like to supply a vector that randomly sample two rows per key, -if this key appears in the vector, where 1 row contains data in A, while the other does not.
MRE:
#data.table
trys <- structure(list(keyb = c("x", "x", "x", "x", "x", "y", "y", "y",
"y", "y"), A = c("1", "", "1", "", "", "1", "", "", "1", "")), .Names = c("keyb",
"A"), row.names = c(NA, -10L), class = c("data.table", "data.frame"
))
setkey(trys,keyb)
#list with keys
list_try <- structure(list(a = "x", b = c("r", "y","x")), .Names = c("a", "b"))
I could, for instance subset the data.table based on the elements that appear in list_try:
trys[keyb %in% list_try[[2]]]
My original (and probably inefficient idea), was to try to chain a sample of two rows per key, where the A column has data or no data, and then merge. But it does not work:
#here I was trying to sample rows based on whether A has data or not
#here for rows where A has no data
trys[keyb %in% list_try[[2]]][nchar(A)==0][sample(.N, 2), ,by = keyb]
#here for rows where A has data
trys[keyb %in% list_try[[2]]][nchar(A)==1][sample(.N, 2), ,by = keyb]
In this case, my expected output would be two data.tables (one for a and one for b in list_try), of two rows per appearing element: So the data.table from a would have two rows (one with and without data in A), and the one from b, four rows (two with and two without data in A).
Please let me know if I can make this post any clearer
You could add A to the by statement too, while converting it to a binary vector by modifying to A != "", combine with a binary join (while adding nomatch = 0L in order to remove non-matches) you could then sample from the row index .I by those two aggregators and then subset from the original data set
For a single subset case
trys[trys[list_try[[2]], nomatch = 0L, sample(.I, 1L), by = .(keyb, A != "")]$V1]
# keyb A
# 1: y 1
# 2: y
# 3: x 1
# 4: x
For a more general case, when you want to create separate data sets according to a list of keys, you could easily embed this into lapply
lapply(list_try,
function(x) trys[trys[x, nomatch = 0L, sample(.I, 1L), by = .(keyb, A != "")]$V1])
# $a
# keyb A
# 1: x 1
# 2: x
#
# $b
# keyb A
# 1: y 1
# 2: y
# 3: x 1
# 4: x