in R, concatenate a conditional vector of strings [closed] - r

Closed. This question needs debugging details. It is not currently accepting answers.
Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question.
Closed 1 year ago.
Improve this question
I am struggling with how to gain insight into two tables of information in R I have. I want to search to see if a string of characters in one data frame is present in another data frame. If it is, record the name for that string and append it to a new data frame.
Here's what I am working with:
df_repeats
sequence promoter_numbers promotors
1 AAAAAAAAAAAA 715 NA
2 AAAAAAAAAAAC 61 NA
3 AAAAAAAAAAAG 184 NA
df_promotors
gene promotor_coordinates sequence
1 Xkr4_1 range=chr1:3671549-36… GAGCTAGTTCTCTTTTCCCTGGTTACTAGCCATGTCCCTCCTCCCA…
2 Rp1_2 range=chr1:4360255-43… CACACACACACACACACACACACACATGTAACAATGAAACAAAAAG…
3 Rp1_1 range=chr1:4409254-44… AGGTATAACTTGGTAAAGACTTTGAAGTAAACAAGAACAAACAGCT…
I am trying to see which gene repeat sequences in df_repeats are present in the sequence column in df_promotors. My goal is to create a new data frame to be able to perform some visualizations. So I've been struggling to create something like the below (just as an example)
df_repeat_occurances
sequence promotor_numbers in_genes
1 AAAAAAAAAAAA 715 Rp1_2
2 AAAAAAAAAAAC 61 Xkr4_1, Rp1_2
3 AAAAAAAAAAAG 184 Xkr4_1
I tried to write a nested loop to search through and if there's a match, append it to the df_repeats in place of the NA, and then change the row names later, but I am completely lost on how to do this, or if it's an ideal way to combine the information of from the two tables into one. Here's what I tried and could not work through.
for (i in 1:nrow(df_repeats)) {
x = df_repeats$sequence[i]
for (j in 1:nrow(df_promotors)) {
if (grepl(x, df_promotors$sequence[j])) {
y = df_promotors$gene[j]
df_repeats$sequence[i] = c(df_repeats$sequence[i], " ", y)
}
}
}
First time ever posting and asking for help, so any guidance or pointers would be greatly appreciated!!!

welcome to SO, in the future please include a reproducible example as I did below, including some meaningful names such as "result" etc.
Also mark any acceptable answer as "accepted".
The best approach is to separate the different computation steps.
#First, define a reproducible example
sequences <- c("AAA", "BBB", "CCC", 'DDD')
promnb <- 1:4
result <- data.frame(sequences, promnb)
genes_names <- paste0("gene_", letters[1:4])
sequence <- c('BBB', 'ABC', 'AAA', 'AAA')
df_proms <- data.frame(genes_names, sequence)
# genes_names sequence
# 1 gene_a BBB
# 2 gene_b ABC
# 3 gene_c AAA
# 4 gene_d AAA
# 1: check in which genes each sequence is present using grepl
# sapply used with data.frames will by default apply the defined function over each column:
in_genes <- sapply(result$sequences, function(x) grepl(x, df_proms$sequence))
# AAA BBB CCC DDD
# [1,] FALSE TRUE FALSE FALSE
# [2,] FALSE FALSE FALSE FALSE
# [3,] TRUE FALSE FALSE FALSE
# [4,] TRUE FALSE FALSE FALSE
#2: replace TRUE or FALSE by the names of the genes
in_genes_names <- data.frame(ifelse(in_genes, paste0(genes_names), ""))
#3: finally, paste each column of the last df to get all the names of the genes
that contain this sequence
result$in_genes <- sapply(in_genes_names, paste, collapse = " ")
result$in_genes <- trimws(result$in_genes)
# By the way, you'd probably want to keep a list of the matches
# you can also include this list as a column of the result df
result$in_genes_list <- sapply(in_genes_names, list)
result
# sequences promnb in_genes in_genes_list
# 1 AAA 1 gene_c gene_d , , gene_c, gene_d
# 2 BBB 2 gene_a gene_a, , ,
# 3 CCC 3 , , ,
# 4 DDD 4 , , ,

You may try the following sapply loop -
df_repeats$in_genes <- sapply(df_repeats$sequence, function(x)
toString(df_promotors$gene[grepl(x, df_promotors$sequence)]))

Related

How to identify picked values from a list in R

Quite new to R...
I loaded a file with 13458 observations containing a time and a value. I ran it through a program which detects homologue series. The output is a large list with 6 elements, including values IDed by the row number in the original file.
I would like to export the original file with values detected by the program marked somehow so I can easily identify them in Excel. Hopefully that makes some sense.
My dataframe looks like this and I'm using the m.z and RT values:
m.z dummy RT
1 151.0092 255975.8 15.043
2 151.0092 110111.7 15.456
3 151.0092 108958.1 15.243
4 151.0093 3258343.0 14.620
5 151.0127 107255.9 6.336
My output contains a list of related series and looks like this:
[359] "3518,4779,5929,6975,8032,9051,9825"
[360] "5927,6977,8036,9052,9824,10507,11043"
I would like a data frame that lets me know if a value has been identified, as this:
m.z dummy RT homologue
3518 459.2006 255975.8 15.043 TRUE
3519 459.2120 110111.7 15.456 FALSE
3520 459.2159 108958.1 15.243 FALSE
Thanks!
Here is an attempt
your MS data:
DF <- read.table(text="m.z dummy RT
1 151.0092 255975.8 15.043
2 151.0092 110111.7 15.456
3 151.0092 108958.1 15.243
4 151.0093 3258343.0 14.620
5 151.0127 107255.9 6.336", header = T)
the script output:
vec <- c("1,3,5", "3,5") #from your example looks like a vector of strings with numbers separated by a comma
As I understand you would like to label rows in df with TRUE/FALSE depending on appearance anywhere in vec?
DF$homologue <- ifelse(row.names(DF) %in% as.numeric(unlist(strsplit(unlist(vec), ","))), T, F)
explanation:
unlist(vec) #in case it is a list and not a vector
strsplit(unlist(vec), ",") #split strings at "," returning a list
unlist(str... #convert that list into a vector
as.numeric(unlist(str... #convert to numeric
if any row names of DF are in vec they will be labeled T and if not F
DF
m.z dummy RT homologue
1 151.0092 255975.8 15.043 TRUE
2 151.0092 110111.7 15.456 FALSE
3 151.0092 108958.1 15.243 TRUE
4 151.0093 3258343.0 14.620 FALSE
5 151.0127 107255.9 6.336 TRUE

strpslit a character array and convert to dataframe simultaneously

I have what feels like a difficult data manipulation problem, and am hoping to get some guidance. Here is a test version of what my current array looks like, as well as what dataframe I hope to obtain:
dput(test)
c("<play quarter=\"1\" oncourt-id=\"\" time-minutes=\"12\" time-seconds=\"0\" id=\"1\"/>", "<play quarter=\"2\" oncourt-id=\"\" time-minutes=\"10\" id=\"1\"/>")
test
[1] "<play quarter=\"1\" oncourt-id=\"\" time-minutes=\"12\" time-seconds=\"0\" id=\"1\"/>"
[2] "<play quarter=\"2\" oncourt-id=\"\" time-minutes=\"10\" id=\"1\"/>"
desired_df
quarter oncourt-id time-minutes time-seconds id
1 1 NA 12 0 1
2 3 NA 10 NA 1
There are a few problems I am dealing with:
the character array "test" has backslashes where there should be nothing, but i was having difficulty using gsub in this format gsub("\", "", test).
not every element in test has the same number of entries, note in the example that the 2nd element doesn't have time-seconds, and so for the dataframe I would prefer it to return NA.
I have tried using strsplit(test, " ") to first split on spaces, which only exist between different column entires, but then I am returned with a list of lists that is just as difficult to deal with.
You've got xml there. You could parse it, then run rbindlist on the result. This will probably be a lot less hassle than trying to split the name-value pairs as strings.
dflist <- lapply(test, function(x) {
df <- as.data.frame.list(XML::xmlToList(x))
is.na(df) <- df == ""
df
})
data.table::rbindlist(dflist, fill = TRUE)
# quarter oncourt.id time.minutes time.seconds id
# 1: 1 NA 12 0 1
# 2: 2 NA 10 NA 1
Note: You will need the XML and data.table packages for this solution.

Calling & creating new columns based on string

I have searched quite a bit and not found a question that addresses this issue--but if this has been answered, forgive me, I am still quite green when it comes to coding in general. I have a data frame with a large number of variables that I would like to combine & create new variables from based on names I've put in a 2nd data frame in a loop. The data frame formulas should create & call columns from the main data frame data
USDb = c(1,2,3)
USDc = c(4,5,6)
EURb = c(7,8,9)
EURc = c(10,11,12)
data = data.frame(USDb, USDc, EURb, EURc)
Now I'd like to create a new column data$USDa as defined by
data$USDa = data$USDb - data$USDc
and so on for EUR and other variables. This is easy enough to do manually, but I'd like to create a loop that pulls the names from formulas, something like this:
a = c("USDa", "EURa")
b = c("USDb", "EURb")
c = c("USDc", "EURc")
formulas = data.frame(a,b,c)
for (i in 1:length(formulas[,a])){
data$formulas[i,a] = data$formulas[i,b] - data$formulas[i,c]
}
Obviously data$formulas[i,a] this returns NULL, so I tried data$paste0(formulas[i,a]) and that returns Error: attempt to apply non-function
How can I get these strings to be recognized as variables in this way? Thanks.
There are simpler ways to do this, but I'll stick to most of your code as a means of explanation. Your code should work so long as you edit your for loop to the following:
for (i in 1:length(formulas[,"a"])){
data[formulas[i,"a"]] = data[formulas[i,"b"]] - data[formulas[i,"c"]]
}
formulas[,a] won't work because you have a variable defined as a already that is not appropriate inside an index. Use formulas[, "a"] instead if you want all rows from column "a" in data.frame formulas.
data$formulas is literally searching for the column called "formulas" in the data.frame data. Instead you want to write data[formulas](of course, knowing that you need to index formulas in order to make it a proper string)
logic : iterate through each of the formulae, using a apply which is a for loop internally, and do calculation based on the formula
x = apply(formulas, 1, function(x) data[[x[3]]] - data[[x[2]]])
colnames(x) = formulas$a
x
# USDa EURa
#[1,] 3 3
#[2,] 3 3
#[3,] 3 3
cbind(data, x)
# USDb USDc EURb EURc USDa EURa
#1 1 4 7 10 3 3
#2 2 5 8 11 3 3
#3 3 6 9 12 3 3
Another option is split with sapply
sapply(setNames(split.default(as.matrix(formulas[-1]),
row(formulas[-1])), formulas$a), function(x) Reduce(`-`, data[rev(x)]))
# USDa EURa
#[1,] 3 3
#[2,] 3 3
#[3,] 3 3

A non index-oriented way to search for unique values between two lists in R

Lets say I have a section of a list that looks like this:
aaa[[1]]
# [1] "A5-5,73" "B3-4,73" "E3-8,73" "A1-8,73" "C1-7,73" "A1-2,73" "C3-2,73" "C1-1,73"
Lets say I have another list with a section that looks like this:
bbb[[1]]
# [1] "B3-4,73" "C3-2,73" "A5-5,73" "A1-8,73" "A1-2,73" "A1-5,73" "B1-1,73" "C1-4,73"
Consider that I now run
which(aaa[[1]]!= bbb[[1]])
which returns
# [1] 1 2 3 5 6 7 8
This is technically true, because the index [4] is the same in both aaa and bbb
What I would like to have returned is:
# [1] "C1-7,73" "A1-2,73" "C1-1,73"
because these are the values of aaa that are not in bbb regardless of position. I would also be open to a solution that would just provide an index number, such as:
# [1] 5 6 8
Here is a reproducible example:
aaa <- vector("list")
aaa[[1]] <- c("A5-5,73", "B3-4,73", "E3-8,73", "A1-8,73",
"C1-7,73", "A1-2,73", "C3-2,73", "C1-1,73")
bbb <- vector("list")
bbb[[1]] <- c("B3-4,73", "C3-2,73", "A5-5,73", "A1-8,73",
"A1-2,73", "A1-5,73", "B1-1,73", "C1-4,73")
If you want to compare every element of the list aaa with the corresponding element bbb simply do.
mapply(FUN = setdiff, x = aaa, y = bbb)
While writing this I saw setdiff was mentioned already in the comments by #RichardScriven.
As #Richard Scriven mentioned in his comment
setdiff(aaa[[1]], bbb[[1]])
is the way to get the elements. As you also asked for an index number, you could use which()
which(aaa[[1]] %in% setdiff(aaa[[1]], bbb[[1]]))
While %in% produces logical values, which() tells where they are TRUE.

Finding matching character strings in 2 sets of data in R

I have 2 data sets; one contains information on patients, and the other is a list of medical codes
patient <- data.table(ID = rep(1:5, each = 3),
codes = c("13H42", "1B1U", "Eu410", "Je450", "Fg65", "Eu411", "Eu402", "B110", "Eu410", "Eu50",
"1B1U", "Eu513", "Eu531", "Eu411", "Eu608")
)
code <- data.table(codes = c("BG689", "13H42", "BG689", "Ju34K", "Eu402", "Eu410", "Eu50", "JE541", "1B1U",
"Eu411", "Fg605", "GT6TU"),
term = c(NA))
The code$term has values, but for this example they're omitted.
What I want is an indicator column in patient that shows 1 if a code in code occurs in patient$codes.
patient
ID codes mh
1: 1 13H42 TRUE
2: 1 1B1U TRUE
3: 1 Eu410 TRUE
4: 2 Je450 FALSE
5: 2 Fg65 FALSE
6: 2 Eu411 TRUE
7: 3 Eu402 TRUE
8: 3 B110 FALSE
9: 3 Eu410 TRUE
10: 4 Eu50 TRUE
11: 4 1B1U TRUE
12: 4 Eu513 FALSE
13: 5 Eu531 FALSE
14: 5 Eu411 TRUE
15: 5 Eu608 FALSE
My solution was to use grepl:
patient$mh <- mapply(grepl, pattern=code$codes, x=patient$codes)
however this didn't work as code isn't the same length and i got the warning
Warning message:
In mapply(grepl, pattern = code$codes, x = patient$codes) :
longer argument not a multiple of length of shorter
Any solutions for an exact match?
You can do this:
patient[,mh := codes %in% code$codes]
Update:
As rightly suggested by Pasqui, for getting 0s and 1s,
you can further do:
patient[,mh := as.numeric(mh)]
EDIT: others have posted better answers. I like the %in% one from #moto myself. Much more concise, and much more efficient. Stick with those :)
This should do it. I've used a for loop, so you might figure something out that would be more efficient. I've also split the loop up into a few lines, rather than squeezing it into one. That's just so you can see what's happening:
for( row in 1:nrow(patient) ) {
codecheck <- patient$codes[row]
output <- ifelse( sum( grepl( codecheck, code$codes ) ) > 0L, 1, 0 )
patient$new[row] <- output
}
So this just goes through the patient list one by one, checks for a match using grepl, then puts the result (1 for match, 0 for no match) back into the patient frame, as a new column.
Is that what you're after?

Resources