Finding strings in columns of dataframe in R [duplicate] - r

This question already has answers here:
R grep: is there an AND operator?
(4 answers)
Closed 8 years ago.
So if I have the following
list <- c("catdog","chicken","poop")
names <- c("Fabio","John","Jack")
df <- data.frame(names, list, stringsAsFactors=FALSE)
names list
1 Fabio catdog
2 John cat
3 Jack dog
Assuming list is a column of strings. I want to know how can I return rows where "cat" AND "dog" after appearing once as a pair they may appear more times. I tried:
want <- c("cat","dog")
df[grepl(paste(want,collapse="&"),df$list),]
I know this works with "|" for some reason its not working with "&". Let me know if anyone can help with this. Thanks!

This is an option, if neither 'cat' nor 'dog' can repeat within a single string.
df[grepl('(cat)|(dog).*(\\1|\\2)', df$list), ]

Related

R: Count frequency of string (inside parentheses) in a dataframe column using regexp [duplicate]

This question already has answers here:
R frequency count by matching strings
(5 answers)
Closed 2 years ago.
The column funders of dataframe research lists names of funders in parentheses, like this:
Funder 1 (FWF)
Another Funding Organization (FWF)
Funder 2 (ERC) supported this research.
Yet another one (Leverhulme Trust), and another (ERC). They helped us!
We thank this funder (FWF) for their support
I want to extract all the funder names within parentheses and rank them based on their frequency counts.
I could not do it with this code:
df <- data.frame(table(research$funders))
funder <- "(?<=\\().*?(?=\\))"
sapply(df, function(x) {
sapply(funder, function(y) {
sum(grepl(y, x, perl=TRUE))
})
})
The code shows this unwelcome result:
Var1.(?<=\\().*?(?=\\)) Freq.(?<=\\().*?(?=\\))
97 0
But I instead would like to have:
FWF 3
ERC 2
Leverhulme Trust 1
How can I do that? Thank you for your help.
You can extract everything inside round brackets using regmatches and gregexpr. Then use table to count it's frequencies.
table(unlist(regmatches(string, gregexpr('\\(.*?\\)', string))))
# (ERC) (FWF) (Leverhulme Trust)
# 2 3 1
Same thing can be done using stringr::str_extract_all :
table(unlist(stringr::str_extract_all(string, '\\(.*?\\)')))

How to get list of all combinations of pairs of character strings in R [duplicate]

This question already has answers here:
How to generate all possible combinations of vectors without caring for order?
(2 answers)
Closed 3 years ago.
I have multiple character strings, let's say 'pred_1', 'pred_2' and 'pred_3'. Now I want to get a list with all pairs of the strings. So, the resulting list should contain 'pred_1 pred_2', 'pred_1 pred_3' and 'pred_2 pred_3'. Does anyone know how to automate this for more than three character strings?
An option is combn
combn(v1, 2, simplify = FALSE)
data
v1 <- paste0("pred_", 1:3)

Remove part of column name post the second "_" [duplicate]

This question already has answers here:
Exclude everything after the second occurrence of a certain string
(2 answers)
Closed 3 years ago.
I have a vector which has names of the columns
group <- c("amount_bin_group", "fico_bin_group", "cltv_bin_group", "p_region_bin")
I want to replace the part after the second "_" from each element i.e. I want it to be
group <- c("amount_bin", "fico_bin", "cltv_bin", "p_region")
I can split this into two vectors and try gsub or substr. However, it would be nice to do that in vector. Any thoughts?
I checked other posts regarding the same question, but none of them has this framework
> sub("(.*)_.*$", "\\1", group)
[1] "amount_bin" "fico_bin" "cltv_bin" "p_region"

Edit character length of row names in R [duplicate]

This question already has answers here:
Getting and removing the first character of a string
(7 answers)
Closed 5 years ago.
I am working on Bioinformatics recently. I have to edit row.names for my variable. Here is the situation for me:
I have clinical data and gene expression values downloaded from Cancer Genome Atlas. I have to match row names but in clinical data I have row names like this "TCGA-6D-AA2E". But in gene expressions row names like "TCGA-6D-AA2E-01A-11R-A38B-07".
Normally I used "match" command to match row names but the character lengths are not same. So my question is "Is there easy way to edit character length for row names?"
You could use grep function instead:
gene.names <- c("TCGA-6D-AA2E-01A-11R-A38B-07", "TCGC-6D-AA2E-01A-11R-A38B-07", "TAGA-6D-AA2E-01A-11R-07", "TCGA-6D-AA2E-A38B-07")
pick <- "TCGA-6D-AA2E"
grep(pick, gene.names)
# [1] 1 4
Edit based on the comment: Use substr to pick 12 first characters:
substr(gene.names, 0,12)
#[1] "TCGA-6D-AA2E" "TCGC-6D-AA2E" "TAGA-6D-AA2E" "TCGA-6D-AA2E"

very simple subset selection in r [duplicate]

This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
How to remove rows of a matrix by row name, rather than numerical index?
removing elements in one vector from another in R
I have two vectors:
a<-c(1,2,3,4,5,6,7,8)
b<-c(7,3,6,4,8,1)
I would like to select those elements of a which are not in b
I tried subset(a, a!=b) but I get the warning:
longer object length is not a multiple of shorter object length
Try setdiff for vectors:
R> setdiff(a,b)
[1] 2 5
Try this:
a[!(a%in%b)]
Look at ?"%in%".

Resources