R - Identifying only strings ending with A and B in a column - r

I have a column in a data frame in R that contains sample names. Some names are identical except that they end in A or B at the end, and some samples repeat themselves, like this:
df <- data.frame(Samples = c("S_026A", "S_026B", "S_028A", "S_028B", "S_038A", "S_040_B", "S_026B", "S_38A"))
What I am trying to do is to isolate all sample names that have an A and B at the end and not include the sample names that only have either A or B.
The end result of what I'm looking for would look like this:
"S_026" and "S_028" as these are the only ones that have A and B at the end.
All I seem to find is how to remove duplicates, and removing duplicates would only give me "S_026B" and "S_38A" in this case.
Alternatively, I have tried to strip the A's and B's at the end and then sum how many times each of those names sum > 2, but again, this does not give me the desired results.
Any suggestions?

We could use substring to get the last character after grouping by substring not including the last character, and check if there are both 'A', and 'B' in the substring
library(dplyr)
df %>%
group_by(grp = substr(Samples, 1, nchar(Samples)-1)) %>%
filter(all(c("A", "B") %in% substring(Samples, nchar(Samples)))) %>%
ungroup %>%
select(-grp)
-output
# A tibble: 5 x 1
Samples
<chr>
1 S_026A
2 S_026B
3 S_028A
4 S_028B
5 S_026B

You can extract the last character from Sample in different column, keep only those values that have both 'A' and 'B' and keep only the unique values.
library(dplyr)
library(tidyr)
df %>%
extract(Samples, c('value', 'last'), '(.*)(.)') %>%
group_by(value) %>%
filter(all(c('A', 'B') %in% last)) %>%
ungroup %>%
distinct(value)
# value
# <chr>
#1 S_026
#2 S_028

Related

How can I find the unique combinations based on two columns? [duplicate]

This question already has answers here:
How can I remove all duplicates so that NONE are left in a data frame?
(3 answers)
Closed 1 year ago.
I need to find the unique entries in my dataframe using column ID and Genus. I do not need to find unique values from column Count. My dataframe is structured like this:
ID Genus Count
A Genus1 4
A Genus18 265
A Genus28 1
A Genus2 900
B Genus1 85
B Genus18 9
B Genus28 24
B Genus2 6
B Genus3000 152
The resulting dataframe would have only
ID Genus Count
B Genus3000 152
In it because this row is unique by ID and Genus.
I have tidyverse loaded but have had trouble trying to get the result I need. I tried using distinct() but continue to get back all data from the input as output.
I have tried the following:
uniquedata <- mydata %>% distinct(.keep_all = TRUE)
uniquedata <- mydata %>% group_by(ID, Genus) %>% distinct(.keep_all = TRUE)
uniquedata <- mydata %>% distinct(ID, Genus, .keep_all = TRUE)
uniquedata <- mydata %>% distinct()
What should I use to achieve my desired output?
We could use add_count in combination with filter:
library(dplyr)
df %>%
add_count(Genus) %>%
filter(n == 1) %>%
select(ID, Genus, Count)
Output:
ID Genus Count
<chr> <chr> <dbl>
1 B Genus3000 152
For the given data set, it is enough to check the column "Genus" for values appearing twice and then to remove the corresponding rows from the dataframe.
df %>% count(Genus) -> countGenus
filter(df, Genus %in% filter(countGenus,n==1)$Genus)

Paste column content by group into a new group

Here is my data frame:
a <- data.frame(x=c(rep("A",2),rep("B",4)),
y=c("AA","BB","CC","AA","DD","AA"))
What I want is group the data frame by x and for each member of the group (here A or B), I would like to paste the content of column y into a single element, separated by _. I would like to sort it by alphabetical order and remove identical characters. Here is the desired result:
out <- data.frame(x=c(rep("A",1),rep("B",1)),
y=c("AA_BB","AA_CC_DD"))
I tried the following code, which produces an error message:
library(dplyr)
a %>% group_by(x) %>% mutate(y_comb=paste(as.character(sort(unique(y))))) %>%
slice(1) %>% ungroup()
We get the distinct element of 'x', 'y' column (as there is only two columns, simply use distinct on the entire data), then arrange the rows by 'x', 'y' column, grouped by 'x', paste (str_c) the 'y' elements into a single string by collapseing with _
library(dplyr)
library(stringr)
a %>%
distinct %>%
arrange(x, y) %>%
group_by(x) %>%
summarise(y = str_c(y, collapse="_"))
-output
# A tibble: 2 x 2
# x y
#* <chr> <chr>
#1 A AA_BB
#2 B AA_CC_DD
The error in OP's code is because of the difference in length after doing the unique and paste by itself doesn't do anything. We need to either collapse (or sep - in this case it is collapse). mutate is particular about returning the same length as the number of rows of original data while summarise is not
Perhaps we can do like this
a %>%
group_by(x) %>%
summarise(y = paste0(sort(unique(y)), collapse = "_"))
which gives
# A tibble: 2 x 2
x y
<chr> <chr>
1 A AA_BB
2 B AA_CC_DD
Base R option with aggregate :
aggregate(y~x, unique(a), function(x) paste0(sort(x), collapse = '_'))
# x y
#1 A AA_BB
#2 B AA_CC_DD

Determine the size of string in a particular cell in dataframe: R

In a data frame, I have a column (type: chr) that contains answers separated by a comma. I want to create another column based on the size of the string and award points. For example, some of the entries in a column are:
Column1
word1,word2,word3
word1,word2
word1
Now, for the first cell, I want the size of the cell to be evaluated as 3 (as it contains three distinct word and there are no duplicates in the cell values). I'm not sure how do I achieve this.
An option is to split the column with strsplit into a list of vectors, get the unique elements by looping over the list with lapply and get the lengths
df1$Size <- lengths(lapply(strsplit(df1$Column1, ",\\s*"), unique))
Another option is separate_rows from tidyr
library(dplyr)
library(tidyr)
df1 %>%
mutate(rn = row_number()) %>%
separate_rows(Column1) %>%
group_by(rn) %>%
summarise(Size = n_distinct(Column1), .groups = 'drop') %>%
select(Size) %>%
bind_cols(df1, .)
-output
# Column1 Size
#1 word1,word2,word3 3
#2 word1,word2 2
#3 word1 1
data
df1 <- data.frame(Column1 = c('word1,word2,word3', 'word1,word2', 'word1'))
Original Answer:
Another option:
library(dplyr)
library(stringr)
df %>%
mutate(Lengths = str_count(Column1, ",") + 1)
Edit:
I hadn't noticed the OP requirements properly (about non-duplicates). As #Onyambu pointed out in the comments, this chunk will only works if there are no duplicated words in data.
It basically counts how many words there are.

R: Remove duplicated rows in a dataframe which contains in a second column a value

I have a data.frame() in R which contains 3 columns:
id<-c(12312, 12312, 12312, 48373, 345632, 223452)
id2<-c(1928277, 17665363, 8282922, 82827722, 1231233,12312333)
description<-c(Positive, Negative, Indetermined, Positive, Negative, Positive)
I want to delete the duplicated rows by id which in description have the value of Indetermined.
This seems like a probem for filter() so:
library(dplyr)
df %>%
mutate(count = 1) %>% # count all ids
group_by(id) %>%
mutate(count = sum(count),Duplicate = count>1) %>% # count how often each id occurs and mark duplicates
ungroup() %>%
filter(!Duplicate & description == "Indetermined") # filter out duplicates that are "indetermined"
Not the best approach, but this should do the trick.
(d <- tibble(id,id2,description))
d[!d$id %in% (d$id[d$description == "Indetermined"]),]

Iterating (Looping) through columns of a dataframe in R

I am struggling in R and hope that someone can help me out. I am trying to write a for loop to iterate over the columns of a data frame, but unfortunately, I am not successful.
So here is my Problem:
I have 10 data frames (dt1, dt2 ,dt3,…,dt10). For example, dt1 looks like this:
dt1<-data.frame(Topic1=c(1,2,3,4,5,6,7,8,9),Topic2=c(9,8,7,6,5,4,3,2,1), Topic3=c(1,9,2,8,3,7,4,6,5), Name=c("A","A","A","A","A","B","B","B","B"))
I want to check if the Name variable still contains “A” and “B” when I filter I filter Topic 1 (then Topic 2, Topic3…) to greater than 5. At the moment, I do the following
Library(dpylr)
dt.new<-dt1 %>% filter(Topic1>5)
isTRUE("A" %in% dt.new$Name && "B" %in% dt.new$Name)
At the end of the day, for each data frame, I want to have a new table (data frame) that looks like this:
result<-data.frame(Topic=c("Topic1","Topic2","Topic3"),Return=c("FALSE","FALSE","TRUE"))
Now the problem is, that I have several data frames (dt1, dt2…) each of them has more than 50 variables (Topic1,…, Topic50).
I've written some loops so far and tried it out. But unfortunately without success. Therefore I would be happy to receive any hint or tip.
Thank you very much!
An option would be to group by 'Name', summarise the variable that have column names that start with 'Topic' by checking if there are any value that are greater than 5, then gather (getting deprecated - in the newer tidyr - use pivot_longer) to convert from 'wide' to 'long', grouped by 'Topic' column, summarise by checking if all the 'val' elements are TRUE
library(dplyr)
library(tidyr)
dt1 %>%
group_by(Name) %>%
summarise_at(vars(starts_with('Topic')), ~ any(. > 5)) %>%
gather(Topic, val, -Name) %>%
group_by(Topic) %>%
summarise(Return = all(val))
# A tibble: 3 x 2
# Topic Return
# <chr> <lgl>
#1 Topic1 FALSE
#2 Topic2 FALSE
#3 Topic3 TRUE
Or reshape it to 'long' format first and then do the summariseation
dt1 %>%
pivot_longer(cols = -Name, names_to = "Topic") %>%
filter(value > 5) %>%
group_by(Topic) %>%
summarise(result = n_distinct(Name) == 2)

Resources