Find duplicates in comma-separated string data using R - r

I am currently trying to run multiple data using loop, and ended up with the results below. Each line corresponds to the output from one data that has been filtered.
I am using this code to get the results below.
output <- print(paste(data.final$Peptide, collapse = ','))
It was previously given the form of table with one of the column name "Peptide", so I am pasting the peptide into a string separated comma, as shown here :
[1]"LPPAYTNSF,LPFNDGVYF,WMESEFRVY,YSSANNCTF,SANNCTFEY,KIYSKHTPI,WTAGAAAYY,YLQPRTFLL,CVADYSVLY,YNSASFSTF,ASFSTFKCY,RLFRKSNLK,TSNQVAVLY"
[1]"LPPAYTNSF,LPFNDGVYF,WMESEFRVY,YSSANNCTF,SANNCTFEY,KIYSKHTPI,WTAGAAAYY,YLQPRTFLL,CVADYSVLY,YNSASFSTF,ASFSTFKCY,RLFRKSNLK,QSYGFQPTY,TSNQVAVLY"
[1]"LPPAYTNSF,LPFNDGVYF,WMESEFRVY,YSSANNCTF,SANNCTFEY,KIYSKHTPI,WTAGAAAYY,YLQPRTFLL,CVADYSVLY,YNSASFSTF,ASFSTFKCY,RLFRKSNLK,QSYGFQPTY,TSNQVAVLY"
[1]"LPPAYTNSF,LPFNDGVYF,WMESEFRVY,YSSANNCTF,SANNCTFEY,KIYSKHTPI,WTAGAAAYY,YLQPRTFLL,CVADYSVLY,YNSASFSTF,ASFSTFKCY,RLFRKSNLK,QSYGFQPTY,TSNQVAVLY"
[1]"LPPAYTNSF,LPFNDGVYF,WMESEFRVY,YSSANNCTF,SANNCTFEY,KIYSKHTPI,WTAGAAAYY,YLQPRTFLL,CVADYSVLY,YNSASFSTF,ASFSTFKCY,RLFRKSNLK,QSYGFQPTY,TSNQVAVLY"
[1]"LPSAYTNSF,SAYTNSFTR,LPFNDGVYF,WMESEFRVY,YSSANNCTF,SANNCTFEY,KIYSKHTPI,WTAGAAAYY,YLQPRTFLL,CVADYSVLY,YNSASFSTF,ASFSTFKCY,RLFRKSNLK,QSYGFQPTY,TSNQVAVLY"
[1]"LPSAYTNSF,SAYTNSFTR,LPFNDGVYF,WMESEFRVY,YSSANNCTF,SANNCTFEY,KIYSKHTPI,WTAGAAAYY,YLQPRTFLL,CVADYSVLY,YNSASFSTF,ASFSTFKCY,RLFRKSNLK,QSYGFQPTY,TSNQVAVLY"
[1]"LPPAYTNSF,LPFNDGVYF,WMESEFRVY,YSSANNCTF,SANNCTFEY,KIYSKHTPI,WTAGAAAYY,YLQPRTFLL,CVADYSVLY,YNSASFSTF,ASFSTFKCY,RLFRKSNLK,TSNQVAVLY"
I would like to find the number of duplicates from each comma-separated strings (eg. LPPAYTNSF) between the lines.
Is there anyways to do this?

If I understand you well, there is no reason to first collapse them to a string. Why not just get the counts from your data.frame? Any n > 1 has duplicates.
data.final %>% count(Peptide)
solution based on your string
table(unlist(strsplit(v, ",")))
results
ASFSTFKCY CVADYSVLY KIYSKHTPI LPFNDGVYF LPPAYTNSF LPSAYTNSF QSYGFQPTY RLFRKSNLK SANNCTFEY SAYTNSFTR TSNQVAVLY WMESEFRVY
8 8 8 8 6 2 6 8 8 2 8 8
WTAGAAAYY YLQPRTFLL YNSASFSTF YSSANNCTF
8 8 8 8
data
v <- "LPPAYTNSF,LPFNDGVYF,WMESEFRVY,YSSANNCTF,SANNCTFEY,KIYSKHTPI,WTAGAAAYY,YLQPRTFLL,CVADYSVLY,YNSASFSTF,ASFSTFKCY,RLFRKSNLK,TSNQVAVLY,LPPAYTNSF,LPFNDGVYF,WMESEFRVY,YSSANNCTF,SANNCTFEY,KIYSKHTPI,WTAGAAAYY,YLQPRTFLL,CVADYSVLY,YNSASFSTF,ASFSTFKCY,RLFRKSNLK,QSYGFQPTY,TSNQVAVLY,LPPAYTNSF,LPFNDGVYF,WMESEFRVY,YSSANNCTF,SANNCTFEY,KIYSKHTPI,WTAGAAAYY,YLQPRTFLL,CVADYSVLY,YNSASFSTF,ASFSTFKCY,RLFRKSNLK,QSYGFQPTY,TSNQVAVLY,LPPAYTNSF,LPFNDGVYF,WMESEFRVY,YSSANNCTF,SANNCTFEY,KIYSKHTPI,WTAGAAAYY,YLQPRTFLL,CVADYSVLY,YNSASFSTF,ASFSTFKCY,RLFRKSNLK,QSYGFQPTY,TSNQVAVLY,LPPAYTNSF,LPFNDGVYF,WMESEFRVY,YSSANNCTF,SANNCTFEY,KIYSKHTPI,WTAGAAAYY,YLQPRTFLL,CVADYSVLY,YNSASFSTF,ASFSTFKCY,RLFRKSNLK,QSYGFQPTY,TSNQVAVLY,LPSAYTNSF,SAYTNSFTR,LPFNDGVYF,WMESEFRVY,YSSANNCTF,SANNCTFEY,KIYSKHTPI,WTAGAAAYY,YLQPRTFLL,CVADYSVLY,YNSASFSTF,ASFSTFKCY,RLFRKSNLK,QSYGFQPTY,TSNQVAVLY,LPSAYTNSF,SAYTNSFTR,LPFNDGVYF,WMESEFRVY,YSSANNCTF,SANNCTFEY,KIYSKHTPI,WTAGAAAYY,YLQPRTFLL,CVADYSVLY,YNSASFSTF,ASFSTFKCY,RLFRKSNLK,QSYGFQPTY,TSNQVAVLY,LPPAYTNSF,LPFNDGVYF,WMESEFRVY,YSSANNCTF,SANNCTFEY,KIYSKHTPI,WTAGAAAYY,YLQPRTFLL,CVADYSVLY,YNSASFSTF,ASFSTFKCY,RLFRKSNLK,TSNQVAVLY"

Not quite sure what your data are like so I've scribbled together some toy data to show how a regex and a non-regex tidyverse solution would work on your task of "find[ing] the number of duplicates from each comma-separated strings":
Data:
x <- c("a,c,x,a,f,s,w,s,b,n,x,q",
"A,B,B,X,B,Q")
A regex tidyverse solution:
library(tidyverse)
data.frame(x) %>%
# create new column with duplicated values:
mutate(dups = str_extract_all(x, "([A-Za-z])+(?=.*\\1)"))
x dups
1 a,c,x,a,f,s,w,s,b,n,x,q a, x, s
2 A,B,B,X,B,Q B, B
NB: the + in the regex pattern makes sure you can use this solution also for longer-than-one-character alphabetic comma-separated strings
Alternatively, a non-regex tidyverse solution:
library(tidyverse)
data.frame(x) %>%
# make column with unique ID for each string:
mutate(stringID = row_number()) %>%
# separate values into rows:
separate_rows(x) %>%
# for each combination of `stringID` and `x`...
group_by(stringID, x) %>%
# ...count the number of tokens:
summarise(N = n()) %>%
# show only the duplicated values:
filter(N > 1)
# A tibble: 4 × 3
# Groups: stringID [2]
stringID x N
<int> <chr> <int>
1 1 a 2
2 1 s 2
3 1 x 2
4 2 B 3

Related

How to validate whether a word in a dataframe is a name in R

I have written a R script that filters data and saves it to a new data frame. Among other things, it filters the most frequently used word as follows:
Word Times
Oliver 3
Great 8
Jacob 2
Fantastic 6
Is there a way in R to filter the last data frame to look like this given a list of names?
(That takes the names, count them, and then add them to a new row named names that counts the times all the names appeared)
Word Times
Names 5 # Oliver [3] + Jacob [2]
Great 8
Fantastic 6
I use x instead of names.
Base R way
x <- c('oliver','jacob','harry', 'jack')
y <- sum(df$Times[df$Word %in% x])
rbind(c("names", y), df[!(df$Word %in% x), ])
Word Times
1 names 5
2 great 8
4 fantastic 6
A dplyr solution
names <- c('oliver','jacob','harry', 'jack')
df %>%
summarize_each(funs(ifelse(Word %in% names, "names", .))) %>%
group_by(Word) %>%
summarize(sum(Times))
# A tibble: 3 × 2
Word `sum(Times)`
<chr> <dbl>
1 fantastic 6
2 great 8
3 names 5

R - Creating DFs (tibbles) in a loop. How to rename them and columns inside, to include date? (I do it with eval(..), but is there a better solution?)

I have a loop, that creates a tibble at the end of each iteration, tbl. Loop uses different date each time, date.
Assume:
tbl <- tibble(colA=1:5,colB=5:10)
date <- as.Date("2017-02-28")
> tbl
# A tibble: 5 x 2
colA colB
<int> <int>
1 1 5
2 2 6
3 3 7
4 4 8
5 5 9
(contents are changing every loop, but tbl, date and all columns (colA, colB) names remain the same)
The output that I want needs to start with output - outputdate1, outputdate2 etc.
With columns inside it as colAdate1, colBdate1, and colAdate2, colBdate2 and so on.
At the moment I am using this piece of code, which works, but is not easy to read:
eval(parse(text = (
paste0("output", year(date), months(date), " <- tbl %>% rename(colA", year(date), months(date), " = 'colA', colB", year(date), months(date), " = 'colB')")
)))
It produces this code for eval(parse(...) to evaluate:
"output2017February <- tbl %>% rename(colA2017February = 'colA', colB2017February = 'colB')"
Which gives me the output that I want:
> output2017February
# A tibble: 5 x 2
colA2017February colB2017February
<int> <int>
1 1 5
2 2 6
3 3 7
4 4 8
5 5 9
Is there a better way of doing this? (Preferably with dplyr)
Thanks!
This avoids eval and is easier to read:
ym <- "2017February"
assign(paste0("output", ym), setNames(tbl, paste0(names(tbl), ym)))
Partial rename
If you only wanted to replace the names in the character vector old with the corresponding names in the character vector new then use the following:
assign(paste0("output", ym),
setNames(tbl, replace(names(tbl), match(old, names(tbl)), new)))
Variation
You might consider putting your data frames in a list instead of having a bunch of loose objects in your workspace:
L <- list()
L[[paste0("output", ym)]] <- setNames(tbl, paste0(names(tbl), ym))
.GlobalEnv could also be used in place of L (omitting the L <- list() line) if you want this style but still to put the objects separately in the global environment.
dplyr
Here it is using dplyr and rlang but it does involve increased complexity:
library(dplyr)
library(rlang)
.GlobalEnv[[paste0("output", ym)]] <- tbl %>%
rename(!!!setNames(names(tbl), paste0(names(tbl), ym)))

Sorting Interior of a column in R

I am trying to sort the interior of a column in R. For example I have this:
ID HoursAvailable
1 a,b,c,k,d
2 e,g,h
3 a,b,c,h,d
And I am trying to sort the numbers in the column internally like this
ID HoursAvailable
1 a,b,c,d,k
2 e,g,h,,
3 a,b,c,d,h
I have tried to use the separate function like this:
cdMCd<- cdMf %>% separate(HoursAvailable, c("a","b","c","d","e","f","g","h","i","j"))
But I cannot get it to sort correctly. For this example e in ID 2 would be sorted into the a column, but I need it sorted into the e column. I was planning to separate all the hours into separate columns, order, then recombine, but I cannot get them to separate correctly.
library(dplyr)
dt = read.table(text="
ID HoursAvailable
1 a,b,c,k,d
2 e,g,h
3 a,b,c,h,d
", header=T, stringsAsFactors=F)
SortString = function(x) {paste0(sort(unlist(strsplit(x, split=","))),collapse = ",")}
dt %>%
rowwise() %>%
mutate(Updated = SortString(HoursAvailable)) %>%
ungroup()
# # A tibble: 3 x 3
# ID HoursAvailable Updated
# <int> <chr> <chr>
# 1 1 a,b,c,k,d a,b,c,d,k
# 2 2 e,g,h e,g,h
# 3 3 a,b,c,h,d a,b,c,d,h
Here is what I will do:
First create a function that can sort a single one and then create a function that can apply such function to a vector of strings
library(stringr)
library(plyr)
split_and_sort <- function(x){
x_split <- sort(unlist(str_split(x, ",")))
return(paste(x_split, collapse = ","))
}
split_and_sort_column <- function(x){
laply(x, split_and_sort)
}
df$HoursAvailable <- split_and_sort_column(df$HoursAvailable)

Summarize Data String in R

I have a large dataset where some of the information needed is stored in the first column as a string separated by semicolons. For example:
TestData <- data.frame("Information" = c("Forrest;Trees;Unknown", "Forrest;Trees;Leaves", "Forrest;Trees;Trunks", "Forrest;Shrubs;Unknown", "Forrest;Shrubs;Branches", "Forrest;Shrubs;Leaves", "Forrest;Shrubs;NA"), "Data" = c(5,1,3,4,2,1,3))
Giving:
Information Data
1 Forrest;Trees;Unknown 5
2 Forrest;Trees;Leaves 1
3 Forrest;Trees;Trunks 3
4 Forrest;Shrubs;Unknown 4
5 Forrest;Shrubs;Branches 2
6 Forrest;Shrubs;Leaves 1
7 Forrest;Shrubs;NA 3
I need to simplify the names so that I only have the last unique name that isn't "Unknown" or "NA" such that my dataframe becomes:
Information Data
1 Trees;Unknown 5
2 Trees;Leaves 1
3 Trunks 3
4 Shrubs;Unknown 4
5 Branches 2
6 Shrubs;Leaves 1
7 Shrubs;NA 3
Maybe it's not the most efficient or elegant solution, but it works on the sample data. Hope it's also adequate for your needs:
library(stringr)
library(dplyr)
TestData <- data.frame("Information" = c("Forrest;Trees;Unknown", "Forrest;Trees;Leaves", "Forrest;Trees;Trunks", "Forrest;Shrubs;Unknown", "Forrest;Shrubs;Branches", "Forrest;Shrubs;Leaves", "Forrest;Shrubs;NA"), "Data" = c(5,1,3,4,2,1,3))
# split text into 3 columns
TestData[3:5] <- str_split_fixed(TestData$Information, ";", 3)
# filter Unknown and NA values, count frequencies to determine unique values
a <- TestData %>%
filter(!V5 %in% c("Unknown", "NA")) %>%
group_by(V5) %>%
summarise(count = n())
# join back to original data
TestData <- TestData %>%
left_join(a)
TestData$Clean <- ifelse(TestData$count > 1 | is.na(TestData$count), paste0(TestData$V4, ";", TestData$V5), TestData$V5)
Generally it is not recommended to put multiple variables in the same column but using dplyr should give you what you want:
TestData_filtered<-TestData%>%separate(Information,into=c("common","TS","BL"),remove=FALSE)%>%filter(!grepl("Unknown|NA",BL))%>%mutate(wanted=paste(TS,BL,sep=";"))

R sum of rows for different group of columns that start with similar string

I'm quite new to R and this is the first time I dare to ask a question here.
I'm working with a dataset with likert scales and I want to row sum over different group of columns which share the first strings in their name.
Below I constructed a data frame of only 2 rows to illustrate the approach I followed, though I would like to receive feedback on how I can write a more efficient way of doing it.
df <- as.data.frame(rbind(rep(sample(1:5),4),rep(sample(1:5),4)))
var.names <- c("emp_1","emp_2","emp_3","emp_4","sat_1","sat_2"
,"sat_3","res_1","res_2","res_3","res_4","com_1",
"com_2","com_3","com_4","com_5","cap_1","cap_2",
"cap_3","cap_4")
names(df) <- var.names
So, what I did, was to use the grep function in order to be able to sum the rows of the specified variables that started with certain strings and store them in a new variable. But I have to write a new line of code for each variable.
df$emp_t <- rowSums(df[, grep("\\bemp.", names(df))])
df$sat_t <- rowSums(df[, grep("\\bsat.", names(df))])
df$res_t <- rowSums(df[, grep("\\bres.", names(df))])
df$com_t <- rowSums(df[, grep("\\bcom.", names(df))])
df$cap_t <- rowSums(df[, grep("\\bcap.", names(df))])
But there is a lot more variables in the dataset and I would like to know if there is a way to do this with only one line of code. For example, some way to group the variables that start with the same strings together and then apply the row function.
Thanks in advance!
One possible solution is to transpose df and calculate sums for the correct columns using base R rowsum function (using set.seed(123))
cbind(df, t(rowsum(t(df), sub("_.*", "_t", names(df)))))
# emp_1 emp_2 emp_3 emp_4 sat_1 sat_2 sat_3 res_1 res_2 res_3 res_4 com_1 com_2 com_3 com_4 com_5 cap_1 cap_2 cap_3 cap_4 cap_t
# 1 2 4 5 3 1 2 4 5 3 1 2 4 5 3 1 2 4 5 3 1 13
# 2 1 3 4 2 5 1 3 4 2 5 1 3 4 2 5 1 3 4 2 5 14
# com_t emp_t res_t sat_t
# 1 15 14 11 7
# 2 15 10 12 9
Agree with MrFlick that you may want to put your data in long format (see reshape2, tidyr), but to answer your question:
cbind(
df,
sapply(split.default(df, sub("_.*$", "_t", names(df))), rowSums)
)
Will do the trick
You'll be better off in the long run if you put your data into tidy format. The problem is that the data is in a wide rather than a long format. And the variable names, e.g., emp_1, are actually two separate pieces of data: the class of the person, and the person's ID number (or something like that). Here is a solution to your problem with dplyr and tidyr.
library(dplyr)
library(tidyr)
df %>%
gather(key, value) %>%
extract(key, c("class", "id"), "([[:alnum:]]+)_([[:alnum:]]+)") %>%
group_by(class) %>%
summarize(class_sum = sum(value))
First we convert the data frame from wide to long format with gather(). Then we split the values emp_1 into separate columns class and id with extract(). Finally we group by the class and sum the values in each class. Result:
Source: local data frame [5 x 2]
class class_sum
1 cap 26
2 com 30
3 emp 23
4 res 22
5 sat 19
Another potential solution is to use dplyr R rowwise function. https://www.tidyverse.org/blog/2020/04/dplyr-1-0-0-rowwise/
df %>%
rowwise() %>%
mutate(emp_sum = sum(c_across(starts_with("emp"))),
sat_sum = sum(c_across(starts_with("sat"))),
res_sum = sum(c_across(starts_with("res"))),
com_sum = sum(c_across(starts_with("com"))),
cap_sum = sum(c_across(starts_with("cap"))))

Resources