Can I split character vector based on position in R? - r

This is my first post so all tips for posting are helpful :)
I want to merge two data frames with the same person-IDs, but the identifier is slightly (but systematically) different from one another.
in df A the ID is: <3 digits>
in df B the ID is: -<3 digits>
See example below:
A_ID <- c("A123", "B213", "C421", "C312")
A_score <- c(8,10,9,10)
A <- data.frame(A_ID, A_score)
colnames(A) <- c("ID", "A_score")
B_ID <- c("A-123", "B-213", "C-421", "C-312")
B_score <- c(2,10,9,10)
B <- data.frame(B_ID, B_score)
colnames(B) <- c("ID", "B_score")
The problem is that because of the - in the middle of df B identifiers, these dfs won't merge.
What i want to achieve is to merge (fulljoin) the dfs to form columns: ID | A_score | B_score.
I tried converting the character vector to strings and then split the ID (A) at characterposition 2 after the letter, add a - and then paste and convert back to character vector. But i feel this is probably not the most efficient and easy way.
Thank you!

Try this using gsub() to clean the second id variable and then merge the dataframes in a one data pipeline. Here the code using tidyverse functions:
library(tidyverse)
#Code
NewA <- A %>% full_join(B %>% mutate(ID=gsub('-','',ID)))
Output:
ID A_score B_score
1 A123 8 2
2 B213 10 10
3 C421 9 9
4 C312 10 10

Use sub to ge rid of - and then merge:
B$ID <- sub("-", "", B$ID)
merge(A, B, "ID")
ID A_score B_score
1 A123 8 2
2 B213 10 10
3 C312 10 10
4 C421 9 9

Related

Find duplicates in comma-separated string data using R

I am currently trying to run multiple data using loop, and ended up with the results below. Each line corresponds to the output from one data that has been filtered.
I am using this code to get the results below.
output <- print(paste(data.final$Peptide, collapse = ','))
It was previously given the form of table with one of the column name "Peptide", so I am pasting the peptide into a string separated comma, as shown here :
[1]"LPPAYTNSF,LPFNDGVYF,WMESEFRVY,YSSANNCTF,SANNCTFEY,KIYSKHTPI,WTAGAAAYY,YLQPRTFLL,CVADYSVLY,YNSASFSTF,ASFSTFKCY,RLFRKSNLK,TSNQVAVLY"
[1]"LPPAYTNSF,LPFNDGVYF,WMESEFRVY,YSSANNCTF,SANNCTFEY,KIYSKHTPI,WTAGAAAYY,YLQPRTFLL,CVADYSVLY,YNSASFSTF,ASFSTFKCY,RLFRKSNLK,QSYGFQPTY,TSNQVAVLY"
[1]"LPPAYTNSF,LPFNDGVYF,WMESEFRVY,YSSANNCTF,SANNCTFEY,KIYSKHTPI,WTAGAAAYY,YLQPRTFLL,CVADYSVLY,YNSASFSTF,ASFSTFKCY,RLFRKSNLK,QSYGFQPTY,TSNQVAVLY"
[1]"LPPAYTNSF,LPFNDGVYF,WMESEFRVY,YSSANNCTF,SANNCTFEY,KIYSKHTPI,WTAGAAAYY,YLQPRTFLL,CVADYSVLY,YNSASFSTF,ASFSTFKCY,RLFRKSNLK,QSYGFQPTY,TSNQVAVLY"
[1]"LPPAYTNSF,LPFNDGVYF,WMESEFRVY,YSSANNCTF,SANNCTFEY,KIYSKHTPI,WTAGAAAYY,YLQPRTFLL,CVADYSVLY,YNSASFSTF,ASFSTFKCY,RLFRKSNLK,QSYGFQPTY,TSNQVAVLY"
[1]"LPSAYTNSF,SAYTNSFTR,LPFNDGVYF,WMESEFRVY,YSSANNCTF,SANNCTFEY,KIYSKHTPI,WTAGAAAYY,YLQPRTFLL,CVADYSVLY,YNSASFSTF,ASFSTFKCY,RLFRKSNLK,QSYGFQPTY,TSNQVAVLY"
[1]"LPSAYTNSF,SAYTNSFTR,LPFNDGVYF,WMESEFRVY,YSSANNCTF,SANNCTFEY,KIYSKHTPI,WTAGAAAYY,YLQPRTFLL,CVADYSVLY,YNSASFSTF,ASFSTFKCY,RLFRKSNLK,QSYGFQPTY,TSNQVAVLY"
[1]"LPPAYTNSF,LPFNDGVYF,WMESEFRVY,YSSANNCTF,SANNCTFEY,KIYSKHTPI,WTAGAAAYY,YLQPRTFLL,CVADYSVLY,YNSASFSTF,ASFSTFKCY,RLFRKSNLK,TSNQVAVLY"
I would like to find the number of duplicates from each comma-separated strings (eg. LPPAYTNSF) between the lines.
Is there anyways to do this?
If I understand you well, there is no reason to first collapse them to a string. Why not just get the counts from your data.frame? Any n > 1 has duplicates.
data.final %>% count(Peptide)
solution based on your string
table(unlist(strsplit(v, ",")))
results
ASFSTFKCY CVADYSVLY KIYSKHTPI LPFNDGVYF LPPAYTNSF LPSAYTNSF QSYGFQPTY RLFRKSNLK SANNCTFEY SAYTNSFTR TSNQVAVLY WMESEFRVY
8 8 8 8 6 2 6 8 8 2 8 8
WTAGAAAYY YLQPRTFLL YNSASFSTF YSSANNCTF
8 8 8 8
data
v <- "LPPAYTNSF,LPFNDGVYF,WMESEFRVY,YSSANNCTF,SANNCTFEY,KIYSKHTPI,WTAGAAAYY,YLQPRTFLL,CVADYSVLY,YNSASFSTF,ASFSTFKCY,RLFRKSNLK,TSNQVAVLY,LPPAYTNSF,LPFNDGVYF,WMESEFRVY,YSSANNCTF,SANNCTFEY,KIYSKHTPI,WTAGAAAYY,YLQPRTFLL,CVADYSVLY,YNSASFSTF,ASFSTFKCY,RLFRKSNLK,QSYGFQPTY,TSNQVAVLY,LPPAYTNSF,LPFNDGVYF,WMESEFRVY,YSSANNCTF,SANNCTFEY,KIYSKHTPI,WTAGAAAYY,YLQPRTFLL,CVADYSVLY,YNSASFSTF,ASFSTFKCY,RLFRKSNLK,QSYGFQPTY,TSNQVAVLY,LPPAYTNSF,LPFNDGVYF,WMESEFRVY,YSSANNCTF,SANNCTFEY,KIYSKHTPI,WTAGAAAYY,YLQPRTFLL,CVADYSVLY,YNSASFSTF,ASFSTFKCY,RLFRKSNLK,QSYGFQPTY,TSNQVAVLY,LPPAYTNSF,LPFNDGVYF,WMESEFRVY,YSSANNCTF,SANNCTFEY,KIYSKHTPI,WTAGAAAYY,YLQPRTFLL,CVADYSVLY,YNSASFSTF,ASFSTFKCY,RLFRKSNLK,QSYGFQPTY,TSNQVAVLY,LPSAYTNSF,SAYTNSFTR,LPFNDGVYF,WMESEFRVY,YSSANNCTF,SANNCTFEY,KIYSKHTPI,WTAGAAAYY,YLQPRTFLL,CVADYSVLY,YNSASFSTF,ASFSTFKCY,RLFRKSNLK,QSYGFQPTY,TSNQVAVLY,LPSAYTNSF,SAYTNSFTR,LPFNDGVYF,WMESEFRVY,YSSANNCTF,SANNCTFEY,KIYSKHTPI,WTAGAAAYY,YLQPRTFLL,CVADYSVLY,YNSASFSTF,ASFSTFKCY,RLFRKSNLK,QSYGFQPTY,TSNQVAVLY,LPPAYTNSF,LPFNDGVYF,WMESEFRVY,YSSANNCTF,SANNCTFEY,KIYSKHTPI,WTAGAAAYY,YLQPRTFLL,CVADYSVLY,YNSASFSTF,ASFSTFKCY,RLFRKSNLK,TSNQVAVLY"
Not quite sure what your data are like so I've scribbled together some toy data to show how a regex and a non-regex tidyverse solution would work on your task of "find[ing] the number of duplicates from each comma-separated strings":
Data:
x <- c("a,c,x,a,f,s,w,s,b,n,x,q",
"A,B,B,X,B,Q")
A regex tidyverse solution:
library(tidyverse)
data.frame(x) %>%
# create new column with duplicated values:
mutate(dups = str_extract_all(x, "([A-Za-z])+(?=.*\\1)"))
x dups
1 a,c,x,a,f,s,w,s,b,n,x,q a, x, s
2 A,B,B,X,B,Q B, B
NB: the + in the regex pattern makes sure you can use this solution also for longer-than-one-character alphabetic comma-separated strings
Alternatively, a non-regex tidyverse solution:
library(tidyverse)
data.frame(x) %>%
# make column with unique ID for each string:
mutate(stringID = row_number()) %>%
# separate values into rows:
separate_rows(x) %>%
# for each combination of `stringID` and `x`...
group_by(stringID, x) %>%
# ...count the number of tokens:
summarise(N = n()) %>%
# show only the duplicated values:
filter(N > 1)
# A tibble: 4 × 3
# Groups: stringID [2]
stringID x N
<int> <chr> <int>
1 1 a 2
2 1 s 2
3 1 x 2
4 2 B 3

Filtering/subsetting R dataframe based on each rows n'th position value

I have a 'df' with 2 columns:
Combinations <- c(0011111111, 0011113111, 0013113112, 0022223114)
Values <- c(1,2,3,4)
df <- cbind.data.frame(Combinations, Values)
I am trying to find a way to subset or filter the dataframe where the 'Combinations' column's 7th, 8th, and 9th digits equal 311. For the example given, I would expect Combination's 0011113111, 0013113112, 0022223114
There are also instances where I would need to find different combinations, in different nth positions.
I know substring() can find these values for single rows but I'm not sure how to apply it to an entire dataframe.
subtring will work with vectors as well.
subset(df, substring(Combinations, 7, 9) == 311)
# Combinations Values
#2 0011113111 2
#3 0013113112 3
#4 0022223114 4
data
Combinations <- c("0011111111", "0011113111", "0013113112", "0022223114")
Values <- c(1,2,3,4)
df <- data.frame(Combinations, Values)
Another base R idea:
Combinations <- c("0011111111", "0011113111", "0013113112", "0022223114")
Values <- c(1,2,3,4)
df <- data.frame(Combinations, Values)
df[grep(pattern = "^[0-9]{6}311.$", df$Combinations), ]
Output:
Combinations Values
2 0011113111 2
3 0013113112 3
4 0022223114 4
As a tip, if you want to know more about regular expressions, this website helps me a lot: https://regexr.com/3elkd
Would this work?
library(dplyr)
library(stringr)
df %>% filter(str_sub(Combinations, 7,9) == 311)
Combinations Values
1 0011113111 2
2 0013113112 3
3 0022223114 4
Not pretty but works:
df[which(lapply(strsplit(df$Combinations, ""), function(x) which(x[7]==3 & x[8]==1 & x[9]==1))==1),]
Combinations Values
2 0011113111 2
3 0013113112 3
4 0022223114 4
Data:
Combinations <- c("0011111111", "0011113111", "0013113112", "0022223114")
Values <- c(1,2,3,4)
df <- cbind.data.frame(Combinations, Values)

How to get x rows from each category in R?

I have a matrix that contains many rows, let's say more than 5000 rows from each category and I would like to get 4500 rows from each category. How to do it in R?
I know that there is unique, but this is getting just one element per category, but I need N elements per category.
Here is my data:
cat f1 f2 f3
1 a 15 20 sdr
2 b 8 6 zrf
3 a 54 6 sf
4 c 32 8 azr
5 b 65 98 arfg
....
One 'brute-force' kind of approach would be to split your data by group and then simply take the head of N rows. Then simply bind them all together for your new data.frame. This is the essence of 'split-apply-combine'.
df <- data.frame(group=rep(c("A","B"), each=10), var=rnorm(20))
# Number of Rows
N <- 5
# the split, apply(i.e. head), combine approach
do.call("rbind", lapply(split(df, f=df$group), function(x) head(x, n=N)))
The same approach will work if you data is in a matrix with a column containing some sort of unique group identifier and you call split.data.frame directly. It will still split your matrix in to a list of 'sub' matrices.
mat <- matrix(c(rep(c(0,1), each=10), rnorm(20)),20,2)
do.call("rbind", lapply(split.data.frame(mat, f=mat[,1]), function(x) head(x, n=N)))
EDIT
As suggested by #akrun below, you could also use dplyr if your object is a data.frame
library(dplyr)
df %>%
group_by(group) %>%
slice(seq(N))

R sum of rows for different group of columns that start with similar string

I'm quite new to R and this is the first time I dare to ask a question here.
I'm working with a dataset with likert scales and I want to row sum over different group of columns which share the first strings in their name.
Below I constructed a data frame of only 2 rows to illustrate the approach I followed, though I would like to receive feedback on how I can write a more efficient way of doing it.
df <- as.data.frame(rbind(rep(sample(1:5),4),rep(sample(1:5),4)))
var.names <- c("emp_1","emp_2","emp_3","emp_4","sat_1","sat_2"
,"sat_3","res_1","res_2","res_3","res_4","com_1",
"com_2","com_3","com_4","com_5","cap_1","cap_2",
"cap_3","cap_4")
names(df) <- var.names
So, what I did, was to use the grep function in order to be able to sum the rows of the specified variables that started with certain strings and store them in a new variable. But I have to write a new line of code for each variable.
df$emp_t <- rowSums(df[, grep("\\bemp.", names(df))])
df$sat_t <- rowSums(df[, grep("\\bsat.", names(df))])
df$res_t <- rowSums(df[, grep("\\bres.", names(df))])
df$com_t <- rowSums(df[, grep("\\bcom.", names(df))])
df$cap_t <- rowSums(df[, grep("\\bcap.", names(df))])
But there is a lot more variables in the dataset and I would like to know if there is a way to do this with only one line of code. For example, some way to group the variables that start with the same strings together and then apply the row function.
Thanks in advance!
One possible solution is to transpose df and calculate sums for the correct columns using base R rowsum function (using set.seed(123))
cbind(df, t(rowsum(t(df), sub("_.*", "_t", names(df)))))
# emp_1 emp_2 emp_3 emp_4 sat_1 sat_2 sat_3 res_1 res_2 res_3 res_4 com_1 com_2 com_3 com_4 com_5 cap_1 cap_2 cap_3 cap_4 cap_t
# 1 2 4 5 3 1 2 4 5 3 1 2 4 5 3 1 2 4 5 3 1 13
# 2 1 3 4 2 5 1 3 4 2 5 1 3 4 2 5 1 3 4 2 5 14
# com_t emp_t res_t sat_t
# 1 15 14 11 7
# 2 15 10 12 9
Agree with MrFlick that you may want to put your data in long format (see reshape2, tidyr), but to answer your question:
cbind(
df,
sapply(split.default(df, sub("_.*$", "_t", names(df))), rowSums)
)
Will do the trick
You'll be better off in the long run if you put your data into tidy format. The problem is that the data is in a wide rather than a long format. And the variable names, e.g., emp_1, are actually two separate pieces of data: the class of the person, and the person's ID number (or something like that). Here is a solution to your problem with dplyr and tidyr.
library(dplyr)
library(tidyr)
df %>%
gather(key, value) %>%
extract(key, c("class", "id"), "([[:alnum:]]+)_([[:alnum:]]+)") %>%
group_by(class) %>%
summarize(class_sum = sum(value))
First we convert the data frame from wide to long format with gather(). Then we split the values emp_1 into separate columns class and id with extract(). Finally we group by the class and sum the values in each class. Result:
Source: local data frame [5 x 2]
class class_sum
1 cap 26
2 com 30
3 emp 23
4 res 22
5 sat 19
Another potential solution is to use dplyr R rowwise function. https://www.tidyverse.org/blog/2020/04/dplyr-1-0-0-rowwise/
df %>%
rowwise() %>%
mutate(emp_sum = sum(c_across(starts_with("emp"))),
sat_sum = sum(c_across(starts_with("sat"))),
res_sum = sum(c_across(starts_with("res"))),
com_sum = sum(c_across(starts_with("com"))),
cap_sum = sum(c_across(starts_with("cap"))))

Random row selection in R

I have this dataframe
id <- c(1,1,1,2,2,3)
name <- c("A","A","A","B","B","C")
value <- c(7:12)
df<- data.frame(id=id, name=name, value=value)
df
This function selects a random row from it:
randomRows = function(df,n){
return(df[sample(nrow(df),n),])
}
i.e.
randomRows(df,1)
But I want to randomly select one row per 'name' (or per 'id' which is the same) and concatenate that entire row into a new table, so in this case, three rows. This has to loop throught a 2000+ rows dataframe. Please show me how?!
I think you can do this with the plyr package:
library("plyr")
ddply(df,.(name),randomRows,1)
which gives you for example:
id name value
1 1 A 8
2 2 B 11
3 3 C 12
Is this what you are looking for?
Here's one way of doing it in base R.
> df.split <- split(df, df$name)
> df.sample <- lapply(df.split, randomRows, 1)
> df.final <- do.call("rbind", df.sample)
> df.final
id name value
A 1 A 7
B 2 B 11
C 3 C 12

Resources