arranging strings from one data frame based on another one - r

I have a data frame like this one
df1<- structure(list(V1 = structure(c(8L, 4L, 5L, 7L, 6L, 3L, 9L, 1L,
2L), .Label = c("A0A061AKW6;Q19219;A0A061AJ82;Q7JLR4", "A0A061AL89;A0A061AJK8;Q21920-2;Q21920-7;Q21920",
"C1P641;C1P640;A0A061AD21;G5EEV6", "O16276", "O16520-2", "O17323-2",
"O17395", "O17403", "Q22501;A0A061AE05"), class = "factor")), .Names = "V1", class = "data.frame", row.names = c(NA,
-9L))
My second data from looks like this
df2<- structure(list(From = structure(c(12L, 10L, 11L, 8L, 7L, 1L,
9L, 15L, 2L, 5L, 13L, 3L, 16L, 6L, 4L, 14L), .Label = c("A0A061AD21",
"A0A061AE05", "A0A061AJ82", "A0A061AJK8", "A0A061AKW6", "A0A061AL89",
"C1P640", "C1P641", "G5EEV6", "O16276", "O17395", "O17403", "Q19219",
"Q21920", "Q22501", "Q7JLR4"), class = "factor"), To = structure(c(4L,
8L, 1L, 5L, 5L, 5L, 5L, 6L, 6L, 2L, 2L, 2L, 2L, 3L, 3L, 7L), .Label = c("aat-3",
"CELE_F08G5.3", "CELE_R11A8.7", "cpsf-2", "epi-1", "pps-1", "R11A8.7",
"ugt-61"), class = "factor")), .Names = c("From", "To"), class = "data.frame", row.names = c(NA,
-16L))
df2 is taken from df1 but some information are added and some are removed . I want to reconstruct the df2 like df1 and arrange the column named To based on that
So the output should look like this
From To
O17403 cpsf-2
O16276 ugt-61
O16520-2 -
O17395 aat-3
O17323-2 -
C1P641;C1P640;A0A061AD21;G5EEV6 epi-1
Q22501;A0A061AE05 pps-1
A0A061AKW6;Q19219;A0A061AJ82;Q7JLR4 CELE_F08G5.3
A0A061AL89;A0A061AJK8;Q21920-2;Q21920-7;Q21920 CELE_R11A8.7; R11AB.7
It means we have O17403 in df2 and was only one string in df1, so it stays the same. O16276 was only one string in a raw in df1 so it also stays the same
O16520-2 was in df1 was not in df2 so in column named to a hyphen
the same for the rest until C1P641;C1P640;A0A061AD21;G5EEV6 are all in the same row of df1 and their To is the same, so we put them the same as df1 and just add one epi-1
Probably the best is to put df1 as template and then parse the To to it , those that are in df2, parse their To , those that are not only a hyphen
It is very complicated, I even could not think how to do it.I will appreciate any help

To solve this I split the semicolon delimited strings and created a nested for-for-if-if loop.
Here's the logic behind the loop which runs against the split string's data.frame (tmp):
Fix data classes (i.e. change factor to character to avoid conflicting level sets) and append a temporary To column to tmp
For each column and row of tmp start by seeing if a cell contains a valid string for matching and a matched value in df2$To, if not, go to the next iteration
If it does then look at the matching value in To from df2, checking to see if we already have the matched value in tmp$To (if so, go to next iteration)
If there's a new matched value in df2$To then put it in the correspond cell of tmp$To, prepending it with any preceeding matches and semicolons if it is not the first match for that row
df1$V1 <- as.character(df1$V1)
df2$From <- as.character(df2$From)
df2$To <- as.character(df2$To)
library(stringr)
tmp <- as.data.frame(str_split_fixed(df1$V1, ";",n=5), stringsAsFactors = F)
tmp$To <- as.character(NA)
for(j in 1:nrow(tmp)){
for(i in 1:ncol(tmp)){
if(length(df2$To[df2$From == tmp[j,i]]) == 0 | is.null(tmp[j,i])){
next
} else if(length(df2$To[df2$From == tmp[j,i]] ) == 1 & !is.na(tmp[j,i])){
if(is.na(tmp$To[j]) | tmp$To[j] == df2$To[df2$From == tmp[j,i]]){
tmp$To[j] <- df2$To[df2$From == tmp[j,i] ]
} else{
tmp$To[j] <- paste(tmp$To[j],";",df2$To[df2$From == tmp[j,i] ], sep="")
}
} else{
next
}
}
}
df1 <- data.frame(From=df1$V1, To=tmp$To)
df1
From To
1 O17403 cpsf-2
2 O16276 ugt-61
3 O16520-2 <NA>
4 O17395 aat-3
5 O17323-2 <NA>
6 C1P641;C1P640;A0A061AD21;G5EEV6 epi-1
7 Q22501;A0A061AE05 pps-1
8 A0A061AKW6;Q19219;A0A061AJ82;Q7JLR4 CELE_F08G5.3
9 A0A061AL89;A0A061AJK8;Q21920-2;Q21920-7;Q21920 CELE_R11A8.7;R11A8.7

One way of doing this is to use the splitstackshape package (use cSplit). I converted the factors to character strings to simplify (and get rid of warnings).
library(dplyr)
library(data.table) # cSplit from 'splitstackshape' returns a 'data.table'.
library(splitstackshape)
### Remove the factors for convenience of manipulation
df1 <- df1 %>% mutate(From = as.character(V1))
df2 <- df2 %>% mutate(From = as.character(From), To = as.character(To))
### 'cSplit' will split on ';' and create a new row for each item. The
### original 'From' column is kept around as cSplit removes the split column.
### 'rn' (row number) is used for ordering later.
cSplit(df1 %>% mutate(rn = row_number(), From_temp = From),
"From_temp", sep = ";", direction = "long", drop = FALSE, type.convert = FALSE) %>%
left_join(df2, by = c(From_temp = 'From')) %>% # Join to 'df2' to get the 'To' column
group_by(From, rn) %>% # Group by original 'From' column.
summarise(To = paste(sort(unique(na.omit(To))), collapse = ';'), # Create 'To' by joining 'To' Values
To = ifelse(To=='', '-', To)) %>% # Set empty values to '-'
ungroup %>%
arrange(rn) %>% # Sort by original row number and
select(-rn) # remove 'rn' column.
## From To
## <chr> <chr>
## 1 O17403 cpsf-2
## 2 O16276 ugt-61
## 3 O16520-2 -
## 4 O17395 aat-3
## 5 O17323-2 -
## 6 C1P641;C1P640;A0A061AD21;G5EEV6 epi-1
## 7 Q22501;A0A061AE05 pps-1
## 8 A0A061AKW6;Q19219;A0A061AJ82;Q7JLR4 CELE_F08G5.3
## 9 A0A061AL89;A0A061AJK8;Q21920-2;Q21920-7;Q21920 CELE_R11A8.7;R11A8.7
There may be a cleaner way to do with dplyr that doesn't require the splitstackshape.

Related

How to separate a dataframe based on specific string in column name [duplicate]

This question already has answers here:
Split string by last two characters in R? (/negative string indices)
(5 answers)
Closed 3 years ago.
I have a huge data that I cannot split into two sets
df<- structure(list(name = structure(1:3, .Label = c("a", "b", "c"
), class = "factor"), X3C_AALI_01A = c(651L, 2L, 1877L), X3C_AALJ_01B = c(419L,
2L, 1825L), X3C_AALK_01A = c(1310L, 52L, 1286L), X4H_AAAK_11B = c(2978L,
4L, 1389L), X5L_AAT0_01B = c(2576L, 15L, 1441L), X5L_AAT1_01A = c(2886L,
5L, 921L), X5T_A9QA_03A = c(929L, 3L, 935L), A1_A0SI_10A = c(1578L,
1L, 2217L), A1_A0SK_07C = c(3003L, 6L, 2984L), A1_A0SO_01A = c(6413L,
0L, 3577L), A1_A0SP_05B = c(5157L, 5L, 4596L), A2_A04P_01A = c(4283L,
6L, 2508L), X5L_AAh1_10A = c(2886L, 5L, 921L), X5T_A0QA_03A = c(929L,
3L, 935L), A1_A0Sm_10A = c(1578L, 1L, 2217L), A1_ArSK_01A = c(3003L,
6L, 2984L), A1_AfSO_01A = c(6413L, 0L, 3577L), A1_AuSP_05A = c(5157L,
5L, 4596L), A2_Ap4P_11A = c(4283L, 6L, 2508L)), class = "data.frame", row.names = c(NA,
-3L))
basically , I want to split the data based on the last character of the column name. for example if you look at the above data, the second column is like this 3C_AALI_01A which I want to generate two data sets based on the _01A
So those columns that have 01 to 09 values I want them to be in one data frame and those ones that have 10 to whatever number want them to be in the second data frame. For example in the above example data.
the columns with the following names should be in one data frame
3C_AALI_01A
3C_AALJ_01B
3C_AALK_01A
5L_AAT0_01B
5L_AAT1_01A
5T_A9QA_03A
A1_A0SK_07C
A1_A0SO_01A
A1_A0SP_05B
A2_A04P_01A
5T_A0QA_03A
A1_ArSK_01A
A1_AfSO_01A
A1_AuSP_05A
and the columns with the following names should be in another data frame
4H_AAAK_11B
A1_A0SI_10A
5L_AAh1_10A
A1_A0Sm_10A
A2_Ap4P_11A
df1 <- df[,grep('0[1-9].$',colnames(df))]
df2 <- df[,-grep('0[1-9].$',colnames(df))]
You could use tidyr::separate(..., last=-1) approach
which uses negative string indexing, which is what you really want here
also, your dataframe is transposed, it would be more normal to have one single column name with the names, and numerical columns a, b, c. Like t(df) without the unwanted coercion to string.

Create function to count values across list of columns

R folks:
I have a dataframe with many sets of columns. Each set is a bank of survey items. I would like to count the number of columns in each set having a certain value. I wrote a function to do this but it results in a list of repeated values that is appended to my dataframe.
df<- structure(list(RespondentID = c(6764279930, 6779986023, 6760279439,
6759243066),
q1 = c(3L, 3L, 4L, 1L),
q2 = c(2L, 2L, 4L, 4L),
q3 = c(4L, 2L, 4L, 5L),
q0010_0004 = c(1L, 2L, 3L, 1L)),
.Names = c("RespondentID", "q1", "q2", "q3", "q4"),
row.names = c(NA, 4L), class = "data.frame")
group1<-c("q1","q2","q3","q4")
# Objective: Count number of ratings==4 for each row
# Make function that receives list of columns &
# then returns ONE column in dataframe with total # columns
# having certain value (in this case, 4)
countcol<-function(colgroup) {
s<-subset(df, select=c(colgroup)) #select only the columns designated by list
s$sum<-Reduce("+", apply(X=s,1,FUN=function(x) (sum(x==4, na.rm = TRUE)))) # count instances of value==4
s2<-subset(s,select=c(sum)) # return ONE column with result for each row
return(s2$sum) }
countcol(group1)
My function, countcol runs without errors but as stated above results in what appears to be a transposed list of results for each row. I would like to have ONE number for each row that indicates the count of values.
I attempted various apply functions here but could not prevail. Anyone have a tip?
Thanks!
rowSums can give you results OP is looking for. This return count of ratings==4 for each group.
rowSums(df[2:5]==4)
#1 2 3 4
#1 0 3 1
OR just part of function from OP can give answer.
apply(df[2:5], 1, function(x)(sum(x==4)))
#1 2 3 4
#1 0 3 1

Can I use %in% to search and match two columns?

I have a large dataframe and I have a vector to pull out terms of interest. for a previous project I was using:
a=data[data$rn %in% y, "Gene"]
To pull out information into a new vector. Now I have a another job Id like to do.
I have a large dataframe of 15 columns and >100000 rows. I want to search column 3 and 9 for the content in the vector and print this as a new dataframe.
To make this extra annoying the hit could be in v3 and not in v9 and visa versa.
Working example
I have striped the dataframe to 3 cols and few rows.
data <- structure(list(Gene = structure(c(1L, 5L, 3L, 2L, 4L), .Label = c("ibp","leuA", "pLeuDn_02", "repA", "repA1"), class = "factor"), LocusTag = structure(c(1L,2L, 5L, 3L, 4L), .Label = c("pBPS1_01", "pBPS1_02", "pleuBTgp4","pleuBTgp5", "pLeuDn_02"), class = "factor"), hit = structure(c(2L,4L, 3L, 1L, 5L), .Label = c("2-isopropylmalate synthase", "Ibp protein","ORF1", "repA1 protein", "replication-associated protein"), class = "factor")), .Names = c("Gene","LocusTag", "hit"), row.names = c(NA, 5L), class = "data.frame")
y <- c("ibp", "orf1")
First of all R is case sensitive so your example will not collect the third line but I guess you want that extracted. so you would have to change your y to
y <- c("ibp", "ORF1")
Ok from your example I try to see what you want to achieve I am not sure if this is really what you want but R knows the operator | as "or" so you could try something like:
new.data<-data[data$Gene %in% y|data$hit %in% y,]
if you only want to extract certain columns of your data set you can specify them behind the "," e.g.:
new.data<-data[data$Gene %in% y|data$hit %in% y, c("LocusTag","Gene")]

Find and remove matching substrings from two data frames

I have two data frames: df1 and df2
df1<- structure(list(sample_1 = structure(c(7L, 6L, 5L, 1L, 2L, 4L,
3L), .Label = c("P41182;Q9HCP0", "Q09472", "Q9Y6H1;Q5T1J5", "Q9Y6I3",
"Q9Y6Q9", "Q9Y6U3", "Q9Y6W5"), class = "factor"), sample_2 = structure(c(7L,
6L, 4L, 3L, 2L, 5L, 1L), .Label = c("O15143", "P31908", "P3R117",
"P41356;P54612;A41PH2", "P54112", "P61809;Q92831", "Q16835"), class = "factor")), .Names = c("sample_1",
"sample_2"), class = "data.frame", row.names = c(NA, -7L))
df2<- structure(list(subunits..UniProt.IDs. = structure(c(4L, 6L, 5L,
12L, 3L, 9L, 14L, 16L, 15L, 11L, 13L, 8L, 1L, 2L, 10L, 7L), .Label = c("O55102,Q9CWG9,Q5U5M8,Q8VED2,Q91WZ8,Q8R015,Q9R0C0,Q9Z266",
"P30561,O08915,P07901,P11499", "P30561,P53762", "P41182,P56524",
"P41182,Q8WUI4", "P41182,Q9UQL6", "P61160,P61158,O15143,O15144,O15145,P59998,O15511",
"P78537,Q6QNY1,Q6QNY0,Q9NUP1,Q96EV8,Q8TDH9,Q9UL45,O95295", "Q15021,Q9BPX3,Q15003,O95347,Q9NTJ3",
"Q8WMR7,(P67776,P11493),(P54612,P54613)", "Q91VB4,P59438,Q8BLY7",
"Q92793,Q09472,Q9Y6Q9,Q92831", "Q92828,Q13227,O15379,O75376,O60907,Q9BZK7",
"Q92902,Q9NQG7", "Q92903,Q96NY9", "Q969F9,Q9UPZ3,Q86YV9"), class = "factor")), .Names = "subunits..UniProt.IDs.", class = "data.frame", row.names = c(NA,
-16L))
I want to look at each semicolon-separated string in df1 and if it contains a match to one of the comma-separated strings in df2, then remove it. So, my output will look like below:
sample_1 sample_2
1 Q9Y6W5 Q16835
2 Q9Y6U3 P61809
3 P41356;A41PH2
4 Q9HCP0 P3R117
5 P31908
6 Q9Y6I3 P54112
7 Q9Y6H1;Q5T1J5
The sample_1 has strings in row 3, 4 and 5 that match one of the strings in df2, and those matching strings are removed.
The sample_2 has strings in row 2, 3 and 7 that match strings in df2, and those matching strings are removed.
First, you could gather all the possible strings to remove:
toRmv <- unique(unlist(strsplit(as.character(df2[,1]), ",", fixed = TRUE)))
toRmv <- gsub("\\W", "", toRmv, perl = TRUE)
Then remove them. I like the stringi package here for its ability to replace multiple strings with an empty string using the handy vectorize_all argument set to FALSE.
library(stringi)
df1[] <- lapply(df1, stri_replace_all_fixed,
pattern = toRmv, replacement = "", vectorize_all = FALSE)
df1
# sample_1 sample_2
#1 Q9Y6W5 Q16835
#2 Q9Y6U3 P61809;
#3 P41356;;A41PH2
#4 ;Q9HCP0 P3R117
#5 P31908
#6 Q9Y6I3 P54112
#7 Q9Y6H1;Q5T1J5
Now, it's just a matter of getting rid of leading semicolons (^;), trailing semicolons (;$), and multiple semicolons ((?<=;);):
df1[] <- lapply(df1, gsub, pattern = "^;|;$|(?<=;);", replacement = "", perl = TRUE)
df1
# sample_1 sample_2
#1 Q9Y6W5 Q16835
#2 Q9Y6U3 P61809
#3 P41356;A41PH2
#4 Q9HCP0 P3R117
#5 P31908
#6 Q9Y6I3 P54112
#7 Q9Y6H1;Q5T1J5
As requested in the comment, here it is in function form. I didn't test this part. Feel free to test and adjust as you see fit:
stringRemove <- function(removeFrom, toRemove) {
library(stringi)
toRemove <- unique(unlist(strsplit(as.character(toRemove), ",", fixed = TRUE)))
toRemove <- gsub("\\W", "", toRemove, perl = TRUE)
removeFrom[] <- lapply(removeFrom, stri_replace_all_fixed,
pattern = toRemove, replacement = "", vectorize_all = FALSE)
removeFrom[] <- lapply(removeFrom, gsub,
pattern = "^;|;$|(?<=;);", replacement = "", perl = TRUE)
removeFrom
}
# use it
stringRemove(removeFrom = df1, toRemove = df2[,1])
Firstly, you should almost definitely rearrange your data so it's tidy, i.e. has a column for each variable and a row for each observation, but not knowing what it is or how it's related, I can't do that for you. Thus, the only way left is to hack through what are effectively list columns:
library(dplyr)
# For each column,
df1 %>% mutate_each(funs(
# convert to character,
as.character(.) %>%
# split each string into a list of strings to evaluate,
strsplit(';') %>%
# loop over the items in each list,
lapply(function(x){
# replacing any in a similarly split and unlisted df2 with NA,
ifelse(x %in% unlist(strsplit(as.character(df2[,1]), '[(),]+')),
NA_character_, x)
}) %>%
# then loop over them again,
sapply(function(x){
# removing NAs where there are non-NA strings.
ifelse(all(is.na(x)), list(NA_character_), list(x[!is.na(x)]))
})))
# sample_1 sample_2
# 1 Q9Y6W5 Q16835
# 2 Q9Y6U3 P61809
# 3 NA P41356, A41PH2
# 4 Q9HCP0 P3R117
# 5 NA P31908
# 6 Q9Y6I3 P54112
# 7 Q9Y6H1, Q5T1J5 NA
If you want to collapse the actual list columns you end with back into strings, you can do so with paste, but really, list columns are more useful.
Edit
If your data is big enough that it's worth the annoyance to make it faster, take the munging of df2 out of the chain and store it separately so you don't calculate it for every iteration. Here's a version that does so, built in purrr, which works with lists instead of data.frames and can be faster than mutate_each for non-trivial functions. Edit as you like.
library(purrr)
df2_unlisted <- df2 %>% map(as.character) %>% # convert; unnecessary if stringsAsFactors = FALSE
map(strsplit, '[(),]') %>% # split
unlist() # unlist to vector
df1 %>% map(as.character) %>% # convert; unnecessary if stringsAsFactors = FALSE
map(strsplit, ';') %>% # split
at_depth(2, ~.x[!.x %in% df2_unlisted]) %>% # subset out unwanted
at_depth(2, ~if(is_empty(.x)) NA_character_ else .x) %>% # insert NA for chr(0)
as_data_frame() %>% data.frame() # for printing
Results are identical.

Dynamic use of match function

I would like to match two data frame based on a certain column. My data frames are attached below
df <- structure(list(Read = structure(1:3, .Label = c("CC", "CG", "GC"
), class = "factor"), index = c(6L, 7L, 10L)), .Names = c("Read",
"index"), row.names = c(NA, -3L), class = "data.frame")
df1 <- structure(list(Ref_base = structure(c(1L, 6L, 4L, 2L, 3L, 4L,
3L, 5L), .Label = c("AT", "CC", "CG", "GC", "GT", "TG"), class = "factor"),
index = c(4L, 15L, 10L, 6L, 7L, 10L, 7L, 12L)), .Names = c("Ref_base",
"index"), row.names = c(NA, -8L), class = "data.frame")
I use match to find the match between the two data frames
match(df$index,df1$index)
and it gives me the correct result 4 5 3 as the index of matches. But i would like to lock down position 4 which is the index of first match and perform the match after 4 or whatever the first index is. I don't want to perform the search beyond the index of first match. For example i am interested to return the indexes as 4,5,6 including repetition if any.
The first solution is basically not more than a loop. It loops through all search elements from df$index and returns the match indices in tmp. The variable search_start is used to let the next search begin from the most recent position. Since search_start was defined outside of the anonymous function in sapply you have to use <<- instead of = or <- to access it. There is also some code for handling NAs (this was missing in the first version of my answer).
match_sapply=function(a,b) {
search_start=1
tmp2=sapply(a,function(x) {
tmp=match(x,b[search_start:nrow(df1)])
search_start<<-search_start+ifelse(is.na(tmp),0,tmp)
tmp
})
#the following line updates all non-NA elements of tmp2 with its cumulative sum
`[<-`(tmp2,!is.na(tmp2),cumsum(tmp2[!is.na(tmp2)]))
}
match_sapply(c(50,df$index,20),df1$index)
#[1] NA 4 5 6 NA
And another version using Recall. This is a recursive approach. Recall calls the function from which it was called (in our case match_recall) again. But you can provide different arguments. The arguments of match_recall are: x the search terms, y target vector, n recursion level (also selects specific element of x), si start index (same as start_index in previous solution). Again, there is some code that handles NAs.
match_recall=function(x,y,n=1,si=1) {
tmp=match(x[n],y[si:length(y)])
tmp1=tmp
if (is.na(tmp1)) tmp1=0
if (length(x)==n) {
return(tmp)
} else {
c(tmp,tmp1+Recall(x,y,n+1,si+tmp1))
}
}
match_recall(c(50,df$index,20),df1$index)
#[1] NA 4 5 6 NA

Resources