Can I use %in% to search and match two columns? - r

I have a large dataframe and I have a vector to pull out terms of interest. for a previous project I was using:
a=data[data$rn %in% y, "Gene"]
To pull out information into a new vector. Now I have a another job Id like to do.
I have a large dataframe of 15 columns and >100000 rows. I want to search column 3 and 9 for the content in the vector and print this as a new dataframe.
To make this extra annoying the hit could be in v3 and not in v9 and visa versa.
Working example
I have striped the dataframe to 3 cols and few rows.
data <- structure(list(Gene = structure(c(1L, 5L, 3L, 2L, 4L), .Label = c("ibp","leuA", "pLeuDn_02", "repA", "repA1"), class = "factor"), LocusTag = structure(c(1L,2L, 5L, 3L, 4L), .Label = c("pBPS1_01", "pBPS1_02", "pleuBTgp4","pleuBTgp5", "pLeuDn_02"), class = "factor"), hit = structure(c(2L,4L, 3L, 1L, 5L), .Label = c("2-isopropylmalate synthase", "Ibp protein","ORF1", "repA1 protein", "replication-associated protein"), class = "factor")), .Names = c("Gene","LocusTag", "hit"), row.names = c(NA, 5L), class = "data.frame")
y <- c("ibp", "orf1")

First of all R is case sensitive so your example will not collect the third line but I guess you want that extracted. so you would have to change your y to
y <- c("ibp", "ORF1")
Ok from your example I try to see what you want to achieve I am not sure if this is really what you want but R knows the operator | as "or" so you could try something like:
new.data<-data[data$Gene %in% y|data$hit %in% y,]
if you only want to extract certain columns of your data set you can specify them behind the "," e.g.:
new.data<-data[data$Gene %in% y|data$hit %in% y, c("LocusTag","Gene")]

Related

How to combine multiple text entries for a variable once dplyr has grouped by another variable [duplicate]

This question already has answers here:
Collapse / concatenate / aggregate a column to a single comma separated string within each group
(6 answers)
Closed 2 years ago.
For hundreds of matters, my data frame has daily text entries by dozens of timekeepers. Not every timekeeper enters time each day for each matter. Text entries can be any length. Each entry for a matter is for work done on a different day (but for my purposes, figuring out readability measures for the text, dates don't matter). What I would like to do is to combine for each matter all of its text entries.
Here is a toy data set and what it looks like:
> dput(df)
structure(list(Matter = structure(c(1L, 1L, 1L, 1L, 2L, 2L, 2L,
3L, 4L, 4L), .Label = c("MatterA", "MatterB", "MatterC", "MatterD"
), class = "factor"), Timekeeper = structure(c(1L, 2L, 3L, 4L,
2L, 3L, 1L, 1L, 3L, 4L), .Label = c("Alpha", "Baker", "Charlie",
"Delta"), class = "factor"), Text = structure(c(5L, 8L, 1L, 3L,
7L, 6L, 9L, 2L, 10L, 4L), .Label = c("all", "all we have", "good men to come to",
"in these times that try men's souls", "Now is", "of", "the aid",
"the time for", "their country since", "to fear is fear itself"
), class = "factor")), class = "data.frame", row.names = c(NA,
-10L))
Dplyr groups the time records by matter, but I am stumped as to how to combine the text entries for each matter so that the result is along these lines -- all text gathered for a matter:
1 MatterA Now is the time for all good men to come to
5 MatterB the aid of their country since
8 MatterC all we have
9 MatterD to fear is fear itself in these times that try men's souls
dplyr::mutate() does not work with various concatenation functions:
textCombined <- df %>% group_by(Matter) %>% mutate(ComboText = str_c(Text))
textCombined2 <- df %>% group_by(Matter) %>% mutate(ComboText = paste(Text))
textCombined3 <- df %>% group_by(Matter) %>% mutate(ComboText = c(Text)) # creates numbers
Maybe a loop will do the job, as in "while the matter stays the same, combine the text" but I don't know how to write that. Or maybe dplyr has a conditional mutate, as in "mutate(while the matter stays the same, combine the text)."
Thank you for your help.
Hi you can use group by and summarise with paste,
> df %>% group_by(Matter) %>% summarise(line= paste(Text, collapse = " "))
# A tibble: 4 x 2
# Matter line
# <fct> <chr>
#1 MatterA Now is the time for all good men to come to
#2 MatterB the aid of their country since
#3 MatterC all we have
#4 MatterD to fear is fear itself in these times that try men's souls

Return a single row out of multiple rows with partially matching entries

I am reposting this question with a bit of more clarity. Unfortunately, didn't get any solutions from my previous posting. Please help me with this.
Below is what I want to do:
I have a dataset with the name of proteome. It has 14 columns and thousands of rows.
Row 1, column 5: GHFCLKPGCNFHAESTRGYR
Row 2, column 5: FCLKPGCNFHAESTRGYR
Row 3, column 5: GHFCLKPGCNFHAESTR
Row 4: column 5: GCNFHAESTR
Please click on this link to see the screenshot of a part of the original data frame; i67.tinypic.com/2wd0ap3.png[/IMG]
So, In row 2, first two letters of row 1 are missing; in row 3, last three letters of row 1 are missing; in row 4, first seven and last three letters of row 1 are missing.
Rows 2, 3, and 4 reflect the artifacts of the scientific method I have been using to generate the data, and therefore I want to remove these entries.
I want R to return only one of the four rows, ideally row 1, and remove the rest. The way R can do it is by first finding all rows with a matching string of letters and then eliminating such rows while keeping only one. For example, in the above data set, GCNFHAESTR match in all four rows, so I want R to return me only one row, ideally the top one. But I don't know how to do this.
Hope this makes better sense this time. I look forward to hearing from the experts.
Thanks!
In response to Julian_Hn suggestion, here is the dput of my dataset:
dput(Proteome)
structure(list(Protein.name = structure(c(1L, 1L, 1L, 1L, 2L,
3L), .Label = c("HCTF", "IFT", "ROSF"), class = "factor"), X..Proteins = c(5L,
5L, 5L, 5L, 3L, 7L), X..PSMs = c(3L, 1L, 6L, 2L, 2L, 4L), Previous.5.amino.acids = structure(c(4L,
5L, 4L, 2L, 3L, 1L), .Label = c("CWYAT", "FCLKP", "MGCPT", "NCTMY",
"TMYFC"), class = "factor"), Sequence = structure(c(5L, 1L, 4L,
2L, 3L, 6L), .Label = c("FCLKPGCNFHAESTRGYR", "GCNFHAESTR", "GFGFNWPHAVR",
"GHFCLKPGCNFHAESTR", "GHFCLKPGCNFHAESTRGYR", "GNFSVKLMNR"), class = "factor")), .Names = c("Protein.name",
"X..Proteins", "X..PSMs", "Previous.5.amino.acids", "Sequence"
), class = "data.frame", row.names = c(NA, -6L))

How to separate a dataframe based on specific string in column name [duplicate]

This question already has answers here:
Split string by last two characters in R? (/negative string indices)
(5 answers)
Closed 3 years ago.
I have a huge data that I cannot split into two sets
df<- structure(list(name = structure(1:3, .Label = c("a", "b", "c"
), class = "factor"), X3C_AALI_01A = c(651L, 2L, 1877L), X3C_AALJ_01B = c(419L,
2L, 1825L), X3C_AALK_01A = c(1310L, 52L, 1286L), X4H_AAAK_11B = c(2978L,
4L, 1389L), X5L_AAT0_01B = c(2576L, 15L, 1441L), X5L_AAT1_01A = c(2886L,
5L, 921L), X5T_A9QA_03A = c(929L, 3L, 935L), A1_A0SI_10A = c(1578L,
1L, 2217L), A1_A0SK_07C = c(3003L, 6L, 2984L), A1_A0SO_01A = c(6413L,
0L, 3577L), A1_A0SP_05B = c(5157L, 5L, 4596L), A2_A04P_01A = c(4283L,
6L, 2508L), X5L_AAh1_10A = c(2886L, 5L, 921L), X5T_A0QA_03A = c(929L,
3L, 935L), A1_A0Sm_10A = c(1578L, 1L, 2217L), A1_ArSK_01A = c(3003L,
6L, 2984L), A1_AfSO_01A = c(6413L, 0L, 3577L), A1_AuSP_05A = c(5157L,
5L, 4596L), A2_Ap4P_11A = c(4283L, 6L, 2508L)), class = "data.frame", row.names = c(NA,
-3L))
basically , I want to split the data based on the last character of the column name. for example if you look at the above data, the second column is like this 3C_AALI_01A which I want to generate two data sets based on the _01A
So those columns that have 01 to 09 values I want them to be in one data frame and those ones that have 10 to whatever number want them to be in the second data frame. For example in the above example data.
the columns with the following names should be in one data frame
3C_AALI_01A
3C_AALJ_01B
3C_AALK_01A
5L_AAT0_01B
5L_AAT1_01A
5T_A9QA_03A
A1_A0SK_07C
A1_A0SO_01A
A1_A0SP_05B
A2_A04P_01A
5T_A0QA_03A
A1_ArSK_01A
A1_AfSO_01A
A1_AuSP_05A
and the columns with the following names should be in another data frame
4H_AAAK_11B
A1_A0SI_10A
5L_AAh1_10A
A1_A0Sm_10A
A2_Ap4P_11A
df1 <- df[,grep('0[1-9].$',colnames(df))]
df2 <- df[,-grep('0[1-9].$',colnames(df))]
You could use tidyr::separate(..., last=-1) approach
which uses negative string indexing, which is what you really want here
also, your dataframe is transposed, it would be more normal to have one single column name with the names, and numerical columns a, b, c. Like t(df) without the unwanted coercion to string.

Dynamic use of match function

I would like to match two data frame based on a certain column. My data frames are attached below
df <- structure(list(Read = structure(1:3, .Label = c("CC", "CG", "GC"
), class = "factor"), index = c(6L, 7L, 10L)), .Names = c("Read",
"index"), row.names = c(NA, -3L), class = "data.frame")
df1 <- structure(list(Ref_base = structure(c(1L, 6L, 4L, 2L, 3L, 4L,
3L, 5L), .Label = c("AT", "CC", "CG", "GC", "GT", "TG"), class = "factor"),
index = c(4L, 15L, 10L, 6L, 7L, 10L, 7L, 12L)), .Names = c("Ref_base",
"index"), row.names = c(NA, -8L), class = "data.frame")
I use match to find the match between the two data frames
match(df$index,df1$index)
and it gives me the correct result 4 5 3 as the index of matches. But i would like to lock down position 4 which is the index of first match and perform the match after 4 or whatever the first index is. I don't want to perform the search beyond the index of first match. For example i am interested to return the indexes as 4,5,6 including repetition if any.
The first solution is basically not more than a loop. It loops through all search elements from df$index and returns the match indices in tmp. The variable search_start is used to let the next search begin from the most recent position. Since search_start was defined outside of the anonymous function in sapply you have to use <<- instead of = or <- to access it. There is also some code for handling NAs (this was missing in the first version of my answer).
match_sapply=function(a,b) {
search_start=1
tmp2=sapply(a,function(x) {
tmp=match(x,b[search_start:nrow(df1)])
search_start<<-search_start+ifelse(is.na(tmp),0,tmp)
tmp
})
#the following line updates all non-NA elements of tmp2 with its cumulative sum
`[<-`(tmp2,!is.na(tmp2),cumsum(tmp2[!is.na(tmp2)]))
}
match_sapply(c(50,df$index,20),df1$index)
#[1] NA 4 5 6 NA
And another version using Recall. This is a recursive approach. Recall calls the function from which it was called (in our case match_recall) again. But you can provide different arguments. The arguments of match_recall are: x the search terms, y target vector, n recursion level (also selects specific element of x), si start index (same as start_index in previous solution). Again, there is some code that handles NAs.
match_recall=function(x,y,n=1,si=1) {
tmp=match(x[n],y[si:length(y)])
tmp1=tmp
if (is.na(tmp1)) tmp1=0
if (length(x)==n) {
return(tmp)
} else {
c(tmp,tmp1+Recall(x,y,n+1,si+tmp1))
}
}
match_recall(c(50,df$index,20),df1$index)
#[1] NA 4 5 6 NA

Constructing All Possible Pairs within Groups

I have a large amount of graph data in the following form. Suppose a person has multiple interests.
person,interest
1,1
1,2
1,3
2,1
2,5
2,2
3,2
3,5
...
I want to construct all pairs of interests for each user. I would like to convert this into an edgelist like the following. I want the data in this format so that I can convert it into an adjacency matrix for graphing etc.
person,x_interest,y_interest
1,1,2
1,1,3
1,2,3
2,1,5
2,1,2
2,5,2
3,2,5
There is one solution here: Pairs of Observations within Groups but it works only for small datasets as the call to table wants to generate more than 2^31 elements. Is there another way that I can do this without having to rely on table?
We can use data.table. We convert the 'data.frame' to 'data.table' (setDT(df1), grouped by 'person', we get the unique pairwise combinations of 'interest' to create two columns ('x_interest' and 'y_interest').
library(data.table)
setDT(df1)[,{tmp <- combn(unique(interest),2)
list(x_interest=tmp[c(TRUE, FALSE)], y_interest= tmp[c(FALSE, TRUE)])} , by = person]
NOTE: To speed up, combnPrim from library(gRbase) could be used in place of combn.
data
df1 <- structure(list(person = c(1L, 1L, 1L, 2L, 2L, 2L, 3L, 3L),
interest = c(1L,
2L, 3L, 1L, 5L, 2L, 2L, 5L)), .Names = c("person", "interest"
), class = "data.frame", row.names = c(NA, -8L))

Resources