Find pairs of rows with identical values in different columns - r

I'm trying to subset some data but got stock at this part. My data looks like this:
structure(list(sym_id = structure(c(1L, 2L, 2L, 2L, 2L,
2L, 3L, 3L, 4L, 5L, 5L), .Label = c("AOL.HH", "ARCH.GA", "ARCH.GK",
"T.GJ", "T.GK"), class = "factor"), comp = structure(c(1L,
2L, 2L, 2L, 2L, 2L, 2L, 2L, 3L, 3L, 3L), .Label = c("AOL", "ARCH",
"T"), class = "factor"), seq_nb = c(18327L, 9952L, 39808L,
56601L, 44974L, 55302L, 20023L, 24403L, 15529L, 46202L, 57269L
), orig_seq_nb = c(81261L, 72161L, 9952L,
1276L, 98216L, 16423L, NA_integer_, NA_integer_,
NA_integer_, NA_integer_, NA_integer_)), .Names = c("bond_sym_id",
"company_symbol", "seq_nb", "orig_seq_nb"), row.names = c(NA,
-11L), class = c("tbl_df", "tbl", "data.frame"))
I'm looking for a code that would give me back rows which have identical values in different columns but also identical values in another.
The output should give me back
Row1 ARCH.GA ARCH 9952 72161
Row2 ARCH.GA ARCH 39808 9952
As you can see, the columns "sym_ID" and "comp" are equal for my desired output and the values in "seq_nb" and "orig_seq_nb" match.
Appreciate your help!

We subset the dataset with 3rd and 4th columns, loop through the rows, order, get the 1st element, cbind with the first two columns, use duplicated to find the logical index of duplicate elements and this can be used for subsetting the rows of 'df1'.
d2 <- cbind(df1[1:2], apply(df1[3:4],1, function(x) x[order(x)][1]))
df1[duplicated(d2)|duplicated(d2, fromLast=TRUE),]
# bond_sym_id company_symbol seq_nb orig_seq_nb
# <fctr> <fctr> <int> <int>
#1 ARCH.GA ARCH 9952 72161
#2 ARCH.GA ARCH 39808 9952

Related

Return one out of multiple rows with partially matching entries

I have a dataset with the name of proteome. It has 14 columns and thousands of rows.
dput(Proteome)
structure(list(Protein.name = structure(c(1L, 1L, 1L, 1L, 2L,
3L), .Label = c("HCTF", "IFT", "ROSF"), class = "factor"), X..Proteins = c(5L,
5L, 5L, 5L, 3L, 7L), X..PSMs = c(3L, 1L, 6L, 2L, 2L, 4L), Previous.5.amino.acids = structure(c(4L,
5L, 4L, 2L, 3L, 1L), .Label = c("CWYAT", "FCLKP", "MGCPT", "NCTMY",
"TMYFC"), class = "factor"), Sequence = structure(c(5L, 1L, 4L,
2L, 3L, 6L), .Label = c("FCLKPGCNFHAESTRGYR", "GCNFHAESTR", "GFGFNWPHAVR",
"GHFCLKPGCNFHAESTR", "GHFCLKPGCNFHAESTRGYR", "GNFSVKLMNR"), class = "factor")), .Names = c("Protein.name",
"X..Proteins", "X..PSMs", "Previous.5.amino.acids", "Sequence"
), class = "data.frame", row.names = c(NA, -6L))
The column of interest in this dataset is "Sequence". In row 2 of this column, first two letters of row 1 are missing; in row 3, last three letters of row 1 are missing; in row 4, first seven and last three letters of row 1 are missing.
Rows 2, 3, and 4 reflect the artifacts of the scientific method I have been using to generate the data, and therefore I want to remove these entries.
I want R to return only one of the four rows, ideally row 1, and remove the rest. The way R can do it is by first finding all rows with a matching string of letters and then eliminating such rows while keeping only one. For example, in the above data set, GCNFHAESTR match in all four rows, so I want R to return me only one row, ideally the top one. But I don't know how to do this.
To further clarify, "Sequence" has hundreds of rows with partially matching entries but the matching entries in those rows are different from the one shown in the example above. For example, it is possible that row no. 35 and 39 have the following entries (Row 35: GNYTCAGCWPFK, and Row 36: YTCAGCWPFK). As matching entries in these rows are totally different than the ones in the example above, I can not declare the string beforehand. So, I want to come up with a mechanism that allows me to detect all those rows which have a partially matching entries and then keep only one of them, while delete others.
I look forward to hearing from the experts.
Thanks!
If I understood correctly, you just need to subset your data according to the presence of the string you want. Use grepl for that.
aa <- structure(list(Protein.name = structure(c(1L, 1L, 1L, 1L, 2L, 3L),
.Label = c("HCTF", "IFT", "ROSF"),
class = "factor"),
X..Proteins = c(5L, 5L, 5L, 5L, 3L, 7L),
X..PSMs = c(3L, 1L, 6L, 2L, 2L, 4L),
Previous.5.amino.acids = structure(c(4L, 5L, 4L, 2L, 3L, 1L),
.Label = c("CWYAT", "FCLKP", "MGCPT", "NCTMY", "TMYFC"),
class = "factor"),
Sequence = structure(c(5L, 1L, 4L, 2L, 3L, 6L),
.Label = c("FCLKPGCNFHAESTRGYR", "GCNFHAESTR", "GFGFNWPHAVR",
"GHFCLKPGCNFHAESTR", "GHFCLKPGCNFHAESTRGYR", "GNFSVKLMNR"),
class = "factor")),
.Names = c("Protein.name", "X..Proteins", "X..PSMs", "Previous.5.amino.acids", "Sequence"),
class = "data.frame", row.names = c(NA, -6L))
It is good for you to declare the string beforehand
myStrToDetect <-'GCNFHAESTR'
#the following line filters the data set into those where "Sequence" has the pattern you provided (4 rows)
matching_df <- aa[grepl(myStrToDetect , aa$Sequence),]
Protein.name X..Proteins X..PSMs Previous.5.amino.acids Sequence
1 HCTF 5 3 NCTMY GHFCLKPGCNFHAESTRGYR
2 HCTF 5 1 TMYFC FCLKPGCNFHAESTRGYR
3 HCTF 5 6 NCTMY GHFCLKPGCNFHAESTR
4 HCTF 5 2 FCLKP GCNFHAESTR
# This next command chooses only the first line, if there are multiple occurrences
head(matching_df, 1)
Protein.name X..Proteins X..PSMs Previous.5.amino.acids Sequence
1 HCTF 5 3 NCTMY GHFCLKPGCNFHAESTRGYR

R : Finding the corresponding row value

I'm trying to get the data from column one that matches with column 2 but only on the "B" values. Need to somehow make the true values a list.
Need this to repeat for 50,000 rows. Around 37,000 of them are true.
I'm incredibly new to this so any help would be nice.
Data <- data.frame(
X = sample(1:10),
Y = sample(c("B", "W"), 10, replace = TRUE)
)
Count <- 1
If(data[count,2] == "B") {
List <- list(data[count,1]
Count <- count + 1
#I'm not sure what to use to repeat I just put
Repeat
} else {
Count <- count + 1
Repeat
}
End result should be a list() of only column one data.
In this if rows 1-5 had "B" I want the column one numbers from that.
Not sure if I understood correctly what you're looking for, but from the comments I would assume that this might help:
setNames(data.frame(Data[1][Data[2]=="B"]), "selected")
# selected
#1 2
#2 5
#3 7
#4 6
No loop needed.
data
Data <- structure(list(X = c(10L, 4L, 9L, 8L, 3L, 2L, 5L, 1L, 7L, 6L),
Y = structure(c(2L, 2L, 2L, 2L, 2L, 1L, 1L, 2L, 1L, 1L),
.Label = c("B", "W"), class = "factor")),
.Names = c("X", "Y"), row.names = c(NA, -10L),
class = "data.frame")

Replacing loop in dplyr R

So I am trying to program function with dplyr withou loop and here is something I do not know how to do
Say we have tv stations (x,y,z) and months (2,3). If I group by this say we get
this output also with summarised numeric value
TV months value
x 2 52
y 2 87
z 2 65
x 3 180
y 3 36
z 3 99
This is for evaluated Brand.
Then I will have many Brands I need to filter to get only those which get value >=0.8*value of evaluated brand & <=1.2*value of evaluated brand
So for example from this down I would only want to filter first two, and this should be done for all months&TV combinations
brand TV MONTH value
sdg x 2 60
sdfg x 2 55
shs x 2 120
sdg x 2 11
sdga x 2 5000
As #akrun said, you need to use a combination of merging and subsetting. Here's a base R solution.
m <- merge(df, data, by.x=c("TV", "MONTH"), by.y=c("TV", "months"))
m[m$value.x >= m$value.y*0.8 & m$value.x <= m$value.y*1.2,][,-5]
# TV MONTH brand value.x
#1 x 2 sdg 60
#2 x 2 sdfg 55
Data
data <- structure(list(TV = structure(c(1L, 2L, 3L, 1L, 2L, 3L), .Label = c("x",
"y", "z"), class = "factor"), months = c(2L, 2L, 2L, 3L, 3L,
3L), value = c(52L, 87L, 65L, 180L, 36L, 99L)), .Names = c("TV",
"months", "value"), class = "data.frame", row.names = c(NA, -6L
))
df <- structure(list(brand = structure(c(2L, 1L, 4L, 2L, 3L), .Label = c("sdfg",
"sdg", "sdga", "shs"), class = "factor"), TV = structure(c(1L,
1L, 1L, 1L, 1L), .Label = "x", class = "factor"), MONTH = c(2L,
2L, 2L, 2L, 2L), value = c(60L, 55L, 120L, 11L, 5000L)), .Names = c("brand",
"TV", "MONTH", "value"), class = "data.frame", row.names = c(NA,
-5L))

R program, ?count, rename "freq" to something else

I am studying this webpage, and cannot figure out how to rename freq to something else, say number of times imbibed
Here is dput
structure(list(name = structure(c(1L, 2L, 1L, 2L, 1L, 2L, 1L,
2L), .Label = c("Bill", "Llib"), class = "factor"), drink = structure(c(2L,
3L, 1L, 4L, 2L, 3L, 1L, 4L), .Label = c("cocoa", "coffee", "tea",
"water"), class = "factor"), cost = 1:8), .Names = c("name",
"drink", "cost"), row.names = c(NA, -8L), class = "data.frame")
And this is working code with output. Again, I'd like to rename the freq column. Thanks!
library(plyr)
bevs$cost <- as.integer(bevs$cost)
count(bevs, "name")
Output
name freq
1 Bill 4
2 Llib 4
Are you trying to do this?
counts <- count(bevs, "name")
names(counts) <- c("name", "number of times imbibed")
counts
The count() function returns a data.frame. Just rename it like any other data.frame:
counts <- count(bevs, "name")
names(counts)[which(names(counts) == "freq")] <- "number of times imbibed"
print(counts)
# name number of times imbibed
# 1 Bill 4
# 2 Llib 4

Combining dataframe rows based on a value in a range [duplicate]

This question already has an answer here:
Comparing multiple columns in different data sets to find values within range R
(1 answer)
Closed 8 years ago.
I'm trying to bring together (it's not really a merge or join) data contained in two dataframes based on whether a value in one falls within a range on the second.
data is at the end of the post for convenience.
One data frame (df1) looks like this:
Chromosome Position P.value start.range end.range name
2 4553493 8.23e-05 4453493 4653493 A
3 24548810 1.04e-04 24448810 24648810 B
1 9952003 2.09e-04 9852003 10052003 C
The second df is much longer, but head(df2) looks like this:
ensembl_gene_id chromosome_name start_position end_position
OS01G0281600 1 10048273 10050309
OS01G0281400 1 10021423 10027120
OS01G0281301 1 10019633 10020376
OS01G0281200 1 10011875 10015468
OS01G0281100 1 10008075 10011595
OS01G0281000 1 10003952 10007742
I need to match the rows from each IF df1$Position is within 100,000 of either df2$start_position or df2$end_position (ie ((df1$Position - df2$start_position)<100000 | (df1$Position - df2$end_position)<100000).
I need, as output, a list or dataframe of the rows that match. There will be multiple df2 values that match df1, and there are multiple entries per chromosome, though df1$name is unique. I've been trying various applications of ddply and custom functions, but am coming up short. Any ideas?
data:
df1 <- structure(list(Chromosome = c(2L, 3L, 1L), Position = c(4553493L,
24548810L, 9952003L), P.value = c(8.23e-05, 0.000104, 0.000209
), start.range = c(4453493, 24448810, 9852003), end.range = c(4653493,
24648810, 10052003), name = c("A", "B", "C")), .Names = c("Chromosome",
"Position", "P.value", "start.range", "end.range", "name"), class = "data.frame", row.names = c(NA,
3L))
df2 <- structure(list(ensembl_gene_id = c("OS01G0281600", "OS01G0281400",
"OS01G0281301", "OS01G0281200", "OS01G0281100", "OS01G0281000",
"OS01G0280500", "OS01G0280400", "OS01G0280000", "OS01G0279900",
"OS01G0279800", "OS01G0279700", "OS01G0279400", "OS01G0279300",
"OS01G0279200", "OS01G0279100", "OS01G0279000", "OS01G0278900",
"OS01G0278950", "OS02G0183000", "OS02G0182850", "OS02G0182900",
"OS02G0182700", "OS02G0182800", "OS02G0182500", "OS02G0182300",
"OS02G0181900", "OS02G0182100", "OS02G0181800", "OS02G0181400",
"OS02G0180900", "OS02G0180700", "OS02G0180500", "OS02G0180200",
"OS02G0180400", "OS02G0180100", "OS03G0640300", "OS03G0640400",
"OS03G0640000", "OS03G0640100", "OS03G0639700", "OS03G0639800",
"OS03G0639600", "OS03G0639400", "OS03G0639300", "OS03G0638900",
"OS03G0639100", "OS03G0638400", "OS03G0638800", "OS03G0638300",
"OS03G0638200"), chromosome_name = c(1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L,
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 3L, 3L,
3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L), start_position = c(10048273L,
10021423L, 10019633L, 10011875L, 10008075L, 10003952L, 9967185L,
9962807L, 9936850L, 9928971L, 9917593L, 9913390L, 9889550L, 9887657L,
9878384L, 9874379L, 9866730L, 9859354L, 9863216L, 4639932L, 4629617L,
4630446L, 4616832L, 4625425L, 4598883L, 4594375L, 4567630L, 4573831L,
4563073L, 4551426L, 4521670L, 4497115L, 4486531L, 4460342L, 4481872L,
4455016L, 24630180L, 24638186L, 24616417L, 24621460L, 24591421L,
24596843L, 24574540L, 24564913L, 24544511L, 24487877L, 24514494L,
24466606L, 24476060L, 24454477L, 24449135L), end_position = c(10050309L,
10027120L, 10020376L, 10015468L, 10011595L, 10007742L, 9969073L,
9966715L, 9947933L, 9935981L, 9921565L, 9917318L, 9902737L, 9889123L,
9885517L, 9876678L, 9870864L, 9860677L, 9866617L, 4641686L, 4630180L,
4634616L, 4621974L, 4628750L, 4601382L, 4595386L, 4573049L, 4578257L,
4566597L, 4552860L, 4523668L, 4500124L, 4489409L, 4463571L, 4483470L,
4457715L, 24634746L, 24641449L, 24617859L, 24629502L, 24596437L,
24600376L, 24579212L, 24565726L, 24549550L, 24489307L, 24515219L,
24473558L, 24480927L, 24457481L, 24453890L)), .Names = c("ensembl_gene_id",
"chromosome_name", "start_position", "end_position"), class = "data.frame", row.names = c(NA,
-51L))
Is this what you want?
ddply(df1, .(name), function(x) {
df2[(x$Position - df2$start_position) < 100000 |
(x$Position - df2$end_position) < 100000, ]
})

Resources