Add values of columns based on condition of another variable in R - r

I want to create a variable that adds the values from other columns based on the condition of variable year. That is if variable YEAR = 2013 then add columns YR_2006, YR_2007, YR_2008, YR_2009, YR_2010, YR_2011. So for group A the sum would be 12,793
GROUP YEAR YR_2006 YR_2007 YR_2008 YR_2009 YR_2010 YR_2011
A 2013 636 3653 4759 3745
B 2019 1417 2176 3005 2045 2088 1849
C 2007 4218 3622 4651 4574 4122 4711
E 2017 5956 6031 6032 4885 5400 5828

Here is an option with apply and MARGIN = 1 to loop over the rows, get the index where the 'YEAR' matches the names, do a sequence from the 2nd element to that index, subset the values and get the sum
df1$Sum <- apply(df1[-1], 1, function(x)
sum(x[2:c(grep(as.character(x[1]), names(x)[-1]) +1,
length(x))[1]], na.rm = TRUE))
df1$Sum
#[1] 12793 12580 7840 28886
Or we can use a vectorized option with rowSums after replaceing some of the elements in each row to NA based on matching the 'YEAR' column with the column names that startsWith 'YR_'
i1 <- startsWith(names(df1), "YR_")
i2 <- match(df1$YEAR, sub("YR_", "", names(df1)[i1]), nomatch = sum(i1))
rowSums(replace(df1[i1], col(df1[i1]) > i2[row(df1[i1])], NA), na.rm = TRUE)
#[1] 12793 12580 7840 28886
Or using tidyverse to reshape to 'long' format with pivot_longer and then do a group_by sum after sliceing the rows based on the match
library(dplyr)
library(tidyr)
df1 %>%
pivot_longer(cols = starts_with("YR_"), values_drop_na = TRUE) %>%
group_by(GROUP) %>%
slice(seq(match(first(YEAR), readr::parse_number(name), nomatch = n()))) %>%
summarise(Sum = sum(value)) %>%
left_join(df1, .)
GROUP YEAR YR_2006 YR_2007 YR_2008 YR_2009 YR_2010 YR_2011 Sum
1 A 2013 NA 636 3653 4759 3745 NA 12793
2 B 2019 1417 2176 3005 2045 2088 1849 12580
3 C 2007 4218 3622 4651 4574 4122 4711 7840
4 E 2017 5956 6031 6032 4885 5400 582 28886
data
df1 <- structure(list(GROUP = c("A", "B", "C", "E"), YEAR = c(2013L,
2019L, 2007L, 2017L), YR_2006 = c(NA, 1417L, 4218L, 5956L), YR_2007 = c(636L,
2176L, 3622L, 6031L), YR_2008 = c(3653L, 3005L, 4651L, 6032L),
YR_2009 = c(4759L, 2045L, 4574L, 4885L), YR_2010 = c(3745L,
2088L, 4122L, 5400L), YR_2011 = c(NA, 1849L, 4711L, 582L)),
class = "data.frame", row.names = c(NA,
-4L))

Related

Complex dataframe values selection based on both rows and columns

I need to select some values on each row of the dataset below and compute a sum.
This is a part of my dataset.
> prova
key_duration1 key_duration2 key_duration3 KeyPress1RESP KeyPress2RESP KeyPress3RESP
18 3483 364 3509 b n m
19 2367 818 3924 b n m
20 3775 1591 802 b m n
21 929 3059 744 n b n
22 3732 530 1769 b n m
23 3503 2011 2932 b n b
24 3684 1424 1688 b n m
Rows are trials of the experiment and columns are the keys pressed, in temporal sequence (keypressRESP) and the amount of time of the key until the next one (key_duration).
So for example in the first trial (first row) I pressed "b" and after 3483 ms I pressed "n" and so on.
This is my dataframe
structure(list(key_duration1 = c(3483L, 2367L, 3775L, 929L, 3732L,
3503L, 3684L), key_duration2 = c(364L, 818L, 1591L, 3059L, 530L,
2011L, 1424L), key_duration3 = c(3509, 3924, 802, 744, 1769,
2932, 1688), KeyPress1RESP = structure(c(2L, 2L, 2L, 4L, 2L,
2L, 2L), .Label = c("", "b", "m", "n"), class = "factor"), KeyPress2RESP = structure(c(4L,
4L, 3L, 2L, 4L, 4L, 4L), .Label = c("", "b", "m", "n"), class = "factor"),
KeyPress3RESP = structure(c(3L, 3L, 4L, 4L, 3L, 2L, 3L), .Label = c("",
"b", "m", "n"), class = "factor")), row.names = 18:24, class = "data.frame")
I need a method for select in each row (trial) all "b" values, compute the sum(key_duration) and print the values on a new column, the same for "m".
How can i do?
I think that i need a function similar to 'apply()' but without compute every values on the row but only selected values.
apply(prova[,1:3],1,sum)
Thanks
Here is a way using data.table.
library(data.table)
setDT(prova)
# melt
prova_long <-
melt(
prova[, idx := 1:.N],
id.vars = "idx",
measure.vars = patterns("^key_duration", "^KeyPress"),
variable.name = "key",
value.name = c("duration", "RESP")
)
# aggregate
prova_aggr <- prova_long[RESP != "n", .(duration_sum = sum(duration)), by = .(idx, RESP)]
# spread and join
prova[dcast(prova_aggr, idx ~ paste0("sum_", RESP)), c("sum_b", "sum_m") := .(sum_b, sum_m), on = "idx"]
prova
Result
# key_duration1 key_duration2 key_duration3 KeyPress1RESP KeyPress2RESP KeyPress3RESP idx sum_b sum_m
#1: 3483 364 3509 b n m 1 3483 3509
#2: 2367 818 3924 b n m 2 2367 3924
#3: 3775 1591 802 b m n 3 3775 1591
#4: 929 3059 744 n b n 4 3059 NA
#5: 3732 530 1769 b n m 5 3732 1769
#6: 3503 2011 2932 b n b 6 6435 NA
#7: 3684 1424 1688 b n m 7 3684 1688
The idea is to reshape your data to long format, aggregate by "RESP" per row. Spread the result and join back to your initial data.
With tidyverse you can do:
bind_cols(df %>%
select_at(vars(starts_with("KeyPress"))) %>%
rowid_to_column() %>%
gather(var, val, -rowid), df %>%
select_at(vars(starts_with("key_"))) %>%
rowid_to_column() %>%
gather(var, val, -rowid)) %>%
group_by(rowid) %>%
summarise(b_values = sum(val1[val == "b"]),
m_values = sum(val1[val == "m"])) %>%
left_join(df %>%
rowid_to_column(), by = c("rowid" = "rowid")) %>%
ungroup() %>%
select(-rowid)
b_values m_values key_duration1 key_duration2 key_duration3 KeyPress1RESP KeyPress2RESP KeyPress3RESP
<dbl> <dbl> <int> <int> <dbl> <fct> <fct> <fct>
1 3483. 3509. 3483 364 3509. b n m
2 2367. 3924. 2367 818 3924. b n m
3 3775. 1591. 3775 1591 802. b m n
4 3059. 0. 929 3059 744. n b n
5 3732. 1769. 3732 530 1769. b n m
6 6435. 0. 3503 2011 2932. b n b
7 3684. 1688. 3684 1424 1688. b n m
First, it splits the df into two: one with variables starting with "KeyPress" and one with variables starting with "key_". Second, it transforms the two dfs from wide to long format and combines them by columns. Third, it creates a summary for "b" and "m" values according row ID. Finally, it merges the results with the original df.
You can make a logical matrix from the KeyPress columns, multiply it by the key_duration subset and then take their rowSums.
prova$b_values <- rowSums((prova[, 4:6] == "b") * prova[, 1:3])
prova$n_values <- rowSums((prova[, 4:6] == "n") * prova[, 1:3])
key_duration1 key_duration2 key_duration3 KeyPress1RESP KeyPress2RESP KeyPress3RESP b_values n_values
18 3483 364 3509 b n m 3483 364
19 2367 818 3924 b n m 2367 818
20 3775 1591 802 b m n 3775 802
21 929 3059 744 n b n 3059 1673
22 3732 530 1769 b n m 3732 530
23 3503 2011 2932 b n b 6435 2011
24 3684 1424 1688 b n m 3684 1424
It works because the logical values are coerced to numeric 1s or 0s, and only the values for individual keys are retained.
Extra: to generalise, you could instead use a function and tidyverse/purrr to map it:
get_sums <- function(key) rowSums((prova[, 4:6] == key) * prova[, 1:3])
keylist <- list(b_values = "b", n_values = "n", m_values = "m")
library(tidyverse)
bind_cols(prova, map_dfr(keylist, get_sums))

How to merge two dataframes based on range value of one table

DF1
SIC Value
350 100
460 500
140 200
290 400
506 450
DF2
SIC1 AREA
100-200 Forest
201-280 Hospital
281-350 Education
351-450 Government
451-550 Land
Note:class of SIC1 is having character,we need to convert to numeric range
i am trying to get the output like below
Desired output:
DF3
SIC Value AREA
350 100 Education
460 500 Land
140 200 Forest
290 400 Education
506 450 Land
i have tried first to convert character class of SIC1 to numeric
then tried to merge,but no luck,can someone guide on this?
An option can be to use tidyr::separate along with sqldf to join both tables on range of values.
library(sqldf)
library(tidyr)
DF2 <- separate(DF2, "SIC1",c("Start","End"), sep = "-")
sqldf("select DF1.*, DF2.AREA from DF1, DF2
WHERE DF1.SIC between DF2.Start AND DF2.End")
# SIC Value AREA
# 1 350 100 Education
# 2 460 500 Lan
# 3 140 200 Forest
# 4 290 400 Education
# 5 506 450 Lan
Data:
DF1 <- read.table(text =
"SIC Value
350 100
460 500
140 200
290 400
506 450",
header = TRUE, stringsAsFactors = FALSE)
DF2 <- read.table(text =
"SIC1 AREA
100-200 Forest
201-280 Hospital
281-350 Education
351-450 Government
451-550 Lan",
header = TRUE, stringsAsFactors = FALSE)
We could do a non-equi join. Split (tstrsplit) the 'SIC1' column in 'DF2' to numeric columns and then do a non-equi join with the first dataset.
library(data.table)
setDT(DF2)[, c('start', 'end') := tstrsplit(SIC1, '-', type.convert = TRUE)]
DF2[, -1, with = FALSE][DF1, on = .(start <= SIC, end >= SIC),
mult = 'last'][, .(SIC = start, Value, AREA)]
# SIC Value AREA
#1: 350 100 Education
#2: 460 500 Land
#3: 140 200 Forest
#4: 290 400 Education
#5: 506 450 Land
Or as #Frank mentioned we can do a rolling join to extract the 'AREA' and update it on the first dataset
setDT(DF1)[, AREA := DF2[DF1, on=.(start = SIC), roll=TRUE, x.AREA]]
data
DF1 <- structure(list(SIC = c(350L, 460L, 140L, 290L, 506L), Value = c(100L,
500L, 200L, 400L, 450L)), .Names = c("SIC", "Value"),
class = "data.frame", row.names = c(NA, -5L))
DF2 <- structure(list(SIC1 = c("100-200", "201-280", "281-350", "351-450",
"451-550"), AREA = c("Forest", "Hospital", "Education", "Government",
"Land")), .Names = c("SIC1", "AREA"), class = "data.frame",
row.names = c(NA, -5L))

Combine list of data frames of differing length by row names in R

I have a list (df) of data frames of differing lengths, indexed by years such that a proxy of the data looks like:
df
$df1
X..i..
1999 10
1998 13
1997 14
$df2
X..i..
1999 20
1998 11
$df3
X..i..
1999 17
1998 8
1997 9
1996 19
I would like to combine these data frames to a single data frame using and preserving the index/rownames
So that:
df_all
Index df1 df2 df3
1999 10 20 17
1998 13 11 8
1997 14 n/a 9
1996 n/a n/a 19
Edit:
smalldflist <- lapply(bai_df, function(i) head(i, 10))
dput(smalldflist)
Produces the following output:
structure(list(IN_DonaldsonWoods_QUAL.txt = structure(list(X..i.. = c(4.5528243479162,
32.6474339976978, 52.7116018957456, 170.932582874866, 227.0430440174,
191.462399206825, 226.94053541991, 274.854835798233, 336.457600434571,
409.132933511232)), .Names = "X..i..", row.names = c("1725",
"1726", "1727", "1728", "1729", "1730", "1731", "1732", "1733",
"1734"), class = "data.frame"), IN_DonaldsonWoods_QURU.txt = structure(list(
X..i.. = c(4.33729067152776, 5.72878688080428, 13.0247658962315,
22.0205798005054, 25.9885943197615, 18.9273551074104, 43.5197887382031,
58.2775710248884, 72.9225976242458, 73.0466756114972)), .Names = "X..i..", row.names = c("1827",
"1828", "1829", "1830", "1831", "1832", "1833", "1834", "1835",
"1836"), class = "data.frame"), IN_DonaldsonWoods_QUVE.txt = structure(list(
X..i.. = c(7.87253273859391, 18.9481296742303, 42.5055176960097,
62.9980951594496, 88.906442207264, 74.2523230533691, 106.911242713809,
152.445167763284, 192.399603839633, 221.263660216113)), .Names = "X..i..", row.names = c("1731",
"1732", "1733", "1734", "1735", "1736", "1737", "1738", "1739",
"1740"), class = "data.frame"), IN_LillyDickey_QUAL.txt = structure(list(
X..i.. = c(8.29576810088555, 17.2934968058816, 31.2091720401804,
33.8966066349882, 47.6496887415004, 32.9921546763907, 82.2281435044324,
108.068226885475, 103.894002151431, 110.255812097949)), .Names = "X..i..", row.names = c("1863",
"1864", "1865", "1866", "1867", "1868", "1869", "1870", "1871",
"1872"), class = "data.frame"), IN_LillyDickey_QUMO.txt = structure(list(
X..i.. = c(3.42413493048312, 8.0847630303073, 19.6833503197648,
13.791136218324, 21.4638165402601, 30.6707376168741, 30.8789937938806,
26.8661212585221, 24.0732956549621, 29.7872997715364)), .Names = "X..i..", row.names = c("1867",
"1868", "1869", "1870", "1871", "1872", "1873", "1874", "1875",
"1876"), class = "data.frame"), IN_Pioneers_QUAL.txt = structure(list(
X..i.. = c(9.14340435634345, 23.5108626053757, 33.8507393822465,
46.1027716604662, 57.5247983011993, 50.5892015892391, 92.2448163706925,
225.832932372368, 278.367628044195, 193.931508820174)), .Names = "X..i..", row.names = c("1817",
"1818", "1819", "1820", "1821", "1822", "1823", "1824", "1825",
"1826"), class = "data.frame"), IN_Pioneers_QURU.txt = structure(list(
X..i.. = c(122.443727611702, 658.649900930018, 830.471777578934,
843.357139228152, 1725.6495913006, 1244.38668477703, 973.00892131628,
1294.7441782001, 1717.18570086886, 1676.63841798444)), .Names = "X..i..", row.names = c("1861",
"1862", "1863", "1864", "1865", "1866", "1867", "1868", "1869",
"1870"), class = "data.frame"), OH_JohnsonWoods_QUAL.txt = structure(list(
X..i.. = c(1.9113449704439, 3.39794661412248, 5.32688450342693,
6.41921626908008, 11.0307601252838, 13.0825342873437, 15.843680070585,
16.885746353779, 20.1011664347289, 19.853294774361)), .Names = "X..i..", row.names = c("1626",
"1627", "1628", "1629", "1630", "1631", "1632", "1633", "1634",
"1635"), class = "data.frame")), .Names = c("IN_DonaldsonWoods_QUAL.txt",
"IN_DonaldsonWoods_QURU.txt", "IN_DonaldsonWoods_QUVE.txt", "IN_LillyDickey_QUAL.txt",
"IN_LillyDickey_QUMO.txt", "IN_Pioneers_QUAL.txt", "IN_Pioneers_QURU.txt",
"OH_JohnsonWoods_QUAL.txt"))
You can use Reduce to merge multiple data frames. Set all = TRUE which adds NAs when no matches occurs. Note df is the list of data frames as you have set up, and by indicates the column used for merging. Therefore in your list of data frames, "Index" should be the name of the year column in each data frame.
Reduce(function(...) merge(..., by="Index", all=TRUE), df)
And thanks to #jazzuro providing sample data, here is the equivalent solution using Reduce in base R. In this sample set the column used for merging by="year":
df1 <- data.frame(year = c(1999, 1998, 1997),
value = c(10, 13, 14))
df2 <- data.frame(year = c(1999, 1998),
value = c(20, 11))
df3 <- data.frame(year = c(1999, 1998, 1997, 1996),
value = c(17, 8, 9, 19))
df <- list(df1=df1, df2=df2, df3=df3)
df_merge <- Reduce(function(...) merge(..., by="year", all=TRUE), df)
colnames(df_merge) <- c("Index", names(df))
# Index df1 df2 df3
# 1 1996 NA NA 19
# 2 1997 14 NA 9
# 3 1998 13 11 8
# 4 1999 10 20 17
If you have the data you need only in global environment, you could try the following. First, you collect unique years in all data frames and create a master data frame, which includes unique years only. Then, you put all data frames in a list and merge each of them with master. Since you have the master data frame in temp, you remove it. Finally, you bind all data frames and change the long format to a wide format.
library(tidyverse)
# Create a data frame with all unique years
master <- data.frame(year = mget(ls()) %>%
sapply(`[`, 1) %>%
as_vector %>%
unique)
# Merge each data frame with the master df
temp <- mget(ls()) %>%
lapply(function(x){full_join(x, master, by = "year")})
# Remove the master df in the list
temp[["master"]] <- NULL
# Bind all dfs and make it wide.
bind_rows(temp, .id = "data") %>%
spread(key = data, value = value)
# year df1 df2 df3
#1 1996 NA NA 19
#2 1997 14 NA 9
#3 1998 13 11 8
#4 1999 10 20 17
DATA
df1 <- data.frame(year = c(1999, 1998, 1997),
value = c(10, 13, 14))
df2 <- data.frame(year = c(1999, 1998),
value = c(20, 11))
df3 <- data.frame(year = c(1999, 1998, 1997, 1996),
value = c(17, 8, 9, 19))
Reconsider the chain merge as #Djork shows but make sure you create an actual column named, Index equal to rownames(). Also, rename the X..1 column according to df# which also avoids the duplicate column warning during merges. Below dfs is equivalent to posted smalldflist:
dfs <- lapply(seq_along(dfs), function(i){
dfs[[i]]$Index = rownames(dfs[[i]]) # CREATE INDEX
colnames(dfs[[i]])[1] <- paste0("df", i) # RENAME X..1 COLUMN
return(dfs[[i]])
})
dfs[[1]]
# df1 Index
# 1725 4.552824 1725
# 1726 32.647434 1726
# 1727 52.711602 1727
# 1728 170.932583 1728
# 1729 227.043044 1729
# 1730 191.462399 1730
# 1731 226.940535 1731
# 1732 274.854836 1732
# 1733 336.457600 1733
# 1734 409.132934 1734
finaldf <- Reduce(function(...) merge(..., by="Index", all=TRUE), dfs)
finaldf
# Index df1 df2 df3 df4 df5 df6 df7 df8
# 1 1626 NA NA NA NA NA NA NA 1.911345
# 2 1627 NA NA NA NA NA NA NA 3.397947
# 3 1628 NA NA NA NA NA NA NA 5.326885
# 4 1629 NA NA NA NA NA NA NA 6.419216
# 5 1630 NA NA NA NA NA NA NA 11.030760
# ...

R comparing 2 dfs to sum data between values

I have 2 dataframes in R, one with start (column 1) and end (column 2) coordinates...
df1
2500 3499
3500 4499
4500 5499
5500 6499
And one with point coordinates (column 1) and associated values (column 2)...
df2
2657 17
2895 33
3875 12
4448 42
5122 3
5633 65
5781 12
I would like to find a vectorized approach to sum the values from df2 column 2 where df2 column 1 coordinates are between the start and stop coordinates for df1. with this data the result should look like this...
df3
2500 3499 50
3500 4499 54
4500 5499 3
5500 6499 77
The dfs contain 100,000+ rows, I can achieve this easily using loops, but as were are in R it is slow and not the best approach.
What is the best way to do this? Also a flexible solution that can be adapted to other functions, other than simply summing data would be good to know.
Here's a possible data.table::foverlaps solution. As you haven't specified column names, I'm assuming that they are called V1 and V2 in both data sets
Solution
library(data.table)
setDT(df1)[, `:=`(start = V1, end = V2)]
setDT(df2)[, `:=`(start = V1, end = V1)]
setkey(df1, start, end)
foverlaps(df2, df1)[, list(SumV2 = sum(i.V2)), by = list(V1, V2)]
# V1 V2 SumV2
# 1: 2500 3499 50
# 2: 3500 4499 54
# 3: 4500 5499 3
# 4: 5500 6499 77
Explanation
Here we converted both data sets to data.table objects and specified the start/end values to overlap on. Then, we keyed the data set that we want to join against. Finally we ran the foverlaps function and then aggregated the matched values of V2 from df2 by the desired columns in df1
Data
df1 <- structure(list(V1 = c(2500L, 3500L, 4500L, 5500L), V2 = c(3499L,
4499L, 5499L, 6499L)), .Names = c("V1", "V2"), class = "data.frame", row.names = c(NA,
-4L))
df2 <- structure(list(V1 = c(2657L, 2895L, 3875L, 4448L, 5122L, 5633L,
5781L), V2 = c(17L, 33L, 12L, 42L, 3L, 65L, 12L)), .Names = c("V1",
"V2"), class = "data.frame", row.names = c(NA, -7L))

Combining (pasting) columns

I have the following data.frame
Tipo Start End Strand Accesion1 Accesion2
1 gene 197 1558 + <NA> SP_0001
2 CDS 197 1558 + NP_344554 <NA>
3 gene 1717 2853 + <NA> SP_0002
4 CDS 1717 2853 + NP_344555 <NA>
5 gene 2864 3112 + <NA> SP_0003
6 CDS 2864 3112 + NP_344556 <NA>
There are more "Tipo" values, such as tRNA, region , exon, or rRNA, but I am only interested in combining these two, gene and CDS
And I would like to get the following
Start End Accesion1 Accesion2
1 197 1558 NP_344554 SP_0001
but only when the start and End values of gene and CDS coincide. I've tried to use select, arrange and mutate with dplyr, but it is sort of complicated for me to get rid of the NAs
A dplyr version with summarize_each:
DF %>%
group_by(Start, End) %>%
summarise_each(funs(max), Accesion1, Accesion2)
Produces:
Source: local data frame [3 x 4]
Groups: Start
Start End Accesion1 Accesion2
1 197 1558 NP_344554 SP_0001
2 1717 2853 NP_344555 SP_0002
3 2864 3112 NP_344556 SP_0003
Assumes AccessionX varibles are character (does not work with factor), as well as the condition that Start End pairs contain only two values, one each of Tipo and Gene, as in your data set.
You could try
library(data.table)
setDT(df1)[, id:=cumsum(Tipo == 'gene')][,
list(Accesion1=na.omit(Accesion1), Accesion2=na.omit(Accesion2)) ,
list(id, Start, End)]
Here's a solution using aggregate():
df <- data.frame(Tipo=c('gene','CDS','gene','CDS','gene','CDS'), Start=c(197,197,1717,1717,2864,2864), End=c(1558,1558,2853,2853,3112,3112), Strand=c('+','+','+','+','+','+'), Accesion1=c(NA,'NP_344554',NA,'NP_344555',NA,'NP_344556'), Accesion2=c('SP_0001',NA,'SP_0002',NA,'SP_0003',NA) );
df2 <- df[df$Tipo%in%c('gene','CDS'),c('Start','End','Accesion1','Accesion2')];
aggregate(df2[,c('Accesion1','Accesion2')], df2[,c('Start','End')], function(x) x[!is.na(x)] );
## Start End Accesion1 Accesion2
## 1 197 1558 NP_344554 SP_0001
## 2 1717 2853 NP_344555 SP_0002
## 3 2864 3112 NP_344556 SP_0003
Precomputing df2 is necessary in case there are non-gene non-CDS rows in the original data.frame; in order to properly aggregate just the gene and CDS rows, the non-gene non-CDS rows must be excluded from both x and by. (Of course, your example data has only gene and CDS rows, so it's not technically necessary for the example data.)
This solution makes the assumption that whenever two rows have the same Start and End values, then they must be gene/CDS pairs (as opposed to gene/gene or CDS/CDS).
Here is one potential way. You choose rows with gene and CDS. Then, you group your data by Start and END. There may be groups of START/END with 1 or 3+ rows. So you want to make sure that you choose START/END groups with two rows. In addition, you want to make sure that you have both gene and CDS (length(unique(Tipo)) == 2). Finally, you take non-NA element in Accesion1 and Accesion 2.
filter(df, Tipo %in% c("gene", "CDS")) %>%
group_by(Start, End) %>%
filter(n() == 2 & length(unique(Tipo)) == 2) %>%
summarise(Accesion1 = Accesion1[!is.na(Accesion1)],
Accesion2 = Accesion2[!is.na(Accesion2)])
Here is a pseudo example.
mydf <- structure(list(Tipo = structure(c(2L, 1L, 2L, 1L, 2L, 2L), .Label = c("CDS",
"gene"), class = "factor"), Start = c(197, 197, 1717, 1717, 2864,
2864), End = c(1558, 1558, 2853, 2853, 3112, 3112), Strand = structure(c(1L,
1L, 1L, 1L, 1L, 1L), .Label = "+", class = "factor"), Accesion1 = structure(c(NA,
1L, NA, 2L, NA, 3L), .Label = c("NP_344554", "NP_344555", "NP_344556"
), class = "factor"), Accesion2 = structure(c(1L, NA, 2L, NA,
3L, NA), .Label = c("SP_0001", "SP_0002", "SP_0003"), class = "factor")), .Names = c("Tipo",
"Start", "End", "Strand", "Accesion1", "Accesion2"), row.names = c(NA,
-6L), class = "data.frame")
Tipo Start End Strand Accesion1 Accesion2
1 gene 197 1558 + <NA> SP_0001
2 CDS 197 1558 + NP_344554 <NA>
3 gene 1717 2853 + <NA> SP_0002
4 CDS 1717 2853 + NP_344555 <NA>
5 gene 2864 3112 + <NA> SP_0003
6 gene 2864 3112 + NP_344556 <NA>
filter(mydf, Tipo %in% c("gene", "CDS")) %>%
group_by(Start, End) %>%
filter(n() == 2 & length(unique(Tipo)) == 2) %>%
summarise(Accesion1 = Accesion1[!is.na(Accesion1)],
Accesion2 = Accesion2[!is.na(Accesion2)])
# Start End Accesion1 Accesion2
#1 197 1558 NP_344554 SP_0001
#2 1717 2853 NP_344555 SP_0002

Resources