Complex dataframe values selection based on both rows and columns - r

I need to select some values on each row of the dataset below and compute a sum.
This is a part of my dataset.
> prova
key_duration1 key_duration2 key_duration3 KeyPress1RESP KeyPress2RESP KeyPress3RESP
18 3483 364 3509 b n m
19 2367 818 3924 b n m
20 3775 1591 802 b m n
21 929 3059 744 n b n
22 3732 530 1769 b n m
23 3503 2011 2932 b n b
24 3684 1424 1688 b n m
Rows are trials of the experiment and columns are the keys pressed, in temporal sequence (keypressRESP) and the amount of time of the key until the next one (key_duration).
So for example in the first trial (first row) I pressed "b" and after 3483 ms I pressed "n" and so on.
This is my dataframe
structure(list(key_duration1 = c(3483L, 2367L, 3775L, 929L, 3732L,
3503L, 3684L), key_duration2 = c(364L, 818L, 1591L, 3059L, 530L,
2011L, 1424L), key_duration3 = c(3509, 3924, 802, 744, 1769,
2932, 1688), KeyPress1RESP = structure(c(2L, 2L, 2L, 4L, 2L,
2L, 2L), .Label = c("", "b", "m", "n"), class = "factor"), KeyPress2RESP = structure(c(4L,
4L, 3L, 2L, 4L, 4L, 4L), .Label = c("", "b", "m", "n"), class = "factor"),
KeyPress3RESP = structure(c(3L, 3L, 4L, 4L, 3L, 2L, 3L), .Label = c("",
"b", "m", "n"), class = "factor")), row.names = 18:24, class = "data.frame")
I need a method for select in each row (trial) all "b" values, compute the sum(key_duration) and print the values on a new column, the same for "m".
How can i do?
I think that i need a function similar to 'apply()' but without compute every values on the row but only selected values.
apply(prova[,1:3],1,sum)
Thanks

Here is a way using data.table.
library(data.table)
setDT(prova)
# melt
prova_long <-
melt(
prova[, idx := 1:.N],
id.vars = "idx",
measure.vars = patterns("^key_duration", "^KeyPress"),
variable.name = "key",
value.name = c("duration", "RESP")
)
# aggregate
prova_aggr <- prova_long[RESP != "n", .(duration_sum = sum(duration)), by = .(idx, RESP)]
# spread and join
prova[dcast(prova_aggr, idx ~ paste0("sum_", RESP)), c("sum_b", "sum_m") := .(sum_b, sum_m), on = "idx"]
prova
Result
# key_duration1 key_duration2 key_duration3 KeyPress1RESP KeyPress2RESP KeyPress3RESP idx sum_b sum_m
#1: 3483 364 3509 b n m 1 3483 3509
#2: 2367 818 3924 b n m 2 2367 3924
#3: 3775 1591 802 b m n 3 3775 1591
#4: 929 3059 744 n b n 4 3059 NA
#5: 3732 530 1769 b n m 5 3732 1769
#6: 3503 2011 2932 b n b 6 6435 NA
#7: 3684 1424 1688 b n m 7 3684 1688
The idea is to reshape your data to long format, aggregate by "RESP" per row. Spread the result and join back to your initial data.

With tidyverse you can do:
bind_cols(df %>%
select_at(vars(starts_with("KeyPress"))) %>%
rowid_to_column() %>%
gather(var, val, -rowid), df %>%
select_at(vars(starts_with("key_"))) %>%
rowid_to_column() %>%
gather(var, val, -rowid)) %>%
group_by(rowid) %>%
summarise(b_values = sum(val1[val == "b"]),
m_values = sum(val1[val == "m"])) %>%
left_join(df %>%
rowid_to_column(), by = c("rowid" = "rowid")) %>%
ungroup() %>%
select(-rowid)
b_values m_values key_duration1 key_duration2 key_duration3 KeyPress1RESP KeyPress2RESP KeyPress3RESP
<dbl> <dbl> <int> <int> <dbl> <fct> <fct> <fct>
1 3483. 3509. 3483 364 3509. b n m
2 2367. 3924. 2367 818 3924. b n m
3 3775. 1591. 3775 1591 802. b m n
4 3059. 0. 929 3059 744. n b n
5 3732. 1769. 3732 530 1769. b n m
6 6435. 0. 3503 2011 2932. b n b
7 3684. 1688. 3684 1424 1688. b n m
First, it splits the df into two: one with variables starting with "KeyPress" and one with variables starting with "key_". Second, it transforms the two dfs from wide to long format and combines them by columns. Third, it creates a summary for "b" and "m" values according row ID. Finally, it merges the results with the original df.

You can make a logical matrix from the KeyPress columns, multiply it by the key_duration subset and then take their rowSums.
prova$b_values <- rowSums((prova[, 4:6] == "b") * prova[, 1:3])
prova$n_values <- rowSums((prova[, 4:6] == "n") * prova[, 1:3])
key_duration1 key_duration2 key_duration3 KeyPress1RESP KeyPress2RESP KeyPress3RESP b_values n_values
18 3483 364 3509 b n m 3483 364
19 2367 818 3924 b n m 2367 818
20 3775 1591 802 b m n 3775 802
21 929 3059 744 n b n 3059 1673
22 3732 530 1769 b n m 3732 530
23 3503 2011 2932 b n b 6435 2011
24 3684 1424 1688 b n m 3684 1424
It works because the logical values are coerced to numeric 1s or 0s, and only the values for individual keys are retained.
Extra: to generalise, you could instead use a function and tidyverse/purrr to map it:
get_sums <- function(key) rowSums((prova[, 4:6] == key) * prova[, 1:3])
keylist <- list(b_values = "b", n_values = "n", m_values = "m")
library(tidyverse)
bind_cols(prova, map_dfr(keylist, get_sums))

Related

Add values of columns based on condition of another variable in R

I want to create a variable that adds the values from other columns based on the condition of variable year. That is if variable YEAR = 2013 then add columns YR_2006, YR_2007, YR_2008, YR_2009, YR_2010, YR_2011. So for group A the sum would be 12,793
GROUP YEAR YR_2006 YR_2007 YR_2008 YR_2009 YR_2010 YR_2011
A 2013 636 3653 4759 3745
B 2019 1417 2176 3005 2045 2088 1849
C 2007 4218 3622 4651 4574 4122 4711
E 2017 5956 6031 6032 4885 5400 5828
Here is an option with apply and MARGIN = 1 to loop over the rows, get the index where the 'YEAR' matches the names, do a sequence from the 2nd element to that index, subset the values and get the sum
df1$Sum <- apply(df1[-1], 1, function(x)
sum(x[2:c(grep(as.character(x[1]), names(x)[-1]) +1,
length(x))[1]], na.rm = TRUE))
df1$Sum
#[1] 12793 12580 7840 28886
Or we can use a vectorized option with rowSums after replaceing some of the elements in each row to NA based on matching the 'YEAR' column with the column names that startsWith 'YR_'
i1 <- startsWith(names(df1), "YR_")
i2 <- match(df1$YEAR, sub("YR_", "", names(df1)[i1]), nomatch = sum(i1))
rowSums(replace(df1[i1], col(df1[i1]) > i2[row(df1[i1])], NA), na.rm = TRUE)
#[1] 12793 12580 7840 28886
Or using tidyverse to reshape to 'long' format with pivot_longer and then do a group_by sum after sliceing the rows based on the match
library(dplyr)
library(tidyr)
df1 %>%
pivot_longer(cols = starts_with("YR_"), values_drop_na = TRUE) %>%
group_by(GROUP) %>%
slice(seq(match(first(YEAR), readr::parse_number(name), nomatch = n()))) %>%
summarise(Sum = sum(value)) %>%
left_join(df1, .)
GROUP YEAR YR_2006 YR_2007 YR_2008 YR_2009 YR_2010 YR_2011 Sum
1 A 2013 NA 636 3653 4759 3745 NA 12793
2 B 2019 1417 2176 3005 2045 2088 1849 12580
3 C 2007 4218 3622 4651 4574 4122 4711 7840
4 E 2017 5956 6031 6032 4885 5400 582 28886
data
df1 <- structure(list(GROUP = c("A", "B", "C", "E"), YEAR = c(2013L,
2019L, 2007L, 2017L), YR_2006 = c(NA, 1417L, 4218L, 5956L), YR_2007 = c(636L,
2176L, 3622L, 6031L), YR_2008 = c(3653L, 3005L, 4651L, 6032L),
YR_2009 = c(4759L, 2045L, 4574L, 4885L), YR_2010 = c(3745L,
2088L, 4122L, 5400L), YR_2011 = c(NA, 1849L, 4711L, 582L)),
class = "data.frame", row.names = c(NA,
-4L))

Extraction of characters and symbols using R

I have a column with these kind of values
id count total SEXO EDAD IDENTIF_AFILIADO
1: 952815090_12_06_Q643 4 133.34 M 39 952815090
2: 952443257_10_17_C64 9 64.32 F 5 952443257
3: 931131767_9_10_C716 2 21.88 M 1 931131767
4: 931131767_8_13_C716 15 173.70 M 1 931131767
5: 931131767_1_09_C716 1 10.94 M 0 931131767
.....
The id column has a code after the third " _ ". For instance, the first row has "952815090_12_06_Q643"
I need to extrac the code Q643.
More specifically the group of characters after the third "_" in every row. How to perform it using R?
Using regular expressions:
gsub("^.*_.*_.*_(.*)$", "\\1", id)
This should do it:
your.ids <- sapply( dat$id, function(id) {
strsplit( id, "_" )[[1]][4]
})
Or if this is a data.table, perhaps something like this:
dat[, idstring := tstrsplit( id, "_", fixed=T )[4] ]
Applied to your code it looks like this:
dat <- read.table(text=
" id count total SEXO EDAD IDENTIF_AFILIADO
1: 952815090_12_06_Q643 4 133.34 M 39 952815090
2: 952443257_10_17_C64 9 64.32 F 5 952443257
3: 931131767_9_10_C716 2 21.88 M 1 931131767
4: 931131767_8_13_C716 15 173.70 M 1 931131767
5: 931131767_1_09_C716 1 10.94 M 0 931131767
") %>% as.data.table
dat[, idstring := tstrsplit( id, "_", fixed=T )[4] ]
print( dat )
Output:
id count total SEXO EDAD IDENTIF_AFILIADO idstring
1: 952815090_12_06_Q643 4 133.34 M 39 952815090 Q643
2: 952443257_10_17_C64 9 64.32 F 5 952443257 C64
3: 931131767_9_10_C716 2 21.88 M 1 931131767 C716
4: 931131767_8_13_C716 15 173.70 M 1 931131767 C716
5: 931131767_1_09_C716 1 10.94 M 0 931131767 C716
You can delete everything until last underscore.
sub('.*_', '', df$id)
#[1] "Q643" "C64" "C716" "C716" "C716"
data
df <- structure(list(id = c("952815090_12_06_Q643", "952443257_10_17_C64",
"931131767_9_10_C716", "931131767_8_13_C716", "931131767_1_09_C716"
), count = c(4L, 9L, 2L, 15L, 1L), total = c(133.34, 64.32, 21.88,
173.7, 10.94), SEXO = c("M", "F", "M", "M", "M"), EDAD = c(39L,
5L, 1L, 1L, 0L), IDENTIF_AFILIADO = c(952815090L, 952443257L,
931131767L, 931131767L, 931131767L)),
class = "data.frame", row.names = c(NA, -5L))

Subsetting rows based on multiple columns using data.table - fastest way

I was wondering if there was a more elegant, less clunky and faster way to do this. I have millions of rows with ICD coding for clinical data. A short example provided below. I was to subset the dataset based on either of the columns meeting a specific set of diagnosis codes. The code below works but takes ages in R and was wondering if there is a faster way.
structure(list(eid = 1:10, mc1 = structure(c(4L, 3L, 5L, 2L,
1L, 1L, 1L, 1L, 1L, 1L), .Label = c("345", "410", "413.9", "I20.1",
"I23.4"), class = "factor"), oc1 = c(350, 323, 12, 35, 413.1,
345, 345, 345, 345, 345), oc2 = structure(c(5L, 6L, 4L, 1L, 1L,
2L, 2L, 2L, 3L, 2L), .Label = c("", "345", "I20.3", "J23.6",
"K50.1", "K51.4"), class = "factor")), .Names = c("eid", "mc1",
"oc1", "oc2"), class = c("data.table", "data.frame"), row.names = c(NA,
-10L), .internal.selfref = <pointer: 0x102812578>)
The code below subsets all rows that meet the code of either "I20" or "413" (this would include all codes that have for example been coded as "I20.4" or "413.9" etc.
dat2 <- dat [substr(dat$mc1,1,3)== "413"|
substr(dat$oc1,1,3)== "413"|
substr(dat$oc2,1,3)== "413"|
substr(dat$mc1,1,3)== "I20"|
substr(dat$oc1,1,3)== "I20"|
substr(dat$oc2,1,3)== "I20"]
Is there a faster way to do this? For example can i loop through each of the columns looking for the specific codes "I20" or "413" and subset those rows?
We can specify the columns of interest in .SDcols, loop through the Subset of Data.table (.SD), get the first 3 characters with substr, check whether it is %in% a vector of values and Reduce it to a single logical vector for subsetting the rows
dat[dat[,Reduce(`|`, lapply(.SD, function(x)
substr(x, 1, 3) %chin% c('413', 'I20'))), .SDcols = 2:4]]
# eid mc1 oc1 oc2
#1: 1 I20.1 350.0 K50.1
#2: 2 413.9 323.0 K51.4
#3: 5 345 413.1
#4: 9 345 345.0 I20.3
For larger data it could help if we dont chech all rows:
minem <- function(dt, colsID = 2:4) {
cols <- colnames(dt)[colsID]
x <- c('413', 'I20')
set(dt, j = "inn", value = F)
for (i in cols) {
dt[inn == F, inn := substr(get(i), 1, 3) %chin% x]
}
dt[inn == T][, inn := NULL][]
}
n <- 1e7
set.seed(13)
dt <- dts[sample(.N, n, replace = T)]
dt <- cbind(dt, dts[sample(.N, n, replace = T), 2:4])
setnames(dt, make.names(colnames(dt), unique = T))
dt
# eid mc1 oc1 oc2 mc1.1 oc1.1 oc2.1
# 1: 8 345 345.0 345 345 345 345
# 2: 3 I23.4 12.0 J23.6 413.9 323 K51.4
# 3: 4 410 35.0 413.9 323 K51.4
# 4: 1 I20.1 350.0 K50.1 I23.4 12 J23.6
# 5: 10 345 345.0 345 345 345 345
# ---
# 9999996: 3 I23.4 12.0 J23.6 I20.1 350 K50.1
# 9999997: 5 345 413.1 I20.1 350 K50.1
# 9999998: 4 410 35.0 345 345 345
# 9999999: 4 410 35.0 410 35
# 10000000: 10 345 345.0 345 345 345 I20.3
system.time(r1 <- akrun(dt, 2:ncol(dt))) # 22.88 sek
system.time(r2 <- minem(dt, 2:ncol(dt))) # 17.72 sek
all.equal(r1, r2)
# [1] TRUE

New data.table columnS based on grouping and function of multiple columns

Let's say I have a data.frame
sample_df = structure(list(AE = c(148, 1789, 1223, 260, 1825, 37, 1442, 484,
10, 163, 1834, 254, 445, 837, 721, 1904, 1261, 382, 139, 213),
FW = structure(c(1L, 3L, 2L, 3L, 3L, 1L, 2L, 3L, 2L, 2L,
3L, 2L, 3L, 2L, 1L, 3L, 1L, 1L, 1L, 3L), .Label = c("LYLR",
"OCXG", "BIYX"), class = "factor"), CP = c("WYB/NXO", "HUK/NXO",
"HUK/WYB", "HUK/NXO", "WYB/NXO", "HUK/WYB", "HUK/NXO", "HUK/NXO",
"WYB/NXO", "HUK/NXO", "WYB/NXO", "HUK/NXO", "HUK/WYB", "WYB/NXO",
"HUK/WYB", "WYB/NXO", "WYB/NXO", "HUK/WYB", "WYB/NXO", "WYB/NXO"
), SD = c(1, 1, -1, 1, 1, 1, 1, -1, 1, 1, -1, -1, 1, -1,
-1, 1, -1, 1, 1, 1)), .Names = c("AE", "FW", "CP", "SD"), row.names = c(NA, -20L), class = "data.frame")
Or in human readable format:
AE FW CP SD
1 148 LYLR WYB/NXO 1
2 1789 BIYX HUK/NXO 1
3 1223 OCXG HUK/WYB -1
4 260 BIYX HUK/NXO 1
5 1825 BIYX WYB/NXO 1
6 37 LYLR HUK/WYB 1
7 1442 OCXG HUK/NXO 1
8 484 BIYX HUK/NXO -1
9 10 OCXG WYB/NXO 1
10 163 OCXG HUK/NXO 1
11 1834 BIYX WYB/NXO -1
12 254 OCXG HUK/NXO -1
13 445 BIYX HUK/WYB 1
14 837 OCXG WYB/NXO -1
15 721 LYLR HUK/WYB -1
16 1904 BIYX WYB/NXO 1
17 1261 LYLR WYB/NXO -1
18 382 LYLR HUK/WYB 1
19 139 LYLR WYB/NXO 1
20 213 BIYX WYB/NXO 1
now suppose that for each unique value (fw,cp) of (FW,CP), I would like to get
sum of all values of AE for (FW,CP)=(fw,cp)
mean of all values of SD for (FW,CP)=(fw,cp)
In R, one could do something like:
unique_keys <- unique(sample_df[,c('FW','CP')])
slow_version <- function(ind, sample_df, unique_keys){
index <- which(sample_df$FW == unique_keys$FW[ind] & sample_df$CP == unique_keys$CP[ind])
c(ind = ind,
sum_ae = sum(sample_df$AE[index]),
min_ae = mean(sample_df$SD[index]))
}
intermed_result <- t(sapply(1:nrow(unique_keys), slow_version,
sample_df = sample_df,
unique_keys = unique_keys))
colnames(intermed_result) <- c('ind','sum','mean')
result <- data.frame(unique_keys[intermed_result[, 'ind'], ],
'sum' = intermed_result[,'sum'],
'mean' = intermed_result[,'mean'])
but this gets pretty slow as the size of data_frame grows.
Thanks to this answer, I suspect it is possible to use data.table magic to get the same result fastly. But doing:
library(data.table)
sample_dt = data.table(sample_df)
setkey(sample_dt, FW, CP)
f <- function(AE, SD) {list('sum' = sum(AE), 'mean' = mean(SD))}
sample_dt[,c("col1","col2"):=f(AE, SD), by=.(FW, CP)][]
does not yield the desired result. What is the correct way?
I would try:
library(data.table)
sample_dt = data.table(data_frame)
setkey(sample_dt, FW, CP)
f <- function(AE, SD) {list('sum' = sum(AE), 'mean' = mean(SD))}
sample_dt[, f(AE, SD), by=.(FW, CP)]
# FW CP sum mean
# 1: LYLR HUK/WYB 1140 0.3333333
# 2: LYLR WYB/NXO 1548 0.3333333
# 3: OCXG HUK/NXO 1859 0.3333333
# 4: OCXG HUK/WYB 1223 -1.0000000
# 5: OCXG WYB/NXO 847 0.0000000
# 6: BIYX HUK/NXO 2533 0.3333333
# 7: BIYX HUK/WYB 445 1.0000000
# 8: BIYX WYB/NXO 5776 0.5000000
you didn't get desired output because you assign the resulting sum and mean columns by group to original data.table with :=. However, I also prefer the syntax suggested by Frank, which should be the right way to go. For our current named list approach, when adding verbose = T, it says:
Making each group and running j (GForce FALSE) ... The result of j is
a named list. It's very inefficient to create the same names over and
over again for each group. When j=list(...), any names are detected,
removed and put back after grouping has completed, for efficiency.
Using j=transform(), for example, prevents that speedup (consider
changing to :=). This message may be upgraded to warning in future.
When we have many groups and the function in j are basic functions like mean and sd, using
sample_dt2[, .(sum.AE = sum(AE), mean.SD = mean(SD)), by=.(FW, CP)]
would be very fast, becaused those functions are replaced with GForce functions like gmean internally. see ?GForce and the benchmark of Frank for more information.

Combining (pasting) columns

I have the following data.frame
Tipo Start End Strand Accesion1 Accesion2
1 gene 197 1558 + <NA> SP_0001
2 CDS 197 1558 + NP_344554 <NA>
3 gene 1717 2853 + <NA> SP_0002
4 CDS 1717 2853 + NP_344555 <NA>
5 gene 2864 3112 + <NA> SP_0003
6 CDS 2864 3112 + NP_344556 <NA>
There are more "Tipo" values, such as tRNA, region , exon, or rRNA, but I am only interested in combining these two, gene and CDS
And I would like to get the following
Start End Accesion1 Accesion2
1 197 1558 NP_344554 SP_0001
but only when the start and End values of gene and CDS coincide. I've tried to use select, arrange and mutate with dplyr, but it is sort of complicated for me to get rid of the NAs
A dplyr version with summarize_each:
DF %>%
group_by(Start, End) %>%
summarise_each(funs(max), Accesion1, Accesion2)
Produces:
Source: local data frame [3 x 4]
Groups: Start
Start End Accesion1 Accesion2
1 197 1558 NP_344554 SP_0001
2 1717 2853 NP_344555 SP_0002
3 2864 3112 NP_344556 SP_0003
Assumes AccessionX varibles are character (does not work with factor), as well as the condition that Start End pairs contain only two values, one each of Tipo and Gene, as in your data set.
You could try
library(data.table)
setDT(df1)[, id:=cumsum(Tipo == 'gene')][,
list(Accesion1=na.omit(Accesion1), Accesion2=na.omit(Accesion2)) ,
list(id, Start, End)]
Here's a solution using aggregate():
df <- data.frame(Tipo=c('gene','CDS','gene','CDS','gene','CDS'), Start=c(197,197,1717,1717,2864,2864), End=c(1558,1558,2853,2853,3112,3112), Strand=c('+','+','+','+','+','+'), Accesion1=c(NA,'NP_344554',NA,'NP_344555',NA,'NP_344556'), Accesion2=c('SP_0001',NA,'SP_0002',NA,'SP_0003',NA) );
df2 <- df[df$Tipo%in%c('gene','CDS'),c('Start','End','Accesion1','Accesion2')];
aggregate(df2[,c('Accesion1','Accesion2')], df2[,c('Start','End')], function(x) x[!is.na(x)] );
## Start End Accesion1 Accesion2
## 1 197 1558 NP_344554 SP_0001
## 2 1717 2853 NP_344555 SP_0002
## 3 2864 3112 NP_344556 SP_0003
Precomputing df2 is necessary in case there are non-gene non-CDS rows in the original data.frame; in order to properly aggregate just the gene and CDS rows, the non-gene non-CDS rows must be excluded from both x and by. (Of course, your example data has only gene and CDS rows, so it's not technically necessary for the example data.)
This solution makes the assumption that whenever two rows have the same Start and End values, then they must be gene/CDS pairs (as opposed to gene/gene or CDS/CDS).
Here is one potential way. You choose rows with gene and CDS. Then, you group your data by Start and END. There may be groups of START/END with 1 or 3+ rows. So you want to make sure that you choose START/END groups with two rows. In addition, you want to make sure that you have both gene and CDS (length(unique(Tipo)) == 2). Finally, you take non-NA element in Accesion1 and Accesion 2.
filter(df, Tipo %in% c("gene", "CDS")) %>%
group_by(Start, End) %>%
filter(n() == 2 & length(unique(Tipo)) == 2) %>%
summarise(Accesion1 = Accesion1[!is.na(Accesion1)],
Accesion2 = Accesion2[!is.na(Accesion2)])
Here is a pseudo example.
mydf <- structure(list(Tipo = structure(c(2L, 1L, 2L, 1L, 2L, 2L), .Label = c("CDS",
"gene"), class = "factor"), Start = c(197, 197, 1717, 1717, 2864,
2864), End = c(1558, 1558, 2853, 2853, 3112, 3112), Strand = structure(c(1L,
1L, 1L, 1L, 1L, 1L), .Label = "+", class = "factor"), Accesion1 = structure(c(NA,
1L, NA, 2L, NA, 3L), .Label = c("NP_344554", "NP_344555", "NP_344556"
), class = "factor"), Accesion2 = structure(c(1L, NA, 2L, NA,
3L, NA), .Label = c("SP_0001", "SP_0002", "SP_0003"), class = "factor")), .Names = c("Tipo",
"Start", "End", "Strand", "Accesion1", "Accesion2"), row.names = c(NA,
-6L), class = "data.frame")
Tipo Start End Strand Accesion1 Accesion2
1 gene 197 1558 + <NA> SP_0001
2 CDS 197 1558 + NP_344554 <NA>
3 gene 1717 2853 + <NA> SP_0002
4 CDS 1717 2853 + NP_344555 <NA>
5 gene 2864 3112 + <NA> SP_0003
6 gene 2864 3112 + NP_344556 <NA>
filter(mydf, Tipo %in% c("gene", "CDS")) %>%
group_by(Start, End) %>%
filter(n() == 2 & length(unique(Tipo)) == 2) %>%
summarise(Accesion1 = Accesion1[!is.na(Accesion1)],
Accesion2 = Accesion2[!is.na(Accesion2)])
# Start End Accesion1 Accesion2
#1 197 1558 NP_344554 SP_0001
#2 1717 2853 NP_344555 SP_0002

Resources