R comparing 2 dfs to sum data between values - r

I have 2 dataframes in R, one with start (column 1) and end (column 2) coordinates...
df1
2500 3499
3500 4499
4500 5499
5500 6499
And one with point coordinates (column 1) and associated values (column 2)...
df2
2657 17
2895 33
3875 12
4448 42
5122 3
5633 65
5781 12
I would like to find a vectorized approach to sum the values from df2 column 2 where df2 column 1 coordinates are between the start and stop coordinates for df1. with this data the result should look like this...
df3
2500 3499 50
3500 4499 54
4500 5499 3
5500 6499 77
The dfs contain 100,000+ rows, I can achieve this easily using loops, but as were are in R it is slow and not the best approach.
What is the best way to do this? Also a flexible solution that can be adapted to other functions, other than simply summing data would be good to know.

Here's a possible data.table::foverlaps solution. As you haven't specified column names, I'm assuming that they are called V1 and V2 in both data sets
Solution
library(data.table)
setDT(df1)[, `:=`(start = V1, end = V2)]
setDT(df2)[, `:=`(start = V1, end = V1)]
setkey(df1, start, end)
foverlaps(df2, df1)[, list(SumV2 = sum(i.V2)), by = list(V1, V2)]
# V1 V2 SumV2
# 1: 2500 3499 50
# 2: 3500 4499 54
# 3: 4500 5499 3
# 4: 5500 6499 77
Explanation
Here we converted both data sets to data.table objects and specified the start/end values to overlap on. Then, we keyed the data set that we want to join against. Finally we ran the foverlaps function and then aggregated the matched values of V2 from df2 by the desired columns in df1
Data
df1 <- structure(list(V1 = c(2500L, 3500L, 4500L, 5500L), V2 = c(3499L,
4499L, 5499L, 6499L)), .Names = c("V1", "V2"), class = "data.frame", row.names = c(NA,
-4L))
df2 <- structure(list(V1 = c(2657L, 2895L, 3875L, 4448L, 5122L, 5633L,
5781L), V2 = c(17L, 33L, 12L, 42L, 3L, 65L, 12L)), .Names = c("V1",
"V2"), class = "data.frame", row.names = c(NA, -7L))

Related

How to match two column values in df1 and extract corresponding values in R

Table 1:
Pos
Samples
129
ERR5678
460
ERR7890
568
ERR7689
Table 2:
Pos
ERR5678
ERR7890
ERR7689
129
67890
76879
67894
460
56782
123478
678390
568
78926
890765
345678
Result Table
Pos
Samples
Dp_value
129
ERR5678
67890
460
ERR7890
123478
568
ERR7689
345678
table 1 contains the list of Positions and their corresponding samples and another table contains the Position and Depth values for each sample. Using R, two tables read into data.table then I used: df1[(df1$Pos%in%df2%pos),]
It extracted the position. Please someone kindly tell me how to match both Pos and Samples in df2 to get the result table.
Reshape the second data to 'long' format and do an inner_join
library(dplyr)
library(tidyr)
df2 %>%
pivot_longer(cols = starts_with("ERR"),
names_to = "Samples", values_to = "Dp_value") %>%
inner_join(df1)
-output
# A tibble: 3 × 3
Pos Samples Dp_value
<int> <chr> <int>
1 129 ERR5678 67890
2 460 ERR7890 123478
3 568 ERR7689 345678
data
df1 <- structure(list(Pos = c(129L, 460L, 568L), Samples = c("ERR5678",
"ERR7890", "ERR7689")), class = "data.frame", row.names = c(NA,
-3L))
df2 <- structure(list(Pos = c(129L, 460L, 568L), ERR5678 = c(67890L,
56782L, 78926L), ERR7890 = c(76879L, 123478L, 890765L), ERR7689 = c(67894L,
678390L, 345678L)), class = "data.frame", row.names = c(NA, -3L
))

How to merge two dataframes based on range value of one table

DF1
SIC Value
350 100
460 500
140 200
290 400
506 450
DF2
SIC1 AREA
100-200 Forest
201-280 Hospital
281-350 Education
351-450 Government
451-550 Land
Note:class of SIC1 is having character,we need to convert to numeric range
i am trying to get the output like below
Desired output:
DF3
SIC Value AREA
350 100 Education
460 500 Land
140 200 Forest
290 400 Education
506 450 Land
i have tried first to convert character class of SIC1 to numeric
then tried to merge,but no luck,can someone guide on this?
An option can be to use tidyr::separate along with sqldf to join both tables on range of values.
library(sqldf)
library(tidyr)
DF2 <- separate(DF2, "SIC1",c("Start","End"), sep = "-")
sqldf("select DF1.*, DF2.AREA from DF1, DF2
WHERE DF1.SIC between DF2.Start AND DF2.End")
# SIC Value AREA
# 1 350 100 Education
# 2 460 500 Lan
# 3 140 200 Forest
# 4 290 400 Education
# 5 506 450 Lan
Data:
DF1 <- read.table(text =
"SIC Value
350 100
460 500
140 200
290 400
506 450",
header = TRUE, stringsAsFactors = FALSE)
DF2 <- read.table(text =
"SIC1 AREA
100-200 Forest
201-280 Hospital
281-350 Education
351-450 Government
451-550 Lan",
header = TRUE, stringsAsFactors = FALSE)
We could do a non-equi join. Split (tstrsplit) the 'SIC1' column in 'DF2' to numeric columns and then do a non-equi join with the first dataset.
library(data.table)
setDT(DF2)[, c('start', 'end') := tstrsplit(SIC1, '-', type.convert = TRUE)]
DF2[, -1, with = FALSE][DF1, on = .(start <= SIC, end >= SIC),
mult = 'last'][, .(SIC = start, Value, AREA)]
# SIC Value AREA
#1: 350 100 Education
#2: 460 500 Land
#3: 140 200 Forest
#4: 290 400 Education
#5: 506 450 Land
Or as #Frank mentioned we can do a rolling join to extract the 'AREA' and update it on the first dataset
setDT(DF1)[, AREA := DF2[DF1, on=.(start = SIC), roll=TRUE, x.AREA]]
data
DF1 <- structure(list(SIC = c(350L, 460L, 140L, 290L, 506L), Value = c(100L,
500L, 200L, 400L, 450L)), .Names = c("SIC", "Value"),
class = "data.frame", row.names = c(NA, -5L))
DF2 <- structure(list(SIC1 = c("100-200", "201-280", "281-350", "351-450",
"451-550"), AREA = c("Forest", "Hospital", "Education", "Government",
"Land")), .Names = c("SIC1", "AREA"), class = "data.frame",
row.names = c(NA, -5L))

Subset multiple observations by ID then select first observation by time [duplicate]

This question already has answers here:
Extract row corresponding to minimum value of a variable by group
(9 answers)
Closed 6 years ago.
I have a large dataset of observations, with several observations in rows and several different variables for each ID.
e.g.
Data
ID V1 V2 V3 time
1 35 100 5.2 2015-07-03 07:49
2 25 111 6.2 2015-04-01 11:52
3 41 120 NA 2015-04-01 14:17
1 25 NA NA 2015-07-03 07:51
2 NA 122 6.2 2015-04-01 11:50
3 40 110 4.1 2015-04-01 14:25
I would like to extract the earliest (first) observation for each variable independently based on the time column, for each unique ID. i.e. I would like to combine multiple rows of the same ID together so that I have one row of the first observation for each variable (time variable will not be equal for all).
The min() function will return the earliest time for a set of observations, but the problem is I need to do this for each variable. To do this I have tried using the tapply function with minimum time
tapply(Data, ID, min(time)
but get an error saying
"Error in match.fun(FUN) :
'min(Data$time)' is not a function, character or symbol.
I suspect that there is also a problem because many of the rows of observations have missing data.
Alternatively I have tried to just do each variable one at a time using aggregate, and select the min(time) this way:
firstV1 <-aggregate(V1[min(time)]~ID, data=Data, na.rm=T)
From the example dataset, what I would like to see is:
Data
ID V1 V2 V3
1 35 100 5.2
2 25 122 6.2
3 41 120 4.1
Note the '25' for ID2 V1 was from the later observation because the first observation was missing. Same for ID3 V3.
Input data
structure(list(ID = c(1L, 2L, 3L, 1L, 2L, 3L), V1 = c(35L, 25L,
41L, 25L, NA, 40L), V2 = c(100L, 111L, 120L, NA, 122L, 110L),
V3 = c(5.2, 6.2, 4.2, NA, 6.2, 4.1), time = structure(c(1435906140,
1427885520, 1427894220, 1435906260, 1427885400, 1427894700
), class = c("POSIXct", "POSIXt"), tzone = "")), .Names = c("ID",
"V1", "V2", "V3", "time"), row.names = c(NA, -6L), class = "data.frame")
This should do what you need.
library(data.table)
Data <- rbind(cbind(1,35,100,5.2,"2015-07-03 07:49"),
cbind(2,25,111,6.2,"2015-04-01 11:52"),
cbind(3,41,120,4.2,"2015-04-01 14:17"),
cbind(1,25,NA,NA,"2015-07-03 07:51"),
cbind(2,NA,122,6.2,"2015-04-01 11:50"),
cbind(3,40,110,4.1,"2015-04-01 14:25"))
colnames(Data) <- c("ID","V1","V2","V3","time")
Data <- data.table(Data)
class(Data[,time])
Data[,time:=as.POSIXct(time)]
minTime.Data <- Data[,lapply(.SD, function(x) x[time==min(time)]),by=ID]
minTime.Data
The outcome will be
ID V1 V2 V3 time
1: 1 35 100 5.2 2015-07-03 07:49:00
2: 2 NA 122 6.2 2015-04-01 11:50:00
3: 3 41 120 4.2 2015-04-01 14:17:00
Let me know if this is what you were looking for, because there is a little ambiguity in your question.

Combining (pasting) columns

I have the following data.frame
Tipo Start End Strand Accesion1 Accesion2
1 gene 197 1558 + <NA> SP_0001
2 CDS 197 1558 + NP_344554 <NA>
3 gene 1717 2853 + <NA> SP_0002
4 CDS 1717 2853 + NP_344555 <NA>
5 gene 2864 3112 + <NA> SP_0003
6 CDS 2864 3112 + NP_344556 <NA>
There are more "Tipo" values, such as tRNA, region , exon, or rRNA, but I am only interested in combining these two, gene and CDS
And I would like to get the following
Start End Accesion1 Accesion2
1 197 1558 NP_344554 SP_0001
but only when the start and End values of gene and CDS coincide. I've tried to use select, arrange and mutate with dplyr, but it is sort of complicated for me to get rid of the NAs
A dplyr version with summarize_each:
DF %>%
group_by(Start, End) %>%
summarise_each(funs(max), Accesion1, Accesion2)
Produces:
Source: local data frame [3 x 4]
Groups: Start
Start End Accesion1 Accesion2
1 197 1558 NP_344554 SP_0001
2 1717 2853 NP_344555 SP_0002
3 2864 3112 NP_344556 SP_0003
Assumes AccessionX varibles are character (does not work with factor), as well as the condition that Start End pairs contain only two values, one each of Tipo and Gene, as in your data set.
You could try
library(data.table)
setDT(df1)[, id:=cumsum(Tipo == 'gene')][,
list(Accesion1=na.omit(Accesion1), Accesion2=na.omit(Accesion2)) ,
list(id, Start, End)]
Here's a solution using aggregate():
df <- data.frame(Tipo=c('gene','CDS','gene','CDS','gene','CDS'), Start=c(197,197,1717,1717,2864,2864), End=c(1558,1558,2853,2853,3112,3112), Strand=c('+','+','+','+','+','+'), Accesion1=c(NA,'NP_344554',NA,'NP_344555',NA,'NP_344556'), Accesion2=c('SP_0001',NA,'SP_0002',NA,'SP_0003',NA) );
df2 <- df[df$Tipo%in%c('gene','CDS'),c('Start','End','Accesion1','Accesion2')];
aggregate(df2[,c('Accesion1','Accesion2')], df2[,c('Start','End')], function(x) x[!is.na(x)] );
## Start End Accesion1 Accesion2
## 1 197 1558 NP_344554 SP_0001
## 2 1717 2853 NP_344555 SP_0002
## 3 2864 3112 NP_344556 SP_0003
Precomputing df2 is necessary in case there are non-gene non-CDS rows in the original data.frame; in order to properly aggregate just the gene and CDS rows, the non-gene non-CDS rows must be excluded from both x and by. (Of course, your example data has only gene and CDS rows, so it's not technically necessary for the example data.)
This solution makes the assumption that whenever two rows have the same Start and End values, then they must be gene/CDS pairs (as opposed to gene/gene or CDS/CDS).
Here is one potential way. You choose rows with gene and CDS. Then, you group your data by Start and END. There may be groups of START/END with 1 or 3+ rows. So you want to make sure that you choose START/END groups with two rows. In addition, you want to make sure that you have both gene and CDS (length(unique(Tipo)) == 2). Finally, you take non-NA element in Accesion1 and Accesion 2.
filter(df, Tipo %in% c("gene", "CDS")) %>%
group_by(Start, End) %>%
filter(n() == 2 & length(unique(Tipo)) == 2) %>%
summarise(Accesion1 = Accesion1[!is.na(Accesion1)],
Accesion2 = Accesion2[!is.na(Accesion2)])
Here is a pseudo example.
mydf <- structure(list(Tipo = structure(c(2L, 1L, 2L, 1L, 2L, 2L), .Label = c("CDS",
"gene"), class = "factor"), Start = c(197, 197, 1717, 1717, 2864,
2864), End = c(1558, 1558, 2853, 2853, 3112, 3112), Strand = structure(c(1L,
1L, 1L, 1L, 1L, 1L), .Label = "+", class = "factor"), Accesion1 = structure(c(NA,
1L, NA, 2L, NA, 3L), .Label = c("NP_344554", "NP_344555", "NP_344556"
), class = "factor"), Accesion2 = structure(c(1L, NA, 2L, NA,
3L, NA), .Label = c("SP_0001", "SP_0002", "SP_0003"), class = "factor")), .Names = c("Tipo",
"Start", "End", "Strand", "Accesion1", "Accesion2"), row.names = c(NA,
-6L), class = "data.frame")
Tipo Start End Strand Accesion1 Accesion2
1 gene 197 1558 + <NA> SP_0001
2 CDS 197 1558 + NP_344554 <NA>
3 gene 1717 2853 + <NA> SP_0002
4 CDS 1717 2853 + NP_344555 <NA>
5 gene 2864 3112 + <NA> SP_0003
6 gene 2864 3112 + NP_344556 <NA>
filter(mydf, Tipo %in% c("gene", "CDS")) %>%
group_by(Start, End) %>%
filter(n() == 2 & length(unique(Tipo)) == 2) %>%
summarise(Accesion1 = Accesion1[!is.na(Accesion1)],
Accesion2 = Accesion2[!is.na(Accesion2)])
# Start End Accesion1 Accesion2
#1 197 1558 NP_344554 SP_0001
#2 1717 2853 NP_344555 SP_0002

How to merge specific rows that match a grep pattern

I have a dataframe as follows:
Jen Rptname freq
AKT bilb1 23
AKT bilb1 234
DFF bilb22 987
DFF bilf34 7
DFF jhs23 623
AKT j45 53
JFG jhs98 65
I know how to group the whole dataframe based on individual columns but how do I merge individual rows based on a grep (in this case bilb.* and jhs.*)
I want to be able to merge the rows (and therefore also add the frequencies together) with bilb* and separately the rows with jhs* so that I end up with
AKT bilb 257
DFF bilb 987
DFF bilf34 7
DFF jhs 623
AKT j45 53
JFG jhs 65
This is so that the aggregation is by Jen and Rptname so I can see how many of the same Rptnames are in each Jen
We can use grep to get the index of 'Rptname' elements that have 'bilb' or 'jhs', remove the numeric part with sub and use aggregate to get the sum of 'Freq' by 'Rptname'
indx <- grep('bilb|jhs', df1$Rptname)
df1$Rptname[indx] <- sub('\\d+', '', df1$Rptname[indx])
aggregate(freq~Rptname, df1, FUN=sum)
# Rptname freq
#1 bilb 1244
#2 bilf34 7
#3 j45 53
#4 jhs 688
Update
Suppose your dataset is 'df2'
df2$grp <- gsub("([A-Z]+|[a-z]+)[^A-Z]+", "\\1", df2$Rptname)
aggregate(freq~grp+Jen, df2, FUN=sum)
data
df1 <- structure(list(Rptname = c("bilb1", "bilb1", "bilb22",
"bilf34",
"jhs23", "j45", "jhs98"), freq = c(23L, 234L, 987L, 7L, 623L,
53L, 65L)), .Names = c("Rptname", "freq"), class = "data.frame",
row.names = c(NA, -7L))
df2 <- structure(list(Jen = c("AKT", "AKT", "AKT", "DFF", "DFF",
"DFF",
"DFF", "DFF", "DFF", "AKT", "JFG", "JFG", "JFG"), Rptname = c("bilb1",
"bilb1", "bilb22", "bilb22", "bilb1", "BTBy", "bilf34", "BTBx",
"jhs23", "j45", "jhs98", "BTBfd", "BTBx"), freq = c(23L, 234L,
22L, 987L, 18L, 18L, 7L, 9L, 623L, 53L, 65L, 19L, 14L)),
.Names = c("Jen",
"Rptname", "freq"), class = "data.frame", row.names = c(NA, -13L))
Similar to akrun's and I like his use of aggregate better than my creation of an intermediate vector:
> inter <- tapply(dat$freq, sub("^(bilb|jhs)(.+)$", "\\1", dat$Rptname) ,sum)
> final <- data.frame( nams = names(inter), sums = inter)
> final
nams sums
bilb bilb 1244
bilf34 bilf34 7
j45 j45 53
jhs jhs 688
My pattern would require that the 'bilb' amd 'jhs' be at the beginning of the value. Remove the "^" if that was not intended, but if so, add a "(.*)" and switch to "\\2" in the replacement.

Resources