Subset of a table that contains at least one element of another table - r

I have two tables that are made by intervals of bp, the Table1 has large intervals and the second has short intervals (just 2bp). I want to make a new table that contains only the Table 1 ranges that have at least one element of table 2 contained in their "large" ranges. If doesn´t have an element in the table 2 that corresponds to the table 1 range, this range of Table 1 should be not included.
In this example row 2 (1, 600, 1500) of Table1 (df) should be not included:
df <- "Chromosome start end
1 1 450
1 600 1500
2 3500 3585
2 7850 10000"
df <- read.table(text=df, header=T)
Table2 (df2)
df2 <- "Chromosome start end
1 5 6
1 598 599
2 3580 3581
2 7851 7852
2 7859 7860"
df2 <- read.table(text=df2, header=T)
NewTable (dfout):
dfout <- "Chromosome start end
1 1 450
2 3500 3585
2 7850 10000"
dfout <- read.table(text=df2, header=T)

Try foverlaps from data.table
library(data.table)
setkey(setDT(df1), Chromosome, start, end)
setkey(setDT(df2), Chromosome, start, end)
setnames(unique(foverlaps(df1, df2, nomatch=0)[, c(1,4:5),
with=FALSE]), names(df1))[]
# Chromosome start end
#1: 1 1 450
#2: 2 3500 3585
#3: 2 7850 10000
Or as #Arun commented, we can use which=TRUE (to extract the indices) and subset 'df1' using yid column.
df1[unique(foverlaps(df2, df1, nomatch=0L, which=TRUE)$yid)]
# Chromosome start end
#1: 1 1 450
#2: 2 3500 3585
#3: 2 7850 10000

It seems to solve your problem:
ranges <- merge(df,df2,by="Chromosome",suffixes=c("A","B"))
ranges <- ranges[with(ranges, startA <= startB & endA >= endB),]
ranges <- ranges[,1:3]
dfout <- unique(ranges)
dfout
# Chromosome startA endA
# 1 1 450
# 2 3500 3585
# 2 7850 10000

Related

creating a variable based on other factors using R [duplicate]

This question already has answers here:
How to sum a variable by group
(18 answers)
Closed 4 months ago.
My data looks like this:
hh_id
indl
ind_salary
hh_income
1
1
200
1
2
450
1
3
00
2
4
1232
2
5
423
Individuals with the same hh_id lives in the same household so they will have the same household income. And for that the variable hh_income equal the sum of the salary of all persons with the same hh_id;
so my data would look like:
hh_id
indl
ind_salary
hh_income
1
1
200
650
1
2
450
650
1
3
00
650
2
4
1232
1655
2
5
423
1655
Any ideas please;
Using dplyr:
data %>% group_by(hh_id) %>% mutate(hh_income = sum(ind_salary))
You can use R base function ave to generate sum of ind_salary grouped by hh_id and get a vector of the same length of ind_salary
> df$hh_income <- ave(df$ind_salary, df$hh_id, FUN=sum)
> df
hh_id indl ind_salary hh_income
1 1 1 200 650
2 1 2 450 650
3 1 3 0 650
4 2 4 1232 1655
5 2 5 423 1655
Using only base R:
hh_id <- c(1, 1 ,1, 2, 2)
indl <- c(1, 2, 3, 4, 5)
ind_salary <- c(200, 450, 0, 1232, 423)
hh_df <- data.frame(hh_id, indl, ind_salary)
hh_income <- tapply(hh_df$ind_salary, hh_df$hh_id, sum)
hh_income <- as.data.frame(hh_income)
hh_income$hh_id <- rownames(hh_income)
hh_df <- merge(hh_df, hh_income, by = 'hh_id')
View(hh_df)
Just to add more explanation to KacZdr's answer which would have helped me immensely as a beginner. Also, this is more in line with standard tidyr pipe code standards.
new_data <- data %>% # This creates a new dataset from the original so you don't alter the original, I find this much easier
group_by(hh_id)%>% # obviously groups the data by the variable that has duplicate values within the column that you want to apply a summary function , in this case sum
mutate(income = sum(ind_salary))# mutate creates a new column "income" and fills it with the sum of ind_salary for all with the same hh_id. This would be what you have called hh_income in your table.

RStudio: Summation of two datasets of different lengths

I have two datasets A and B:
Dataset A (called Sales) has the following data:
ID Person Sales
1 1 100
2 2 300
3 3 400
4 4 200
5 5 50
Dataset B (called Account_Scenarios) has the following data (Note- there are a lot more rows in dataset B I have just included the first 6):
ID Scenario Person Upkeep
1 1 1 -10
2 1 2 -200
3 2 1 -150
4 3 4 -50
5 3 3 -100
6 4 5 -500
I want to add a column called 'Profit' in dataset B such that I am able to see the profit per person per scenario (Profit = Sales + Upkeep). For example as below:
ID Scenario Person Upkeep Profit
1 1 1 -10 90
2 1 2 -200 100
3 2 1 -150 -50
4 3 4 -50 150
5 3 3 -100 300
6 4 5 -500 -450
What is the best way to do this? I am new to R and trying use an aggregate function but it requires the arguments to be the same length.
Account_Scenarios$Profit <- aggregate(Sales[,c('Sales')], Account_Scenarios[,c('Upkeep')], by=list(Sales$Person), 'sum')
Assuming that Sales$Person have only unique values, you can:
Account_Scenarios$Profit=Account_Scenarios$Upkeep-Sales$Sales[sapply(Account_Scenarios$Person,function(x)which(Sales$Person==x))]
I would left_join the two datasets base Person variable, then calculate the profit:
library(tidyverse)
A <- A %>% select(Person, Sales) # Only need the two variables for the join
df <- left_join(B, A, by = "Person") %>%
mutate(Profit = Sales + Upkeep)
A solution can be using sqldf library (a sql style join):
library(sqldf)
A <- data.frame(Person=1:5, Sales=c(100,300,400,200,50))
B <- data.frame(Scenario=c(1,1,2,3,3,4), Person=c(1,2,1,4,3,5), Upkeep=c(-10,-200,-150,-50,-100,-500))
B <- sqldf("SELECT B.*, A.Sales + B.Upkeep as Profit FROM B JOIN A on B.Person = A.Person")

Replace value in data frame with value from other data frame based on set of conditions

In df1 I need to replace values for msec with corresponding values in df2.
df1 <- data.frame(ID=c('rs', 'rs', 'rs', 'tr','tr','tr'), cond=c(1,1,2,1,1,2),
block=c(2,2,4,2,2,4), correct=c(1,0,1,1,1,0), msec=c(456,678,756,654,625,645))
df2 <- data.frame(ID=c('rs', 'rs', 'tr','tr'), cond=c(1,2,1,2),
block=c(2,4,2,4), mean=c(545,664,703,765))
In df1, if correct==0, then reference df2 with the matching values of ID, cond, and block. Replace the value for msec in df1 with the corresponding value for mean in df2.
For example, the second row in df1 has correct==0. So, in df2 find the corresponding row where ID=='rs', cond==1, block==2 and use the value for mean (mean=545) to replace the value for msec (msec=678). Note that in df1 combinations of ID, block, and cond can repeat, but each combination occurs only once in df2.
Using the data.table package:
# load the 'data.table' package
library(data.table)
# convert the data.frame's to data.table's
setDT(df1)
setDT(df2)
# update df1 by reference with a join with df2
df1[df2[, correct := 0], on = .(ID, cond, block, correct), msec := i.mean]
which gives:
> df1
ID cond block correct msec
1: rs 1 2 1 456
2: rs 1 2 0 545
3: rs 2 4 1 756
4: tr 1 2 1 654
5: tr 1 2 1 625
6: tr 2 4 0 765
Note: The above code will update df1 instead of creating a new dataframe, which is more memory-efficient.
One option would be to use base R with an interaction() and a match(). How about:
df1[which(df1$correct==0),"msec"] <- df2[match(interaction(df1[which(df1$correct==0),c("ID","cond","block")]),
interaction(df2[,c("ID","cond", "block")])),
"mean"]
df1
# ID cond block correct msec
#1 rs 1 2 1 456
#2 rs 1 2 0 545
#3 rs 2 4 1 756
#4 tr 1 2 1 654
#5 tr 1 2 1 625
#6 tr 2 4 0 765
We overwrite the correct == 0 columns with their matched rows in df2$mean
Edit: Another option would be a sql merge this could look like:
library(sqldf)
merged <- sqldf('SELECT l.ID, l.cond, l.block, l.correct,
case when l.correct == 0 then r.mean else l.msec end as msec
FROM df1 as l
LEFT JOIN df2 as r
ON l.ID = r.ID AND l.cond = r.cond AND l.block = r.block')
merged
ID cond block correct msec
1 rs 1 2 1 456
2 rs 1 2 0 545
3 rs 2 4 1 756
4 tr 1 2 1 654
5 tr 1 2 1 625
6 tr 2 4 0 765
With dplyr. This solution left_join all columns and mutate when correct is 0.
library(dplyr)
left_join(df1,df2)%>%
mutate(msec=ifelse(correct==0,mean,msec))%>%
select(-mean)
ID cond block correct msec
1 rs 1 2 1 456
2 rs 1 2 0 545
3 rs 2 4 1 756
4 tr 1 2 1 654
5 tr 1 2 1 625
6 tr 2 4 0 765

select rows in one data frame that partially match rows in another data frame

I wish to select rows in one data frame, data.1, that partially match rows in a second data frame, keep.these, to obtain the desired.result. I have found several questions here that match based on one column, but I want to match on three columns: STATE, COUNTY and CITY. I have come up with three solutions so far, but none seem ideal.
Note that each row contains a unique combination of STATE, COUNTY and CITY in my real data.
When I use merge I must re-order. The function match seems to work, but I am not familiar with it and do not know if my use of this function is as intended. The apply solution below is clearly too complex.
The merge approach would be ideal if I did not have to reorder the result. Reordering can be time consuming with large data sets. The match approach seems okay if someone can confirm this is a reasonable approach.
Is there a better solution, ideally in base R?
data.1 <- read.table(text= "
CITY COUNTY STATE AA
1 1 1 2
2 1 1 4
1 2 1 6
2 2 1 8
1 1 2 20
2 1 2 40
1 2 2 60
2 2 2 80
1 1 3 200
2 1 3 400
1 2 3 600
2 2 3 800
1 1 4 2000
2 1 4 4000
1 2 4 6000
2 2 4 8000
1 1 5 20000
2 1 5 40000
1 2 5 60000
2 2 5 80000
", header=TRUE, na.strings=NA)
keep.these <- read.table(text= "
CITY COUNTY STATE BB
1 1 2 -10
2 1 2 -11
1 2 2 -12
2 2 2 -13
1 1 4 -14
2 1 4 -15
1 2 4 -16
2 2 4 -17
", header=TRUE, na.strings=NA)
desired.result <- read.table(text= "
CITY COUNTY STATE AA
1 1 2 20
2 1 2 40
1 2 2 60
2 2 2 80
1 1 4 2000
2 1 4 4000
1 2 4 6000
2 2 4 8000
", header=TRUE, na.strings=NA)
##########
# this works, but I need to reorder
new.data.a <- merge(keep.these[,1:3], data.1, by=c('CITY', 'COUNTY', 'STATE'))
new.data.a <- new.data.a[order(new.data.a$STATE, new.data.a$COUNTY, new.data.a$CITY),]
rownames(desired.result) <- NULL
rownames(new.data.a) <- NULL
all.equal(desired.result, new.data.a)
##########
# this seems to work, but match is unfamiliar
new.data.2 <- data.1[match(data.1$CITY , keep.these$CITY , nomatch=0) &
match(data.1$STATE , keep.these$STATE , nomatch=0) &
match(data.1$COUNTY, keep.these$COUNTY, nomatch=0),]
rownames(desired.result) <- NULL
rownames(new.data.2) <- NULL
all.equal(desired.result, new.data.2)
##########
# this works, but is too complex
data.1b <- data.frame(my.group = apply( data.1[,1:3], 1, paste, collapse = "."), data.1)
keep.these.b <- data.frame(my.group = apply(keep.these[,1:3], 1, paste, collapse = "."), keep.these)
data.1b <- data.1b[apply(data.1b, 1, function(x) {x[1] %in% keep.these.b$my.group}),]
data.1b <- data.1b[,-1]
rownames(desired.result) <- NULL
rownames(data.1b) <- NULL
all.equal(desired.result, data.1b)
##########
Here is a generic solution for this type of problem which is very efficient:
data.1.ID <- paste(data.1[,1],data.1[,2],data.1[,3])
keep.these.ID <- paste(keep.these[,1],keep.these[,2],keep.these[,3])
desired.result <- data.1[data.1.ID %in% keep.these.ID,]
I have simply created an unique ID for each record, and then searched it.
Note: This will change the row names, and you may want to add the following:
row.names(desired.result) <- 1:nrow(desired.result)
EDIT:
Here is another way to solve the same problem.
If you have a very large data set, say millions of rows, another very efficient solution is using the package data.table. It works nearly 50-100 times faster than merge, depending on how much data you have.
All you have to do is the following:
library(data.table)
Step1: Convert data.frame to data.table, with first three columns as keys.
d1 <- data.table(data.1, key=names(data.1)[1:3])
kt <- data.table(keep.these, key=names(keep.these)[1:3])
Step2: A merge using data.table's binary search:
d1[kt]
Note1: The simplicity of execution.
Note2: This will sort the data by key. To avoid that try following:
data.1$index <- 1:nrow(data.1) # Add index to original data
d1 <- data.table(data.1,key=names(data.1)[1:3]) # Step1 as above
kt <- data.table(keep.these,key=names(keep.these)[1:3]) # Step1 as above
d1[kt][order(index)] # Step2 as above
If you want to remove the last two columns (index, BB), that's straight forward too:
d1[kt][order(index)][,-(5:6),with=F] #Remove index
Try this with large data sets, and compare the timing with merge. It's typically about 50-100 times faster.
To learn more about data.table, try:
vignette("datatable-intro")
vignette("datatable-faq")
vignette("datatable-timings")
Or see it in action:
example(data.table)
Hope this helps!!
I'm not sure how this will do in terms of time, compared to reordering, but you can simply add an option to merge to not change the sort.
new.data.a <- merge(keep.these[,1:3], data.1, by=c('CITY', 'COUNTY', 'STATE'), sort = FALSE)
rownames(desired.result) <- NULL
rownames(new.data.a) <- NULL
all.equal(desired.result, new.data.a)

Counting Records Based on Unique Date

I have a frame with a column of dates (some dates with multiple records) and a numeric column. I want a frame that lists one date per record, the sum of the numbers for each date, and the number of occurrences of records for each date.
Starting frame:
SomeDate SomeNum
10/1/2013 2
10/1/2013 3
10/2/2013 5
10/3/2013 4
10/3/2013 1
10/3/2013 1
I can get the sum of SomeNum per unique date with the following:
newDF<-unique(within(df, {
SumOfSomeNums <- ave(SomeNum, SomeDate, FUN = sum)
}))
But I can't figure out how to get a count of the number of times each unique SomeDate occurs.
I want:
SomeDate SumOfSomeNums CountOfSomeDate
10/1/2013 5 2
10/2/2013 5 1
10/3/2013 6 3
What would get me the CountOfSomeDate data?
Thx
Continuing with your approach, use length as your aggregation function:
unique(within(mydf, {
SumOfSomeNums <- ave(SomeNum, SomeDate, FUN = sum)
CountOfSomeDate <- ave(SomeDate, SomeDate, FUN = length)
rm(SomeNum)
}))
# SomeDate CountOfSomeDate SumOfSomeNums
# 1 10/1/2013 2 5
# 3 10/2/2013 1 5
# 4 10/3/2013 3 6
However, there are many alternative ways to get here.
Here's an aggregate approach:
do.call(data.frame, aggregate(SomeNum ~ SomeDate, mydf, function(x) c(sum(x), length(x))))
# SomeDate SomeNum.1 SomeNum.2
# 1 10/1/2013 5 2
# 2 10/2/2013 5 1
# 3 10/3/2013 6 3
And a data.table approach:
library(data.table)
DT <- data.table(mydf)
DT[, list(Count = length(SomeNum), Sum = sum(SomeNum)), by = SomeDate]
# SomeDate Count Sum
# 1: 10/1/2013 2 5
# 2: 10/2/2013 1 5
# 3: 10/3/2013 3 6

Resources