Create dataframe containing only matching data from 2 dataframes in R - r

I've seen several posts on similar topics to this but I can't seem to make it work for my needs. I have 2 data frames, df1 and df2. df1 is quite large, df 2 is small.
df1
Chr start end Count
1 0 50 20
1 51 100 40
2 0 50 100
2 51 100 30
2 101 150 7
df2
Chr coord Name
1 25 X
2 75 Y
What I would like is to return rows which contain only those that match Chr exactly (df1$Chr == df2$Chr) and where df2$coord falls in the range of df1 start and end (df2$coord >= df1$start & df2$coord <= df1$end)
The end result (ideally) should look like this:
Chr start end Count coord Name
1 0 50 20 25 X
2 51 100 30 75 Y
I know this is probably a basic problem but any help would be greatly appreciated.

This linked question by thelatemail gives the solution: Comparing multiple columns in different data sets to find values within range R That question is somewhat muddled and unclear.
This is a duplicate of that question, but this question is clearer and much more readable.
x <- merge(df1, df2)
with(x, x[coord >= start & coord <= end,])
## Chr start end Count coord Name
## 1 1 0 50 20 25 X
## 4 2 51 100 30 75 Y

Related

r - lapply divides a column by an integer value from different dataset, unexpected result

I have two data.frames, one with genotype counts and one with a number that I need to normalize my counts from the first dataset.
countsdata=data.frame(genotype1=rep(c(10,20,30,40),each=1),
genotype2=rep(c(100,200,300,400),each=1),
genotype3=rep(c(40,50,60,70),each=1),
genotype4=rep(c(40,50,60,70),each=1)
)
coldata = data.frame(Group =c('genotype1', 'genotype2', 'genotype3', 'genotype4'),
Treatment = rep(c("control","treated"),each = 2),
Norm=rep(c(1,2,5,5)))
I made sure my variables don't have factors
factorsCharacter <- function(d) modifyList(d, lapply(d[, sapply(d, is.factor)],
as.character))
coldata=factorsCharacter(coldata)
Then I see that lapply loops through my counts, one column at the time and through my coldata that contains the normalization value (Norm). All is looking good, until I combined the two action in the same step
> lapply(coldata['Group'],function(group_i){group_i})
$Group
[1] "genotype1" "genotype2" "genotype3" "genotype4"
> lapply(coldata['Group'],function(group_i){countsdata[,group_i]})
$Group
genotype1 genotype2 genotype3 genotype4
1 10 100 40 40
2 20 200 50 50
3 30 300 60 60
4 40 400 70 70
> lapply(coldata['Group'],function(group_i){as.integer(coldata[coldata$Group==group_i,'Norm'])})
$Group
[1] 1 2 5 5
> lapply(coldata['Group'],function(group_i){
+ countsdata[,group_i]/as.integer(coldata[coldata$Group==group_i,'Norm'])
+ })
$Group
genotype1 genotype2 genotype3 genotype4
1 10 100 40 40
2 10 100 25 25
3 6 60 12 12
4 8 80 14 14
Here the result is not what I was expecting (dividing each column by its normalization number). After further inspection I noticed it's normalizing by rows, in other words it's normalizing across different columns, which shouldn't be the case as I am looping through one column at the time. I am probably missing a basic concept but looking through other SO posts didn't find anything I could use. My goal is to fix the code to make the right calculation but I also would like to understand why this code above is not working. Thanks so much.
The problem is in using [ and not [[. So, instead of looping through each of the elements in 'Group' column, we have a list of length 1 with all the elements. So, either use coldata[, 'Group'] or coldata[['Group']] or coldata$Group for looping.
countsdataNew <- countsdata
countsdataNew[] <- lapply(coldata[['Group']],function(group_i)
countsdata[,group_i]/coldata$Norm[coldata$Group==group_i])
countsdataNew
# genotype1 genotype2 genotype3 genotype4
#1 10 50 8 8
#2 20 100 10 10
#3 30 150 12 12
#4 40 200 14 14
If the column name in 'countsdata' and 'Group' column from 'countsdata' are in the same order, we can do this easily with Map
Map(`/`, countsdata, coldata$Norm)
Or just replicate the 'Norm' and do a simple division
countsdata/coldata$Norm[col(countsdata)]
Or with sweep
sweep(countsdata, 2, coldata$Norm, "/")

Counting Attempts of an event in R

I'm relatively new in R and learning. I have the following data frame = data
ID grade Test_Date
1 56 01-25-2012
1 63 02-21-2016
1 73 02-31-2016
2 41 12-23-2015
2 76 01-07-2016
3 66 02-08-2016
I am looking to count the number of people (in this case only two unique individuals) who passed their tests after multiple attempts(passing is defined as 65 or over). So the final product would return me a list of unique ID's who had multiple counts until their test scores hit 65. This would inform me that approx. 66% of the clients in this data frame require multiple test sessions before getting a passing grade.
Below is my idea or concept more or less, I've framed it as an if statement
If ID appears twice
count how often it appears, until TEST GRADE >= 65
ifelse(duplicated(data$ID), count(ID), NA)
I'm struggling with the second piece where I want to say, count the occurrence of ID until grade >=65.
The other option I see is some sort of loop. Below is my attempt
for (i in data$ID) {
duplicated(datad$ID)
count(data$ID)
Here is where something would say until =65
}
Again the struggle comes in how to tell R to stop counting when grade hits 65.
Appreciate the help!
You can use data.table:
library(data.table)
dt <- fread(" ID grade Test_Date
1 56 01-25-2012
1 63 02-21-2016
1 73 02-31-2016
2 41 12-23-2015
2 76 01-07-2016
3 66 02-08-2016")
# count the number of try per ID then get only the one that have been successful
dt <- dt[, N:=.N, by=ID][grade>=65]
# proportion of successful having tried more than once
length(dt[N>1]$ID)/length(dt$ID)
[1] 0.6666667
Another option, though the other two work just fine:
library(dplyr)
dat2 <- dat %>%
group_by(ID) %>%
summarize(
multiattempts = n() > 1 & any(grade < 65),
maxgrade = max(grade)
)
dat2
# Source: local data frame [3 x 3]
# ID multiattempts maxgrade
# <int> <lgl> <int>
# 1 1 TRUE 73
# 2 2 TRUE 76
# 3 3 FALSE 66
sum(dat2$multiattempts) / nrow(dat2)
# [1] 0.6666667
Here is a method using the aggregate function and subsetting that returns the maximum score for testers that took the the test more than once starting from their second test.
multiTestMax <- aggregate(grade~ID, data=df[duplicated(df$ID),], FUN=max)
multiTestMax
ID grade
1 1 73
2 2 76
To get the number of rows, you can use nrow:
nrow(multiTestMax)
2
or the proportion of all test takers
nrow(multiTestMax) / unique(df$ID)
data
df <- read.table(header=T, text="ID grade Test_Date
1 56 01-25-2012
1 63 02-21-2016
1 73 02-31-2016
2 41 12-23-2015
2 76 01-07-2016
3 66 02-08-2016")

Creating a subset of unique entries for a recursive list in R

I have the following data set df
name draught nav_status date
A 22 0 24/12/2014
A 22 0 25/12/2014
A 11 5 26/12/2014
A 11 1 27/12/2014
B 22 0 24/12/2014
B 22 0 25/12/2014
B 22 0 26/12/2014
B 22 5 27/12/2014
B 9 0 28/12/2014
B 22 0 29/12/2014
from this data set, I need to extract the unique draught values for each object of the list.
I am fairly new to R and have made the following attempts
y <- subset(df,!duplicated(df[,draught]),)
and
Dup <- function(x){
x <- x[!duplicated[x$draught],]
y <- lapply(df, Dup)
But this deletes the draught entries for the entire data. I went through some literature regarding split-apply and combine techniques and also tries those options.
Please provide some guidance, literature so as to solve this problem.
The result should be
name draught nav_status date
A 22 0 24/12/2014
A 11 5 26/12/2014
A 11 1 27/12/2014
B 22 0 25/12/2014
B 9 0 28/12/2014
I even tried to subsetthe data based on first and last entries by arranging them sequentially and deleting the duplicate entries, but there was loss of data.Thank you!!
Using data.table library you can arrive at the result by:
library(data.table)
dt <- as.data.table(df)
unique(dt, by = c('name', 'draught'))
One thing though. Why you have two entries of a pair A 11 in your desired result?

Print column values to rows of new data frame only if it matches second dataframe range in R

I have two data frames and I'd like to print from a column in df1 to separate rows of a new df/matrix if values match a range around df2. Please see example below.
df1
Chr Coord Value
1 25 10
1 75 20
1 125 15
1 175 30
2 25 16
2 75 25
2 125 50
2 175 100
2 225 150
df2
Chr Coord
1 75
2 125
What I need is if:
(df1$Chr == df2$Chr & df1$Coord <= df$2Coord +50 & df1$Coord is >= df2$coord -50)
then print
df1$Value to it's own row of a new data frame or matrix.
Final output that I need is:
df3
10 20 15
25 50 100
Any help would be greatly appreciated.
I may be wrong but it looks like you're working with genetic ranges (assuming Chr = Chromosome). If so, you should look at the GenomicRanges package from Bioconductor. It provides classes for representing ranged data with biological metadata and includes methods for subsetting one set of ranges based on their overlap with another set of ranges.
First you need to convert your data.frames to GRanges objects:
library(GenomicRanges)
gr1 <- GRanges(seqnames = df1$Chr,
IRanges(start = df1$Coord, width = 1),
Value = df1$Value)
gr2 <- GRanges(seqnames = df2$Chr,
IRanges(start = df2$Coord, width = 1),
Value = df2$Value)
Then use subsetByOverlaps() with the maxgap argument to indicate we're looking for ranges within 50:
df3 <- subsetByOverlaps(gr1, gr2, maxgap = 50)
Extract the Value column of df3:
df3$Value
Final output:
## 10 20 15 25 50 100

Row aggregation when values are close enough in a column

I have a dataframe with 2 columns
time x
1306247226 5
1306247236 10
1306248127 20
1306248187 36
1306249248 28
1306249258 24
1306249259 20
...
I'd like to aggregate the rows whose values in the 'time' column are close enough
(eg. let's say their difference is less than 60.) and sum their 'x' values in the aggregated row. The 'time value in the aggregated row will be the one of the first row of the aggregation. ('time' is an unix timestamp)
The goal is to have as output of this example:
time x
1306247226 15
1306248127 20
1306248187 36
1306249248 72
...
The dataset is quite big, a 'for' loop will take a long time... but if it is the only option I can deal with it and wait.
Any idea?
Thanks a lot!
You can use something like this :
First I create a new column for aggregation
dat$gg <- cumsum(c(0,diff(dat$time)) > 60)
Then I use the plyr package to apply function aggregation
library(plyr)
ddply(dat,.(gg),summarise,time = head(time,1),res = sum(x))
gg time res
1 0 1306247226 15
2 1 1306248127 56
3 2 1306249248 72
Edit after comment
The Op wanted a threshold of 60, not greater than 60. So I need to change the > to >=
dat$gg <- cumsum(c(0,diff(dat$time)) >= 60)
ddply(dat,.(gg),summarise,time = head(time,1),res = sum(x))
gg time res
1 0 1306247226 15
2 1 1306248127 20
3 2 1306248187 36
4 3 1306249248 72

Resources