Calculated field based on Ranges in a second data frame in R - r

I have found similar posts regarding this task, but all of which have a common ID joining the two tables.
I have one data frame which contains sale records (sales_df). For this example I have simplified the data table so that it contains only 5 records. I would like to create a new column in the sales_df that calculates what the fee would be given a sale price amount as defined in the fee table (pricing_fees). Please note that the number of actual pricing fee ranges that I have to account for are around 30, so writing this into a mutate statement is something that I would like to try and avoid.
The two data frames are coded as follows
sales_df <- data.frame(invoice_id = 1:5,
sale_price = c(100, 275, 350, 500, 675))
pricing_fees <- data.frame(min_range = c(0, 50, 100, 200, 300, 400, 500), # >=
max_range = c(50, 100, 200, 300, 400, 500, 1000), # <
buyer_fee = c(1, 1, 25, 50, 75, 110, 125))
In the end I would like the resulting sales_df to look something like this.
invoice_id sale_price buyer_fee
1 1 100 25
2 2 275 50
3 3 350 75
4 4 500 125
5 5 675 125
Thanks in advance

You can use findInterval function which is supposed to be efficient in splitting values over ranges (since it uses binary search) :
# build consecutive increasing ranges of fees
# (in order to use findInterval, since it works on ranges defined in a single vector)
pricing_fees <- pricing_fees[order(pricing_fees$min_range),]
consecFees <- data.frame(ranges=c(pricing_fees$min_range[1], pricing_fees$max_range),
fees=c(pricing_fees$buyer_fee,NA))
# consecFees now is :
#
# ranges fees
# 1 0 1 ---> it means for price in [0,50) -> 1
# 2 50 1 ---> it means for price in [50,100) -> 1
# 3 100 25 ---> it means for price in [100,200) -> 25
# 4 200 50 ... and so on
# 5 300 75
# 6 400 110
# 7 500 125
# 8 1000 NA ---> NA because for values >= 1000 we set NA
# add the column to sales_df using findInterval
sales_df$buyer_fee <- consecFees$fees[findInterval(sales_df$sale_price,consecFees$ranges)]
Result :
> sales_df
invoice_id sale_price buyer_fee
1 1 100 25
2 2 275 50
3 3 350 75
4 4 500 125
5 5 675 125

You can also use cut to "bin" sales_df$sale_price values and label bins with corresponding buyer_fee values.
# Make pricing_fee table with unique buyer_fee
brks <- do.call(rbind, by(pricing_fees, pricing_fees$buyer_fee, FUN = function(x)
data.frame(min_range = min(x$min_range), max_range = max(x$max_range), buyer_fee = unique(x$buyer_fee))))
sales_df$buyer_fee = as.numeric(as.character(cut(
sales_df$sale_price,
breaks = c(0, brks$max_range),
labels = brks$buyer_fee,
right = F)))
# invoice_id sale_price buyer_fee
#1 1 100 25
#2 2 275 50
#3 3 350 75
#4 4 500 125
#5 5 675 125

Related

Duplicate rows in R N times and adding new count column

Ciao: Here is the data I have "have"
have = data.frame(c(1,2,3),
c(90,87,71),
c(600,601,602))
colnames(have) <- c("STUDENT","SCORE","TYPE")
Here is the data I want "want"
want = data.frame(c(1,1,2,2,3,3),
c(90,90,87,87,71,71),
c(600,600,601,601,602,602),
c(100,101,100,101,100,101))
colnames(want) <- c("STUDENT","SCORE","TYPE","CLASS")
As shown above here starting from "have" data I want to copy the row for every STUDENT; add new column "CLASS" which is equals to 100 for the STUDENT's first row and 101 for the STUDENT's second row.
Cheers!
I am creating a additional key for merge
have$key=1
mergedf=data.frame('CLASS'=c(100,101),'key'=1)
merge(have,mergedf,all.x=T)
key STUDENT SCORE TYPE CLASS
1 1 1 90 600 100
2 1 1 90 600 101
3 1 2 87 601 100
4 1 2 87 601 101
5 1 3 71 602 100
6 1 3 71 602 101
李哲源 plus Axeman provided the answers
## R core
data.frame(have[rep(1:nrow(have), each = 2), ], CLASS = c(100, 101),
row.names = seq_len(2 * nrow(have)))
## dplyr
dplyr::bind_rows('100' = have, '101' = have, .id = 'CLASS')
classes <- as.matrix(seq(100,101, by=1))
classes_rep <-matrix(classes, nrow=nrow(have)*nrow(classes))
want <- cbind(rbind(have, have), classes_rep)

Find the values between the range of 100's and their count

In excel I can make the group the column and count but iam unable to do it in R.
For doing in R i am using cut function with some breaks.
cut(elapsed, breaks=seq(min(elapsed),max(elapsed)+100,50), include.lowest=T)
here i attached the png of the data and required output.
but above code not give my require output.
this is the my data
and my required output:
400 9
500 4
600 2
700 5
800 3
900 3
This should work:
data.frame(table(elapsed %/% 100))
For example:
elapsed <- c(400, 423, 423, 534, 534, 639, 602, 812, 703)
data.frame(table(elapsed %/% 100))
Var1 Freq
1 4 3
2 5 2
3 6 2
4 7 1
5 8 1
For desired result in hundreds use this:
res <- data.frame(table(elapsed %/% 100))
res$Var1 <- as.numeric(res$Var1) * 100
you can try:
require(magrittr)
elapsed <- runif(100, 400, 1000) %>% round
cut(elapsed, breaks = seq(400,1000,100),
labels = as.character(seq(400,900,100)),
include.lowest=TRUE) %>% table
gives you:
400 500 600 700 800 900
15 22 16 9 20 18

how to subtract a value from one column from a value from a previous row, different column in r

I have a dataframe composed of 3 columns and ~2000 rows.
ID DistA DistB
1 100 200
2 239 390
3 392 550
4 700 760
5 770 900
The first column (ID) is a unique identifier for each row. I'd like my script to read each row, and subtract/compare the value from column "DistA" in each row from the value of column "DistB" from the previous row. If the difference of the distance of any subsequent pairs is <40, to output that they are in the same area.
For example: In the above example comparing row 2 and 1, '239' from row 2 and '200' from row 1 is <40 and therefore in the same area. The same way 2 and 3, are in the same area ie the difference is 2 and 2<40. But rows 3 and 4 are not as the difference is 150.
I have not been able to go far, as I am stuck in the comparison (subtraction/difference) step. I have tried to write a loop to iterate in all the rows, but I keep getting errors. Should I even use a loop, or can I do this without a loop?
I am a new R learner, and this is the 'rookie' code that I have so far. Where am I going wrong. Thanks in advance:
#the function to compare the two columns
funct <- function(x){
for(i in 1:(nrow(dat)))
(as.numeric(dat$DistA[i-1])) - (as.numeric(dat$DistB[i]))}
#creating a new column 'new2' with the differences
dat$new2 <- apply(dat[,c('DistB','DistA')]),1, funct
When I run this, I get the following error:
Error: unexpected ',' in "dat$new2 <- apply(dat[,c('DistB','DistA')]),"
I'll appreciate all the comments/suggestions.
I believe dplyr can help you here.
library(dplyr)
dfData <- data.frame(ID = c(1, 2, 3, 4, 5),
DistA = c(100, 239, 392, 700, 770),
DistB = c(200, 390, 550, 760, 900))
dfData <- mutate(dfData, comparison = DistA - lag(DistB))
This results in...
dfData
ID DistA DistB comparison
1 1 100 200 NA
2 2 239 390 39
3 3 392 550 2
4 4 700 760 150
5 5 770 900 10
You could then check to see if a row is within the same "area" as your previous row.
We could also try data.table (similar to the approach as suggested in the comments by #David Arenburg). shift is a new function introduced in the devel version with type='lag' as the default option. It can be installed from here
library(data.table)#data.table_1.9.5
setDT(df1)[, Categ := c('Diff', 'Same')[
(abs(DistA-shift(DistB)) < 40 )+1L]][]
# ID DistA DistB Categ
#1: 1 100 200 NA
#2: 2 239 390 Same
#3: 3 392 550 Same
#4: 4 700 760 Diff
#5: 5 770 900 Same
If we need both the 'difference' and 'category' columns
setDT(df1)[,c('Dist', 'Categ'):={tmp= abs(DistA-shift(DistB))
list(tmp, c('Diff', 'Same')[(tmp <40)+1L])}]
df1
# ID DistA DistB Dist Categ
#1: 1 100 200 NA NA
#2: 2 239 390 39 Same
#3: 3 392 550 2 Same
#4: 4 700 760 150 Diff
#5: 5 770 900 10 Same

Sort data by row based on a range of values

My data is:
phone colour length weight rating
100 5 3 3 0
200 1 4
303 3 30 9
302 2 43 0 2
106 43
203 23 3 1 7
I want my data to look like this:
Variable A (sort_by_model_100):
phone colour length weight rating
100 5 3 3 0
106 43
Variable B (sort_by_model_200):
phone colour length weight rating
200 4 20 1 4
203 23 3 1 7
Variable C (sort_by_model_300):
phone colour length weight rating
303 3 30 0 9
302 2 43 0 2
My R code:
data <- read.csv(file.choose(),header=TRUE)
sort_by_model_100 <- split (data, data$phone[100:200])
sort_by_model_200 <- split (data, data$phone[200:300])
sort_by_model_300 <- split (data, data$phone[300:400])
I get this error and my code doesn't work :
Warning message:
In split.default(x = seq_len(nrow(x)), f = f, drop = drop, ...) :
data length is not a multiple of split variable
Please help.
You can use subset:
var_a = subset(data, phone >= 100 & phone < 200)
var_b = subset(data, phone >= 200 & phone < 300)
And so on. Maybe you can improve the code to avoid hard-coding the ranges.
With this data
data<-data.frame(
phone=c(100,200,303,302,106,203),
colour=c(5,NA,3,2,43,23),
length=c(3,NA,30,43,NA,3),
weight=c(3,1,NA,0,NA,1),
rating=c(0,4,9,2,NA,7)
)
I'd use cut to create a factor to indicated model class
model<-cut(data$phone, breaks=c(100,200,300,400), include.lowest=T, right=F)
Then you can use split to create a list of sub-data.frames
split(data, model)
This is likely to be easier to work with than a bunch of different data.frame variables.

Finding overlapping ranges between two interval data

I have one table with coordinates (start, end) of ca. 500000 fragments and another table with 60000 single coordinates that I would like to match with the former fragments. I.e., for each record from dtCoords table I need to search a record in dtFrags table having the same chr and start<=coord<=end (and retrieve the type from this record of dtFrags). Is it good idea at all to use R for this, or I should rather look to other languages?
Here is my example:
require(data.table)
dtFrags <- fread(
"id,chr,start,end,type
1,1,100,200,exon
2,2,300,500,intron
3,X,400,600,intron
4,2,250,600,exon
")
dtCoords <- fread(
"id,chr,coord
10,1,150
20,2,300
30,Y,500
")
At the end, I would like to have something like this:
"idC,chr,coord,idF,type
10, 1, 150, 1, exon
20, 2, 300, 2, intron
20, 2, 300, 4, exon
30, Y, 500, NA, NA
"
I can simplify a bit the task by splitting the table to subtables by chr, so I would concentrate only on coordinates
setkey(dtCoords, 'chr')
setkey(dtFrags, 'chr')
for (chr in unique(dtCoords$chr)) {
dtCoordsSub <- dtCoords[chr];
dtFragsSub <- dtFrags[chr];
dtCoordsSub[, {
# ????
}, by=id]
}
but it's still not clear for me how should I work inside... I would be very grateful for any hints.
UPD. just in case, I put my real table in the archive here. After unpacking to your working directory, tables can be loaded with the following code:
dtCoords <- fread("dtCoords.txt", sep="\t", header=TRUE)
dtFrags <- fread("dtFrags.txt", sep="\t", header=TRUE)
In general, it's very appropriate to use the bioconductor package IRanges to deal with problems related to intervals. It does so efficiently by implementing interval tree. GenomicRanges is another package that builds on top of IRanges, specifically for handling, well, "Genomic Ranges".
require(GenomicRanges)
gr1 = with(dtFrags, GRanges(Rle(factor(chr,
levels=c("1", "2", "X", "Y"))), IRanges(start, end)))
gr2 = with(dtCoords, GRanges(Rle(factor(chr,
levels=c("1", "2", "X", "Y"))), IRanges(coord, coord)))
olaps = findOverlaps(gr2, gr1)
dtCoords[, grp := seq_len(nrow(dtCoords))]
dtFrags[subjectHits(olaps), grp := queryHits(olaps)]
setkey(dtCoords, grp)
setkey(dtFrags, grp)
dtFrags[, list(grp, id, type)][dtCoords]
grp id type id.1 chr coord
1: 1 1 exon 10 1 150
2: 2 2 intron 20 2 300
3: 2 4 exon 20 2 300
4: 3 NA NA 30 Y 500
Does this work?
You can use merge first and then subset
kk<-merge(dtFrags,dtCoords,by="chr",all.x=TRUE)
> kk
chr id.x start end type id.y coord
1: 1 1 100 200 exon 10 150
2: 2 2 300 500 intron 20 300
3: 2 4 250 600 exon 20 300
4: X 3 400 600 intron NA NA
kk[coord>=start & coord<=end]
chr id.x start end type id.y coord
1: 1 1 100 200 exon 10 150
2: 2 4 250 600 exon 20 300

Resources