My data looks like this:
FlightID FareClass FareClassRank FareValue Bookings
1 YULCDG215135 Q 1 100 5
2 YULCDG215135 X 2 150 7
3 YULCDG215135 V 3 200 4
4 YULCDG215135 Y 4 1000 2
5 YULCDG215136 Q 1 120 1
6 YULCDG215136 X 2 200 4
7 YULCDG215136 V 3 270 5
8 YULCDG215136 Y 4 900 15
Question: I need to write simple code in R:
For each flight, the ratio of the value of the fare class to the value of its next fare class.
As example, X is the next fare class of Q, V is the next fare class of X, and so on.
The ratio for flight YULCDG215135 is 100/150 = 0.6667 for X, 150/200=0.75 for V and so on.
Here is a base R solution using by.
by(df, df$FlightID, function(x) c(NA, exp(-diff(log(x$FareValue)))))
#df$FlightID: YULCDG215135
#[1] NA 0.6666667 0.7500000 0.2000000
#------------------------------------------------------------
#df$FlightID: YULCDG215136
#[1] NA 0.6000000 0.7407407 0.3000000
Or alternatively using ave
transform(df, ratio = ave(FareValue, FlightID, FUN = function(x) c(NA, exp(-diff(log(x))))))
# FlightID FareClass FareClassRank FareValue Bookings ratio
#1 YULCDG215135 Q 1 100 5 NA
#2 YULCDG215135 X 2 150 7 0.6666667
#3 YULCDG215135 V 3 200 4 0.7500000
#4 YULCDG215135 Y 4 1000 2 0.2000000
#5 YULCDG215136 Q 1 120 1 NA
#6 YULCDG215136 X 2 200 4 0.6000000
#7 YULCDG215136 V 3 270 5 0.7407407
#8 YULCDG215136 Y 4 900 15 0.3000000
The trick in both cases is to log-transform FareValue so that we can use diff, and then to invert the transformation (using exp) to give the ratio.
require(dplyr)
df %>%
group_by(FlightID) %>%
arrange(FareClassRank) %>%
mutate(ratio=FareValue/lead(FareValue))
Try this:
df$FareOverNext <- unlist(lappy(split(df$FareValue, df$FlightID), {
c(1, x[1:(length(x) - 1)] / x[2:length(x)])
}))
EDIT:
Added lapply and split thanks to #thelatemail comment.
Related
The data frame I have looked like this.
"rank" variable has to be increased once the differences between the [i]th row of "start" and the [i-1]th row of "end" are over 14.(also, when encountered the different "ID")
I tried the code below and it worked very well.
But the thing is.. it is way too slow because I have like over 700000 rows.
So, is there any way to make it perform much faster?
df$rank <- 1
for(i in 2:nrow(l50.df)){
df[i,"rank"] <- ifelse((df[i,"ID"]==df[i-1,"ID"])&
(df[i-1,"diff"]<=14),
df[i,"rank"] <- df[i-1,"rank"],
df[i,"rank"] <- df[i-1,"rank"] + 1)
}
You can try :
library(dplyr)
df %>% mutate(rank = cumsum(diff > 14 | ID != lag(ID, default = TRUE)))
Same logic using base R :
df$rank <- with(df, cumsum(diff > 14 | c(TRUE, tail(ID, -1) != head(ID, -1))))
You can use cumsum to get an increasing rank when the conditions df[i,"ID"]==df[i-1,"ID"]) & (df[i-1,"diff"]<=14) are meet.
df$rank <- cumsum(c(1,(df$ID != c(df$ID[-1], NA) | df$diff>14)[-nrow(df)]))
df
# ID diff rank
#1 a 4 1
#2 a 6 1
#3 a 8 1
#4 a 870 1
#5 a 34 2
#6 a NA 3
#7 b 4 4
#8 b 6 4
#9 b 8 4
#10 b 870 4
#11 b 34 5
#12 b NA 6
Using your code:
df$rank <- 1
for(i in 2:nrow(df)){
df[i,"rank"] <- ifelse((df[i,"ID"]==df[i-1,"ID"]) & (df[i-1,"diff"]<=14),
df[i,"rank"] <- df[i-1,"rank"], df[i,"rank"] <- df[i-1,"rank"] + 1)
}
df
# ID diff rank
#1 a 4 1
#2 a 6 1
#3 a 8 1
#4 a 870 1
#5 a 34 2
#6 a NA 3
#7 b 4 4
#8 b 6 4
#9 b 8 4
#10 b 870 4
#11 b 34 5
#12 b NA 6
Data:
df <- data.frame(ID=rep(c("a","b"), each=6), diff=c(4,6,8,870,34,NA)
, stringsAsFactors = FALSE)
df
# ID diff
#1 a 4
#2 a 6
#3 a 8
#4 a 870
#5 a 34
#6 a NA
#7 b 4
#8 b 6
#9 b 8
#10 b 870
#11 b 34
#12 b NA
Here is a base R solution using ave + ifelse
df <- within(df,rank <- ave(diff>14, diff>14,ID,FUN = function(x) ifelse(x,seq(x),+!x)))
I have a list of dataframes and I am trying to normalize the data in a few of the columns by dividing each row in a column by the sum. The issue is all of the sums are different.
I started by making summing each column in the variable using lapply which resulted in a list of colSums. However, I'm not sure how to divide each list by the corresponding values.
Here is an example using some code I made up
df1=data.frame("total"=c(50,100,75),
"a"=c(15,50,30),
"b"=c(15,10,5),
"c"=c(20,40,40))
df2=data.frame("total"=c(100,200,400,100),
"a"=c(10,40,100,50),
"b"=c(50,100,200,30),
"c"=c(40,60,100,20))
df3=data.frame("total"=c(40,60,80),
"a"=c(15,30,50),
"b"=c(25,20,20),
"c"=c(0,10,10))
listex=list(df1=df1,df2=df2,df3=df3)
listtotal=lapply(listex,function(x) {x=colSums(x,na.rm=T)})
listex
$df1
total a b c
1 50 15 15 20
2 100 50 10 40
3 75 30 5 40
$df2
total a b c
1 100 10 50 40
2 200 40 100 60
3 400 100 200 100
4 100 50 30 20
$df3
total a b c
1 40 15 25 0
2 60 30 20 10
3 80 50 20 10
listtotal
$df1
total a b c
225 95 30 100
$df2
total a b c
800 200 380 220
$df3
total a b c
180 95 65 20
What I want to have happen is the following, but, you know, without having to write it all over again
df1n=data.frame("total"=c(50/225,100/225,75/225),"a"=c(15/95,50/95,30/95),
"b"=c(15/30,10/30,5/30),"c"=c(20/100,40/100,40/100))
df2n=data.frame("total"=c(100/800,200/800,400/800,100/800),
"a"=c(10/200,40/200,100/200,50/200),
"b"=c(50/380,100/380,200/380,30/380),
"c"=c(40/220,60/220,100/220,20/220))
df3n=data.frame('total'=c(40/180,60/180,80/180),
'a'=c(15/95,30/95,50/95),
'b'=c(25/65,20/65,20/65),
'c'=c(0/20,10/20,10/20))
listn=list(df1=df1n,df2=df2n,df3=df3n)
listn
$df1
total a b c
1 0.2222222 0.1578947 0.5000000 0.2
2 0.4444444 0.5263158 0.3333333 0.4
3 0.3333333 0.3157895 0.1666667 0.4
$df2
total a b c
1 0.125 0.05 0.13157895 0.18181818
2 0.250 0.20 0.26315789 0.27272727
3 0.500 0.50 0.52631579 0.45454545
4 0.125 0.25 0.07894737 0.09090909
$df3
total a b c
1 0.2222222 0.1578947 0.3846154 0.0
2 0.3333333 0.3157895 0.3076923 0.5
3 0.4444444 0.5263158 0.3076923 0.5
I'm thinking it has something to do with a nested apply function, but I'm not sure how exactly I would go about doing that. Any help is much appreciated!
We can do arithmetic on equal length vectors/matrices/data.frames. One option is to divide each of the elements in list with the colSums replicated to make the dimensions same
lapply(listex, function(x) x/colSums(x)[col(x)])
Also, if the 'listtotal' is another object, use Map to divide the corresponding elements of one object with the other
Map(function(x, y) x/y[col(x)], listex, listtotal)
You could use the sweep function
lapply(listex, function(x) sweep(x, 2, colSums(x), '/'))
Or convert the colsums to a list to use / directly
lapply(listex, function(x) x/as.list(colSums(x)))
Output for both methods:
# $`df1`
# total a b c
# 1 0.2222222 0.1578947 0.5000000 0.2
# 2 0.4444444 0.5263158 0.3333333 0.4
# 3 0.3333333 0.3157895 0.1666667 0.4
#
# $df2
# total a b c
# 1 0.125 0.05 0.13157895 0.18181818
# 2 0.250 0.20 0.26315789 0.27272727
# 3 0.500 0.50 0.52631579 0.45454545
# 4 0.125 0.25 0.07894737 0.09090909
#
# $df3
# total a b c
# 1 0.2222222 0.1578947 0.3846154 0.0
# 2 0.3333333 0.3157895 0.3076923 0.5
# 3 0.4444444 0.5263158 0.3076923 0.5
I can perform a rolling division on vectors, i can take my data, then lag the same data and perform the division:
# Dummy Data
sample <- c(4,5,6,7,8,4,2,6,5,4,3,2,1,2,3,4,5,6)
lagSam <- lag(sample) # Lag by 1
output <- sample / lagSam # Perform division
sample.df <- data.frame(sample, desired = output)
with the desired output:
sample desired
1 4 NA
2 5 1.2500000
3 6 1.2000000
4 7 1.1666667
5 8 1.1428571
6 4 0.5000000
7 2 0.5000000
8 6 3.0000000
9 5 0.8333333
10 4 0.8000000
11 3 0.7500000
12 2 0.6666667
13 1 0.5000000
14 2 2.0000000
15 3 1.5000000
16 4 1.3333333
17 5 1.2500000
18 6 1.2000000
My question is, how do I do the same on a data frame? I have over 100 columns and need to take a rolling division of each. I am trying to write a function and use roll apply:
# My attempt
division <- function(x) {
#tail(x,1) / head(x,1)
x / lag(x)
}
rollapplyr(sample.df$sample, 1, division, fill = NA)
I tried to lag it with head and tail and then again with x / lag of x.
Both results produce NA's.
I frequently find that I want the dplyr::lag behavior but am startled a little when stats::lag returns a tsp-like object.
You're close with tail/head functionality. Using negative n:
n: a single integer. If positive, size for the resulting object:
number of elements for a vector (including lists), rows for a
matrix or data frame or lines for a function. If negative,
all but the 'n' last/first number of elements of 'x'.
c(NA, tail(sample.df$sample,n=-1) / head(sample.df$sample,n=-1))
# [1] NA 1.2500000 1.2000000 1.1666667 1.1428571 0.5000000 0.5000000
# [8] 3.0000000 0.8333333 0.8000000 0.7500000 0.6666667 0.5000000 2.0000000
# [15] 1.5000000 1.3333333 1.2500000 1.2000000
Note that your thought to do a rolling apply is slightly problematic in that it will divide a number by the result of the previous division, not the previous value. That is, c(2,3,4) in a rolling divide should start with an initial value (say, 1, the division-identity), and go with something like c(2/1, 3/(2/1), 4/(3/(2/1))), not what (I think) you requested.
This drives the function to be:
division <- function(x) c(NA, tail(x,n=-1) / head(x,n=-1))
So then you can do
lapply(sample.df, division)
If you only want to run this on select columns, I'd do
ind <- 1
lapply(sample.df[ind], division)
# $sample
# [1] NA 1.2500000 1.2000000 1.1666667 1.1428571 0.5000000 0.5000000
# [8] 3.0000000 0.8333333 0.8000000 0.7500000 0.6666667 0.5000000 2.0000000
# [15] 1.5000000 1.3333333 1.2500000 1.2000000
cbind(sample.df, lapply(sample.df[ind], division))
# sample desired sample
# 1 4 1 NA
# 2 5 1 1.2500000
# 3 6 1 1.2000000
# 4 7 1 1.1666667
# 5 8 1 1.1428571
# 6 4 1 0.5000000
# 7 2 1 0.5000000
# 8 6 1 3.0000000
# 9 5 1 0.8333333
# 10 4 1 0.8000000
# 11 3 1 0.7500000
# 12 2 1 0.6666667
# 13 1 1 0.5000000
# 14 2 1 2.0000000
# 15 3 1 1.5000000
# 16 4 1 1.3333333
# 17 5 1 1.2500000
# 18 6 1 1.2000000
This of course generated a duplicate name, but it's a start.
BTW: rolling generally has to do with a cumulative process on a vector. What you are looking for is applying one function to each vector individually and capturing the response.
Here are a few ways:
1) diff
transform(sample.df, desired = c(NA, exp(diff(log(sample)))))
## sample desired
## 1 4 NA
## 2 5 1.2500000
## 3 6 1.2000000
## 4 7 1.1666667
## 5 8 1.1428571
## ... etc ...
To apply this to multiple columns using the built in data frame anscombe:
rbind(NA, exp(diff(log(as.matrix(anscombe)))))
2) diff.zoo
library(zoo)
z <- zoo(sample.df$sample)
merge(z, desired = diff(z, arith = FALSE), all = TRUE)
## z desired
## 1 4 NA
## 2 5 1.2500000
## 3 6 1.2000000
## 4 7 1.1666667
## 5 8 1.1428571
## ... etc ...
To apply it to all columns of anscombe:
z <- zoo(rbind(NA, anscombe))
diff(z, arith = FALSE)
3) dplyr
library(dplyr)
sample.df %>% mutate(desired = sample/lag(sample))
## sample desired
## 1 4 NA
## 2 5 1.2500000
## 3 6 1.2000000
## 4 7 1.1666667
## 5 8 1.1428571
## ... etc ...
To apply this to all columns of anscombe:
anscombe %>% mutate_all(funs(. / lag(.)))
4) rollapplyr
library(zoo)
transform(sample.df, desired = rollapplyr(sample, 2, function(x) x[2]/x[1], fill = NA))
## sample desired
## 1 4 NA
## 2 5 1.2500000
## 3 6 1.2000000
## 4 7 1.1666667
## 5 8 1.1428571
To apply it to all columns of anscombe:
rollapplyr(anscombe, 2, function(x) x[2]/x[1], fill = NA))
5) lag.ts
transform(sample.df, desired = c(NA, lag(ts(sample)) / ts(sample)))
## sample desired
## 1 4 NA
## 2 5 1.2500000
## 3 6 1.2000000
## 4 7 1.1666667
## 5 8 1.1428571
To apply it to all columns of anscombe use the following. Note that dplyr should NOT be loaded since it annoyingly clobbers lag with its own lag. Alternately use stats::lag:
lag(ts(anscombe)) / ts(anscombe)
If I have data look like this:
A B C
1 GM1 100
2 DOX 10
3 GM2 3
4 GM3 99
5 MY 62
6 GMPN 30
How could I using R let data looks like:(only choose include "GM" data)
A B C
1 GM1 100
3 GM2 3
4 GM3 99
You can use grep
df1[grep('GM\\d+', df1$B),]
# A B C
#1 1 GM1 100
#3 3 GM2 3
#4 4 GM3 99
Or as #ColonelBeauvel mentioned
subset(df1, grepl('GM\\d+', B))
I am attempting to append a sequence number to a data frame grouped by individuals and date. For example, to turn this:
x y
1 A 2012-01-02
2 A 2012-02-03
3 A 2012-02-25
4 A 2012-03-04
5 B 2012-01-02
6 B 2012-02-03
7 C 2013-01-02
8 C 2012-02-03
9 C 2012-03-04
10 C 2012-04-05
in to this:
x y v
1 A 2012-01-02 1
2 A 2012-02-03 2
3 A 2012-02-25 3
4 A 2012-03-04 4
5 B 2012-01-02 1
6 B 2012-02-03 2
7 C 2013-01-02 1
8 C 2012-02-03 2
9 C 2012-03-04 3
10 C 2012-04-05 4
where "x" is the individual, "y" is the date, and "v" is the appended sequence number
I have had success on a small data frame using a for loop in this code:
x=c("A","A","A","A","B","B","C","C","C","C")
y=as.Date(c("1/2/2012","2/3/2012","2/25/2012","3/4/2012","1/2/2012","2/3/2012",
"1/2/2013","2/3/2012","3/4/2012","4/5/2012"),"%m/%d/%Y")
x
y
z=data.frame(x,y)
z$v=rep(1,nrow(z))
for(i in 2:nrow(z)){
if(z$x[i]==z$x[i-1]){
z$v[i]=(z$v[i-1]+1)
} else {
z$v[i]=1
}
}
but when I expand this to a much larger data frame (250K+ rows) the process takes forever.
Any thoughts on how I can make this more efficient?
This seems to work. May be overkill though.
## code needed revision - this is old code
## > d$v <- unlist(sapply(sapply(split(d, d$x), nrow), seq))
EDIT
I can't believe I got away with that ugly mess for so long. Here's a revision. Much simpler.
## revised 04/24/2014
> d$v <- unlist(sapply(table(d$x), seq))
> d
## x y v
## 1 A 2012-01-02 1
## 2 A 2012-02-03 2
## 3 A 2012-02-25 3
## 4 A 2012-03-04 4
## 5 B 2012-01-02 1
## 6 B 2012-02-03 2
## 7 C 2013-01-02 1
## 8 C 2012-02-03 2
## 9 C 2012-03-04 3
## 10 C 2012-04-05 4
Also, an interesting one is stack. Take a look.
> stack(sapply(table(d$x), seq))
## values ind
## 1 1 A
## 2 2 A
## 3 3 A
## 4 4 A
## 5 1 B
## 6 2 B
## 7 1 C
## 8 2 C
## 9 3 C
## 10 4 C
I'm removing my previous post and replacing it with this solution. Extremely efficient for my purposes.
# order data
z=z[order(z$x,z$y),]
#convert to data table
dt.z=data.table(z)
# obtain vector of sequence numbers
z$seq=dt.z[,1:.N,"x"]$V1
The above can be accomplished in fewer steps but I wanted to illustrate what I did. This is appending sequence numbers to my data sets of over 250k records in under a second. Thanks again to Henrik and Richard.