I can perform a rolling division on vectors, i can take my data, then lag the same data and perform the division:
# Dummy Data
sample <- c(4,5,6,7,8,4,2,6,5,4,3,2,1,2,3,4,5,6)
lagSam <- lag(sample) # Lag by 1
output <- sample / lagSam # Perform division
sample.df <- data.frame(sample, desired = output)
with the desired output:
sample desired
1 4 NA
2 5 1.2500000
3 6 1.2000000
4 7 1.1666667
5 8 1.1428571
6 4 0.5000000
7 2 0.5000000
8 6 3.0000000
9 5 0.8333333
10 4 0.8000000
11 3 0.7500000
12 2 0.6666667
13 1 0.5000000
14 2 2.0000000
15 3 1.5000000
16 4 1.3333333
17 5 1.2500000
18 6 1.2000000
My question is, how do I do the same on a data frame? I have over 100 columns and need to take a rolling division of each. I am trying to write a function and use roll apply:
# My attempt
division <- function(x) {
#tail(x,1) / head(x,1)
x / lag(x)
}
rollapplyr(sample.df$sample, 1, division, fill = NA)
I tried to lag it with head and tail and then again with x / lag of x.
Both results produce NA's.
I frequently find that I want the dplyr::lag behavior but am startled a little when stats::lag returns a tsp-like object.
You're close with tail/head functionality. Using negative n:
n: a single integer. If positive, size for the resulting object:
number of elements for a vector (including lists), rows for a
matrix or data frame or lines for a function. If negative,
all but the 'n' last/first number of elements of 'x'.
c(NA, tail(sample.df$sample,n=-1) / head(sample.df$sample,n=-1))
# [1] NA 1.2500000 1.2000000 1.1666667 1.1428571 0.5000000 0.5000000
# [8] 3.0000000 0.8333333 0.8000000 0.7500000 0.6666667 0.5000000 2.0000000
# [15] 1.5000000 1.3333333 1.2500000 1.2000000
Note that your thought to do a rolling apply is slightly problematic in that it will divide a number by the result of the previous division, not the previous value. That is, c(2,3,4) in a rolling divide should start with an initial value (say, 1, the division-identity), and go with something like c(2/1, 3/(2/1), 4/(3/(2/1))), not what (I think) you requested.
This drives the function to be:
division <- function(x) c(NA, tail(x,n=-1) / head(x,n=-1))
So then you can do
lapply(sample.df, division)
If you only want to run this on select columns, I'd do
ind <- 1
lapply(sample.df[ind], division)
# $sample
# [1] NA 1.2500000 1.2000000 1.1666667 1.1428571 0.5000000 0.5000000
# [8] 3.0000000 0.8333333 0.8000000 0.7500000 0.6666667 0.5000000 2.0000000
# [15] 1.5000000 1.3333333 1.2500000 1.2000000
cbind(sample.df, lapply(sample.df[ind], division))
# sample desired sample
# 1 4 1 NA
# 2 5 1 1.2500000
# 3 6 1 1.2000000
# 4 7 1 1.1666667
# 5 8 1 1.1428571
# 6 4 1 0.5000000
# 7 2 1 0.5000000
# 8 6 1 3.0000000
# 9 5 1 0.8333333
# 10 4 1 0.8000000
# 11 3 1 0.7500000
# 12 2 1 0.6666667
# 13 1 1 0.5000000
# 14 2 1 2.0000000
# 15 3 1 1.5000000
# 16 4 1 1.3333333
# 17 5 1 1.2500000
# 18 6 1 1.2000000
This of course generated a duplicate name, but it's a start.
BTW: rolling generally has to do with a cumulative process on a vector. What you are looking for is applying one function to each vector individually and capturing the response.
Here are a few ways:
1) diff
transform(sample.df, desired = c(NA, exp(diff(log(sample)))))
## sample desired
## 1 4 NA
## 2 5 1.2500000
## 3 6 1.2000000
## 4 7 1.1666667
## 5 8 1.1428571
## ... etc ...
To apply this to multiple columns using the built in data frame anscombe:
rbind(NA, exp(diff(log(as.matrix(anscombe)))))
2) diff.zoo
library(zoo)
z <- zoo(sample.df$sample)
merge(z, desired = diff(z, arith = FALSE), all = TRUE)
## z desired
## 1 4 NA
## 2 5 1.2500000
## 3 6 1.2000000
## 4 7 1.1666667
## 5 8 1.1428571
## ... etc ...
To apply it to all columns of anscombe:
z <- zoo(rbind(NA, anscombe))
diff(z, arith = FALSE)
3) dplyr
library(dplyr)
sample.df %>% mutate(desired = sample/lag(sample))
## sample desired
## 1 4 NA
## 2 5 1.2500000
## 3 6 1.2000000
## 4 7 1.1666667
## 5 8 1.1428571
## ... etc ...
To apply this to all columns of anscombe:
anscombe %>% mutate_all(funs(. / lag(.)))
4) rollapplyr
library(zoo)
transform(sample.df, desired = rollapplyr(sample, 2, function(x) x[2]/x[1], fill = NA))
## sample desired
## 1 4 NA
## 2 5 1.2500000
## 3 6 1.2000000
## 4 7 1.1666667
## 5 8 1.1428571
To apply it to all columns of anscombe:
rollapplyr(anscombe, 2, function(x) x[2]/x[1], fill = NA))
5) lag.ts
transform(sample.df, desired = c(NA, lag(ts(sample)) / ts(sample)))
## sample desired
## 1 4 NA
## 2 5 1.2500000
## 3 6 1.2000000
## 4 7 1.1666667
## 5 8 1.1428571
To apply it to all columns of anscombe use the following. Note that dplyr should NOT be loaded since it annoyingly clobbers lag with its own lag. Alternately use stats::lag:
lag(ts(anscombe)) / ts(anscombe)
Related
I'd like to divide a set of columns by another set of columns based on their common suffices in column names. To be more specific, in the following dataframe I would like to divide each column with prefix1 with the corresponding columns with prefix2 (i.e. "prefix1 column1" with "prefix2 column1", "prefix1 column2" with "prefix2 column2" etc.).
dt <- data.frame(replicate(6,sample(1:15,10,rep=TRUE)))
colnames(dt) <- c("prefix1 column1","prefix1 column2","prefix1 column3","prefix2 column1","prefix2 column2","prefix2 column3")
View(dt)
The desirable output would be a dataframe with additional 3 columns having the results of the divisions. My head has badly stuck with this task - I would appreciate any suggestions.
We can loop across the 'prefix1' columns, replace the substring 'prefix1' in column name (cur_column()) with 'prefix2', get the value and divide, create new columns by updating the .names
library(dplyr)
library(stringr)
dt <- dt %>%
mutate(across(starts_with('prefix1'), ~ ./get(str_replace(cur_column(),
'prefix1', 'prefix2')), .names = '{.col}_new'))
Or use base R
dt[paste0(names(dt)[1:3], "_new")] <- dt[1:3]/dt[4:6]
Another solution, based on purrr::reduce:
library(tidyverse)
dt <- data.frame(replicate(6,sample(1:15,10,rep=TRUE)))
colnames(dt) <- c("prefix1 column1","prefix1 column2","prefix1 column3","prefix2 column1","prefix2 column2","prefix2 column3")
reduce(paste0("column",1:3), function(x,y)
{ z <- x[,paste("prefix1",y)] / x[,paste("prefix2",y)];
bind_cols(x, z %>% data.frame %>% setNames(. ,paste0("d",y))) }, .init=dt)
#> prefix1 column1 prefix1 column2 prefix1 column3 prefix2 column1
#> 1 2 12 4 13
#> 2 13 14 12 11
#> 3 6 3 4 6
#> 4 1 9 5 6
#> 5 15 1 2 8
#> 6 4 7 15 1
#> 7 5 9 1 10
#> 8 4 10 1 5
#> 9 6 8 3 12
#> 10 13 13 9 2
#> prefix2 column2 prefix2 column3 dcolumn1 dcolumn2 dcolumn3
#> 1 15 15 0.1538462 0.80000000 0.2666667
#> 2 1 11 1.1818182 14.00000000 1.0909091
#> 3 5 1 1.0000000 0.60000000 4.0000000
#> 4 13 13 0.1666667 0.69230769 0.3846154
#> 5 15 4 1.8750000 0.06666667 0.5000000
#> 6 9 4 4.0000000 0.77777778 3.7500000
#> 7 7 8 0.5000000 1.28571429 0.1250000
#> 8 14 5 0.8000000 0.71428571 0.2000000
#> 9 13 1 0.5000000 0.61538462 3.0000000
#> 10 14 14 6.5000000 0.92857143 0.6428571
This question already has answers here:
How does the `prop.table()` function work in r?
(3 answers)
Closed 2 years ago.
I've got a model with 9 covariates and below is an example of one of the tables that it used to calculate the "yes"(1) and no(0) responses of a dataset,
table(wbca1$y,wbca1$Adhes)
And the output appears as follows
How can I code this so that I get the sample proportions for each covariate so I have a new table with 10 columns each representing "yes"(1)?
Thank you in advance
Something like this:
set.seed(111)
x = sample(1:9,100,replace=TRUE)
y = sample(0:1,100,replace=TRUE)
prop.table(table(y,x),margin=2)
x
y 1 2 3 4 5 6 7
0 0.4444444 0.2857143 0.6923077 0.4666667 0.5000000 0.4615385 0.6666667
1 0.5555556 0.7142857 0.3076923 0.5333333 0.5000000 0.5384615 0.3333333
x
y 8 9
0 0.3636364 0.4615385
1 0.6363636 0.5384615
Or you can simply do:
tab = table(y,x)
tab[2,]/colSums(tab)
1 2 3 4 5 6 7 8
0.5555556 0.7142857 0.3076923 0.5333333 0.5000000 0.5384615 0.3333333 0.6363636
9
0.5384615
Using tidyverse
library(dplyr)
tibble(x, y) %>%
count(x, y) %>%
mutate(prop = n/sum(n))
I have a list of dataframes and I am trying to normalize the data in a few of the columns by dividing each row in a column by the sum. The issue is all of the sums are different.
I started by making summing each column in the variable using lapply which resulted in a list of colSums. However, I'm not sure how to divide each list by the corresponding values.
Here is an example using some code I made up
df1=data.frame("total"=c(50,100,75),
"a"=c(15,50,30),
"b"=c(15,10,5),
"c"=c(20,40,40))
df2=data.frame("total"=c(100,200,400,100),
"a"=c(10,40,100,50),
"b"=c(50,100,200,30),
"c"=c(40,60,100,20))
df3=data.frame("total"=c(40,60,80),
"a"=c(15,30,50),
"b"=c(25,20,20),
"c"=c(0,10,10))
listex=list(df1=df1,df2=df2,df3=df3)
listtotal=lapply(listex,function(x) {x=colSums(x,na.rm=T)})
listex
$df1
total a b c
1 50 15 15 20
2 100 50 10 40
3 75 30 5 40
$df2
total a b c
1 100 10 50 40
2 200 40 100 60
3 400 100 200 100
4 100 50 30 20
$df3
total a b c
1 40 15 25 0
2 60 30 20 10
3 80 50 20 10
listtotal
$df1
total a b c
225 95 30 100
$df2
total a b c
800 200 380 220
$df3
total a b c
180 95 65 20
What I want to have happen is the following, but, you know, without having to write it all over again
df1n=data.frame("total"=c(50/225,100/225,75/225),"a"=c(15/95,50/95,30/95),
"b"=c(15/30,10/30,5/30),"c"=c(20/100,40/100,40/100))
df2n=data.frame("total"=c(100/800,200/800,400/800,100/800),
"a"=c(10/200,40/200,100/200,50/200),
"b"=c(50/380,100/380,200/380,30/380),
"c"=c(40/220,60/220,100/220,20/220))
df3n=data.frame('total'=c(40/180,60/180,80/180),
'a'=c(15/95,30/95,50/95),
'b'=c(25/65,20/65,20/65),
'c'=c(0/20,10/20,10/20))
listn=list(df1=df1n,df2=df2n,df3=df3n)
listn
$df1
total a b c
1 0.2222222 0.1578947 0.5000000 0.2
2 0.4444444 0.5263158 0.3333333 0.4
3 0.3333333 0.3157895 0.1666667 0.4
$df2
total a b c
1 0.125 0.05 0.13157895 0.18181818
2 0.250 0.20 0.26315789 0.27272727
3 0.500 0.50 0.52631579 0.45454545
4 0.125 0.25 0.07894737 0.09090909
$df3
total a b c
1 0.2222222 0.1578947 0.3846154 0.0
2 0.3333333 0.3157895 0.3076923 0.5
3 0.4444444 0.5263158 0.3076923 0.5
I'm thinking it has something to do with a nested apply function, but I'm not sure how exactly I would go about doing that. Any help is much appreciated!
We can do arithmetic on equal length vectors/matrices/data.frames. One option is to divide each of the elements in list with the colSums replicated to make the dimensions same
lapply(listex, function(x) x/colSums(x)[col(x)])
Also, if the 'listtotal' is another object, use Map to divide the corresponding elements of one object with the other
Map(function(x, y) x/y[col(x)], listex, listtotal)
You could use the sweep function
lapply(listex, function(x) sweep(x, 2, colSums(x), '/'))
Or convert the colsums to a list to use / directly
lapply(listex, function(x) x/as.list(colSums(x)))
Output for both methods:
# $`df1`
# total a b c
# 1 0.2222222 0.1578947 0.5000000 0.2
# 2 0.4444444 0.5263158 0.3333333 0.4
# 3 0.3333333 0.3157895 0.1666667 0.4
#
# $df2
# total a b c
# 1 0.125 0.05 0.13157895 0.18181818
# 2 0.250 0.20 0.26315789 0.27272727
# 3 0.500 0.50 0.52631579 0.45454545
# 4 0.125 0.25 0.07894737 0.09090909
#
# $df3
# total a b c
# 1 0.2222222 0.1578947 0.3846154 0.0
# 2 0.3333333 0.3157895 0.3076923 0.5
# 3 0.4444444 0.5263158 0.3076923 0.5
My data looks like this:
FlightID FareClass FareClassRank FareValue Bookings
1 YULCDG215135 Q 1 100 5
2 YULCDG215135 X 2 150 7
3 YULCDG215135 V 3 200 4
4 YULCDG215135 Y 4 1000 2
5 YULCDG215136 Q 1 120 1
6 YULCDG215136 X 2 200 4
7 YULCDG215136 V 3 270 5
8 YULCDG215136 Y 4 900 15
Question: I need to write simple code in R:
For each flight, the ratio of the value of the fare class to the value of its next fare class.
As example, X is the next fare class of Q, V is the next fare class of X, and so on.
The ratio for flight YULCDG215135 is 100/150 = 0.6667 for X, 150/200=0.75 for V and so on.
Here is a base R solution using by.
by(df, df$FlightID, function(x) c(NA, exp(-diff(log(x$FareValue)))))
#df$FlightID: YULCDG215135
#[1] NA 0.6666667 0.7500000 0.2000000
#------------------------------------------------------------
#df$FlightID: YULCDG215136
#[1] NA 0.6000000 0.7407407 0.3000000
Or alternatively using ave
transform(df, ratio = ave(FareValue, FlightID, FUN = function(x) c(NA, exp(-diff(log(x))))))
# FlightID FareClass FareClassRank FareValue Bookings ratio
#1 YULCDG215135 Q 1 100 5 NA
#2 YULCDG215135 X 2 150 7 0.6666667
#3 YULCDG215135 V 3 200 4 0.7500000
#4 YULCDG215135 Y 4 1000 2 0.2000000
#5 YULCDG215136 Q 1 120 1 NA
#6 YULCDG215136 X 2 200 4 0.6000000
#7 YULCDG215136 V 3 270 5 0.7407407
#8 YULCDG215136 Y 4 900 15 0.3000000
The trick in both cases is to log-transform FareValue so that we can use diff, and then to invert the transformation (using exp) to give the ratio.
require(dplyr)
df %>%
group_by(FlightID) %>%
arrange(FareClassRank) %>%
mutate(ratio=FareValue/lead(FareValue))
Try this:
df$FareOverNext <- unlist(lappy(split(df$FareValue, df$FlightID), {
c(1, x[1:(length(x) - 1)] / x[2:length(x)])
}))
EDIT:
Added lapply and split thanks to #thelatemail comment.
I have data (prop_attack), a portion of which, looks like this
attack.type proportion class
4 0.8400000 high
5 0.9733333 high
6 0.9385151 high
7 0.9228659 high
8 0.6187500 high
9 0.9219331 high
1 0.8364853 mid
2 0.9896870 mid
3 0.9529760 mid
4 0.6666667 mid
5 0.9965636 mid
6 0.9687825 mid
attack.type is actually categorical, they are just being assigned numbers 1-9. I want to create a table that rearranges the data such that
weap.type high mid
1 corresponding proportion value corresponding proportion value
2 corresponding proportion value corresponding proportion value
3 corresponding proportion value corresponding proportion value
4 corresponding proportion value corresponding proportion value
5 corresponding proportion value corresponding proportion value
6 etc.
7
8
9
Any suggestions on how to do this?
This is a straightforward "reshape" problem. In base R, you could do:
reshape(prop_attack, direction = "wide", idvar="attack.type", timevar="class")
# attack.type proportion.high proportion.mid
# 1 4 0.8400000 0.6666667
# 2 5 0.9733333 0.9965636
# 3 6 0.9385151 0.9687825
# 4 7 0.9228659 NA
# 5 8 0.6187500 NA
# 6 9 0.9219331 NA
# 7 1 NA 0.8364853
# 8 2 NA 0.9896870
# 9 3 NA 0.9529760
Or even use xtabs:
xtabs(proportion ~ attack.type + class, prop_attack)
# class
# attack.type high mid
# 1 0.0000000 0.8364853
# 2 0.0000000 0.9896870
# 3 0.0000000 0.9529760
# 4 0.8400000 0.6666667
# 5 0.9733333 0.9965636
# 6 0.9385151 0.9687825
# 7 0.9228659 0.0000000
# 8 0.6187500 0.0000000
# 9 0.9219331 0.0000000
Using a package, many would suggest dcast from "reshape2" for its convenient syntax:
dcast(prop_attack, attack.type ~ class, value.var="proportion")
# attack.type high mid
# 1 1 NA 0.8364853
# 2 2 NA 0.9896870
# 3 3 NA 0.9529760
# 4 4 0.8400000 0.6666667
# 5 5 0.9733333 0.9965636
# 6 6 0.9385151 0.9687825
# 7 7 0.9228659 NA
# 8 8 0.6187500 NA
# 9 9 0.9219331 NA