using lag for creating an x+1 column - r

I'm trying to implement a lag function but it seems i need an existing x column for it to work
lets say i have this data frames
df <- data.frame(AgeGroup=c("0-4", "5-39", "40-59","60-69","70+"),
px=c(.99, .97, .95, .96,.94))
i want a column Ix that is lag(Ix)*lag(px) starting from 1000.
The data i want is
df2 <- data.frame(AgeGroup=c("0-4", "5-39", "40-59","60-69","70+"),
px=c(.99, .97, .95, .96, .94),
Ix=c(1000, 990, 960.3, 912.285, 875.7936))
I've tried
library(dplyr)
df2<-mutate(df,Ix = lag(Ix, default = 1000)*lag(px))
ifelse statements don't work after the creation of a reference
df$Ix2=NA
df[1,3]=1000
df$Ix<-ifelse(df[,3]==1000,1000,
lag(df$Ix, default = 1000)*lag(px,default =1))
and have been playing around with creating separate Ix column with Ix=1000 then run the above but it doesn't seem to work. Does anyone have any ideas how i can create the x+1 column?

You could use cumprod() combined with dplyr::lag() for this:
> df$Ix <- 1000*dplyr::lag(cumprod(df$px), default = 1)
> df
AgeGroup px Ix
1 0-4 0.99 1000.0000
2 5-39 0.97 990.0000
3 40-59 0.95 960.3000
4 60-69 0.96 912.2850
5 70+ 0.94 875.7936

You can also use accumulate from purrr. Using head(px, -1) includes all values in px except the last one, and the initial Ix is set to 1000.
library(tidyverse)
df %>%
mutate(Ix = accumulate(head(px, -1), prod, .init = 1000))
Output
AgeGroup px Ix
1 0-4 0.99 1000.0000
2 5-39 0.97 990.0000
3 40-59 0.95 960.3000
4 60-69 0.96 912.2850
5 70+ 0.94 875.7936

Related

Creating a ranking column based on values in two other columns using mutate and min_rank

I'm attempting to revisit some older code in which I used a for loop to calculate a combined ranking of genes based on two columns. My end goal is to get out a column that lists the proportion of genes that any given gene in the dataset performs better than.
I have a data.frame that I'm calling scores which contains two columns of relevant scores for my genes. To calculate the combined ranking I use the following for loop and I calculate the proportional score by dividing the resulting rank by the total number of observations.
scores <- data.frame(x = c(0.128, 0.279, 0.501, 0.755, 0.613), y = c(1.49, 1.43, 0.744, 0.647, 0.380))
#Calculate ranking
comb.score = matrix(0, nrow = nrow(scores), ncol = 1)
for(i in 1:nrow(scores)){
comb.score[i] = length(which(scores[ , 1] < scores[i, 1] & scores[ , 2] < scores[i, 2]))
}
comb.score <- comb.score/length(comb.score) #Calculate proportion
Now that I've become more familiar and comfortable with the tidyverse I want to convert this code to use tidyverse functions but I haven't been able to figure it out on my own, nor with SO or RStudio community answers.
The idea I had in mind was to use mutate() along with min_rank() but I'm not entirely sure of the syntax. Additionally the behavior of min_rank() appears to assess rank using a logical test like scores[ , 1] <= scores[i, 1] as opposed to just using < like I did in my original test.
My expected out come is an additional column in the scores table that has the same output as the comb.score output in the above code: a score that tells me the proportion of genes in the whole dataset that a gene on a given row performs better than.
Any help would be much appreciated! If I need to clarify anything or add more information please let me know!
Interessting question. I propose this way:
scores %>%
rowwise() %>%
mutate(comb_score = sum(x > .$x & y > .$y)) %>%
ungroup() %>%
mutate(comb_score = comb_score/n())
which gives
# A tibble: 5 x 3
x y comb_score
<dbl> <dbl> <dbl>
1 0.128 1.49 0
2 0.279 1.43 0
3 0.501 0.744 0
4 0.755 0.647 0.2
5 0.613 0.38 0
A bit similar to Martins answer, but using pmap instead.
library(tidyverse)
scores <- data.frame(
x = c(0.128, 0.279, 0.501, 0.755, 0.613),
y = c(1.49, 1.43, 0.744, 0.647, 0.380)
)
scores %>%
mutate(
score = pmap(list(x, y), ~ sum(..1 > x & ..2 > y)) / n()
)
#> x y score
#> 1 0.128 1.490 0
#> 2 0.279 1.430 0
#> 3 0.501 0.744 0
#> 4 0.755 0.647 0.2
#> 5 0.613 0.380 0
Created on 2020-06-18 by the reprex package (v0.3.0)

R: Aggregating over several variables and observations (depending on values) and creating a new variable

The data set has the following structure
Key Date Mat Amount
<int> <date> <chr> <dbl>
1 1001056 2014-12-12 10025 0.10
2 1001056 2014-12-23 10025 0.20
3 1001056 2015-01-08 10025 0.10
4 1001056 2015-04-07 10025 0.20
5 1001056 2015-05-08 10025 0.20
6 1001076 2013-10-29 10026 3.00
7 1001140 2013-01-18 10026 0.72
8 1001140 2013-04-11 10026 2.40
9 1001140 2014-10-08 10026 0.24
10 1001237 2015-02-17 10025 2.40
11 1001237 2015-02-17 10026 3.40
Mat takes values in {10001,...,11000}, hence A:=|Mat|=1000.
I would like to accomplish the following goals:
1) (Intermediate step) For each Key-Date combination I would like to calculate for all materials, which are availabe at such a combination (which might vary from key to key), the differences in amount,
e.g. for combination "1001237 2015-02-17" this would be for materials 10025 and 10026 2.40-3.40=-1 (but might be more combinations). (How to store those values effienently?)
This step might be skipped.
2) Finally, I would like to construct a new matrix of dimension A=1000 where each entry (i,j) (Material combination i and j) contains the average of the values calculated in the step before.
More formally, entry (i,j) is given by,
1/|all key-date combinationas containing Mat i and Mat j| \sum_{all key-date combinationas containing Mat i and Mat j} Amount_i - Amount_j
As the table is quite large efficiency of the computation is very important.
Thank you very much for your help in advance!
I can do it with list columns in tidyverse; the trick is to use group_by to get distinct combinations of Key and Date. Here's the code:
materials <- unique(x$Mat)
n <- length(materials)
x <- x %>%
group_by(Key, Date) %>%
nest() %>%
# Create a n by n matrix for each combination of Key and Date
mutate(matrices = lapply(data,
function(y) {
out <- matrix(nrow = n, ncol = n,
dimnames = list(materials, materials))
# Only fill in when the pair of materials is present
# for the date of interest
mat_present <- as.character(unique(y$Mat))
for (i in mat_present) {
for (j in mat_present) {
# You may want to take an absolute value
out[i,j] <- y$Amount[y$Mat == i] - y$Amount[y$Mat == j]
}
}
out
}))
If you really want speed, you can implement the function in lapply with Rcpp. You can use RcppParallel to further speed it up. Now one of the columns of the data frame is a list of matrices. Then, for each element of the matrices, take an average while ignoring NAs:
x_arr <- array(unlist(x$matrices), dim = c(2,2,10))
results <- apply(x_arr, 2, rowMeans, na.rm = TRUE)
I stacked the list of matrices into a 3D array and found row means slice by slice. For performance, you can also do it in RcppArmadillo, with sum(x_arr, 2), but it's hard to deal with missing values when not all types of materials are represented in a combination of Key and Date.

How to remove one-off increases in value

I'm working with time series data on a variable that generally increases slowly over time. Very simplified example:
df <- data.frame(index=1:8, value = c(rep(0.25, 3),1.95,0.25,rep(0.5,3)))
index value
1 0.25
2 0.25
3 0.25
4 1.95
5 0.25
6 0.50
7 0.50
8 0.50
A recurring feature of the dataset is what happens at index 4: the value spikes upward then immediately comes back down again. I want to remove those values. (There are also points in my dataset where the value makes a small increase followed by a small decrease some time later, but I want to keep those.)
I have found a way of removing the values, by using diff to calculate the change from the previous value, then turning the data frame upside down, using diff again to calculate the change from the next value and removing rows where the two diffs are the same, but that seems like the least efficient process ever:
library(dplyr)
df %>%
mutate(diffprev = diff(value) %>% c(0, .)) %>%
arrange(desc(variable)) %>%
mutate(diffnext = diff(value) %>% c(0, .)) %>%
filter(diffprev == 0 | diffprev != diffnext)
I realise that if the spike in value happened at index 5 rather than 4 this wouldn't work but in the full dataset this is so unlikely that unless there's a simple fix I'm not going to worry about it. But what would be a better way of going about this?
You could try:
df %>% filter(lag(value) != lead(value) | (value - lag(value)) %in% c(0, NA))
You might also be interested in the lag and lead functions from dplyr.
Edit: thanks #Frank for a couple modifications
You don't need to rearrange. The first diff column you make contains all the info you need:
df %>%
mutate(diffprev = diff(value) %>% c(0, .)) %>%
filter(diffprev == 0 | diffprev != -lead(diffprev) ) %>%
select(-diffprev)
which gives
variable value
1 1 0.25
2 2 0.25
3 3 0.25
4 5 0.25
5 6 0.50
6 7 0.50
7 8 0.50

Subtract data frames with different number of rows?

I have two data frames with two columns each. The first column is timestamps and the second contains some values.
One of the data frames is much bigger than the other one but both of them contains data in the same timestamp range.
If I plot these two on top of each other, I will get a nice plot showing how they differ in time.
Now I would like to get the absolute difference by time of these two dataframes to make a another plot showing how much they differ (or to create a boxplot with this information) even though they do not have the same length and exact matching timestamps.
Check this example:
df1:
timestamp | data
1334103075| 1.2
1334103085| 1.5
1334103095| 0.9
1334103105| 0.7
1334103115| 1.1
1334103125| 0.8
df2:
timestamp | data
1334103078| 1.2
1334103099| 1.5
1334103123| 0.8
1334103125| 0.9
How would I achieve something like this:
df3 <- abs(df1-df2)
As you see df2 might not have the same timestamps as df1, but they both have timestamps in the same time range.
Of course the subtraction should try to match timestamps or subtract values from timestamp averages that they are near.
I would suggest using two linear interpolators and evaluate both of them on the union of your two sets of timestamps:
df1 <- data.frame(timestamp = c(1334103075, 1334103085, 1334103095,
1334103105, 1334103115, 1334103125),
data = c(1.2, 1.5, 0.9, 0.7, 1.1, 0.8))
df2 <- data.frame(timestamp = c(1334103078, 1334103099, 1334103123,
1334103125),
data = c(1.2, 1.5, 0.8, 0.9))
library(Hmisc)
all.timestamps <- sort(unique(c(df1$timestamp, df2$timestamp)))
data1 <- approxExtrap(df1$timestamp, df1$data, all.timestamps)$y
data2 <- approxExtrap(df2$timestamp, df2$data, all.timestamps)$y
df3 <- data.frame(timestamp = all.timestamps,
data1 = data1,
data2 = data2,
abs.diff = abs(data1 - data2))
df3
# timestamp data1 data2 abs.diff
# 1 1334103075 1.20 1.157143 0.04285714
# 2 1334103078 1.29 1.200000 0.09000000
# 3 1334103085 1.50 1.300000 0.20000000
# 4 1334103095 0.90 1.442857 0.54285714
# 5 1334103099 0.82 1.500000 0.68000000
# 6 1334103105 0.70 1.325000 0.62500000
# 7 1334103115 1.10 1.033333 0.06666667
# 8 1334103123 0.86 0.800000 0.06000000
# 9 1334103125 0.80 0.900000 0.10000000
Then you could consider fitting splines if you are not quite happy with linear approximations.

Plotting only a subset of the points?

I am trying to plot the CDF curve for a large dataset containing about 29 million values using ggplot. The way I am computing this is like this:
mycounts = ddply(idata.frame(newdata), .(Type), transform, ecd = ecdf(Value)(Value))
plot = ggplot(mycounts, aes(x=Value, y=ecd))
This is taking ages to plot. I was wondering if there is a clean way to plot only a sample of this dataset (say, every 10th point or 50th point) without compromising on the actual result?
I am not sure about your data structure, but a simple sample call might be enough:
n <- nrow(mycounts) # number of cases in data frame
mycounts <- mycounts[sample(n, round(n/10)), ] # get an n/10 sample to the same data frame
Instead of taking every n-th point, can you quantize your data set down to a sufficient resolution before plotting it? That way, you won't have to plot resolution you don't need (or can't see).
Here's one way you can do it. (The function I've written below is generic, but the example uses names from your question.)
library(ggplot2)
library(plyr)
## A data set containing two ramps up to 100, one by 1, one by 10
tens <- data.frame(Type = factor(c(rep(10, 10), rep(1, 100))),
Value = c(1:10 * 10, 1:100))
## Given a data frame and ddply-style arguments, partition the frame
## using ddply and summarize the values in each partition with a
## quantized ecdf. The resulting data frame for each partition has
## two columns: value and value_ecdf.
dd_ecdf <- function(df, ..., .quantizer = identity, .value = value) {
value_colname <- deparse(substitute(.value))
ddply(df, ..., .fun = function(rdf) {
xs <- rdf[[value_colname]]
qxs <- sort(unique(.quantizer(xs)))
data.frame(value = qxs, value_ecdf = ecdf(xs)(qxs))
})
}
## Plot each type's ECDF (w/o quantization)
tens_cdf <- dd_ecdf(tens, .(Type), .value = Value)
qplot(value, value_ecdf, color = Type, geom = "step", data = tens_cdf)
## Plot each type's ECDF (quantizing to nearest 25)
rounder <- function(...) function(x) round_any(x, ...)
tens_cdfq <- dd_ecdf(tens, .(Type), .value = Value, .quantizer = rounder(25))
qplot(value, value_ecdf, color = Type, geom = "step", data = tens_cdfq)
While the original data set and the ecdf set had 110 rows, the quantized-ecdf set is much reduced:
> dim(tens)
[1] 110 2
> dim(tens_cdf)
[1] 110 3
> dim(tens_cdfq)
[1] 10 3
> tens_cdfq
Type value value_ecdf
1 1 0 0.00
2 1 25 0.25
3 1 50 0.50
4 1 75 0.75
5 1 100 1.00
6 10 0 0.00
7 10 25 0.20
8 10 50 0.50
9 10 75 0.70
10 10 100 1.00
I hope this helps! :-)

Resources