Compute increase between rows for each same ID - r

A have a sorted data frame and I would like to compute the increase of x2 for each same ID.
The input is already sorted in a certain manner:
ID x2 x3 x4
1 10 11 2
2 100 12 4
1 20 13 10
7 24 3 1
1 30 14 0
3 6 15 1
2 90 15 1
I would like to get:
ID x2 increase x3 x4
1 10 11 2
2 100 12 4
1 20 +100% 13 10
7 24 3 1
1 30 +50% 14 0
3 6 15 1
2 90 -10% 15 1

You could do
df <- read.table(header=T, text="
ID x2 x3 x4
1 10 11 2
2 100 12 4
1 20 13 10
7 24 3 1
1 30 14 0
3 6 15 1
2 90 15 1")
df$increase <- ave(df$x2, df$ID, FUN = function(x) c(NA, diff(x)/head(x, -1))*100)
df$increase <- ifelse(is.na(df$increase), "", sprintf("%+.0f%%", df$increase))
df
# ID x2 x3 x4 increase
# 1 1 10 11 2
# 2 2 100 12 4
# 3 1 20 13 10 +100%
# 4 7 24 3 1
# 5 1 30 14 0 +50%
# 6 3 6 15 1
# 7 2 90 15 1 -10%

Related

Add rows to dataframe in R based on values in column

I have a dataframe with 2 columns: time and day. there are 3 days and for each day, time runs from 1 to 12. I want to add new rows for each day with times: -2, 1 and 0. How do I do this?
I have tried using add_row and specifying the row number to add to, but this changes each time a new row is added making the process tedious. Thanks in advance
picture of the dataframe
We could use add_row
then slice the desired sequence
and bind all to a dataframe:
library(tibble)
library(dplyr)
df1 <- df %>%
add_row(time = -2:0, Day = c(1,1,1), .before = 1) %>%
slice(1:15)
df2 <- bind_rows(df1, df1, df1) %>%
mutate(Day = rep(row_number(), each=15, length.out = n()))
Output:
# A tibble: 45 x 2
time Day
<dbl> <int>
1 -2 1
2 -1 1
3 0 1
4 1 1
5 2 1
6 3 1
7 4 1
8 5 1
9 6 1
10 7 1
11 8 1
12 9 1
13 10 1
14 11 1
15 12 1
16 -2 2
17 -1 2
18 0 2
19 1 2
20 2 2
21 3 2
22 4 2
23 5 2
24 6 2
25 7 2
26 8 2
27 9 2
28 10 2
29 11 2
30 12 2
31 -2 3
32 -1 3
33 0 3
34 1 3
35 2 3
36 3 3
37 4 3
38 5 3
39 6 3
40 7 3
41 8 3
42 9 3
43 10 3
44 11 3
45 12 3
Here's a fast way to create the desired dataframe from scratch using expand.grid(), rather than adding individual rows:
df <- expand.grid(-2:12,1:3)
colnames(df) <- c("time","day")
Results:
df
time day
1 -2 1
2 -1 1
3 0 1
4 1 1
5 2 1
6 3 1
7 4 1
8 5 1
9 6 1
10 7 1
11 8 1
12 9 1
13 10 1
14 11 1
15 12 1
16 -2 2
17 -1 2
18 0 2
19 1 2
20 2 2
21 3 2
22 4 2
23 5 2
24 6 2
25 7 2
26 8 2
27 9 2
28 10 2
29 11 2
30 12 2
31 -2 3
32 -1 3
33 0 3
34 1 3
35 2 3
36 3 3
37 4 3
38 5 3
39 6 3
40 7 3
41 8 3
42 9 3
43 10 3
44 11 3
45 12 3
You can use tidyr::crossing
library(dplyr)
library(tidyr)
add_values <- c(-2, 1, 0)
crossing(time = add_values, Day = unique(day$Day)) %>%
bind_rows(day) %>%
arrange(Day, time)
# A tibble: 45 x 2
# time Day
# <dbl> <int>
# 1 -2 1
# 2 0 1
# 3 1 1
# 4 1 1
# 5 2 1
# 6 3 1
# 7 4 1
# 8 5 1
# 9 6 1
#10 7 1
# … with 35 more rows
If you meant -2, -1 and 0 you can also use complete.
tidyr::complete(day, Day, time = -2:0)

Lagged function within group

I would like to write code to compute within each group, sum of lagged differences as shown in the table below:
ID x rank U R Required Output Value
1 1 1 U1 R1 -
1 1 2 U2 R2 R2-U1
1 1 3 U3 R3 (R3-U2) + (R3-U1)
1 1 4 U4 R4 (R4-U3) + (R4-U2) + (R4-U1)
1 0 5 U5 R5 R5
1 0 6 U6 R6 R6
1 0 7 U7 R7 R7
2 1 1 U8 R8 -
2 1 2 U9 R9 R9-U8
2 1 3 U10 R10 (R10-U9) + (R10 - U8)
2 1 4 U11 R11 (R11-U10) + (R11 - U9) + (R11 - U8)
3 1 1 U12 R12 -
3 0 2 U13 R13 R13
3 0 3 U14 R14 R14
ID is the unique group identifier. x is a bool and depending on its value the required output is either sum of difference with previous values or same period value. "rank" is a rank ordering column and the maximum rank can vary within each group. "U" and "R" are the main columns of interest.
To give a numerical example, I need the following:
ID x rank U R Required Output Value
1 1 1 10 7 -
1 1 2 9 11 1
1 1 3 10 10 1 + 0 = 1
1 1 4 11 13 3+4+3 = 10
1 0 5 7 8 8
1 0 6 8 8 8
1 0 7 5 7 7
2 1 1 3 2 -
2 1 2 9 15 12
2 1 3 13 14 16
2 1 4 1 14 17
3 1 1 12 1 -
3 0 2 14 9 9
3 0 3 1 11 11
R code to generate this table:
ID = c(rep(1,7),rep(2,4),rep(3,3))
x = c(rep(1,4),rep(0,3),rep(1,5),rep(0,2))
rank = c(1:7,1:4,1:3)
U = c(10,9,10,11,7,8,5,3,9,13,1,12,14,1)
R = c(7,11,10,13,8,8,7,2,15,14,14,1,9,11)
dat = cbind(ID,x,rank,U,R)
colnames(dat)=c("ID","x","rank","U","R")
Here's a tidyverse solution:
library(dplyr)
library(tidyr)
dat %>%
as_tibble() %>%
group_by(ID) %>%
mutate(output = ifelse(x, lag(rank) * R - lag(cumsum(U)), R))
Result:
# A tibble: 14 x 6
# Groups: ID [3]
ID x rank U R output
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 1 1 10 7 NA
2 1 1 2 9 11 1
3 1 1 3 10 10 1
4 1 1 4 11 13 10
5 1 0 5 7 8 8
6 1 0 6 8 8 8
7 1 0 7 5 7 7
8 2 1 1 3 2 NA
9 2 1 2 9 15 12
10 2 1 3 13 14 16
11 2 1 4 1 14 17
12 3 1 1 12 1 NA
13 3 0 2 14 9 9
14 3 0 3 1 11 11
Here is a base R solution using ave
dat <- within(dat,output <- ave(R,ID,x, FUN = function(v) v*(seq(v)-1))-ave(U,ID,x, FUN = function(v) c(NA,cumsum(v)[-length(v)])))
dat <- within(dat, output <- ifelse(x==0,R,output))
such that
> dat
ID x rank U R output
1 1 1 1 10 7 NA
2 1 1 2 9 11 1
3 1 1 3 10 10 1
4 1 1 4 11 13 10
5 1 0 5 7 8 8
6 1 0 6 8 8 8
7 1 0 7 5 7 7
8 2 1 1 3 2 NA
9 2 1 2 9 15 12
10 2 1 3 13 14 16
11 2 1 4 1 14 17
12 3 1 1 12 1 NA
13 3 0 2 14 9 9
14 3 0 3 1 11 11

R: Separate data into combinations of two columns

I have some data where each id is measured by different types which can be have different values type_val. The measured value is val. A small dummy data is like this:
df <- data.frame(id=rep(letters[1:2],6),
type=c(rep('t1',6), rep('t2',6)),
type_val=rep(c(1,1,2,2,3,3),2),
val=1:12)
Then df is:
id type type_val val
1 a t1 1 1
2 b t1 1 2
3 a t1 2 3
4 b t1 2 4
5 a t1 3 5
6 b t1 3 6
7 a t2 1 7
8 b t2 1 8
9 a t2 2 9
10 b t2 2 10
11 a t2 3 11
12 b t2 3 12
I need to spread/cast data so that all combinations of type and type_val for each id are row-wise. I think this must be a job for pkgs reshape2 or tidyr but I have completely failed to generate anything other than errors.
The outcome data structure - somewhat redundant - would be something like this (hope I got it right!) where pairs of type (as given by combinations of the type_val) are columns type_t1 and type_t2 , and their associated values (val in df) are val_t1 and val_t2 - columns names are of cause arbitrary :
id type_t1 type_t2 val_t1 val_t2
1 a 1 1 1 7
2 a 1 2 1 9
3 a 1 3 1 11
4 a 2 1 3 7
5 a 2 2 3 9
6 a 2 3 3 11
7 a 3 1 5 7
8 a 3 2 5 9
9 a 3 3 5 11
10 b 1 1 2 8
11 b 1 2 2 10
12 b 1 3 2 12
13 b 2 1 4 8
14 b 2 2 4 10
15 b 2 3 4 12
16 b 3 1 6 8
17 b 3 2 6 10
18 b 3 3 6 12
UPDATE
Note that (#Sotos)
> spread(df, type, val)
id type_val t1 t2
1 a 1 1 7
2 a 2 3 9
3 a 3 5 11
4 b 1 2 8
5 b 2 4 10
6 b 3 6 12
is not the desired output - it fails to deliver the wide format defined by combinations of type and type_val in df.
how about this:
df1=df[df$type=="t1",]
df2=df[df$type=="t2",]
DF=merge(df1,df2,by="id")
DF=DF[,-c(2,5)]
colnames(DF)<-c("id", "type_t1", "val_t1","type_t2", "val_t2")
Here is something more generic that will work with an arbitrary number of unique type:
library(dplyr)
# This function takes a list of dataframes (.data) and merges them by ID
reduce_merge <- function(.data, ID) {
return(Reduce(function(x, y) merge(x, y, by = ID), .data))
}
# This function renames the cols columns in .data by appending _identifier
batch_rename <- function(.data, cols, identifier, sep = '_') {
return(plyr::rename(.data, sapply(cols, function(x){
x = paste(x, .data[1, identifier], sep = sep)
})))
}
# This function creates a list of subsetted dataframes
# (subsetted by values of key),
# uses batch_rename() to give each dataframe more informative column names,
# merges them together, and returns the columns you'd like in a sensible order
multi_spread <- function(.data, grp, key, vals) {
.data %>%
plyr::dlply(key, subset) %>%
lapply(batch_rename, vals, key) %>%
reduce_merge(grp) %>%
select(-starts_with(paste0(key, '.'))) %>%
select(id, sort(setdiff(colnames(.), c(grp, key, vals))))
}
# Your example
df <- data.frame(id=rep(letters[1:2],6),
type=c(rep('t1',6), rep('t2',6)),
type_val=rep(c(1,1,2,2,3,3),2),
val=1:12)
df %>% multi_spread('id', 'type', c('type_val', 'val'))
id type_val_t1 type_val_t2 val_t1 val_t2
1 a 1 1 1 7
2 a 1 2 1 9
3 a 1 3 1 11
4 a 2 1 3 7
5 a 2 2 3 9
6 a 2 3 3 11
7 a 3 1 5 7
8 a 3 2 5 9
9 a 3 3 5 11
10 b 1 1 2 8
11 b 1 2 2 10
12 b 1 3 2 12
13 b 2 1 4 8
14 b 2 2 4 10
15 b 2 3 4 12
16 b 3 1 6 8
17 b 3 2 6 10
18 b 3 3 6 12
# An example with three unique values of 'type'
df <- data.frame(id = rep(letters[1:2], 9),
type = c(rep('t1', 6), rep('t2', 6), rep('t3', 6)),
type_val = rep(c(1, 1, 2, 2, 3, 3), 3),
val = 1:18)
df %>% multi_spread('id', 'type', c('type_val', 'val'))
id type_val_t1 type_val_t2 type_val_t3 val_t1 val_t2 val_t3
1 a 1 1 1 1 7 13
2 a 1 1 2 1 7 15
3 a 1 1 3 1 7 17
4 a 1 2 1 1 9 13
5 a 1 2 2 1 9 15
6 a 1 2 3 1 9 17
7 a 1 3 1 1 11 13
8 a 1 3 2 1 11 15
9 a 1 3 3 1 11 17
10 a 2 1 1 3 7 13
11 a 2 1 2 3 7 15
12 a 2 1 3 3 7 17
13 a 2 2 1 3 9 13
14 a 2 2 2 3 9 15
15 a 2 2 3 3 9 17
16 a 2 3 1 3 11 13
17 a 2 3 2 3 11 15
18 a 2 3 3 3 11 17
19 a 3 1 1 5 7 13
20 a 3 1 2 5 7 15
21 a 3 1 3 5 7 17
22 a 3 2 1 5 9 13
23 a 3 2 2 5 9 15
24 a 3 2 3 5 9 17
25 a 3 3 1 5 11 13
26 a 3 3 2 5 11 15
27 a 3 3 3 5 11 17
28 b 1 1 1 2 8 14
29 b 1 1 2 2 8 16
30 b 1 1 3 2 8 18
31 b 1 2 1 2 10 14
32 b 1 2 2 2 10 16
33 b 1 2 3 2 10 18
34 b 1 3 1 2 12 14
35 b 1 3 2 2 12 16
36 b 1 3 3 2 12 18
37 b 2 1 1 4 8 14
38 b 2 1 2 4 8 16
39 b 2 1 3 4 8 18
40 b 2 2 1 4 10 14
41 b 2 2 2 4 10 16
42 b 2 2 3 4 10 18
43 b 2 3 1 4 12 14
44 b 2 3 2 4 12 16
45 b 2 3 3 4 12 18
46 b 3 1 1 6 8 14
47 b 3 1 2 6 8 16
48 b 3 1 3 6 8 18
49 b 3 2 1 6 10 14
50 b 3 2 2 6 10 16
51 b 3 2 3 6 10 18
52 b 3 3 1 6 12 14
53 b 3 3 2 6 12 16
54 b 3 3 3 6 12 18

Data Frame Filter Values

Suppose I have the next data frame.
table<-data.frame(group=c(0,5,10,15,20,25,30,35,40,0,5,10,15,20,25,30,35,40,0,5,10,15,20,25,30,35,40),plan=c(1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2,3,3,3,3,3,3,3,3,3),price=c(1,4,5,6,8,9,12,12,12,3,5,6,7,10,12,20,20,20,5,6,8,12,15,20,22,28,28))
group plan price
1 0 1 1
2 5 1 4
3 10 1 5
4 15 1 6
5 20 1 8
6 25 1 9
7 30 1 12
8 35 1 12
9 40 1 12
10 0 2 3
11 5 2 5
12 10 2 6
13 15 2 7
14 20 2 10
15 25 2 12
16 30 2 20
17 35 2 20
18 40 2 20
How can I get the values from the table up to the maximum price, without duplicates.
So the result would be:
group plan price
1 0 1 1
2 5 1 4
3 10 1 5
4 15 1 6
5 20 1 8
6 25 1 9
7 30 1 12
10 0 2 3
11 5 2 5
12 10 2 6
13 15 2 7
14 20 2 10
15 25 2 12
16 30 2 20
You can use slice in dplyr:
library(dplyr)
table %>%
group_by(plan) %>%
slice(1:which.max(price == max(price)))
which.max gives the index of the first occurrence of price == max(price). Using that, I can slice the data.frame to only keep rows for each plan up to the maximum price.
Result:
# A tibble: 22 x 3
# Groups: plan [3]
group plan price
<dbl> <dbl> <dbl>
1 0 1 1
2 5 1 4
3 10 1 5
4 15 1 6
5 20 1 8
6 25 1 9
7 30 1 12
8 0 2 3
9 5 2 5
10 10 2 6
# ... with 12 more rows

zoo - Round coredata to integer

I've got a list of 69 zoo objects, I used na.approx to fill few gaps, but since my time series deal with counts I need the imputed values to be integers.
This code
list_int <- lapply(list_dec, round(coredata(list_dec), digits=0))
gives me the following error message
Error in round(coredata(list_dec), digits=0) :
non-numeric argument to mathematical function
I thought it was a problem with applying the function to a list instead of a vector, but the function
coredata(list_dec)
correctly shows all 69 time series (without need for lapply).
So why can't round apply to coredata?
EDITED
As suggested here's a minimal data set
A1 <- runif(20, min=-5, max=13)
A2 <- runif(20, min=-1, max=5)
A3 <- runif(20, min=-3, max=10)
A4 <- runif(20, min=0, max=2)
ls <- list(A1, A2, A3, A4)
list_dec <- lapply(ls, as.zoo)
As discussed in the comments, you can accomplish what you want by the following:
> library(zoo)
> A1 <- runif(20, min=-5, max=13)
> A2 <- runif(20, min=-1, max=5)
> A3 <- runif(20, min=-3, max=10)
> A4 <- runif(20, min=0, max=2)
> ls <- list(A1, A2, A3, A4)
> list_dec <- lapply(ls, as.zoo)
Now list_dec looks as follows:
> list_dec
[[1]]
1 2 3 4 5 6 7 8 9 10 11 12 13
9.20889929 8.03050882 1.52621137 9.91528049 12.71637959 11.93573340 3.34967427 9.75224030 7.90654714 0.08199464 -2.84403691 11.57990103 4.74868873
14 15 16 17 18 19 20
2.94023319 10.71812525 -2.05394366 -1.07669056 7.17503613 4.84871327 4.58929978
[[2]]
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
1.0756646 0.5615212 0.5697795 0.9629726 2.5962021 3.1932062 0.6894849 1.9844943 1.3351256 4.0043998 0.4756172 0.4573920 0.6009208 4.4963877 4.4149804
16 17 18 19 20
3.7762369 2.9670795 -0.8241576 2.1796402 2.5504061
[[3]]
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
1.5765136 1.7310402 0.7273943 4.0838831 -0.9946958 -2.0222258 7.5756159 3.9105252 3.9006369 -0.9939739 4.7603811 8.5079521 3.3653795 0.8546201 3.8143874
16 17 18 19 20
5.0847501 -2.6324485 2.0860695 5.7202315 9.5304238
[[4]]
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
1.36751418 1.44009472 1.41155170 1.55018689 1.31378442 1.09746739 0.09224919 0.66425731 0.61047787 1.63552109 1.56096710 1.59775494 1.69658733 1.08939868 1.96183397
16 17 18 19 20
1.20476936 0.94640977 0.73820689 0.65899943 1.54647028
Now you can directly call lapply like this:
lapply(list_dec,round)
which gives you the desired output:
[[1]]
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
9 8 2 10 13 12 3 10 8 0 -3 12 5 3 11 -2 -1 7 5 5
[[2]]
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
1 1 1 1 3 3 1 2 1 4 0 0 1 4 4 4 3 -1 2 3
[[3]]
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
2 2 1 4 -1 -2 8 4 4 -1 5 9 3 1 4 5 -3 2 6 10
[[4]]
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
1 1 1 2 1 1 0 1 1 2 2 2 2 1 2 1 1 1 1 2

Resources