Shift up rows in R - r

This is a simple example of how my data looks like.
Suppose I got the following data
>x
Year a b c
1962 1 2 3
1963 4 5 6
. . . .
. . . .
2001 7 8 9
I need to form a time series of x with 7 column contains the following variables:
Year a lag(a) b lag(b) c lag(c)
What I did is the following:
> x<-ts(x) # converting x to a time series
> x<-cbind(x,x[,-1]) # adding the same variables to the time series without repeating the year column
> x
Year a b c a b c
1962 1 2 3 1 2 3
1963 4 5 6 4 5 6
. . . . . . .
. . . . . . .
2001 7 8 9 7 8 9
I need to shift the last three column up so they give the lags of a,b,c. then I will rearrange them.

Here's an approach using dplyr
df <- data.frame(
a=1:10,
b=21:30,
c=31:40)
library(dplyr)
df %>% mutate_each(funs(lead(.,1))) %>% cbind(df, .)
# a b c a b c
#1 1 21 31 2 22 32
#2 2 22 32 3 23 33
#3 3 23 33 4 24 34
#4 4 24 34 5 25 35
#5 5 25 35 6 26 36
#6 6 26 36 7 27 37
#7 7 27 37 8 28 38
#8 8 28 38 9 29 39
#9 9 29 39 10 30 40
#10 10 30 40 NA NA NA
You can change the names afterwards using colnames(df) <- c("a", "b", ...)
As #nrussel noted in his answer, what you described is a leading variable. If you want a lagging variable, you can change the lead in my answer to lag.

X <- data.frame(
a=1:100,
b=2*(1:100),
c=3*(1:100),
laga=1:100,
lagb=2*(1:100),
lagc=3*(1:100),
stringsAsFactors=FALSE)
##
Xts <- ts(X)
Xts[1:(nrow(Xts)-1),c(4,5,6)] <- Xts[2:nrow(Xts),c(4,5,6)]
Xts[nrow(Xts),c(4,5,6)] <- c(NA,NA,NA)
> head(Xts)
a b c laga lagb lagc
[1,] 1 2 3 2 4 6
[2,] 2 4 6 3 6 9
[3,] 3 6 9 4 8 12
[4,] 4 8 12 5 10 15
[5,] 5 10 15 6 12 18
[6,] 6 12 18 7 14 21
##
> tail(Xts)
a b c laga lagb lagc
[95,] 95 190 285 96 192 288
[96,] 96 192 288 97 194 291
[97,] 97 194 291 98 196 294
[98,] 98 196 294 99 198 297
[99,] 99 198 297 100 200 300
[100,] 100 200 300 NA NA NA
I'm not sure if by shift up you literally mean shift the rows up 1 place like above (because that would mean you are using lagging values not leading values), but here's the other direction ("true" lagged values):
X2 <- data.frame(
a=1:100,
b=2*(1:100),
c=3*(1:100),
laga=1:100,
lagb=2*(1:100),
lagc=3*(1:100),
stringsAsFactors=FALSE)
##
Xts2 <- ts(X2)
Xts2[2:nrow(Xts2),c(4,5,6)] <- Xts2[1:(nrow(Xts2)-1),c(4,5,6)]
Xts2[1,c(4,5,6)] <- c(NA,NA,NA)
##
> head(Xts2)
a b c laga lagb lagc
[1,] 1 2 3 NA NA NA
[2,] 2 4 6 1 2 3
[3,] 3 6 9 2 4 6
[4,] 4 8 12 3 6 9
[5,] 5 10 15 4 8 12
[6,] 6 12 18 5 10 15
##
> tail(Xts2)
a b c laga lagb lagc
[95,] 95 190 285 94 188 282
[96,] 96 192 288 95 190 285
[97,] 97 194 291 96 192 288
[98,] 98 196 294 97 194 291
[99,] 99 198 297 98 196 294
[100,] 100 200 300 99 198 297

Related

Join data frame into one in r

I have 4 data frames that all look like this:
Product 2018
Number
Minimum
Maximum
1
56
1
5
2
42
12
16
3
6523
23
56
4
123
23
102
5
56
23
64
6
245623
56
87
7
546
25
540
8
54566
253
560
Product 2019
Number
Minimum
Maximum
1
56
32
53
2
642
423
620
3
56423
432
560
4
3
431
802
5
2
2
6
6
4523
43
68
7
555
23
54
8
55646
3
6
Product 2020
Number
Minimum
Maximum
1
23
2
5
2
342
4
16
3
223
3
5
4
13
4
12
5
2
4
7
6
223
7
8
7
5
34
50
8
46
3
6
Product 2021
Number
Minimum
Maximum
1
234
3
5
2
3242
4
16
3
2423
43
56
4
123
43
102
5
24
4
6
6
2423
4
18
7
565
234
540
8
5646
23
56
I want to join all the tables so I get a table that looks like this:
Products
Number 2021
Min-Max 2021
Number 2020
Min-Max 2020
Number 2019
Min-Max 2019
Number 2018
Min-Max 2018
1
234
3 to 5
23
2 to 5
...
...
...
...
2
3242
4 to 16
342
4 to 16
...
...
...
...
3
2423
43 to 56
223
3 to 5
...
...
...
...
4
123
43 to 102
13
4 to 12
...
...
...
...
5
24
4 to 6
2
4 to 7
...
...
...
...
6
2423
4 to 18
223
7 to 8
...
...
...
...
7
565
234 to 540
5
34 to 50
...
...
...
...
8
5646
23 to 56
46
3 to 6
...
...
...
...
The Product for all years are the same so I would like to have a data frame that contains the number for each year as a column and joins the column for minimum and maximum as one.
Any help is welcome!
How about something like this. You are trying to join several dataframes by a single column, which is relatively straight forward using full_join. The difficulty is that you are trying to extract information from the column names and combine several columns at the same time. I would map out everying you want to do and then reduce the list of dataframes at the end. Here is an example with two dataframes, but you could add as many as you want to the list at the begining.
library(tidyverse)
#test data
set.seed(23)
df1 <- tibble("Product 2018" = seq(1:8),
Number = sample(1:100, 8),
Minimum = sample(1:100, 8),
Maximum = map_dbl(Minimum, ~sample(.x:1000, 1)))
set.seed(46)
df2 <- tibble("Product 2019" = seq(1:8),
Number = sample(1:100, 8),
Minimum = sample(1:100, 8),
Maximum = map_dbl(Minimum, ~sample(.x:1000, 1)))
list(df1, df2) |>
map(\(x){
year <- str_extract(colnames(x)[1], "\\d+?$")
mutate(x, !!quo_name(paste0("Min-Max ", year)) := paste(Minimum, "to", Maximum))|>
rename(!!quo_name(paste0("Number ", year)) := Number)|>
rename_with(~gsub("\\s\\d+?$", "", .), 1) |>
select(-c(Minimum, Maximum))
}) |>
reduce(full_join, by = "Product")
#> # A tibble: 8 x 5
#> Product `Number 2018` `Min-Max 2018` `Number 2019` `Min-Max 2019`
#> <int> <int> <chr> <int> <chr>
#> 1 1 29 21 to 481 50 93 to 416
#> 2 2 28 17 to 314 78 7 to 313
#> 3 3 72 40 to 787 1 91 to 205
#> 4 4 43 36 to 557 47 55 to 542
#> 5 5 45 70 to 926 52 76 to 830
#> 6 6 34 96 to 645 70 20 to 922
#> 7 7 48 31 to 197 84 6 to 716
#> 8 8 17 86 to 951 99 75 to 768
This is a similar answer, but includes bind_rows to combine the data.frames, then pivot_wider to end in a wide format.
The first steps strip the year from the Product XXXX column name, as this carries relevant information on year for that data.frame. If that column is renamed as Product they are easily combined (with a separate column containing the Year). If this step can be taken earlier in the data collection or processing timeline, it is helpful.
library(tidyverse)
list(df1, df2, df3, df4) %>%
map(~.x %>%
mutate(Year = gsub("Product", "", names(.x)[1])) %>%
rename(Product = !!names(.[1]))) %>%
bind_rows() %>%
mutate(Min_Max = paste(Minimum, Maximum, sep = " to ")) %>%
pivot_wider(id_cols = Product, names_from = Year, values_from = c(Number, Min_Max), names_vary = "slowest")
Output
Product Number_2018 Min_Max_2018 Number_2019 Min_Max_2019 Number_2020 Min_Max_2020 Number_2021 Min_Max_2021
<int> <int> <chr> <int> <chr> <int> <chr> <int> <chr>
1 1 56 1 to 5 56 32 to 53 23 2 to 5 234 3 to 5
2 2 42 12 to 16 642 423 to 620 342 4 to 16 3242 4 to 16
3 3 6523 23 to 56 56423 432 to 560 223 3 to 5 2423 43 to 56
4 4 123 23 to 102 3 431 to 802 13 4 to 12 123 43 to 102
5 5 56 23 to 64 2 2 to 6 2 4 to 7 24 4 to 6
6 6 245623 56 to 87 4523 43 to 68 223 7 to 8 2423 4 to 18
7 7 546 25 to 540 555 23 to 54 5 34 to 50 565 234 to 540
8 8 54566 253 to 560 55646 3 to 6 46 3 to 6 5646 23 to 56

R- Subtract values of all rows in a group from previous row in different group and filter out rows

In R say I had the data frame:
data
frame object x y
1 6 150 100
2 6 149 99
3 6 148 98
3 6 140 90
4 6 148.5 97
4 6 142 93
5 6 147 96
5 6 138 92
5 6 135 90
6 6 146.5 99
1 7 125 200
2 7 126 197
3 7 127 202
3 7 119 185
4 7 117 183
4 7 123 199
5 7 115 190
5 7 124 202
5 7 118 192
6 7 124.5 199
I want to output the object which is the closest in the previous frame based on the (x,y) coordinates and filter out the other objects. I want to find the difference in the x and y between all the objects in a given frame and the single object in the previous frame and keep the closest object while removing the rest. The object that is kept would then serve as reference for the next frame. The frames with only one object would be left as is. The output should be one object per frame:
data
frame object x y
1 6 150 100
2 6 149 99
3 6 148 98
4 6 148.5 97
5 6 147 96
6 6 146.5 99
1 7 125 200
2 7 126 197
3 7 127 202
4 7 123 199
5 7 124 202
6 7 124.5 199
This is a cumulative operation, so it'll take an iterative approach. Here's a simple function to do one operation, assuming it's for only one object.
fun <- function(Z, fr) {
prevZ <- head(subset(Z, frame == (fr-1)), 1)
thisZ <- subset(Z, frame == fr)
if (nrow(prevZ) < 1 || nrow(thisZ) < 2) return(Z)
ind <- which.min( abs(thisZ$x - prevZ$x) + abs(thisZ$y - prevZ$y) )
rbind(subset(Z, frame != fr), thisZ[ind,])
}
fun(subset(dat, object == 6), 3)
# frame object x y
# 1 1 6 150.0 100
# 2 2 6 149.0 99
# 5 4 6 148.5 97
# 6 4 6 142.0 93
# 7 5 6 147.0 96
# 8 5 6 138.0 92
# 9 5 6 135.0 90
# 10 6 6 146.5 99
# 3 3 6 148.0 98
(The order is not maintained, it can easily be sorted back into place as needed.)
Now we can Reduce this for each object within the data.
out <- do.call(rbind,
lapply(split(dat, dat$object),
function(X) Reduce(fun, seq(min(X$frame)+1, max(X$frame)), init=X)))
out <- out[order(out$object, out$frame),]
out
# frame object x y
# 6.1 1 6 150.0 100
# 6.2 2 6 149.0 99
# 6.3 3 6 148.0 98
# 6.5 4 6 148.5 97
# 6.7 5 6 147.0 96
# 6.10 6 6 146.5 99
# 7.11 1 7 125.0 200
# 7.12 2 7 126.0 197
# 7.13 3 7 127.0 202
# 7.16 4 7 123.0 199
# 7.18 5 7 124.0 202
# 7.20 6 7 124.5 199
We can create a for loop that applies the criteria to a single object, and then use group_by %>% summarize to apply it to every object:
library(dplyr)
keep_closest_frame = function(data) {
frames = split(data, dd$frame)
for(i in seq_along(frames)) {
if(nrow(frames[[i]]) != 1 & i == 1) {
stop("First frame must have exactly 1 row")
}
if(nrow(frames[[i]]) == 1) next
dists = with(frames[[i]], abs(x - frames[[i - 1]][["x"]]) + abs(y - frames[[i - 1]][["y"]]))
frames[[i]] = frames[[i]][which.min(dists), ]
}
bind_rows(frames)
}
data %>%
group_by(object) %>%
summarize(keep_closest_frame(across()))
# # A tibble: 12 × 4
# # Groups: object [2]
# object frame x y
# <int> <int> <dbl> <int>
# 1 6 1 150 100
# 2 6 2 149 99
# 3 6 3 148 98
# 4 6 4 148. 97
# 5 6 5 147 96
# 6 6 6 146. 99
# 7 7 1 125 200
# 8 7 2 126 197
# 9 7 3 127 202
# 10 7 4 123 199
# 11 7 5 124 202
# 12 7 6 124. 199

How to use mutate_at() with two sets of variables, in R

Using dplyr, I want to divide a column by another one, where the two columns have a similar pattern.
I have the following data frame:
My_data = data.frame(
var_a = 101:110,
var_b = 201:210,
number_a = 1:10,
number_b = 21:30)
I would like to create a new variable: var_a_new = var_a/number_a, var_b_new = var_b/number_b and so on if I have c, d etc.
My_data %>%
mutate_at(
.vars = c('var_a', 'var_b'),
.funs = list( new = function(x) x/(.[,paste0('number_a', names(x))]) ))
I did not get an error, but a wrong result. I think that the problem is that I don't understand what the 'x' is. Is it one of the string in .vars? Is it a column in My_data? Something else?
One option could be:
bind_cols(My_data,
My_data %>%
transmute(across(starts_with("var"))/across(starts_with("number"))) %>%
rename_all(~ paste0(., "_new")))
var_a var_b number_a number_b var_a_new var_b_new
1 101 201 1 21 101.00000 9.571429
2 102 202 2 22 51.00000 9.181818
3 103 203 3 23 34.33333 8.826087
4 104 204 4 24 26.00000 8.500000
5 105 205 5 25 21.00000 8.200000
6 106 206 6 26 17.66667 7.923077
7 107 207 7 27 15.28571 7.666667
8 108 208 8 28 13.50000 7.428571
9 109 209 9 29 12.11111 7.206897
10 110 210 10 30 11.00000 7.000000
You can do this directly provided the columns are correctly ordered meaning "var_a" is first column in "var" group and "number_a" is first column in "number" group and so on for other pairs.
var_cols <- grep('var', names(My_data), value = TRUE)
number_cols <- grep('number', names(My_data), value = TRUE)
My_data[paste0(var_cols, '_new')] <- My_data[var_cols]/My_data[number_cols]
My_data
# var_a var_b number_a number_b var_a_new var_b_new
#1 101 201 1 21 101.00000 9.571429
#2 102 202 2 22 51.00000 9.181818
#3 103 203 3 23 34.33333 8.826087
#4 104 204 4 24 26.00000 8.500000
#5 105 205 5 25 21.00000 8.200000
#6 106 206 6 26 17.66667 7.923077
#7 107 207 7 27 15.28571 7.666667
#8 108 208 8 28 13.50000 7.428571
#9 109 209 9 29 12.11111 7.206897
#10 110 210 10 30 11.00000 7.000000
The function across() has replaced scope variants such as mutate_at(), summarize_at() and others. For more details, see vignette("colwise") or https://cran.r-project.org/web/packages/dplyr/vignettes/colwise.html. Based on tmfmnk's answer, the following works well:
My_data %>%
mutate(
new = across(starts_with("var"))/across(starts_with("number")))
The prefix "new." will be added to the names of the new variables.
var_a var_b number_a number_b new.var_a new.var_b
1 101 201 1 21 101.00000 9.571429
2 102 202 2 22 51.00000 9.181818
3 103 203 3 23 34.33333 8.826087
4 104 204 4 24 26.00000 8.500000
5 105 205 5 25 21.00000 8.200000
6 106 206 6 26 17.66667 7.923077
7 107 207 7 27 15.28571 7.666667
8 108 208 8 28 13.50000 7.428571
9 109 209 9 29 12.11111 7.206897
10 110 210 10 30 11.00000 7.000000

Change the value of a column if it meets one of the requirements. WITHOUT using loops or IF statements

I would like to change the value of a column if it meets one of the requirements. The dataframe is:
> dat
V1 V2 V3
1 0.0582597361 13 1.6147
2 0.0188402085 23 1.5917
3 0.0362384206 64 6.2791
4 0.0690792261 110 20.2906
5 0.0443102834 57 11.3775
6 0.0654932712 137 49.7685
7 0.0388503030 5 0.0397
8 0.0591058288 22 3.4062
9 0.0838927581 569 218.2068
10 0.0749128048 17 1.0305
11 0.0523715810 56 0.5930
12 0.0328815149 0 0.0092
13 0.0246113928 1327 201.1935
14 0.0595625342 181 76.8364
15 0.0879960297 25 4.2614
16 0.0388291615 22 4.3269
17 0.0746654630 40 19.3294
18 0.0003277829 140 43.4176
19 0.0624188329 22 4.0448
20 0.0417003184 157 28.4765
I want to change values from V1 to 1 if it meets one of this requirements:
dat$V3>dat$V2 or dat$V1<=0.05
Without using loops or IF statements, thanks.
Basically do it in a vectorized way, which works faster:
dat$V1[ dat$V3>dat$V2 | dat$V1<=0.05] <- 1
Output:
> dat
V1 V2 V3
1 0.05825974 13 1.6147
2 1.00000000 23 1.5917
3 1.00000000 64 6.2791
4 0.06907923 110 20.2906
5 1.00000000 57 11.3775
6 0.06549327 137 49.7685
7 1.00000000 5 0.0397
8 0.05910583 22 3.4062
9 0.08389276 569 218.2068
10 0.07491280 17 1.0305
11 0.05237158 56 0.5930
12 1.00000000 0 0.0092
13 1.00000000 1327 201.1935
14 0.05956253 181 76.8364
15 0.08799603 25 4.2614
16 1.00000000 22 4.3269
17 0.07466546 40 19.3294
18 1.00000000 140 43.4176
19 0.06241883 22 4.0448
20 1.00000000 157 28.4765
Edited, thanks for the feedback.
Using ifelse:
dat$V1=ifelse(dat$V3>dat$V2 | dat$V1<=0.05,1,dat$V1)
Output:
V1 V2 V3
1 0.05825974 13 1.6147
2 1.00000000 23 1.5917
3 1.00000000 64 6.2791
4 0.06907923 110 20.2906
5 1.00000000 57 11.3775
6 0.06549327 137 49.7685
7 1.00000000 5 0.0397
8 0.05910583 22 3.4062
9 0.08389276 569 218.2068
10 0.07491280 17 1.0305
11 0.05237158 56 0.5930
12 1.00000000 0 0.0092
13 1.00000000 1327 201.1935
14 0.05956253 181 76.8364
15 0.08799603 25 4.2614
16 1.00000000 22 4.3269
17 0.07466546 40 19.3294
18 1.00000000 140 43.4176
19 0.06241883 22 4.0448
20 1.00000000 157 28.4765
You could use an ifelse that would not require any loops
dat$V1 <- ifelse(dat$V3>dat$V2 | dat$V1<=0.05, 1, dat$V1)

correct way to add columns to data frame without loop

I have this "d" data frame that has 2 groups. In real life I have 20 groups.
d= data.frame(group = c(rep("A",10),rep("B",10),"A"), value = c(seq(1,10,1),seq(101,110,1),10000))
d
group value
1 A 1
2 A 2
3 A 3
4 A 4
5 A 5
6 A 6
7 A 7
8 A 8
9 A 9
10 A 10
11 B 101
12 B 102
13 B 103
14 B 104
15 B 105
16 B 106
17 B 107
18 B 108
19 B 109
20 B 110
21 A 10000
I'd like to add 2 columns, "Upper" and "Lower" that are calculated at the GROUP below level. Since there are only 2 groups I can add the columns manually like this:
d= data.frame(group = c(rep("A",10),rep("B",10),"A"), value = c(seq(1,10,1),seq(101,110,1),10000))
d
d$upper = ifelse(d$group=="A", quantile(d$value[d$group=="A"])[4]+ 2.5*IQR(d$value[d$group=="A"]), quantile(d$value[d$group=="B"])[4]+ 2.5*IQR(d$value[d$group=="B"]) )
d$lower = ifelse(d$group=="A", quantile(d$value[d$group=="A"])[4]- 2.5*IQR(d$value[d$group=="A"]), quantile(d$value[d$group=="B"])[4]- 2.5*IQR(d$value[d$group=="B"]) )
group value upper lower
1 A 1 21 -4.0
2 A 2 21 -4.0
3 A 3 21 -4.0
4 A 4 21 -4.0
5 A 5 21 -4.0
6 A 6 21 -4.0
7 A 7 21 -4.0
8 A 8 21 -4.0
9 A 9 21 -4.0
10 A 10 21 -4.0
11 B 101 119 96.5
12 B 102 119 96.5
13 B 103 119 96.5
14 B 104 119 96.5
15 B 105 119 96.5
16 B 106 119 96.5
17 B 107 119 96.5
18 B 108 119 96.5
19 B 109 119 96.5
20 B 110 119 96.5
21 A 10000 21 -4.0
But when I have 20 or 30 columns whats the best way to add these columns without doing a loop?
Groupwise operations can easily be done using dplyr's group_by function:
library(dplyr)
d <- data.frame(group = c(rep("A",10),rep("B",10),"A"), value = c(seq(1,10,1),seq(101,110,1),10000))
d %>%
group_by(group) %>%
mutate(upper=quantile(value, 0.75) + 2.5*IQR(value),
lower=quantile(value, 0.75) - 2.5*IQR(value))
This splits the data frame by the "group" variable and then computes the "upper" and "lower" columns separately for each group.

Resources