Creating a starting value variable with longitudinal data (conditional) - r

I am trying to create a new variable that is basically the starting value of another variable in my dataframe. Example data:
id <- rep(c(1, 2), each = 8)
outcome <- rep(1:5, length.out = 16)
time <- rep(c(0, 1, 3, 4),4)
Attitude <- rep(c('A1', 'A2', 'A1', 'A2'), each = 4)
df <- data.frame(id, Attitude, outcome, time)
What I'd like to get is a new column named new_var (or whatever) that is equal to the value of outcome at time == 0 for id = id and also depends on Attitude. Thus what I'd like to extend the dataframe to is:
df$new_var <- c(1,1,1,1,5,5,5,5,4,4,4,4,3,3,3,3)
Only then with some decent coding. In SAS I know I can do this with the lag function. I would really appreciate a solution that isn't a 'work around' so it is like SAS, but rather the proper r solution. In the end I want to get stronger in r too.
Related: Retain and lag function in R as SAS
However I prefer some solution that is based on indices or the 'usual' r way. And here it's also not dependent on other conditions.
So, important here is that the coding works for the different ids, attitude levels / variables (A1, A2, ...) and that the outcome value at time == 0 is basically copied to new_var.
I hope I am clear in conveying my message. If not I think the small piece of example code and how I'd like to extend it should be clear enough. Looking forward to suggestions.
EDIT Another example code for #jogo answer.
ID <- rep(1, 36)
Attitude <- rep(c('A1', 'A2','A3', 'A4', 'A5', 'A6', 'A7', 'A8', 'A9'),
length.out =36)
Answer_a <- rep(1:5, length.out = 36)
time <- as.character(rep(c(0, 1, 3, 4), each = 9))
df <- data.frame(ID, Attitude, Answer_a, time)
df$time <- as.character(df$time)

I think this is what you mean - assuming the data is always in the correct order?
EDIT Added an arrange step to ensure the data is always correctly ordered.
library(tidyverse)
df %>% group_by(id, Attitude) %>%
arrange(time) %>%
mutate(new_var2 = first(outcome[!is.na(outcome)])
# A tibble: 16 x 6
# Groups: id, Attitude [4]
id Attitude outcome time new_var new_var2
<dbl> <fct> <int> <dbl> <dbl> <int>
1 1.00 A1 1 0 1.00 1
2 1.00 A1 2 1.00 1.00 1
3 1.00 A1 3 3.00 1.00 1
4 1.00 A1 4 4.00 1.00 1
5 1.00 A2 5 0 5.00 5
6 1.00 A2 1 1.00 5.00 5
7 1.00 A2 2 3.00 5.00 5
8 1.00 A2 3 4.00 5.00 5
9 2.00 A1 4 0 4.00 4
10 2.00 A1 5 1.00 4.00 4
11 2.00 A1 1 3.00 4.00 4
12 2.00 A1 2 4.00 4.00 4
13 2.00 A2 3 0 3.00 3
14 2.00 A2 4 1.00 3.00 3
15 2.00 A2 5 3.00 3.00 3
16 2.00 A2 1 4.00 3.00 3

Here is a solution with data.table:
library("data.table")
setDT(df)
df[, new_var:=outcome[1], rleid(Attitude)][] # or
# df[, new_var:=outcome[time==0], rleid(Attitude)][]
For testing I named the new column new_var2:
id <- rep(c(1, 2), each = 8)
outcome <- rep(1:5, length.out = 16)
time <- rep(c(0, 1, 3, 4),4)
Attitude <- rep(c('A1', 'A2', 'A1', 'A2'), each = 4)
df <- data.frame(id, Attitude, outcome, time)
df$new_var <- c(1,1,1,1,5,5,5,5,4,4,4,4,3,3,3,3)
library("data.table")
setDT(df)
df[, new_var2:=outcome[1], rleid(Attitude)][]
# > df[, new_var2:=outcome[1], rleid(Attitude)][]
# id Attitude outcome time new_var new_var2
# 1: 1 A1 1 0 1 1
# 2: 1 A1 2 1 1 1
# 3: 1 A1 3 3 1 1
# 4: 1 A1 4 4 1 1
# 5: 1 A2 5 0 5 5
# 6: 1 A2 1 1 5 5
# 7: 1 A2 2 3 5 5
# 8: 1 A2 3 4 5 5
# 9: 2 A1 4 0 4 4
# 10: 2 A1 5 1 4 4
# 11: 2 A1 1 3 4 4
# 12: 2 A1 2 4 4 4
# 13: 2 A2 3 0 3 3
# 14: 2 A2 4 1 3 3
# 15: 2 A2 5 3 3 3
# 16: 2 A2 1 4 3 3
Your second example shows that you have to reorder the rows of the data. Usinf data.table this can be done by setkey():
ID <- rep(1, 36)
Attitude <- rep(c('A1', 'A2','A3', 'A4', 'A5', 'A6', 'A7', 'A8', 'A9'),
length.out =36)
Answer_a <- rep(1:5, length.out = 36)
time <- as.character(rep(c(0, 1, 3, 4), each = 9))
df <- data.frame(ID, Attitude, Answer_a, time)
df$time <- as.character(df$time)
library("data.table")
setDT(df)
setkey(df, ID, Attitude, time)
df[, new_var:=Answer_a[1], rleid(Attitude)]
df

Related

How to sample across a dataset with two factors in it?

I have a dataframe with two species A and B and certain variables a b associated with the total of 100 rows.
I want to create a sampler such that in one set it randomly picks 6 rows reps from the df dataset. However, the samples for A must only come from rows associated with sp A from df, similarly from B. I want do this for 500 times over for each of species A and B.
I attempted a for loop and when I ran sampling it shows a single row with 6 columns. I would appreciate any guidance
a <- rnorm(100, 2,1)
b <- rnorm(100, 2,1)
sp <- rep(c("A","B"), each = 50)
df <- data.frame(a,b,sp)
df.sample <- for(i in 1:1000){
sampling <- sample(df[i,],6,replace = TRUE)
}
#Output in a single row
a a.1 sp b sp.1 a.2
1000 1.68951 1.68951 B 1.395995 B 1.68951
#Expected dataframe
df.sample
set rep a b sp
1 1 1 9 A
1 2 3 2 A
1 3 0 2 A
1 4 1 2 A
1 5 1 6 A
1 6 4 2 A
2 1 1 2 B
2 2 5 2 B
2 3 1 2 B
2 4 1 6 B
2 5 1 8 B
2 6 9 2 B
....
Here's how I would do it (using tidyverse):
data:
a <- rnorm(100, 2,1)
b <- rnorm(100, 2,1)
sp <- rep(c("A","B"), each = 50)
df <- data.frame(a,b,sp)
# create an empty table with desired columns
library(tidyverse)
output <- tibble(a = numeric(),
b = numeric(),
sp = character(),
set = numeric())
# sampling in a loop
set.seed(42)
for(i in 1:500){
samp1 <- df %>% filter(sp == 'A') %>% sample_n(6, replace = TRUE) %>% mutate(set = i)
samp2 <- df %>% filter(sp == 'B') %>% sample_n(6, replace = TRUE) %>% mutate(set = i)
output %>% add_row(bind_rows(samp1, samp2)) -> output
}
Result
> head(output, 20)
# A tibble: 20 × 4
a b sp set
<dbl> <dbl> <chr> <dbl>
1 2.59 3.31 A 1
2 1.84 1.66 A 1
3 2.35 1.17 A 1
4 2.33 1.95 A 1
5 0.418 1.11 A 1
6 1.19 2.54 A 1
7 2.35 0.899 B 1
8 1.19 1.63 B 1
9 0.901 0.986 B 1
10 3.12 1.75 B 1
11 2.28 2.61 B 1
12 1.37 3.47 B 1
13 2.33 1.95 A 2
14 1.84 1.66 A 2
15 3.76 1.26 A 2
16 2.96 3.10 A 2
17 1.03 1.81 A 2
18 1.42 2.00 A 2
19 0.901 0.986 B 2
20 2.37 1.39 B 2
You could split df by species at first. Random rows in each species can be drawn by x[sample(nrow(x), 6), ]. Pass it into replicate(), you could do sampling for many times. Here dplyr::bind_rows() is used to combine samples and add a new column set indicating the sampling indices.
lapply(split(df, df$sp), function(x) {
dplyr::bind_rows(
replicate(3, x[sample(nrow(x), 6), ], FALSE),
.id = "set"
)
})
Output
$A
set a b sp
1 1 1.52480034 3.41257975 A
2 1 1.82542370 2.08511584 A
3 1 1.80019901 1.39279162 A
4 1 2.20765154 2.11879412 A
5 1 1.61295185 2.04035172 A
6 1 1.92936567 2.90362816 A
7 2 0.88903679 2.46948106 A
8 2 3.19223788 2.81329767 A
9 2 1.28629416 2.69275525 A
10 2 2.61044815 0.82495427 A
11 2 2.30928735 1.67421328 A
12 2 -0.09789704 2.62434719 A
13 3 2.10386603 1.78157862 A
14 3 2.17542841 0.84016203 A
15 3 3.22202227 3.49863423 A
16 3 1.07929909 -0.02032945 A
17 3 2.95271838 2.34460193 A
18 3 1.90414536 1.54089645 A
$B
set a b sp
1 1 3.5130317 -0.4704879 B
2 1 3.0053072 1.6021795 B
3 1 4.1167657 1.1123342 B
4 1 1.5460589 3.2915979 B
5 1 0.8742753 0.9132530 B
6 1 2.0882660 1.5588471 B
7 2 1.2444645 1.8199525 B
8 2 2.7960117 2.6657735 B
9 2 2.5970774 0.9984187 B
10 2 1.1977317 3.7360884 B
11 2 2.2830643 1.0452440 B
12 2 3.1047150 1.5609482 B
13 3 2.9309124 1.5679255 B
14 3 0.8631965 1.3501631 B
15 3 1.5460589 3.2915979 B
16 3 2.7960117 2.6657735 B
17 3 3.1047150 1.5609482 B
18 3 2.8735390 0.6329279 B
If I understood well what you want, it could be done following this code
# Create the initial data frame
a <- rnorm(100, 2,1)
b <- rnorm(100, 2,1)
sp <- rep(c("A","B"), each = 50)
df <- data.frame(a,b,sp)
# Rows with sp=A
row.A <- which(df$sp=="A")
row.B <- which(df$sp=="B")
# Sampling data.frame
sampling <- data.frame(matrix(ncol = 5, nrow = 0))
# "rep" column for each iteration
rep1 <- rep(1:6,2)
# Build the dara.frame
for(i in 1:500){
# Sampling row.A
s.A <- sample(row.A,6,replace = T)
# Sampling row.B
s.B <- sample(row.B,6,replace = T)
# Data frame with the subset of df and "set" and "rep" values
sampling <- rbind(sampling, set=cbind(rep(i,12),rep=rep1,df[c(s.A,s.B),]))
}
# Delete row.names of sampling and redefine sampling's column names
row.names(sampling) <- NULL
colnames(sampling) <- c("set", "rep", "a", "b", "sp")
And the output looks like this:
set rep a b sp
1 1 3.713663 2.717456 A
1 2 2.456070 2.803443 A
1 3 2.166655 1.395556 A
1 4 1.453738 5.662969 A
1 5 2.692518 2.971156 A
1 6 2.699634 3.016791 A

Gathering columns with unequal length with same name in R

I want to gather data from experiments from data frame into columns. Data is in the following form;
and I want to arrange data in the format given below;
Is there any simple method available to do it in R/RStudio? I tried tidyr, rbind and cbind as suggested in different examples. But I am unable to do ad found that these are not much relevant.
It will be great if someone help me to understand.
Thanks
Using data.table function melt with patterns in measure.vars, you can try the following:
library(data.table)
#Some data in the same format as you have
df <- read.table(text="
1 0.8525099 0.5598105 0.4242143 0 0.06016425 0.678719492 0.4852765 0.4970301 0.1657070
2 0.1237982 0.2853534 0.8281460 0.42586728 0.31214568 0.647306659 0.5445816 0.4250520 0.9975251
3 0.4907858 0.4925835 0.6689135 0.06042183 0.47391134 0.002571686 0.5267215 0.4291427 NA
4 0.8524778 0.1091856 0.6529887 0.24606198 0.44869099 0.540201766 0.6263992 0.1448730 NA
")
#assigning names to columns
colnames(df) <- c("Run", rep(c("Logging", "Salinity","surface"),3))
setDT(df) #converting df into a data.table
df #Similar as your initial data frame
Run Logging Salinity surface Logging Salinity surface Logging Salinity surface
1: 1 0.8525099 0.5598105 0.4242143 0.00000000 0.06016425 0.678719492 0.4852765 0.4970301 0.1657070
2: 2 0.1237982 0.2853534 0.8281460 0.42586728 0.31214568 0.647306659 0.5445816 0.4250520 0.9975251
3: 3 0.4907858 0.4925835 0.6689135 0.06042183 0.47391134 0.002571686 0.5267215 0.4291427 NA
4: 4 0.8524778 0.1091856 0.6529887 0.24606198 0.44869099 0.540201766 0.6263992 0.1448730 NA
df2 <- melt(df, #melting data, converting from wide to long
id.vars = 1, # here we attempt to fix the first column "Runs"
measure.vars = patterns(Logging="Logging", # here we look up for a pattern of column names to convert into measure
Salinity= "Salinity",
surface="surface")
)
#Output
df2
Run variable Logging Salinity surface
1: 1 1 0.85250990 0.55981050 0.424214300
2: 2 1 0.12379820 0.28535340 0.828146000
3: 3 1 0.49078580 0.49258350 0.668913500
4: 4 1 0.85247780 0.10918560 0.652988700
5: 1 2 0.00000000 0.06016425 0.678719492
6: 2 2 0.42586728 0.31214568 0.647306659
7: 3 2 0.06042183 0.47391134 0.002571686
8: 4 2 0.24606198 0.44869099 0.540201766
9: 1 3 0.48527650 0.49703010 0.165707000
10: 2 3 0.54458160 0.42505200 0.997525100
11: 3 3 0.52672150 0.42914270 NA
12: 4 3 0.62639920 0.14487300 NA
And finally remove column variable
#Removing column variable (second column in df2) you get your result
df2[, -2]
Run Logging Salinity surface
1: 1 0.85250990 0.55981050 0.424214300
2: 2 0.12379820 0.28535340 0.828146000
3: 3 0.49078580 0.49258350 0.668913500
4: 4 0.85247780 0.10918560 0.652988700
5: 1 0.00000000 0.06016425 0.678719492
6: 2 0.42586728 0.31214568 0.647306659
7: 3 0.06042183 0.47391134 0.002571686
8: 4 0.24606198 0.44869099 0.540201766
9: 1 0.48527650 0.49703010 0.165707000
10: 2 0.54458160 0.42505200 0.997525100
11: 3 0.52672150 0.42914270 NA
12: 4 0.62639920 0.14487300 NA
You can use tidyr::pivot_longer. Using #Chriss Paul's data.
tidyr::pivot_longer(df, cols = -Run, names_to = '.value')
# Run Logging Salinity surface
# <int> <dbl> <dbl> <dbl>
# 1 1 0.853 0.560 0.424
# 2 1 0 0.0602 0.679
# 3 1 0.485 0.497 0.166
# 4 2 0.124 0.285 0.828
# 5 2 0.426 0.312 0.647
# 6 2 0.545 0.425 0.998
# 7 3 0.491 0.493 0.669
# 8 3 0.0604 0.474 0.00257
# 9 3 0.527 0.429 NA
#10 4 0.852 0.109 0.653
#11 4 0.246 0.449 0.540
#12 4 0.626 0.145 NA
PS - It is not advised to have data with duplicate column names.
I created a similar data set and applied the following code based on binding since you mentioned it in your question, it may sound verbose but it gets you to the desired output:
library(dplyr)
df <- tibble(
runs = c(1, 2, 3, 4),
col1 = c(3, 4, 5, 5),
col2 = c(5, 3, 1, 4),
col3 = c(6, 4, 9, 2),
col1 = c(0, 2, 2, 1),
col2 = c(2, 3, 1, 7),
col3 = c(2, 4, 9, 9),
col1 = c(3, 4, 5, 7),
col2 = c(3, 3, 1, 4),
col3 = c(3, 2, NA, NA), .name_repair = "minimal")
df %>%
select(2:4) %>%
bind_rows(df %>%
select(5:7)) %>%
bind_rows(df %>%
select(8:10)) %>%
select(run, col1:col3)
Ok there are two other ways I thought you might be interested to know, since it was your question. These are not my codes completely and I got help for that but there are great alternative ways of dealing with the same problem:
df %>%
pivot_longer(cols = starts_with("col"), names_to = c(".value")) %>% # Pay attention to the `.value` sentinel it indicates that component of the name defines the name of the column containing the cell values, overriding values_to.
group_by(runs) %>%
mutate(id = row_number()) %>%
ungroup() %>%
arrange(id) %>%
select(-id)
# A tibble: 12 x 4
runs col1 col2 col3
<dbl> <dbl> <dbl> <dbl>
1 1 3 5 6
2 2 4 3 4
3 3 5 1 9
4 4 5 4 2
5 1 0 2 2
6 2 2 3 4
7 3 2 1 9
8 4 1 7 9
9 1 3 3 3
10 2 4 3 2
11 3 5 1 NA
12 4 7 4 NA
The above code is proposed by # Ashley G which was amazing.
And the base R alternative:
data.frame(df$runs,
sapply(split.default(df[-1], names(df)[-1]), unlist),
row.names = NULL)
df.runs col1 col2 col3 # Here `split.default` splits the columns of data frame whereas `split` splits the rows.
1 1 3 5 6
2 2 4 3 4
3 3 5 1 9
4 4 5 4 2
5 1 0 2 2
6 2 2 3 4
7 3 2 1 9
8 4 1 7 9
9 1 3 3 3
10 2 4 3 2
11 3 5 1 NA
12 4 7 4 NA
The base R code is written by #Ronak Shah, for which I'm very grateful.

Calculate standard deviation across multiple rows grouped by ID

I want to calculate the standard deviation across multiple rows (not per row) and then save the results into a new data frame. Best to explain using an example.
Data:
ID <- c("a","a","a","a","b","b","b","b","c","c","c","c")
y1 <- c(8,9,3,6,6,4,5,8,7,5,8,1)
y2 <- c(3,6,6,1,7,3,8,7,5,8,1,7)
y3 <- c(9,3,1,8,4,6,3,8,4,6,5,7)
df <- data.frame(ID, y1, y2, y3)
ID y1 y2 y3
1 a 8 3 9
2 a 9 6 3
3 a 3 6 1
4 a 6 1 8
5 b 6 7 4
6 b 4 3 6
7 b 5 8 3
8 b 8 7 8
9 c 7 5 4
10 c 5 8 6
11 c 8 1 5
12 c 1 7 7
I want to calculate the standard deviation of ID$a, ID$b and ID$c and store in a new data frame. I know I can do this:
sd_a <- sd(as.matrix(subset(df, ID == "a")), na.rm = TRUE)
sd_b <- sd(as.matrix(subset(df, ID == "b")), na.rm = TRUE)
sd_c <- sd(as.matrix(subset(df, ID == "c")), na.rm = TRUE)
ID <- c("a","b","c")
sd <- c(sd_a,sd_b,sd_c)
df2 <- data.frame(ID, sd)
ID sd
1 a 2.958040
2 b 1.912875
3 c 2.386833
But is there a more straightforward way of achieving this?
You can use pivot_longer() to stack y1 to y3 and then calculate the sd.
library(dplyr)
library(tidyr)
df %>%
pivot_longer(y1:y3) %>%
group_by(ID) %>%
summarise(sd = sd(value))
# # A tibble: 3 x 2
# ID sd
# <chr> <dbl>
# 1 a 2.96
# 2 b 1.91
# 3 c 2.39
One dplyr solution could be:
df %>%
group_by(ID) %>%
summarise(sd = sd(unlist(cur_data())))
ID sd
<fct> <dbl>
1 a 2.96
2 b 1.91
3 c 2.39
In base R you can do:
aggregate(values ~ ID, cbind(df[1], stack(df[-1])), sd)
ID values
1 a 2.958040
2 b 1.912875
3 c 2.386833

bind_rows to each group of tibble

Consider the following two tibbles:
library(tidyverse)
a <- tibble(time = c(-1, 0), value = c(100, 200))
b <- tibble(id = rep(letters[1:2], each = 3), time = rep(1:3, 2), value = 1:6)
So a and b have the same columns and b has an additional column called id.
I want to do the following: group b by id and then add tibble a on top of each group.
So the output should look like this:
# A tibble: 10 x 3
id time value
<chr> <int> <int>
1 a -1 100
2 a 0 200
3 a 1 1
4 a 2 2
5 a 3 3
6 b -1 100
7 b 0 200
8 b 1 4
9 b 2 5
10 b 3 6
Of course there are multiple workarounds to achieve this (like loops for example). But in my case I have a large number of IDs and a very large number of columns.
I would be thankful if anyone could point me towards the direction of a solution within the tidyverse.
Thank you
We can expand the data frame a with id from b and then bind_rows them together.
library(tidyverse)
a2 <- expand(a, id = b$id, nesting(time, value))
b2 <- bind_rows(a2, b) %>% arrange(id, time)
b2
# # A tibble: 10 x 3
# id time value
# <chr> <dbl> <dbl>
# 1 a -1 100
# 2 a 0 200
# 3 a 1 1
# 4 a 2 2
# 5 a 3 3
# 6 b -1 100
# 7 b 0 200
# 8 b 1 4
# 9 b 2 5
# 10 b 3 6
split from base R will divide a data frame into a list of subsets based on an index.
b %>%
split(b[["id"]]) %>%
lapply(bind_rows, a) %>%
lapply(select, -"id") %>%
bind_rows(.id = "id")
# # A tibble: 10 x 3
# id time value
# <chr> <dbl> <dbl>
# 1 a 1 1
# 2 a 2 2
# 3 a 3 3
# 4 a -1 100
# 5 a 0 200
# 6 b 1 4
# 7 b 2 5
# 8 b 3 6
# 9 b -1 100
# 10 b 0 200
An idea (via base R) is to split your data frame and create a new one with id + the other data frame and rbind, i.e.
df = do.call(rbind, lapply(split(b, b$id), function(i)rbind(data.frame(id = i$id[1], a), i)))
which gives
id time value
a.1 a -1 100
a.2 a 0 200
a.3 a 1 1
a.4 a 2 2
a.5 a 3 3
b.1 b -1 100
b.2 b 0 200
b.3 b 1 4
b.4 b 2 5
b.5 b 3 6
NOTE: You can remove the rownames by simply calling rownames(df) <- NULL
We can nest and add the relevant rows to each nested item :
library(tidyverse)
b %>%
nest(-id) %>%
mutate(data= map(data,~bind_rows(a,.x))) %>%
unnest
# # A tibble: 10 x 3
# id time value
# <chr> <dbl> <dbl>
# 1 a -1 100
# 2 a 0 200
# 3 a 1 1
# 4 a 2 2
# 5 a 3 3
# 6 b -1 100
# 7 b 0 200
# 8 b 1 4
# 9 b 2 5
# 10 b 3 6
Maybe not the most efficient way, but easy to follow:
library(tidyverse)
a <- tibble(time = c(-1, 0), value = c(100, 200))
b <- tibble(id = rep(letters[1:2], each = 3), time = rep(1:3, 2), value =
1:6)
a.a <- a %>% add_column(id = rep("a",length(a)))
a.b <- a %>% add_column(id = rep("b",length(a)))
joint <- bind_rows(b,a.a,a.b)
(joint <- arrange(joint,id))

Using lapply to transpose part of a column and add it as new columns to a data frame

I've been searching for some clarity on this one, but cannot find something that applies to my case, I constructed a DF very similar to this one (but with considerably more data, over a million rows in total)
Key1 <- c("A", "B", "C", "A", "C", "B", "B", "C", "A", "C")
Key2 <- c("A1", "B1", "C1", "A2", "C2", "B2", "B3", "C3", "A3", "C4")
NumVal <- c(2, 3, 1, 4, 6, 8, 2, 3, 1, 0)
DF1 <- as.data.frame(cbind(Key1, Key2, NumVal), stringsAsFactors = FALSE) %>% arrange(Key2)
ConsId <- c(1:10)
DF1 <- cbind(DF1, ConsId)
Now, what I want to do is to add lets say 3 new columns (in real life I need 12, but in order to be more graphic in this toy example we'll use 3) to the data frame, where each row corresponds to the values of $NumVal with the same $Key1 and greater than or equal $ConsId to the ones in each row and filling the remaining spaces with NA's, here is the expected result in case I wasn't very clear:
Key1 Key2 NumVal ConsId V1 V2 V3
A A1 2 1 2 4 1
A A2 4 2 4 1 NA
A A3 1 3 1 NA NA
B B1 3 4 3 8 2
B B2 8 5 8 2 NA
B B3 2 6 2 NA NA
C C1 1 7 1 6 3
C C2 6 8 6 3 0
C C3 3 9 3 0 NA
C C4 0 10 0 NA NA
Now I'm using a do.call(rbind), and even tough it works fine, it takes way too long for my real data with a bit over 1 million rows (around 6 hrs), I also tried with the bind_rows dplyr function but it took a bit longer so I stuck with the do.call option, here's an example of the code I'm using:
# Function
TranspNumVal <- function(i){
Id <- DF1[i, "Key1"]
IdCons <- DF1[i, "ConsId"]
myvect <- as.matrix(filter(DF1, Id == Key1, ConsId >= IdCons) %>% select(NumVal))
Result <- as.data.frame(t(myvect[1:3]))
return(Result)
}
# Applying the function to the entire data frame
DF2 <- do.call(rbind, lapply(1:NROW(DF1), function(i) TranspNumVal(i)))
DF3 <- cbind(DF1, DF2)
Maybe changing the class is causing the code to be so inefficient, or maybe I'm just not finding a better way to vectorize my problem (you don't want to know how long it took with a nested loop), I'm fairly new to R and have just started fooling around with dplyr, so I'm open to any suggestion about how to optimize my code
We can use dplyr::lead
DF1 %>%
group_by(Key1) %>%
mutate(
V1 = NumVal,
V2 = lead(NumVal, n = 1),
V3 = lead(NumVal, n = 2))
## A tibble: 10 x 7
## Groups: Key1 [3]
# Key1 Key2 NumVal ConsId V1 V2 V3
# <chr> <chr> <chr> <int> <chr> <chr> <chr>
# 1 A A1 2 1 2 4 1
# 2 A A2 4 2 4 1 NA
# 3 A A3 1 3 1 NA NA
# 4 B B1 3 4 3 8 2
# 5 B B2 8 5 8 2 NA
# 6 B B3 2 6 2 NA NA
# 7 C C1 1 7 1 6 3
# 8 C C2 6 8 6 3 0
# 9 C C3 3 9 3 0 NA
#10 C C4 0 10 0 NA NA
Explanation: We group entries by Key1 and then use lead to shift NumVal values for columns V2 and V3. V1 is simply a copy of NumVal.
A dplyr pipeline.
First utility function will filter a (NumVal) based on the values of b (ConsId):
myfunc1 <- function(a,b) {
n <- length(b)
lapply(seq_along(b), function(i) a[ b >= b[i] ])
}
Second utility function converts a ragged list into a data.frame. It works with arbitrary number of columns to append, but we've limited it to 3 based on your requirements:
myfunc2 <- function(x, ncols = 3) {
n <- min(ncols, max(lengths(x)))
as.data.frame(do.call(rbind, lapply(x, `length<-`, n)))
}
Now the pipeline:
dat %>%
group_by(Key1) %>%
mutate(lst = myfunc1(NumVal, ConsId)) %>%
ungroup() %>%
bind_cols(myfunc2(.$lst)) %>%
select(-lst) %>%
arrange(Key1, ConsId)
# # A tibble: 10 × 7
# Key1 Key2 NumVal ConsId V1 V2 V3
# <chr> <chr> <int> <int> <int> <int> <int>
# 1 A A1 2 1 2 4 1
# 2 A A2 4 2 4 1 NA
# 3 A A3 1 3 1 NA NA
# 4 B B1 3 4 3 8 2
# 5 B B2 8 5 8 2 NA
# 6 B B3 2 6 2 NA NA
# 7 C C1 1 7 1 6 3
# 8 C C2 6 8 6 3 0
# 9 C C3 3 9 3 0 NA
# 10 C C4 0 10 0 NA NA
After grouping by 'Key1', use shift (from data.table) to get the next value of 'NumVal' in a list, convert it to tibble and unnest the nested list elements to individual columns of the dataset. By default, shift fill NA at the end.
library(data.table)
library(tidyverse)
DF1 %>%
group_by(Key1) %>%
mutate(new = shift(NumVal, 0:(n()-1), type = 'lead') %>%
map(~
as.list(.x) %>%
set_names(paste0("V", seq_along(.))) %>%
as_tibble)) %>%
unnest %>%
select(-V4)
# A tibble: 10 x 7
# Groups: Key1 [3]
# Key1 Key2 NumVal ConsId V1 V2 V3
# <chr> <chr> <dbl> <int> <dbl> <dbl> <dbl>
# 1 A A1 2 1 2 4 1
# 2 A A2 4 2 4 1 NA
# 3 A A3 1 3 1 NA NA
# 4 B B1 3 4 3 8 2
# 5 B B2 8 5 8 2 NA
# 6 B B3 2 6 2 NA NA
# 7 C C1 1 7 1 6 3
# 8 C C2 6 8 6 3 0
# 9 C C3 3 9 3 0 NA
#10 C C4 0 10 0 NA NA
data
DF1 <- data.frame(Key1, Key2, NumVal, stringsAsFactors = FALSE) %>%
arrange(Key2)
DF1$ConsId <- 1:10

Resources