Expand a dataframe IN R into a larger dataframe - r

I Have used the code below to generate a dataframe df1
df1<-data.frame("ID"=c("A", "A", "A", "A", "A", "B", "B", "B", 'B', "B"),
"X_Fr"=c(NA, NA, NA, NA,NA,1,2,3,4,5), "X_Ax"=c(NA, NA, NA, NA, NA,
.2,.3,.4,.2, .3),
"Y_Fr"=c(1,2,3,4,5,1,2,3,4,5),
"Y_Ax"=c(.1,.2,.3,.4,.1,.3,.4,.5,.2,.3),"Z_Fr"=c(1,2,NA, NA, 3,
1,3,4,5,10),
"Z_Ax"=c(.1,.2,NA,NA,.5, .1,.2,.4,.3,.5) )
ID X_Fr X_Ax Y_Fr Y_Ax Z_Fr Z_Ax
1 A NA NA 1 0.1 1 0.1
2 A NA NA 2 0.2 2 0.2
3 A NA NA 3 0.3 NA NA
4 A NA NA 4 0.4 NA NA
5 A NA NA 5 0.1 3 0.5
6 B 1 0.2 1 0.3 1 0.1
7 B 2 0.3 2 0.4 3 0.2
8 B 3 0.4 3 0.5 4 0.4
9 B 4 0.2 4 0.2 5 0.3
10 B 5 0.3 5 0.3 10 0.5
I would like to expand the dataframe to give the following data frame as output
ID X_Fr Y_Fr Z_Fr X_Ax Y_Ax Z_Ax
1 A 1 1 1 NA 0.1 0.1
2 A 2 2 2 NA 0.2 0.2
3 A 3 3 3 NA 0.3 0.5
4 A 4 4 4 NA 0.4 NA
5 A 5 5 5 NA 0.1 NA
6 B 1 1 1 0.2 0.3 0.1
7 B 2 2 2 0.3 0.4 NA
8 B 3 3 3 0.4 0.5 0.2
9 B 4 4 4 0.2 0.2 0.4
10 B 5 5 5 0.3 0.3 0.3
11 B 6 6 6 NA NA NA
12 B 7 7 7 NA NA NA
13 B 8 8 8 NA NA NA
14 B 9 9 9 NA NA NA
15 B 10 10 10 NA NA 0.5
I have tried the following code to obtain the above dataframe
library(tidyr)
library(dplyr)
df2<-df1 %>% complete(ID, nesting(X_Fr=full_seq(na.omit(c(X_Fr, Y_Fr,
Z_Fr)),1), Y_Fr=full_seq(na.omit(c(X_Fr, Y_Fr, Z_Fr)),1),
Z_Fr=full_seq(na.omit(c(X_Fr, Y_Fr, Z_Fr)),1)))
I am unable to obtain this result. I request someone to take a look.

I think it's two steps process so I create two new dataframes processed them and joined them later
library(dplyr)
library(tidyr)
df1x<-df1 %>% select(ID,matches('^X|^Y')) #select ID and any cloumn start with X or Y
df1y<-df1 %>% select(ID,matches('^Z'))
df1y %>% group_by(ID) %>% #group by ID column
arrange(Z_Fr, .by_group=TRUE) %>% #arrange Z_Fr column in ascending order so we can use row_number later
mutate(Z_Fr=coalesce(Z_Fr,as.numeric(row_number()))) %>% #Use row_number to fill NA's in Z_Fr.
#See ?dplyr::row_number() for more details
ungroup() %>% #Before using complete we need to ungroup
complete(ID, nesting(Z_Fr=full_seq(Z_Fr,1))) %>%
left_join(df1x, by=c('ID','Z_Fr'='Y_Fr')) #left join using "on" ID and Z_Fr from df1y and Y_Fr from df1x

Related

Different random numbers when two conditions are met in R

I have a data frame of three columns Distance, Age, and Value where there are three repeated Value for every Distance and Age combination. I would like to generate a random number for Value for certain Distance and Age combinations. I can get a random number to generate however, it is the same random number repeated and I need three different random numbers.
Example Data
set.seed(321)
dat <- data.frame(matrix(ncol = 3, nrow = 27))
colnames(dat)[1:3] <- c("Distance", "Age", "Value")
dat$Distance <- rep(c(0.5,1.5,2.5), each = 9)
dat$Age <- rep(1:3, times = 9)
The code below creates a random number for the Distance and Age combo but the random number is the same for each of the three measurements, they should be different random numbers.
dat$Value <- ifelse(dat$Distance == '0.5' & dat$Age == '1',
rep(rnorm(3,10,2),3), NA)
Instead of getting the same repeated random number for the Distance and Age combo
head(dat)
Distance Age Value
1 0.5 1 13.40981
2 0.5 2 NA
3 0.5 3 NA
4 0.5 1 13.40981
5 0.5 2 NA
6 0.5 3 NA
I would like different random numbers for the Distance and Age combo
head(dat)
Distance Age Value
1 0.5 1 13.40981
2 0.5 2 NA
3 0.5 3 NA
4 0.5 1 11.18246
5 0.5 2 NA
6 0.5 3 NA
The numbers for Value don't really matter and are for demonstration purposes only.
Replace rep(rnorm(3,10,2),3) with rnorm(nrow(dat), 10, 2).
Something like this?
library(dplyr)
dat %>%
mutate(Value = ifelse(Distance == 0.5 & Age == 1, sample(1000,nrow(dat), replace = TRUE), NA))
Distance Age Value
1 0.5 1 478
2 0.5 2 NA
3 0.5 3 NA
4 0.5 1 707
5 0.5 2 NA
6 0.5 3 NA
7 0.5 1 653
8 0.5 2 NA
9 0.5 3 NA
10 1.5 1 NA
11 1.5 2 NA
12 1.5 3 NA
13 1.5 1 NA
14 1.5 2 NA
15 1.5 3 NA
16 1.5 1 NA
17 1.5 2 NA
18 1.5 3 NA
19 2.5 1 NA
20 2.5 2 NA
21 2.5 3 NA
22 2.5 1 NA
23 2.5 2 NA
24 2.5 3 NA
25 2.5 1 NA
26 2.5 2 NA
27 2.5 3 NA
You can eliminate the ifelse():
idx <- dat$Distance == '0.5' & dat$Age == '1'
dat$Value[idx] <- rnorm(sum(idx), 10, 2)
head(dat)
head(dat, 7)
# Distance Age Value
# 1 0.5 1 10.91214
# 2 0.5 2 NA
# 3 0.5 3 NA
# 4 0.5 1 10.84067
# 5 0.5 2 NA
# 6 0.5 3 NA
# 7 0.5 1 11.15517

Problems with Rollapplyr function due to NA values (that should not be removed) in R

I have a dataframe:
date comp ei
1 1/1/73 A NA
2 1/4/73 A 0.6
3 1/7/73 A 0.7
4 1/10/73 A 0.9
5 1/1/74 A 0.4
6 1/4/74 A 0.5
7 1/7/74 A 0.7
8 1/10/74 A 0.7
9 1/1/75 A 0.4
10 1/4/75 A 0.5
11 1/1/73 B 0.8
12 1/4/73 B 0.8
13 1/7/73 B 0.5
14 1/10/73 B 0.6
15 1/1/74 B 0.3
16 1/4/74 B 0.2
17 1/1/73 C NA
18 1/4/73 C 0.6
19 1/7/73 C 0.4
20 1/10/73 C 0.8
21 1/1/74 C 0.7
22 1/4/74 C 0.9
23 1/7/74 C 0.4
24 1/10/74 C 0.3
I want to calculate the rolling std. deviation of ei grouped by comp. I want the rolling standard deviation of the last 8 lines - but if only 6 lines exists, so far, it should still take the rolling std. deviation of those. So I use width = 8 and partial = 6 in this code:
roll <- function(z) rollapplyr(z, width = 8, FUN = sd, fill = NA, partial = 6)
df <- transform(df, roll = ave(ei, comp, FUN = roll))
However, due to the fact that some of my 'ei' values are 'NA' the partial part of the function doesn't work, since there is an NA in one of the past 8 lines. So of course after 6 lines the std. dev. is NA. Only for comp = B, the partial = 6 works. The results are seen below:
date comp ei roll
1 1/1/73 A NA NA
2 1/4/73 A 0.6 NA
3 1/7/73 A 0.7 NA
4 1/10/73 A 0.9 NA
5 1/1/74 A 0.4 NA
6 1/4/74 A 0.5 NA
7 1/7/74 A 0.7 NA
8 1/10/74 A 0.7 NA
9 1/1/75 A 0.4 0.1726888
10 1/4/75 A 0.5 0.1772811
11 1/1/73 B 0.8 NA
12 1/4/73 B 0.8 NA
13 1/7/73 B 0.5 NA
14 1/10/73 B 0.6 NA
15 1/1/74 B 0.3 NA
16 1/4/74 B 0.2 0.2503331
17 1/1/73 C NA NA
18 1/4/73 C 0.6 NA
19 1/7/73 C 0.4 NA
20 1/10/73 C 0.8 NA
21 1/1/74 C 0.7 NA
22 1/4/74 C 0.9 NA
23 1/7/74 C 0.4 NA
24 1/10/74 C 0.3 NA
I would have rather wanted my results to look as it does below, where the first std. dev is calculated for comp A in line number 7 for the previous 6 values (not NA) and where comp C has a std. dev in line 23 and 24:
date comp ei roll
1 1/1/73 A NA NA
2 1/4/73 A 0.6 NA
3 1/7/73 A 0.7 NA
4 1/10/73 A 0.9 NA
5 1/1/74 A 0.4 NA
6 1/4/74 A 0.5 NA
7 1/7/74 A 0.7 0.1751190
8 1/10/74 A 0.7 0.1618347
9 1/1/75 A 0.4 0.1726888
10 1/4/75 A 0.5 0.1772811
11 1/1/73 B 0.8 NA
12 1/4/73 B 0.8 NA
13 1/7/73 B 0.5 NA
14 1/10/73 B 0.6 NA
15 1/1/74 B 0.3 NA
16 1/4/74 B 0.2 0.2503331
17 1/1/73 C NA NA
18 1/4/73 C 0.6 NA
19 1/7/73 C 0.4 NA
20 1/10/73 C 0.8 NA
21 1/1/74 C 0.7 NA
22 1/4/74 C 0.9 NA
23 1/7/74 C 0.4 0.2065591
24 1/10/74 C 0.3 0.2267787
How can I do this without running a na.omit code before calculating the rolling std. dev? The reason why I don't want to remove NA's is that I need the lines with comp and dates (plus other columns in my real dataset). Also, removing my NA values might, in my real dataset, lead to removing NA's in the middle of a period so that the rolling std. dev. function won't fit with the dates and my results will be wrong.
Is there a way to deal with this without removing the NA values?
1) FUN computes sd if there are at least 6 non-NAs and otherwise returns NA.
Then proceed as in the question.
library(zoo)
df$date <- as.Date(df$date, "%d/%m/%y")
FUN <- function(x) if (length(na.omit(x)) >= 6) sd(x, na.rm = TRUE) else NA
roll <- function(z) rollapplyr(z, width = 8, FUN = FUN,
fill = NA, partial = 6)
transform(df, roll = ave(ei, comp, FUN = roll))
2) The other possibility is to use na.omit and then merge the result back with the original data frame.
library(zoo)
df$date <- as.Date(df$date, "%d/%m/%y")
roll <- function(z) rollapplyr(z, width = 8, FUN = sd, fill = NA, partial = 6)
df_roll_0 <- transform(na.omit(df), roll = ave(ei, comp, FUN = roll))
df_roll_m <- merge(df, df_roll_0, all = TRUE)
o <- with(df_roll_m, order(comp, date))
df_roll <- df_roll_m[o, ]
2a) This could also be expressed using dplyr/tidyr:
library(dplyr)
library(tidyr)
library(zoo)
df$date <- as.Date(df$date, "%d/%m/%y")
roll <- function(z) rollapplyr(z, width = 8, FUN = sd, fill = NA, partial = 6)
df_roll_0 <- df %>%
drop_na %>%
group_by(comp) %>%
mutate(roll = roll(ei)) %>%
ungroup
df %>%
left_join(df_roll_0)
Note
Lines <- " date comp ei
1 1/1/73 A NA
2 1/4/73 A 0.6
3 1/7/73 A 0.7
4 1/10/73 A 0.9
5 1/1/74 A 0.4
6 1/4/74 A 0.5
7 1/7/74 A 0.7
8 1/10/74 A 0.7
9 1/1/75 A 0.4
10 1/4/75 A 0.5
11 1/1/73 B 0.8
12 1/4/73 B 0.8
13 1/7/73 B 0.5
14 1/10/73 B 0.6
15 1/1/74 B 0.3
16 1/4/74 B 0.2
17 1/1/73 C NA
18 1/4/73 C 0.6
19 1/7/73 C 0.4
20 1/10/73 C 0.8
21 1/1/74 C 0.7
22 1/4/74 C 0.9
23 1/7/74 C 0.4
24 1/10/74 C 0.3"
df <- read.table(text = Lines)

Adding a new column(group) to a data frame based on the two other data frames

new to R. I want to add a new column to df1 based on df2 and df3
df1
ind t1 t2
1 12 0.2
2 NA 0.3
3 9 0.3
4 11 0.4
5 11 0.3
6 10 0.4
7 14 0.3
df2 and df3 are subsets of df1
df2
ind t1
1 12
2 NA
3 9
df3
ind t1
4 11
7 14
8 12
desired output
df1
ind t1 t2 group
1 12 0.2 df2
2 NA 0.3 df2
3 9 0.3 df2
4 11 0.4 df3
5 11 0.3 NA
6 10 0.4 NA
7 14 0.3 df3
is there anyway to get the desired output using for loop with if statement?
Hi welcome to so please provide a dput() but you can do
library(tidyverse)
df_class <- bind_rows(df2,df3)
df_result <- df_1 %>%
left_join(df_class)

Lag variable by group/time indicator in dplyr

I have data that looks like this:
set.seed(13)
dt <- data.frame(group = c(rep("a", 3), rep("b", 4), rep("c", 3)), var = c(rep(0.1,3), rep(0.3, 4), rep(1.1,3)))
dt
group var
1 a 0.1
2 a 0.1
3 a 0.1
4 b 0.3
5 b 0.3
6 b 0.3
7 b 0.3
8 c 1.1
9 c 1.1
10 c 1.1
I'd like to lag var variable for all respondents in the group variable group. One difficulty is that the groups are of different size, otherwise this would be no problem specifing n as the size of all groups. My data should look accordingly (see below). How do I get at this using dplyr for example?
group var lag1.var lag2.var
1 a 0.1 NA NA
2 a 0.1 NA NA
3 a 0.1 NA NA
4 b 0.3 0.1 NA
5 b 0.3 0.1 NA
6 b 0.3 0.1 NA
7 b 0.3 0.1 NA
8 c 1.1 0.3 0.1
9 c 1.1 0.3 0.1
10 c 1.1 0.3 0.1
You can create a tibble with the lag variables for each group and then merge it with dt. Try this:
left_join(dt, dt %>%
group_by(group) %>%
mutate(var = first(var)) %>%
distinct() %>%
ungroup() %>%
mutate(lag1.var = lag(var, order_by = group),
lag2.var = lag(lag1.var, order_by = group)) %>%
select(-var),
by = "group")
# output
group var lag1.var lag2.var
1 a 0.1 NA NA
2 a 0.1 NA NA
3 a 0.1 NA NA
4 b 0.3 0.1 NA
5 b 0.3 0.1 NA
6 b 0.3 0.1 NA
7 b 0.3 0.1 NA
8 c 1.1 0.3 0.1
9 c 1.1 0.3 0.1
10 c 1.1 0.3 0.1
This assumes that var is always the same within each group
Here is another option. First we nest by group, then we map out the lagged values and then unnest.
library(tidyverse)
dt %>%
nest(-group) %>%
mutate(lag1.var = map_dbl(data, ~.x$var[[1]]) %>% lag(.), lag2.var = lag(lag1.var)) %>%
unnest
#> group lag1.var lag2.var var
#> 1 a NA NA 0.1
#> 2 a NA NA 0.1
#> 3 a NA NA 0.1
#> 4 b 0.1 NA 0.3
#> 5 b 0.1 NA 0.3
#> 6 b 0.1 NA 0.3
#> 7 b 0.1 NA 0.3
#> 8 c 0.3 0.1 1.1
#> 9 c 0.3 0.1 1.1
#> 10 c 0.3 0.1 1.1

Lagging variable by group does not work in dplyr

I'm desperately trying to lag a variable by group. I found this post that deals with essentially the same problem I'm facing, but the solution does not work for me, no idea why.
This is my problem:
library(dplyr)
df <- data.frame(monthvec = c(rep(1:2, 2), rep(3:5, 3)))
df <- df %>%
arrange(monthvec) %>%
mutate(growth=ifelse(monthvec==1, 0.3,
ifelse(monthvec==2, 0.5,
ifelse(monthvec==3, 0.7,
ifelse(monthvec==4, 0.1,
ifelse(monthvec==5, 0.6,NA))))))
df%>%
group_by(monthvec) %>%
mutate(lag.growth = lag(growth, order_by=monthvec))
Source: local data frame [13 x 3]
Groups: monthvec [5]
monthvec growth lag.growth
<int> <dbl> <dbl>
1 1 0.3 NA
2 1 0.3 0.3
3 2 0.5 NA
4 2 0.5 0.5
5 3 0.7 NA
6 3 0.7 0.7
7 3 0.7 0.7
8 4 0.1 NA
9 4 0.1 0.1
10 4 0.1 0.1
11 5 0.6 NA
12 5 0.6 0.6
13 5 0.6 0.6
This is what I'd like it to be in the end:
df$lag.growth <- c(NA, NA, 0.3, 0.3, 0.5, 0.5, 0.5, 0.7,0.7,0.7, 0.1,0.1,0.1)
monthvec growth lag.growth
1 1 0.3 NA
2 1 0.3 NA
3 2 0.5 0.3
4 2 0.5 0.3
5 3 0.7 0.5
6 3 0.7 0.5
7 3 0.7 0.5
8 4 0.1 0.7
9 4 0.1 0.7
10 4 0.1 0.7
11 5 0.6 0.1
12 5 0.6 0.1
13 5 0.6 0.1
I believe that one problem is that my groups are not of equal length...
Thanks for helping out.
Here is an idea. We group by monthvec in order to get the number of rows (cnt) of each group. We ungroup and use the first value of cnt as the size of the lag. We regroup on monthvec and replace the values in each group with the first value of each group.
library(dplyr)
df %>%
group_by(monthvec) %>%
mutate(cnt = n()) %>%
ungroup() %>%
mutate(lag.growth = lag(growth, first(cnt))) %>%
group_by(monthvec) %>%
mutate(lag.growth = first(lag.growth)) %>%
select(-cnt)
which gives,
# A tibble: 13 x 3
# Groups: monthvec [5]
monthvec growth lag.growth
<int> <dbl> <dbl>
1 1 0.3 NA
2 1 0.3 NA
3 2 0.5 0.3
4 2 0.5 0.3
5 3 0.7 0.5
6 3 0.7 0.5
7 3 0.7 0.5
8 4 0.1 0.7
9 4 0.1 0.7
10 4 0.1 0.7
11 5 0.6 0.1
12 5 0.6 0.1
13 5 0.6 0.1
You may join your original data with a dataframe with a shifted "monthvec".
left_join(df, df %>% mutate(monthvec = monthvec + 1) %>% unique(), by = "monthvec")
# monthvec growth.x growth.y
# 1 1 0.3 NA
# 2 1 0.3 NA
# 3 2 0.5 0.3
# 4 2 0.5 0.3
# 5 3 0.7 0.5
# 6 3 0.7 0.5
# 7 3 0.7 0.5
# 8 4 0.1 0.7
# 9 4 0.1 0.7
# 10 4 0.1 0.7
# 11 5 0.6 0.1
# 12 5 0.6 0.1
# 13 5 0.6 0.1

Resources