Related
I have a dataframe with missing data for some rows. The correct data can be found in another set of columns. I want to replace the NAs with the correct data.
My data looks like this:
df <- data.frame(M_1=c(1,NA,3,NA,6),
M_2=c(5,NA,3,NA,1),
M_3=c(6,NA,2,NA,4),
M_C_1=c(NA,2,NA,6,NA),
M_C_2=c(NA,1,NA,4,NA),
M_C_3=c(NA,7,NA,3,NA))
df
# M_1 M_2 M_3 M_C_1 M_C_2 M_C_3
#1 1 5 6 NA NA NA
#2 NA NA NA 2 1 7
#3 3 3 2 NA NA NA
#4 NA NA NA 6 4 3
#5 6 1 4 NA NA NA
For all records, I either have a complete set of records for variables M_1, M_2, and M_3
or
I have a complete set for variables M_C_1, M_C_2, and M_C_3.
For each row that has NAs in the first set of variables (M_1:M_3), I would like to replace with the values from the second set of values (M_C_1:M_C_2).
I dont need to retain the second set of values.
So my desired data frame would look like:
df
# M_1 M_2 M_3
#1 1 5 6
#2 2 1 7
#3 3 3 2
#4 6 4 3
#5 6 1 4
My real dataset contains many columns in this notation, so I need a general solutions (ie, I dont want to refer to each column individually).
I would like to do this with dplyr if possible.
You could use map2 + coalesce:
library(dplyr)
library(purrr)
map2_dfc(select(df, 1:3), select(df, 4:6), coalesce)
# # A tibble: 5 × 3
# M_1 M_2 M_3
# <dbl> <dbl> <dbl>
# 1 1 5 6
# 2 2 1 7
# 3 3 3 2
# 4 6 4 3
# 5 6 1 4
Here is another option with dplyover
library(dplyover)
library(stringr)
df %>%
transmute(across2(1:3, 4:6, coalesce,
.names_fn = ~ str_remove(.x, "(?<=\\d)_.*") ))
M_1 M_2 M_3
1 1 5 6
2 2 1 7
3 3 3 2
4 6 4 3
5 6 1 4
Here's a generic example that works if the columns have names that allow them to be identified and they are in the correct order.
library(dplyr)
df <- data.frame(M_1=c(1,NA,3,NA,6),
M_2=c(5,NA,3,NA,1),
M_3=c(6,NA,2,NA,4),
M_C_1=c(NA,2,NA,6,NA),
M_C_2=c(NA,1,NA,4,NA),
M_C_3=c(NA,7,NA,3,NA))
# make a temporary id so we can retain the order later
df <- df %>% mutate(temporary_id = 1:n())
# find the columns corresponding to the final data
# they are assumed to be of the form M_number
df_records <-
df %>%
select(matches('temporary_id|M_[0-9]+')) %>%
na.omit()
# find the extra columns with data to replace in the final data
# they are assumed to be of the form M_C_number
df_extra <-
df %>%
select(matches('temporary_id|M_C_[0-9]+')) %>%
na.omit()
# change the names of the extra columns to match the final data
# this only works if the columns are in the correct order in the original data frame
names(df_extra) <- names(df_records)
# bind the rows of the final and extra data, sort and remove the temporary id
final_df <-
df_records %>%
bind_rows(df_extra) %>%
arrange(temporary_id) %>%
select(-temporary_id)
final_df
# M_1 M_2 M_3
#1 1 5 6
#2 2 1 7
#3 3 3 2
#4 6 4 3
#5 6 1 4
If they are not in the required order, some sorting could be done but I'll leave that for now.
I have two dfs : df1 and df2 where the column names are dates. When I join the two df's I get columns like
date1.x, date1.y, date2.x, date2.y, date3.x, date3.y, date4.x, date4.y...........
I want to create new columns which have values which are multiplication of date1.x and date1.y and similarly for other date pairs as well.
df <- data.frame(id=11:13, date1.x=1:3, date2.x=4:6, date1.y=7:9, date2.y=10:12)
df
# id date1.x date2.x date1.y date2.y
# 1 11 1 4 7 10
# 2 12 2 5 8 11
# 3 13 3 6 9 12
grep("^date.*\\.x$", colnames(df), value = TRUE)
# [1] "date1.x" "date2.x"
datenms <- grep("^date.*\\.x$", colnames(df), value = TRUE)
### make sure all of our 'date#.x' columns have matching 'date#.y' columns
datenms <- datenms[ gsub("x$", "y", datenms) %in% colnames(df) ]
datenms
# [1] "date1.x" "date2.x"
subset(df, select = datenms)
# date1.x date2.x
# 1 1 4
# 2 2 5
# 3 3 6
subset(df, select = gsub("x$", "y", datenms))
# date1.y date2.y
# 1 7 10
# 2 8 11
# 3 9 12
subset(df, select = datenms) * subset(df, select = gsub("x$", "y", datenms))
# date1.x date2.x
# 1 7 40
# 2 16 55
# 3 27 72
There are a number of ways to do this, but I suggest that it is a good practice to get used to transforming your data into a format that is easy to work with. The first answer showed you one way to do what you want without transforming your data. My answer will show you how to transform the data so that calculation (this one and others) are easy, and then how to perform the calculation once the data is tidy.
Making your data tidy helps to perform easier aggregations, to graph results, to perform feature engineering for models, etc.
library(dplyr)
library(tidyr)
df <- data.frame(id=11:13, date1.x=1:3, date2.x=4:6, date1.y=7:9, date2.y=10:12)
df
# id date1.x date2.x date1.y date2.y
# 1 11 1 4 7 10
# 2 12 2 5 8 11
# 3 13 3 6 9 12
# Convert the data to a tidy format that is easier for computers to calculate
tidy_df <- df %>%
pivot_longer(
cols = starts_with("date"), # We are tidying any column starting with date
names_to = c("date_num","date_source"), # creating two columns for names
values_to = c("date_value"), # creating one column for values
names_prefix = "date", # removing the "date" prefix
names_sep = "\\." # splitting the names on the period `.`
)
tidy_df
# id date_num date_source date_value
# <int> <chr> <chr> <int>
# 1 11 1 x 1
# 2 11 2 x 4
# 3 11 1 y 7
# 4 11 2 y 10
# 5 12 1 x 2
# 6 12 2 x 5
# 7 12 1 y 8
# 8 12 2 y 11
# 9 13 1 x 3
# 10 13 2 x 6
# 11 13 1 y 9
# 12 13 2 y 12
# Now that the data is tidy we can do easier dataframe grouping and aggregation
tidy_df %>%
group_by(id,date_num) %>%
summarise(date_value_mult = prod(date_value)) %>%
ungroup()
# id date_num date_value_mult
# <int> <chr> <dbl>
# 1 11 1 7
# 2 11 2 40
# 3 12 1 16
# 4 12 2 55
# 5 13 1 27
# 6 13 2 72
# If/When you eventually want the data in a more human readable format you can
# pivot the data back into a human readable format. This is likely after all
# computer calculations are done and you want to present the data. For storing
# the data (such as in a database) you would not need/want this step.
tidy_df %>%
group_by(id,date_num) %>%
summarise(date_value_mult = prod(date_value)) %>%
ungroup() %>%
pivot_wider(
names_from = date_num,
values_from = date_value_mult,
names_prefix = "date"
)
# id date1 date2
# <int> <dbl> <dbl>
# 1 11 7 40
# 2 12 16 55
# 3 13 27 72
I have an incomplete time series dataframe and I need to insert rows of NAs for missing time stamps. There should always be 6 time stamps per day, which is indicated by the variable "Signal" (1-6) in the dataframe. I am trying to merge the incomplete dataframe A with a vector Bcontaining all Signals. Simplified example data below:
B <- rep(1:6,2)
A <- data.frame(Signal = c(1,2,3,5,1,2,4,5,6), var1 = c(1,1,1,1,1,1,1,1,1))
Expected <- data.frame(Signal = c(1,2,3,NA, 5, NA, 1,2,NA,4,5,6), var1 = c(1,1,1,NA,1,NA,1,1,NA,1,1,1)
Note that Brepresents a dataframe with multiple variables and the NAs in Expected are rows of NAs in the dataframe. Also the actual dataframe has more observations (84 in total).
Would be awesome if you guys could help me out!
If you already know there are 6 timestamps in a day you can do this without B. We can create groups for each day and use complete to add the missing observations with NA.
library(dplyr)
library(tidyr)
A %>%
group_by(gr = cumsum(c(TRUE, diff(Signal) < 0))) %>%
complete(Signal = 1:6) %>%
ungroup() %>%
select(-gr)
# Signal var1
# <dbl> <dbl>
# 1 1 1
# 2 2 1
# 3 3 1
# 4 4 NA
# 5 5 1
# 6 6 NA
# 7 1 1
# 8 2 1
# 9 3 NA
#10 4 1
#11 5 1
#12 6 1
If in the output you need Signal as NA for missing combination you can use
A %>%
group_by(gr = cumsum(c(TRUE, diff(Signal) < 0))) %>%
complete(Signal = 1:6) %>%
mutate(Signal = replace(Signal, is.na(var1), NA)) %>%
ungroup %>%
select(-gr)
# Signal var1
# <dbl> <dbl>
# 1 1 1
# 2 2 1
# 3 3 1
# 4 NA NA
# 5 5 1
# 6 NA NA
# 7 1 1
# 8 2 1
# 9 NA NA
#10 4 1
#11 5 1
#12 6 1
Suppose I have the following data and data frame:
sample_data <- c(1:14)
sample_data2 <- c(NA,NA,NA, "break", NA, NA, "break", NA,NA,NA,NA,NA,NA,"break")
sample_df <- as.data.frame(sample_data)
sample_df$sample_data2 <- sample_data2
When I print this data frame, the results are as follows:
sample_data sample_data2
1 1 <NA>
2 2 <NA>
3 3 <NA>
4 4 break
5 5 <NA>
6 6 <NA>
7 7 break
8 8 <NA>
9 9 <NA>
10 10 <NA>
11 11 <NA>
12 12 <NA>
13 13 <NA>
14 14 break
How would I program it so that at every "break", it outputs the max from that row up? For instance, I would want the code to output the set of (4,7,14). Additionally, I would want it so that it only finds the max value between up to the next "break" interval.
I apologize in advance if I used any incorrect nomenclature.
I construct the groups looking for the word "break" and then move the results one row up. Then some dplyr commands to get max of every group.
library(dplyr)
sample_df_new <- sample_df %>%
mutate(group = c(1, cumsum(grepl("break", sample_data2)) + 1)[1:length(sample_data2)]) %>%
group_by(group) %>%
summarise(group_max = max(sample_data))
> sample_df_new
# A tibble: 3 x 2
group group_max
<dbl> <dbl>
1 1 4
2 2 7
3 3 14
I have an answer using data.table:
library(data.table)
sample_df <- setDT(sample_df)
sample_df[,group := (rleid(sample_data2)-0.5)%/%2]
sample_df[,.(maxvalues = max(sample_data)),by = group]
group maxvalues
1: 0 4
2: 1 7
3: 2 14
The tricky part is (rleid(sample_data2)-0.5)%/%2: rleid create an increasing index to each change :
sample_data sample_data2 rleid
1: 1 NA 1
2: 2 NA 1
3: 3 NA 1
4: 4 break 2
5: 5 NA 3
6: 6 NA 3
7: 7 break 4
8: 8 NA 5
9: 9 NA 5
10: 10 NA 5
11: 11 NA 5
12: 12 NA 5
13: 13 NA 5
14: 14 break 6
If you keep the entire part of that index - 0.5, you have a constant index for the rows you want, that you can use for grouping operation:
sample_data sample_data2 group
1: 1 NA 0
2: 2 NA 0
3: 3 NA 0
4: 4 break 0
5: 5 NA 1
6: 6 NA 1
7: 7 break 1
8: 8 NA 2
9: 9 NA 2
10: 10 NA 2
11: 11 NA 2
12: 12 NA 2
13: 13 NA 2
14: 14 break 2
Then it is just taking the maximum for each group. You can easily translate it into dplyr if it is easier for you
Here are 2 ways with base R. The trick is to define a grouping variable, grp.
grp <- !is.na(sample_df$sample_data2) & sample_df$sample_data2 == "break"
grp <- rev(cumsum(rev(grp)))
grp <- -1*grp + max(grp)
tapply(sample_df$sample_data, grp, max, na.rm = TRUE)
aggregate(sample_data ~ grp, sample_df, max, na.rm = TRUE)
Data.
This is simplified data creation code.
sample_data <- 1:14
sample_data2 <- c(NA,NA,NA, "break", NA, NA, "break", NA,NA,NA,NA,NA,NA,"break")
sample_df <- data.frame(sample_data, sample_data2)
Looks like there are lots of different ways of doing this. This is how I went about it:
rows <- which(sample_data2 == "break") #Get the row indices for where "break" appears
findmax <- function(maxrow) {
max(sample_data[1:maxrow])
} #Create a function that returns the max "up to" a given row
sapply(rows, findmax) #apply it for each of your rows
### [1] 4 7 14
Note that this works "up to" the given row. To get the maximum value between the two breaks would probably be easier with one of the other solutions, but you could also do it by looking at the j-1 row to jth row from the rows object.
Depending whether you want to assess the maximum "sample_data" number between all "sample_data2" == break including (e.g. row 1 to row 4) or excluding (e.g. row 1 to row 3) the given "sample_data2" == break row, you can do something like this with tidyverse:
Excluding the break rows:
sample_df %>%
group_by(sample_data2) %>%
mutate(temp = ifelse(is.na(sample_data2), NA_character_, paste0(gl(length(sample_data2), 1)))) %>%
ungroup() %>%
fill(temp, .direction = "up") %>%
filter(is.na(sample_data2)) %>%
group_by(temp) %>%
summarise(res = max(sample_data))
temp res
<chr> <dbl>
1 1 3.
2 2 6.
3 3 13.
Including the break rows:
sample_df %>%
group_by(sample_data2) %>%
mutate(temp = ifelse(is.na(sample_data2), NA_character_, paste0(gl(length(sample_data2), 1)))) %>%
ungroup() %>%
fill(temp, .direction = "up") %>%
group_by(temp) %>%
summarise(res = max(sample_data))
temp res
<chr> <dbl>
1 1 4.
2 2 7.
3 3 14.
Both of the codes create an ID variable called "temp" using gl() for "sample_data2" == break and then fill up the NA rows with that ID. Then, the first code filters out the "sample_data2" == break rows and assess the maximum "sample_data" values per group, while the second assess the maximum "sample_data" values per group including the "sample_data2" == break rows.
I have time series data that I'm predicting on, so I am creating lag variables to use in my statistical analysis. I'd like a quick way to create multiple variables given specific inputs so that I can easily cross-validate and compare models.
The following is example code that adds 2 lags for 2 different variables (4 total) given a certain category (A, B, C):
# Load dplyr
library(dplyr)
# create day, category, and 2 value vectors
days = 1:9
cats = rep(c('A','B','C'),3)
set.seed = 19
values1 = round(rnorm(9, 16, 4))
values2 = round(rnorm(9, 16, 16))
# create data frame
data = data.frame(days, cats, values1, values2)
# mutate new lag variables
LagVal = data %>% arrange(days) %>% group_by(cats) %>%
mutate(LagVal1.1 = lag(values1, 1)) %>%
mutate(LagVal1.2 = lag(values1, 2)) %>%
mutate(LagVal2.1 = lag(values2, 1)) %>%
mutate(LagVal2.2 = lag(values2, 2))
LagVal
days cats values1 values2 LagVal1.1 LagVal1.2 LagVal2.1 LagVal2.2
<int> <fctr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 A 16 -10 NA NA NA NA
2 2 B 14 24 NA NA NA NA
3 3 C 16 -6 NA NA NA NA
4 4 A 12 25 16 NA -10 NA
5 5 B 20 14 14 NA 24 NA
6 6 C 18 -5 16 NA -6 NA
7 7 A 21 2 12 16 25 -10
8 8 B 19 5 20 14 14 24
9 9 C 18 -3 18 16 -5 -6
My problem comes in at the # mutate new lag variables step, since I have about a dozen predictor variables that I would potentially want to lag up to 10 times (~13k row dataset), and I don't have the heart to create 120 new variables.
Here is my attempt at writing a function which mutates new variables given the inputs for data (dataset to mutate), variables (the variables you wish to lag), and lags (the number of lags per variable):
MultiMutate = function(data, variables, lags){
# select the data to be working with
FuncData = data
# Loop through desired variables to mutate
for (i in variables){
# Loop through number of desired lags
for (u in 1:lags){
FuncData = FuncData %>% arrange(days) %>% group_by(cats) %>%
# Mutate new variable for desired number of lags. Give new variable a name with the lag number appended
mutate(paste(i, u) = lag(i, u))
}
}
FuncData
}
To be honest I'm just sort of lost on how to get this to work. The ordering of my for-loops and overall logic makes sense, but the way the function takes characters into variables and the overall syntax seems way off. Is there a simple way to fix up this function to get my desired result?
In particular, I'm looking for:
A function like MultiMutate(data = data, variables = c(values1, values2), lags = 2) that would create the exact result of LagVal from above.
Dynamically naming the variables based on the variable and their lag. I.e. value1.1, value1.2, value2.1, value2.2, etc.
Thank you in advance and let me know if you need additional information. If there's a simpler way to get what I'm looking for, then I am all ears.
You'll have to reach deeper into the tidyverse toolbox to add them all at once. If you nest data for each value of cats, you can iterate over the nested data frames, iterating the lags over the values* columns in each.
library(tidyverse)
set.seed(47)
df <- data_frame(days = 1:9,
cats = rep(c('A','B','C'),3),
values1 = round(rnorm(9, 16, 4)),
values2 = round(rnorm(9, 16, 16)))
df %>% nest(-cats) %>%
mutate(lags = map(data, function(dat) {
imap_dfc(dat[-1], ~set_names(map(1:2, lag, x = .x),
paste0(.y, '_lag', 1:2)))
})) %>%
unnest() %>%
arrange(days)
#> # A tibble: 9 x 8
#> cats days values1 values2 values1_lag1 values1_lag2 values2_lag1
#> <chr> <int> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 A 1 24. -7. NA NA NA
#> 2 B 2 19. 1. NA NA NA
#> 3 C 3 17. 17. NA NA NA
#> 4 A 4 15. 24. 24. NA -7.
#> 5 B 5 16. -13. 19. NA 1.
#> 6 C 6 12. 17. 17. NA 17.
#> 7 A 7 12. 27. 15. 24. 24.
#> 8 B 8 16. 15. 16. 19. -13.
#> 9 C 9 15. 36. 12. 17. 17.
#> # ... with 1 more variable: values2_lag2 <dbl>
data.table::shift makes this simpler, as it's vectorized. Naming takes more work than the actual lagging:
library(data.table)
setDT(df)
df[, sapply(1:2, function(x){paste0('values', x, '_lag', 1:2)}) := shift(.SD, 1:2),
by = cats, .SDcols = values1:values2][]
#> days cats values1 values2 values1_lag1 values1_lag2 values2_lag1
#> 1: 1 A 24 -7 NA NA NA
#> 2: 2 B 19 1 NA NA NA
#> 3: 3 C 17 17 NA NA NA
#> 4: 4 A 15 24 24 NA -7
#> 5: 5 B 16 -13 19 NA 1
#> 6: 6 C 12 17 17 NA 17
#> 7: 7 A 12 27 15 24 24
#> 8: 8 B 16 15 16 19 -13
#> 9: 9 C 15 36 12 17 17
#> values2_lag2
#> 1: NA
#> 2: NA
#> 3: NA
#> 4: NA
#> 5: NA
#> 6: NA
#> 7: -7
#> 8: 1
#> 9: 17
In these cases, I rely on the magic of dplyr and tidyr:
library(dplyr)
library(tidyr)
set.seed(47)
# create data
s_data = data_frame(
days = 1:9,
cats = rep(c('A', 'B', 'C'), 3),
values1 = round(rnorm(9, 16, 4)),
values2 = round(rnorm(9, 16, 16))
)
max_lag = 2 # define max number of lags
# create lags
s_data %>%
gather(select = -c("days", "cats")) %>% # gather all variables that will be lagged
mutate(n_lag = list(0:max_lag)) %>% # add list-column with lag numbers
unnest() %>% # unnest the list column
arrange(cats, key, n_lag, days) %>% # order the data.frame
group_by(cats, key, n_lag) %>% # group by relevant variables
# create lag. when grouped by vars above, n_lag is a constant vector, take 1st value
mutate(lag_val = lag(value, n_lag[1])) %>%
ungroup() %>%
# create some fancy labels
mutate(var_name = ifelse(n_lag == 0, key, paste0("Lag", key, ".", n_lag))) %>%
select(-c(key, value, n_lag)) %>% # drop unnecesary data
spread(var_name, lag_val) %>% # spread your newly created variables
select(days, cats, starts_with("val"), starts_with("Lag")) # reorder
## # A tibble: 9 x 8
## days cats values1 values2 Lagvalues1.1 Lagvalues1.2 Lagvalues2.1 Lagvalues2.2
## <int> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1 A 24. -7. NA NA NA NA
## 2 2 B 19. 1. NA NA NA NA
## 3 3 C 17. 17. NA NA NA NA
## 4 4 A 15. 24. 24. NA -7. NA
## 5 5 B 16. -13. 19. NA 1. NA
## 6 6 C 12. 17. 17. NA 17. NA
## 7 7 A 12. 27. 15. 24. 24. -7.
## 8 8 B 16. 15. 16. 19. -13. 1.
## 9 9 C 15. 36. 12. 17. 17. 17.