Variance over time with two dataframes in R - r

I have two dataframes. df1 is a dataframe where every row is a score that someone gives.
df1
title <- c("x","x","x","x","y","y","y","y","y")
day <- c(0,2,2,4,1,1,3,3,4)
score <- c(7,7,6,4,8,1,7,1,5)
df1 = data.frame(title,day,score)
df2 is title-day formatted panel dataset in long format with a lot of variables. I'm looking for a way to mutate the variance score on day x and the variance score over time (i.e. variance of the score(s) on day x and all the previous scores that are there).
It should look like this:
title <- c("x","x","x","x","x","y","y","y","y","y")
day <- c(0,1,2,3,4,0,1,2,3,4)
variance_day_x <- c(0,0,0.5,0,0,0,24.5,0,12,0)
variance_cumulative <- c(0,0,0.3333,0.3333,2,0,24.5,24.5,14.25,10.8)
df2 <- data.frame(title,day,variance_day_x,variance_cumulative)
As you can see I need to mutate 2 variables out of df1 into df2. The variance per day is the first variable, where variance = 0 when there is 0 or 1 score available on that day because there is nothing to calculate. The second variable is cumulative variance where the variance needs to be updated every time there are new scores available.
Hope this explained my problem well enough. I'm stuck at this moment, hope you guys can help!

Using tidyverse you could try something like this. First group_by title and use a custom cumulative variance function that can be called from mutate. The daily variance is computed after grouping by both title and day. complete will fill in missing days, and fill will carry forward the cumulative variance for those missing days. You can replace the NA with zero if you would like with replace_na.
library(tidyverse)
cumvar <- function(x) {
sapply(seq_along(x), function(i) var(x[1:i]))
}
df1 %>%
group_by(title) %>%
mutate(cvar = cumvar(score)) %>%
group_by(title, day) %>%
summarise(variance_day_x = var(score),
variance_cumulative = last(cvar)) %>%
complete(title, day = 0:4) %>%
fill(variance_cumulative, .direction = "down")
Output
# A tibble: 10 x 4
# Groups: title [2]
title day variance_day_x variance_cumulative
<chr> <dbl> <dbl> <dbl>
1 x 0 NA NA
2 x 1 NA NA
3 x 2 0.5 0.333
4 x 3 NA 0.333
5 x 4 NA 2
6 y 0 NA NA
7 y 1 24.5 24.5
8 y 2 NA 24.5
9 y 3 18 14.2
10 y 4 NA 10.8

A bit messy Base R solution:
df_variances <- cbind(df1, data.frame(do.call("rbind", lapply(split(df1, df1$title),
function(x){
variance_cumulative <- sapply(seq_len(nrow(x)), function(i){
z <- var(x$score[1:i])
}
)
variance_day_x <- sapply(seq_len(nrow(x)), function(j){
q <- var(x$score[(j-1):j])
}
)
variance_df <- data.frame(variance_day_x = variance_day_x,
variance_cumulative = variance_cumulative)
}
)
), row.names = NULL))
df_clean <- replace(df_variances, is.na(df_variances), 0)

Another base R solution. I also use a custom cumvar function. Furthermore I use #Ruben's great repeat_last function to fill up NAs with last known values.
This solution bases mainly on ave, which applies a function on a variable, grouped by other variables. Since the days are not complete, we can merge the original data to a complete data set with all unique titles and days. Before we calculate the variances, we calculate the cumulative variances; the idea is to select later the "newer" value per day using length. Finally we delete the dupes, then it's done.
cumvar <- function(x) sapply(1:length(x), function(i) {var(x[1:i])})
df1$vari.cum <- with(df1, ave(score, title, FUN=cumvar))
compl <- expand.grid(title=unique(df1$title), day=unique(df1$day))
dfx <- merge(compl, df1, all.x= TRUE)
dfx$vari.cum <- with(dfx, ave(vari.cum, title, FUN=repeat_last))
res <- within(dfx, {
vari.day <- ave(score, title, day, FUN=var)
vari.cum <- ave(vari.cum, title, day, FUN=function(x) x[length(x)])
})
res <- res[!duplicated(res[c("title", "day")]), c(1:2, 5:4)]
res
# title day vari.day vari.cum
# 1 x 0 NA NA
# 2 x 1 NA NA
# 3 x 2 0.5 0.3333333
# 5 x 3 NA 0.3333333
# 6 x 4 NA 2.0000000
# 7 y 0 NA NA
# 8 y 1 24.5 24.5000000
# 10 y 2 NA 24.5000000
# 11 y 3 18.0 14.2500000
# 13 y 4 NA 10.8000000

Related

Appending a column to each data frame within a list

I have a list of dataframes and want to append a new column to each, however I keep getting various error messages. Can anybody explain why the below code doesn't work for me? I'd be happy if rowid_to)column works as the data in my actual set is alright ordered correctly, otherwise i'd like a new column with a list going from 1:length(data$data)
##dataset
data<- tibble(Location = c(rep("London",6),rep("Glasgow",6),rep("Dublin",6)),
Day= rep(seq(1,6,1),3),
Average = runif(18,0,20),
Amplitude = runif(18,0,15))%>%
nest_by(Location)
###map + rowid_to_column
attempt1<- data%>%
map(.,rowid_to_column(.,var = "hour"))
##mutate
attempt2<-data %>%
map(., mutate("Hours" = 1:6))
###add column
attempt3<- data%>%
map(.$data,add_column(.data,hours = 1:6))
newcolumn<- 1:6
###lapply
attempt4<- lapply(data,cbind(data$data,newcolumn))
Many thanks,
Stuart
You were nearly there with your base R attempt, but you want to iterate over data$data, which is a list of data frames.
data$data <- lapply(data$data, function(x) {
hour <- seq_len(nrow(x))
cbind(x, hour)
})
data$data
# [[1]]
# Day Average Amplitude hour
# 1 1 6.070539 1.123182 1
# 2 2 3.638313 8.218556 2
# 3 3 11.220683 2.049816 3
# 4 4 12.832782 14.858611 4
# 5 5 12.485757 7.806147 5
# 6 6 19.250489 6.181270 6
Edit: Updated as realised it was iterating over columns rather than rows. This approach will work if the data frames have different numbers of rows, which the methods with the vector defined as 1:6 will not.
a data.table approach
library(data.table)
setDT(data)
data[, data := lapply(data, function(x) cbind(x, new_col = 1:6))]
data$data
# [[1]]
# Day Average Amplitude test new_col
# 1 1 11.139917 0.3690539 1 1
# 2 2 5.350847 7.0925508 2 2
# 3 3 9.602104 6.1782818 3 3
# 4 4 14.866074 13.7356913 4 4
# 5 5 1.114201 1.1007080 5 5
# 6 6 2.447236 5.9944926 6 6
#
# [[2]]
# Day Average Amplitude test new_col
# 1 1 17.230213 13.966576 1 1
# .....
A purrr approach:
data<- tibble(Location = c(rep("London",6),rep("Glasgow",6),rep("Dublin",6)),
Day= rep(seq(1,6,1),3),
Average = runif(18,0,20),
Amplitude = runif(18,0,15))%>%
group_split(Location) %>%
purrr::map_dfr(~.x %>% mutate(Hours = c(1:6)))
If you want to use your approach and preserve the same data structure, this is a way again using purrr (you need to ungroup, otherwise it will not work due to the rowwise grouping)
data %>% ungroup() %>%
mutate_at("data", .f = ~map(.x, ~.x %>% mutate(Hours = c(1:6))) )

Replace NA with random numbers within `group_by` in dplyr

I have a data frame in long format and I want to replace missing values by random numbers, but I want to do this group wise with different settings...
library(dplyr)
set.seed(1)
imp_df <-
data.frame(exp=rep(letters[1:3], each=2),
rep=1:2,
mean=1:6,
sd=seq(0,0.5,0.1))
df <-
data.frame(
exp=rep(letters[1:3], each=20),
rep=1:2,
int=rnorm(60,10,5)
)
df[sample(1:60,25,replace=F), 'int'] <- NA
So my data looks like above, in the imp_df I have the settings for the rnorm function based on the experiment exp and the replicate rep.
My data frame has then some missing values and I want to replace the NA by the random numbers.
How can I do it using dplyr or tidyr?
Edit
After the answer from #starja, I found a quick, but maybe slow solution by using rowwise together with left_join.
df %>%
left_join(imp_df) %>%
rowwise() %>%
mutate(imp.int=if_else(
is.na(int),
rnorm(1, mean, sd),
int
)) %>%
print(n=60)
Are there other ways to do this?
Edit 2
Since the rowwise approach is pretty slow and I couldn't get it running within some dplyr code, I used a for loop to go through imp_df with the imputation settings.
This is a pretty quick solution, but not as readable as I was hoping:
df$imp.int <- df$int
for(line in 1:nrow(imp_df)) {
imp_settings <- as.list(imp_df[line,])
rows_missing_values <- which(
df$exp == imp_settings$exp &
df$rep == imp_settings$rep &
is.na(df$imp.int)
)
df$imp.int[rows_missing_values] <-
stats::rnorm(length(rows_missing_values), imp_settings$mean, imp_settings$sd)
}
So we first add a column imp.int for the imputed values and run now line by line the different imputation settings by replacing the NAs for each group.
This could also be done:
library(dplyr)
library(purrr)
df %>%
left_join(imp_df, by = c("exp", "rep")) %>%
mutate(int = ifelse(is.na(int),
map2(mean, sd, ~ rnorm(1, .x, .y)), int))
exp rep int mean sd
1 a 1 1 1 0.0
2 a 2 10.91822 2 0.1
3 a 1 5.821857 1 0.0
4 a 2 17.9764 2 0.1
5 a 1 11.64754 1 0.0
6 a 2 5.897658 2 0.1
7 a 1 12.43715 1 0.0
8 a 2 13.69162 2 0.1
9 a 1 12.87891 1 0.0
10 a 2 1.986482 2 0.1
I guess there are cleverer solutions out there that use vectorisation, but if you don't have super large data, I like to use a purrr::map function for this together with a small custom made function:
library(dplyr)
set.seed(1)
imp_df <-
data.frame(exp=rep(letters[1:3], each=2),
rep=1:2,
mean=1:6,
sd=seq(0,0.5,0.1))
df <-
data.frame(
exp=rep(letters[1:3], each=20),
rep=1:2,
int=rnorm(60,10,5)
)
df[sample(1:60,25,replace=F), 'int'] <- NA
replace_fun <- function(x, mean, sd) {
if (is.na(x)) {
rnorm(1, mean, sd)
} else {
x
}
}
df %>%
left_join(imp_df, by = c("exp", "rep")) %>%
mutate(int = purrr::pmap_dbl(list(int, mean, sd), replace_fun)) %>%
head()
#> exp rep int mean sd
#> 1 a 1 1.000000 1 0.0
#> 2 a 2 10.918217 2 0.1
#> 3 a 1 5.821857 1 0.0
#> 4 a 2 17.976404 2 0.1
#> 5 a 1 11.647539 1 0.0
#> 6 a 2 5.897658 2 0.1
Created on 2021-05-27 by the reprex package (v0.3.0)
(If you want, you can remove the mean/sd columns with select(-c(mean, sd)).)

How to take difference between variable and lag determined by month date per group?

Essentially, I have a dataset with variables indicating group, date and value of variable. I need to take the difference between the value and the end-of-previous year value per group. Since the data is balanced, I was trying to do that with dplyr::lag, inserting the lag given the month of the observation:
x <- x %>% group_by(g) %>% mutate(y = v - lag(v, n=month(d))
This, however, does not work.
The results should be:
Mock dataset:
x <- data.frame('g'=c('B','B','B','C','A','A','A','A','A','A'),'d'=c('2018-11-30', '2018-12-31','2019-01-31','2019-12-31','2016-12-31','2017-11-30','2017-12-31','2018-12-31','2019-01-31','2019-02-28'),'v'=c(300,200,250,100,400,150,200,500,400,500))
Desired variable:
y <- c(NA,NA,-50,NA,NA,-250,-200,300,-100,0)
New dataset:
cbind(x,y)
An idea via dplyr can be to look for the last day, get the index and use that to subtract and then convert to NAs, i.e.
library(dplyr)
x %>%
group_by(g) %>%
mutate(new = which(sub('^[0-9]+-([0-9]+-[0-9]+)$', '\\1', d) == '12-31'),
y = v - v[new],
y = replace(y, row_number() <= new, NA)) %>%
select(-new)
which gives,
# A tibble: 7 x 4
# Groups: g [3]
g d v y
<fct> <fct> <dbl> <dbl>
1 B 2018-11-30 300 NA
2 B 2018-12-31 200 NA
3 B 2019-01-31 250 50
4 C 2017-12-31 400 NA
5 A 2018-12-31 500 NA
6 A 2019-01-31 400 -100
7 A 2019-02-28 500 0
In the end I decided to create an auxiliary variable ('eoy') to indicate the row of the corresponding end-of-year per group for each row. It requires a loop and is inefficient but facilitates the remaining computations that will depend on this. The desired computation would become:
mutate('y'= x - x[eoy])

Replace missing data by using another data table for multiple columns

I have many columns in a table where there is missing data. I want to be able to pull in the information from another table if the data is missing for a particular record based on ID. I thought about possibly joining the two tables and writing a for loop where if column X is NA then pull in information from column Y, however, I have many columns and would require writing many of these conditions.
I want to create a function or a loop where I can pass in the data column names with the missing data and be able to pass in the column name from another table to get the information from.
Reproducible Example:
ID <- c(1,2,3,4,5,6)
Year <- c(1990,1987,NA,NA,1968,1992)
Month <- c(1,NA,8,12,NA,5)
Day <- c(3,NA,NA,NA,NA,30)
New_Data = data.frame(ID=ID,Year=Year,Month=Month,Day=Day)
ID <- c(2,3,4,5)
Year <- c(NA,1994,1967,NA)
Month <- c(4,NA,NA,10)
Day <- c(23,12,16,9)
Old_Data = data.frame(ID=ID,Year=Year,Month=Month,Day=Day)
Expected Output:
ID <- c(1,2,3,4,5,6)
Year <- c(1990,1987,1994,1967,1968,1992)
Month <- c(1,4,8,12,10,5)
Day <- c(3,23,12,16,9,30)
New_Data = data.frame(ID=ID,Year=Year,Month=Month,Day=Day)
Using rbind combine two dataframe , then we using group_by with summarise_all
library(dplyr)
rbind(New_Data,Old_Data)%>%group_by(ID)%>%dplyr::summarise_all(function(x) x[!is.na(x)][1])
# A tibble: 6 x 4
ID Year Month Day
<dbl> <dbl> <dbl> <dbl>
1 1 1990 1 3
2 2 1987 4 23
3 3 1994 8 12
4 4 1967 12 16
5 5 1968 10 9
6 6 1992 5 30
An option using dplyr::left_join and dplyr::coalesce can be as:
library(dplyr)
New_Data %>% left_join(Old_Data, by="ID") %>%
mutate(Year = coalesce(Year.x, Year.y),
Month = coalesce(Month.x, Month.y),
Day = coalesce(Day.x, Day.y)) %>%
select(ID, Year, Month, Day)
# ID Year Month Day
# 1 1 1990 1 3
# 2 2 1987 4 23
# 3 3 1994 8 12
# 4 4 1967 12 16
# 5 5 1968 10 9
# 6 6 1992 5 30
Here's a solution using only base functions from another SO question
I modified it to your needs (created a function, and made an argument for the key column name):
fill_missing_data = function(df1, df2, keyColumn) {
commonNames <- names(df1)[which(colnames(df1) %in% colnames(df2))]
commonNames <- commonNames[commonNames != keyColumn]
dfmerge<- merge(df1,df2,by="ID",all=T)
for(i in commonNames){
left <- paste(i, ".x", sep="")
right <- paste(i, ".y", sep="")
dfmerge[is.na(dfmerge[left]),left] <- dfmerge[is.na(dfmerge[left]),right]
dfmerge[right]<- NULL
colnames(dfmerge)[colnames(dfmerge) == left] <- i
}
return(dfmerge)
}
result = fill_missing_data(New_Data, Old_Data, "ID")

How to get a frequency table of all columns of complete data frame in R?

I want to create a frequency table from a data frame and save it in excel. Using table() function i can only create frequency of a particular column. But I want to create frequency table for all the columns altogether, and for each column the levels or type of variables may differ too. Like kind of summary of a data frame but there will not be mean or other measures, only frequencies.
I was trying something like this
for(i in 1:230){
rm(tb)
tb<-data.frame(table(mydata[i]))
tb2<-cbind(tb2,tb)
}
But it's showing the following Error
Error in data.frame(..., check.names = FALSE) : arguments imply
differing number of rows: 15, 12
In place of cbind() I also used data.frame() but the Error didn't changed.
You are getting an error because you are trying to combine the data frames that have different dimensions. From what I understand, your problem is two-fold: (1) you want to get the frequency distribution of each column regardless of type; and, (2) you want to save all of the results in a single Excel sheet.
For the first problem, you can use the mapply() function.
set.seed(1)
dat <- data.frame(
x = sample(LETTERS[1:5], 15, replace = TRUE),
y = rbinom(5, 15, prob = 0.4)
)
mylist <- mapply(table, dat); mylist
# $x
#
# A B C D E
# 2 5 1 4 3
#
# $y
#
# 5 6 7 11
# 3 3 6 3
You can also use purrr::map().
library(purrr)
dat %>% map(table)
The second problem has several solutions in this question: Export a list into a CSV or TXT file in R. In particular, LyzandeR's answer will enable you to do just what you intended. If you prefer to save the outputs in separate files, you can do:
mapply(write.csv, mylist, file=paste0(names(mylist), '.csv'))
Maybe an rbind solution is better as it allows you to handle variables with different levels:
dt = data.frame(x = c("A","A","B","C"),
y = c(1,1,2,1))
dt
# x y
# 1 A 1
# 2 A 1
# 3 B 2
# 4 C 1
dt_res = data.frame()
for (i in 1:ncol(dt)){
dt_temp = data.frame(t(table(dt[,i])))
dt_temp$Var1 = names(dt)[i]
dt_res = rbind(dt_res, dt_temp)
}
names(dt_res) = c("Variable","Levels","Freq")
dt_res
# Variable Levels Freq
# 1 x A 2
# 2 x B 1
# 3 x C 1
# 4 y 1 3
# 5 y 2 1
And an alternative (probably faster) process using apply:
dt = data.frame(x = c("A","A","B","C"),
y = c(1,1,2,1))
dt
ff = function(x){
y = data.frame(t(table(x)))
y$Var1 = NULL
names(y) = c("Levels","Freq")
return(y)
}
dd = do.call(rbind, apply(dt, 2, ff))
dd
# Levels Freq
# x.1 A 2
# x.2 B 1
# x.3 C 1
# y.1 1 3
# y.2 2 1
# extract variable names from row names
dd$Variable = sapply(row.names(dd), function(x) unlist(strsplit(x,"[.]"))[1])
dd
# Levels Freq Variable
# x.1 A 2 x
# x.2 B 1 x
# x.3 C 1 x
# y.1 1 3 y
# y.2 2 1 y
Edit (2021-03-29): tidyverse Principles
Here is some updated code that utilizes tidyverse, specifically functions from dplyr, tibble, and purrr. The code is a bit more readable and easier to carry out as well. Example data set is provided.
tibble(
a = rep(c(1:3), 2),
b = factor(rep(c("Jan", "Feb", "Mar"), 2)),
c = factor(rep(LETTERS[1:3], 2))
) ->
dat
dat #print df
# A tibble: 6 x 3
a b c
<int> <fct> <fct>
1 1 Jan A
2 2 Feb B
3 3 Mar C
4 1 Jan A
5 2 Feb B
6 3 Mar C
Get counts and proportions across columns.
library(purrr)
library(dplyr)
library(tibble)
#library(tidyverse) #to load assortment of pkgs
#output tables - I like to use parentheses & specifying my funs
purrr::map(
dat, function(.x) {
count(tibble(x = .x), x) %>%
mutate(pct = (n / sum(n) * 100))
})
#here is the same code but more concise (tidy eval)
purrr::map(dat, ~ count(tibble(x = .x), x) %>%
mutate(pct = (n / sum(n) * 100)))
$a
# A tibble: 6 x 3
x n pct
<int> <int> <dbl>
1 1 1 16.7
2 2 1 16.7
3 3 1 16.7
4 4 1 16.7
5 5 1 16.7
6 6 1 16.7
$b
# A tibble: 3 x 3
x n pct
<fct> <int> <dbl>
1 Feb 2 33.3
2 Jan 2 33.3
3 Mar 2 33.3
$c
# A tibble: 2 x 3
x n pct
<fct> <int> <dbl>
1 A 3 50
2 B 3 50
Old code...
The table() function returns a "table" object, which is nigh impossible to manipulate using R in my experience. I tend to just write my own function to circumvent this issue. Let's first create a data frame with some categorical variables/features (wide formatted data).
We can use lapply() in conjunction with the table() function found in base R to create a list of frequency counts for each feature.
freqList = lapply(select_if(dat, is.factor),
function(x) {
df = data.frame(table(x))
names(df) = c("x", "y")
return(df)
}
)
This approach allows each list object to be easily indexed and further manipulated if necessary, which can be really handy with data frames containing a lot of features. Use print(freqList) to view all of the frequency tables.

Resources