I have the following dataset:
id observation_date Observation_value
1 2015-02-23 5
1 2015-02-24 6
1 2015-03-01 24
1 2015-07-16 2
1 2015-09-28 9
1 2015-12-05 12
I would like to create monthly averages of observation_value. In those cases that there are no values for a certain month, I would like to fill in the data with the average between the months where I have data.
Using the data in the Note at the end -- we have added a second id -- convert to zoo using column 1 to split by and column 2 as the index with yearmon class. Also in the same statement aggregate using mean over year/month giving the zoo object z. Then convert to ts which will fill in the missing months with NA and then convert back to zoo and use na.approx to fill in the NAs (or use na.spline or na.locf depending on what you want). fortify.zoo(zz) and fortify.zoo(zz, melt = TRUE) can be used to convert zoo objects to data frames.
library(zoo)
z <- read.zoo(dat, FUN = as.yearmon, index = 2, split = 1, aggregate = mean)
zz <- na.approx(as.zoo(as.ts(z)))
giving
> zz
1 2
Feb 2015 5.5 5.5
Mar 2015 24.0 24.0
Apr 2015 18.5 18.5
May 2015 13.0 13.0
Jun 2015 7.5 7.5
Jul 2015 2.0 2.0
Aug 2015 5.5 5.5
Sep 2015 9.0 9.0
Oct 2015 10.0 10.0
Nov 2015 11.0 11.0
Dec 2015 12.0 12.0
Note
Lines <- "id observation_date Observation_value
1 2015-02-23 5
1 2015-02-24 6
1 2015-03-01 24
1 2015-07-16 2
1 2015-09-28 9
1 2015-12-05 12
2 2015-02-23 5
2 2015-02-24 6
2 2015-03-01 24
2 2015-07-16 2
2 2015-09-28 9
2 2015-12-05 12"
dat <- read.table(text = Lines, header = TRUE)
I have a dataframe like this:
X ID X1 X2 X3 X4 X5
BIL 1 1 2 7 1 5
Date 1 12.2 13.5 1.1 26.9 7.9
Year 1 2012 2013 2020 1999 2017
BIL 2 7 9 2 1 5
Date 2 12.2 13.5 1.1 26.9 7.9
Year 2 2022 2063 2000 1989 2015
BIL 3 1 2 7 1 5
Date 3 12.2 13.5 1.1 26.9 7.9
Year 3 2012 2013 2020 1999 2017
I would like to transform it so that I get a new df with BIL Date Year as column names and the values listed in the rows below for example
ID BIL Date Year
1 1 1 12.2 2012
2 1 2 13.5 2013
3 1 7
4 1 1
5 1 5
6 2 7 12.2 2022
7 2 9 13.5 2063
Any help would really be appreciated!
Edit: Is there any way to also add a grouping variable like I added above
This strategy will work.
create an ID column by equating first column name with X.
transform into long format, deselect unwanted column (your original variable names) using names_to = NULL argument
transform back into wide, this time using correct variable names
collect multiple instances into list column using values_fn = list argument in pivot_wider
unnest all except ID
df <- read.table(text = 'X X1 X2 X3 X4 X5
BIL 1 2 7 1 5
Date 12.2 13.5 1.1 26.9 7.9
Year 2012 2013 2020 1999 2017
BIL 7 9 2 1 5
Date 12.2 13.5 1.1 26.9 7.9
Year 2022 2063 2000 1989 2015
BIL 1 2 7 1 5
Date 12.2 13.5 1.1 26.9 7.9
Year 2012 2013 2020 1999 2017', header = T)
library(tidyverse)
df %>% mutate(ID = cumsum(X == df[1,1])) %>%
pivot_longer(!c(X,ID), names_to = NULL) %>%
pivot_wider(id_cols = c(ID), names_from = X, values_from = value, values_fn = list) %>%
unnest(!ID)
#> # A tibble: 15 x 4
#> ID BIL Date Year
#> <int> <dbl> <dbl> <dbl>
#> 1 1 1 12.2 2012
#> 2 1 2 13.5 2013
#> 3 1 7 1.1 2020
#> 4 1 1 26.9 1999
#> 5 1 5 7.9 2017
#> 6 2 7 12.2 2022
#> 7 2 9 13.5 2063
#> 8 2 2 1.1 2000
#> 9 2 1 26.9 1989
#> 10 2 5 7.9 2015
#> 11 3 1 12.2 2012
#> 12 3 2 13.5 2013
#> 13 3 7 1.1 2020
#> 14 3 1 26.9 1999
#> 15 3 5 7.9 2017
Created on 2021-05-17 by the reprex package (v2.0.0)
This will also give you same results
df %>% mutate(ID = cumsum(X == df[1,1])) %>%
pivot_longer(!c(X,ID)) %>%
pivot_wider(id_cols = c(ID, name), names_from = X, values_from = value) %>%
select(-name)
Get the data in long format, create a unique row number for each value in X column and get it back in wide format.
library(dplyr)
library(tidyr)
df %>%
pivot_longer(cols = -X) %>%
group_by(X) %>%
mutate(row = row_number()) %>%
ungroup %>%
pivot_wider(names_from = X, values_from = value) %>%
select(-row, -name)
# BIL Date Year
# <dbl> <dbl> <dbl>
# 1 1 12.2 2012
# 2 2 13.5 2013
# 3 7 1.1 2020
# 4 1 26.9 1999
# 5 5 7.9 2017
# 6 7 12.2 2022
# 7 9 13.5 2063
# 8 2 1.1 2000
# 9 1 26.9 1989
#10 5 7.9 2015
#11 1 12.2 2012
#12 2 13.5 2013
#13 7 1.1 2020
#14 1 26.9 1999
#15 5 7.9 2017
In data.table with melt + dcast
library(data.table)
dcast(melt(setDT(df), id.vars = 'X'), rowid(X)~X, value.var = 'value')
data
df <- structure(list(X = c("BIL", "Date", "Year", "BIL", "Date", "Year",
"BIL", "Date", "Year"), X1 = c(1, 12.2, 2012, 7, 12.2, 2022,
1, 12.2, 2012), X2 = c(2, 13.5, 2013, 9, 13.5, 2063, 2, 13.5,
2013), X3 = c(7, 1.1, 2020, 2, 1.1, 2000, 7, 1.1, 2020), X4 = c(1,
26.9, 1999, 1, 26.9, 1989, 1, 26.9, 1999), X5 = c(5, 7.9, 2017,
5, 7.9, 2015, 5, 7.9, 2017)), class = "data.frame", row.names = c(NA, -9L))
A base R option using reshape (first wide and then long)
p <- reshape(
transform(
df,
id = ave(X, X, FUN = seq_along)
),
direction = "wide",
idvar = "id",
timevar = "X"
)
q <- reshape(
setNames(p, gsub("(.*)\\.(.*)", "\\2.\\1", names(p))),
direction = "long",
idvar = "id",
varying = -1
)
and you will see
id time BIL Date Year
1.X1 1 X1 1 12.2 2012
2.X1 2 X1 7 12.2 2022
3.X1 3 X1 1 12.2 2012
1.X2 1 X2 2 13.5 2013
2.X2 2 X2 9 13.5 2063
3.X2 3 X2 2 13.5 2013
1.X3 1 X3 7 1.1 2020
2.X3 2 X3 2 1.1 2000
3.X3 3 X3 7 1.1 2020
1.X4 1 X4 1 26.9 1999
2.X4 2 X4 1 26.9 1989
3.X4 3 X4 1 26.9 1999
1.X5 1 X5 5 7.9 2017
2.X5 2 X5 5 7.9 2015
3.X5 3 X5 5 7.9 2017
You may unlist, matrix and coerce it as.data.frame, use some cells for setNames.
setNames(as.data.frame(t(matrix(unlist(dat[-1]), 3, 15))), dat[1:3, 1])
# BIL Date Year
# 1 1 12.2 2012
# 2 7 12.2 2022
# 3 1 12.2 2012
# 4 2 13.5 2013
# 5 9 13.5 2063
# 6 2 13.5 2013
# 7 7 1.1 2020
# 8 2 1.1 2000
# 9 7 1.1 2020
# 10 1 26.9 1999
# 11 1 26.9 1989
# 12 1 26.9 1999
# 13 5 7.9 2017
# 14 5 7.9 2015
# 15 5 7.9 2017
If less hardcoded wanted, use:
m <- length(unique(dat$X))
n <- ncol(dat[-1]) * m
setNames(as.data.frame(t(matrix(unlist(dat[-1]), m, n))), dat[1:m, 1])
Data:
dat <- read.table(header=T, text='X X1 X2 X3 X4 X5
BIL 1 2 7 1 5
Date 12.2 13.5 1.1 26.9 7.9
Year 2012 2013 2020 1999 2017
BIL 7 9 2 1 5
Date 12.2 13.5 1.1 26.9 7.9
Year 2022 2063 2000 1989 2015
BIL 1 2 7 1 5
Date 12.2 13.5 1.1 26.9 7.9
Year 2012 2013 2020 1999 2017
')
I have two dataframes with the following matching keys: year, region and province. They each have a set of variables (in this illustrative example I use x1 for df1 and x2 for df2) and both variables have several missing values on their own.
df1 df2
year region province x2 ... xn year region province x2 ... xn
2019 1 5 NA 2019 1 5 NA
2019 2 4 NA. 2019 2 4 NA.
2019 2 4 NA. 2019 2 4 NA
2018 3 7 13. 2018 3 7 13
2018 3 7 15 2018 3 7 15
2018 3 7 17 2018 3 7 17
I want to merge both dataframes such that they end up like this:
year region province x1 x2
2019 1 5 3 NA
2019 2 4 27 NA
2019 2 4 15 NA
2018 3 7 12 13
2018 3 7 NA 15
2018 3 7 NA 17
2017 4 9 NA 12
2017 4 9 19 30
2017 4 9 20 10
However, when doing so using merged_df <- merge(df1, df2, by=c("year","region","province"), all.x=TRUE), R seems to create a lot of additional missing values on each of the variable columns (x1 and x2), which were not there before. What is happening here? I have tried sorting both using df1 %>% arrange(province,-year) and df2 %>% arrange(province,-year), which is enough to have matching order in both dataframes, only to find the same issue when running the merge command. I've tried a bunch of other stuff too, but nothing seems to work. R's output sort of looks like this:
year region province x1 x2
2019 1 5 NA NA
2019 2 4 NA NA
2019 2 4 NA NA
2018 3 7 NA NA
2018 3 7 NA NA
2018 3 7 NA NA
2017 4 9 15 NA
2017 4 9 19 30
2017 4 9 20 10
I have done this before; in fact, one of the dataframes is an already merged dataframe in which I did not encounter this issue.
Maybe it is not clear the concept of merge(). I include two examples with example data. I hope you understand and it helps you.
#Data
set.seed(123)
DF1 <- data.frame(year=rep(c(2017,2018,2019),3),
region=rep(c(1,2,3),3),
province=round(runif(9,1,5),0),
x1=rnorm(9,3,1.5))
DF2 <- data.frame(year=rep(c(2016,2018,2019),3),
region=rep(c(1,2,3),3),
province=round(runif(9,1,5),0),
x2=rnorm(9,3,1.5))
#Merge based only in df1
Merged1 <- merge(DF1,DF2,by=intersect(names(DF1),names(DF2)),all.x=T)
Merged1
year region province x1 x2
1 2017 1 2 2.8365510 NA
2 2017 1 3 3.7557187 NA
3 2017 1 5 4.9208323 NA
4 2018 2 4 2.8241371 NA
5 2018 2 5 6.7925048 1.460993
6 2018 2 5 0.4090941 1.460993
7 2019 3 1 5.5352765 NA
8 2019 3 3 3.8236451 4.256681
9 2019 3 3 3.2746239 4.256681
#Merge including all elements despite no match between ids
Merged2 <- merge(DF1,DF2,by=intersect(names(DF1),names(DF2)),all = T)
Merged2
year region province x1 x2
1 2016 1 3 NA 4.052034
2 2016 1 4 NA 2.062441
3 2016 1 5 NA 2.673038
4 2017 1 2 2.8365510 NA
5 2017 1 3 3.7557187 NA
6 2017 1 5 4.9208323 NA
7 2018 2 1 NA 0.469960
8 2018 2 2 NA 2.290813
9 2018 2 4 2.8241371 NA
10 2018 2 5 6.7925048 1.460993
11 2018 2 5 0.4090941 1.460993
12 2019 3 1 5.5352765 NA
13 2019 3 2 NA 1.398264
14 2019 3 3 3.8236451 4.256681
15 2019 3 3 3.2746239 4.256681
16 2019 3 4 NA 1.906663
i have 2 df's ,in df1 we have NA values which needs to be replaced with mean of previous 2 years Average_f1
eg. in df1 - for row 5 year is 2015 and bin - 5 and we need to replace previous 2 years mean for same bin from df2 (2013&2014) and for row-7 we have only 1 year value
df1 df2
year p1 bin year bin_p1 Average_f1
2013 20 1 2013 5 29.5
2013 24 1 2014 5 16.5
2014 10 2 2015 NA 30
2014 11 2 2016 7 12
2015 NA 5
2016 10 3
2017 NA 7
output
df1
year p1 bin
2013 20 1
2013 24 1
2014 10 2
2014 11 2
2015 **23** 5
2016 10 3
2017 **12** 7
Thanks in advance
Hi i would like to change my data frame profile_table_long which represents 24 hour/data for 50 companies from 2 years.
Data - date from 2015-01-01 to 2016-12-31
name - name of firm 1:50
hour - hour 1:24 (with additional 2a between 2 and 3)
load - variable
x <- NULL
x$Data <- rep(seq(as.Date("2015/1/1"), as.Date("2016/12/31"), "days"), length.out=913750)
x$Name <- rep(rep(1:50, each=731), length.out=913750)
x$hour <- rep(rep(c(1, 2, "2a", 3:24), each=36550),length.out=913750)
x$load <- sample(2000:2500, 913750, replace=T)
x <- data.frame(x)
Data name hour load
1 2015-01-01 1 1 8837.050
2 2015-01-01 1 2 6990.952
3 2015-01-01 1 2a 8394.421
4 2015-01-01 1 3 8267.276
5 2015-01-01 1 4 8324.069
6 2015-01-01 1 5 8644.901
7 2015-01-01 1 6 8720.878
8 2015-01-01 1 7 9213.204
9 2015-01-01 1 8 9601.976
10 2015-01-01 1 9 8549.170
11 2015-01-01 1 10 9379.324
12 2015-01-01 1 11 9370.418
13 2015-01-01 1 12 7159.201
14 2015-01-01 1 13 8497.344
15 2015-01-01 1 14 6419.835
16 2015-01-01 1 15 9354.910
17 2015-01-01 1 16 9320.462
18 2015-01-01 1 17 9263.098
19 2015-01-01 1 18 9167.991
20 2015-01-01 1 19 9004.010
21 2015-01-01 1 20 9134.466
22 2015-01-01 1 21 7631.472
23 2015-01-01 1 22 6492.074
24 2015-01-01 1 23 6888.025
25 2015-01-01 1 24 8821.283
25 2015-01-02 1 1 8902.135
I would like to make it look like that:
data hour name1 name2 .... name49 name50
2015-01-01 1 load load .... load load
2015-01-01 2 load load .... load load
.....
2015-01-01 24 load load .... load load
2015-01-02 1 load load .... load load
.....
2016-12-31 24 load load .... load load
I tried spread() from tidyr package profile_table_tidy <- spread(profile_table_long, name, load) but I am getting an error Error: Duplicate identifiers for rows
This method uses the reshape2 package:
library("reshape2")
profile_table_wide = dcast(data = profile_table_long,
formula = Data + hour ~ name,
value.var = "load")
You might also want to choose a value for fill as well. Good luck!