R - split data frame and save to different files

R - split data frame and save to different files - r

I have a data frame with monthly temperature data for several locations:
> df4[1:36,]
location variable cut month year freq
1 Adamantina temperature 10 Jan 1981 21.0
646 Adamantina temperature 10 Feb 1981 20.5
1291 Adamantina temperature 10 Mar 1981 21.5
1936 Adamantina temperature 10 Apr 1981 21.5
2581 Adamantina temperature 10 May 1981 24.0
3226 Adamantina temperature 10 Jun 1981 21.5
3871 Adamantina temperature 10 Jul 1981 22.5
4516 Adamantina temperature 10 Aug 1981 23.5
5161 Adamantina temperature 10 Sep 1981 19.5
5806 Adamantina temperature 10 Oct 1981 21.5
6451 Adamantina temperature 10 Nov 1981 23.0
7096 Adamantina temperature 10 Dec 1981 19.0
2 Adolfo temperature 10 Jan 1981 24.0
647 Adolfo temperature 10 Feb 1981 20.0
1292 Adolfo temperature 10 Mar 1981 24.0
1937 Adolfo temperature 10 Apr 1981 23.0
2582 Adolfo temperature 10 May 1981 18.0
3227 Adolfo temperature 10 Jun 1981 21.0
3872 Adolfo temperature 10 Jul 1981 22.0
4517 Adolfo temperature 10 Aug 1981 19.0
5162 Adolfo temperature 10 Sep 1981 19.0
5807 Adolfo temperature 10 Oct 1981 24.0
6452 Adolfo temperature 10 Nov 1981 24.0
7097 Adolfo temperature 10 Dec 1981 24.0
3 Aguai temperature 10 Jan 1981 24.0
648 Aguai temperature 10 Feb 1981 20.0
1293 Aguai temperature 10 Mar 1981 22.0
1938 Aguai temperature 10 Apr 1981 20.0
2583 Aguai temperature 10 May 1981 21.5
3228 Aguai temperature 10 Jun 1981 20.5
3873 Aguai temperature 10 Jul 1981 24.0
4518 Aguai temperature 10 Aug 1981 23.5
5163 Aguai temperature 10 Sep 1981 18.5
5808 Aguai temperature 10 Oct 1981 21.0
6453 Aguai temperature 10 Nov 1981 22.0
7098 Aguai temperature 10 Dec 1981 23.5
What I need to do is to programmatically split this data frame by location and create a .Rdata file for every location.
In the example above, I would have three different files - Adamantina.Rdata, Adolfo.Rdata and Aguai.Rdata - containing all the columns but only the rows corresponding to those locations.
It needs to be efficient and programmatic, because in my actual data I have about 700 different locations and about 50 years of data for every location.
Thanks in advance.

This is borrowing from a previous answer, but I don't believe that answer does you want.
First, as they suggest, you want to split up your data set.
splitData <- split(df4, df4$location)
Now, to go through this list and one by one, save your datasetset, this can be done with by pulling off the names:
allNames <- names(splitData)
for(thisName in allNames){
saveName = paste0(thisName, '.Rdata')
saveRDS(splitData[[thisName]], file = saveName)
}

To split data frame, use split(df4, df4$location). It will create data frames named Adamantina, Adolfo, Aguai, etc.
And to save these new data frames into locations.RData file, use save(Adamantina, Adolfo, Aguai, file="locations.RData"). save.image(file="filename.RData") will save everything in current R session into filename.RData file.
You can read more about save and save.image here.
Edit:
If number of splits is way too large, then use this approach:
locations <- split(df4, df4$location)
save(locations, "locations.RData")
locations.RData will then load as a list.

Related

Seasonal package: Forecasts end date [...] must end on or before user-defined regression variables end date

I'm relatively new to R and had a question regarding time series format for forecasting and seasonal adjustment using the seasonal package. I'm working with import.spc to generate function calls based on spec files.
Currently, I have FORECAST{MAXLEAD=48}, with my time series ending in 2022-02. I'm getting this error:
- forecasts end date, 2026.Feb, must end on or before user-defined regression variables end date, 2022.Feb.
Is this because my time series ends earlier than 2026-02? I tried appending "NA"s to the end of my historicals but it didn't do much.
Alternatively, I also tried setting FORECAST{MAXLEAD=0}, but I ran into this error:
Error: X-13 has returned a non-zero exist status, which means that the current spec file cannot be processed for an unknown reason.
See my code below:
library("tidyverse")
library("seasonal")
fn<-import.spc("C:\\PATH\\TO\\SPEC\\FILE.spc")
x<-import.ts("C:\\PATH\\TO\\DATA\\FILE.dat")
x %>% (fn[1]$seas)
FILE.spc
SERIES{
TITLE = "Logging"
START = 2016.01
PERIOD = 12
SAVE = (A1 B1)
PRINT = BRIEF
NAME = '1011330000 - AE'
FILE = '"C:\\PATH\\TO\\DATA\\FILE.dat"}
TRANSFORM{FUNCTION = NONE
}
REGRESSION{
USER = (dum1 dum2 dum3 dum4 dum5 dum6 dum7 dum8 dum9 dum10 dum11)
START = 1986.01
USERTYPE = TD
FILE = 'C:\\PATH\\TO\\FILE\\FDUM8606.dat'
SAVE = (TD AO LS TC)
}
ARIMA{
MODEL = (0 1 1)(0 1 1)
}
ESTIMATE{
MAXITER = 3000
}
FORECAST{
MAXLEAD = 0
}
OUTLIER{
CRITICAL = 10.5
TYPES = AO
}
X11{
SEASONALMA = (s3x3)
MODE = ADD
PRINT = (BRIEF)
SAVE = (D10 D11 D16)
APPENDFCST = YES
FINAL = USER
SAVELOG = (Q Q2 M7 FB1 FD8 MSF)
}
FDUM8606.dat can be found here
FILE.dat
2016 2 51.1
2016 3 50.4
2016 4 47.9
2016 5 49.8
2016 6 52.0
2016 7 52.6
2016 8 52.6
2016 9 51.9
2016 10 52.1
2016 11 51.4
2016 12 49.9
2017 1 48.2
2017 2 49.6
2017 3 48.0
2017 4 47.6
2017 5 48.9
2017 6 50.4
2017 7 50.7
2017 8 50.6
2017 9 50.1
2017 10 49.7
2017 11 50.7
2017 12 50.2
2018 1 49.2
2018 2 49.8
2018 3 48.7
2018 4 47.8
2018 5 49.0
2018 6 49.2
2018 7 50.8
2018 8 50.6
2018 9 50.0
2018 10 49.6
2018 11 49.1
2018 12 49.7
2019 1 49.3
2019 2 48.1
2019 3 47.7
2019 4 45.4
2019 5 47.1
2019 6 48.8
2019 7 49.3
2019 8 50.5
2019 9 49.5
2019 10 51.6
2019 11 51.2
2019 12 49.1
2020 1 47.9
2020 2 47.9
2020 3 46.7
2020 4 42.0
2020 5 44.3
2020 6 45.7
2020 7 46.8
2020 8 46.7
2020 9 46.6
2020 10 47.5
2020 11 47.0
2020 12 48.1
2021 1 48.1
2021 2 48.0
2021 3 46.3
2021 4 43.4
2021 5 43.7
2021 6 46.8
2021 7 47.6
2021 8 48.0
2021 9 46.0
2021 10 45.5
2021 11 45.4
2021 12 44.7
2022 1 44.8
2022 2 45.1

Converting Dates to Julian Date

I am currently trying to do Theil-Sen trend estimates with a number of time series. How should I convert the Date variables so that they can be run in mblm package? The dates currently exist like so 'Apr 1981'. I want to use monthly medians in this assessment. See attached data.frame.
Thanks!
mo yr doc Date
04 1981 2.800 Apr 1981
05 1982 2.700 May 1982
10 1999 0.500 Oct 1999
05 2000 2.400 May 2000
06 2000 1.200 Jun 2000
07 2000 0.950 Jul 2000
08 2000 0.700 Aug 2000
09 2000 0.750 Sep 2000
10 2000 0.600 Oct 2000
11 2000 0.785 Nov 2000
12 2000 0.660 Dec 2000
01 2001 0.710 Jan 2001

Convert rows to Columns in R

My Dataframe:
> head(scotland_weather)
JAN Year.1 FEB Year.2 MAR Year.3 APR Year.4 MAY Year.5 JUN Year.6 JUL Year.7 AUG Year.8 SEP Year.9 OCT Year.10
1 293.8 1993 278.1 1993 238.5 1993 191.1 1947 191.4 2011 155.0 1938 185.6 1940 216.5 1985 267.6 1950 258.1 1935
2 292.2 1928 258.8 1997 233.4 1990 149.0 1910 168.7 1986 137.9 2002 181.4 1988 211.9 1992 221.2 1981 254.0 1954
3 275.6 2008 244.7 2002 201.3 1992 146.8 1934 155.9 1925 137.8 1948 170.1 1939 202.3 2009 193.9 1982 248.8 2014
4 252.3 2015 227.9 1989 200.2 1967 142.1 1949 149.5 2015 137.7 1931 165.8 2010 191.4 1962 189.7 2011 247.7 1938
5 246.2 1974 224.9 2014 180.2 1979 133.5 1950 137.4 2003 135.0 1966 162.9 1956 190.3 2014 189.7 1927 242.3 1983
6 245.0 1975 195.6 1995 180.0 1989 132.9 1932 129.7 2007 131.7 2004 159.9 1985 189.1 2004 189.6 1985 240.9 2001
NOV Year.11 DEC Year.12 WIN Year.13 SPR Year.14 SUM Year.15 AUT Year.16 ANN Year.17
1 262.0 2009 300.7 2013 743.6 2014 409.5 1986 455.6 1985 661.2 1981 1886.4 2011
2 244.8 1938 268.5 1986 649.5 1995 401.3 2015 435.6 1948 633.8 1954 1828.1 1990
3 242.2 2006 267.2 1929 645.4 2000 393.7 1994 427.8 2009 615.8 1938 1756.8 2014
4 231.3 1917 265.4 2011 638.3 2007 393.2 1967 422.6 1956 594.5 1935 1735.8 1938
5 229.9 1981 264.0 2006 608.9 1990 391.7 1992 397.0 2004 590.6 1982 1720.0 2008
6 224.9 1951 261.0 1912 592.8 2015 389.1 1913 390.1 1938 589.2 2006 1716.5 1954
Year.X column is not ordered. I wish to convert this into the following format:
month year rainfall_mm
Jan 1993 293.8
Feb 1993 278.1
Mar 1993 238.5
...
Nov 2015 230.0
I tried t() but it keeps the year column separate.
also tried reshape2 recast(data, formula, ..., id.var, measure.var) but something is missing. as both month and Year.X columns are numeric and int
> str(scotland_weather)
'data.frame': 106 obs. of 34 variables:
$ JAN : num 294 292 276 252 246 ...
$ Year.1 : int 1993 1928 2008 2015 1974 1975 2005 2007 1990 1983 ...
$ FEB : num 278 259 245 228 225 ...
$ Year.2 : int 1990 1997 2002 1989 2014 1995 1998 2000 1920 1918 ...
$ MAR : num 238 233 201 200 180 ...
$ Year.3 : int 1994 1990 1992 1967 1979 1989 1921 1913 2015 1978 ...
$ APR : num 191 149 147 142 134 ...

Based on the pattern of alternating columns in the 'scotland_weather' for the 'YearX' column, one way would be to use c(TRUE, FALSE) to select the alternate column by recycling, which is similar to seq(1, ncol(scotland_weather), by =2). By using c(FALSE, TRUE), we get the seq(2, ncol(scotland_weather), by =2). This will be useful for extracting those columns, get the transpose (t) and concatenate (c) to vector. Once we are done with this, the next step will be to extract the column names that are not 'Year'. For this grep can be used. Then, we use data.frame to bind the vectors to a data.frame.
res <- data.frame(month= names(scotland_weather)[!grepl('Year',
names(scotland_weather))], year=c(t(scotland_weather[c(FALSE,TRUE)])),
rainfall_mm= c(t(scotland_weather[c(TRUE,FALSE)])))
head(res,4)
# month year rainfall_mm
#1 JAN 1993 293.8
#2 FEB 1993 278.1
#3 MAR 1993 238.5
#4 APR 1947 191.1

The problem you have is not only that you need to transform your data you do also have the problem that years for first column is in the second, years for the third column is in the fourth and so on...
Here is a solution using tidyr.
library(tidyr)
match <- Vectorize(function(x,y) grep(x,names(df)) - grep(y,names(df) == 1))
years <- grep("Year",names(scotland_weather))
df %>% gather("month","rainfall_mm",-years) %>%
gather("yearname","year",-c(months,time)) %>%
filter(match(month,yearname)) %>%
select(-yearname)

Linear model/lmList with nested/multiple group categories?

I am trying to build a model for monthly energy consumption based on weather, grouped by location (there are ~1100) AND year (I would like to do it from 2011-2014). The data is called factin and looks like this:
Store Month Days UPD HD CD Year
1 August, 2013 31 6478.27 0.06 10.03 2013
1 September, 2013 30 6015.38 0.50 5.67 2013
1 October, 2013 31 5478.21 5.29 1.48 2013
1 November, 2013 30 5223.78 18.60 0.00 2013
1 December, 2013 31 5115.80 20.52 0.23 2013
6 January, 2011 31 4517.56 27.45 0.00 2011
6 February, 2011 28 4116.07 16.75 0.07 2011
6 March, 2011 31 3981.78 12.68 0.39 2011
6 April, 2011 30 4041.68 3.83 2.53 2011
6 May, 2011 31 4287.23 1.61 6.58 2011
And my model code, which just spits out 1 set of coefficients for all the years of each store, looks like this:
factout <- lmList(UPD ~ HD + CD | Store, factin)
My question is: is there any way I can get coefficients for each store AND year without creating a separate data frame for each year?

dat <- read.table(header = T, stringsAsFactors = F, text = "Store Month year Days UPD HD CD Year
1 August 2013 31 6478.27 0.06 10.03 2013
1 September 2013 30 6015.38 0.50 5.67 2013
1 October 2013 31 5478.21 5.29 1.48 2013
1 November 2013 30 5223.78 18.60 0.00 2013
1 December 2013 31 5115.80 20.52 0.23 2013
6 January 2011 31 4517.56 27.45 0.00 2011
6 February 2011 28 4116.07 16.75 0.07 2011
6 March 2011 31 3981.78 12.68 0.39 2011
6 April 2011 30 4041.68 3.83 2.53 2011
6 May 2011 31 4287.23 1.61 6.58 2011")
factout <- lmList(UPD ~ HD + CD | Store, dat)
data.frame(Store = unique(dat$Store), summary(factout)$coef[1:2,1,1:3])
(Intercept) HD CD
1 5405.108 -12.90986 107.2061
6 3581.307 32.93137 102.9780

correlation between two data frames in R

I have one data frame which has sales values for the time period Oct. 2000 to Dec. 2001 (15 months). Also I have profit values for the same time period as above and I want to find the correlation between these two data frames month wise for these 15 months in R. My data frame sales is:
Month sales
Oct. 2000 24.1
Nov. 2000 23.3
Dec. 2000 43.9
Jan. 2001 53.8
Feb. 2001 74.9
Mar. 2001 25
Apr. 2001 48.5
May. 2001 18
Jun. 2001 68.1
Jul. 2001 78
Aug. 2001 48.8
Sep. 2001 48.9
Oct. 2001 34.3
Nov. 2001 54.1
Dec. 2001 29.3
My second data frame profit is:
period profit
Oct 2000 14.1
Nov 2000 3.3
Dec 2000 13.9
Jan 2001 23.8
Feb 2001 44.9
Mar 2001 15
Apr 2001 58.5
May 2001 18
Jun 2001 58.1
Jul 2001 38
Aug 2001 28.8
Sep 2001 18.9
Oct 2001 24.3
Nov 2001 24.1
Dec 2001 19.3
Now I know that for initial two months I cannot get the correlation as there are not enough values but from Dec 2000 onwards I want to calculate the correlation by taking into consideration the previous months values. So, for Dec. 200 I will consider values of Oct. 2000, Nov. 2000 and Dec. 2000 which will give me 3 sales value and 3 profit values. Similarly for Jan. 2001 I will consider values of Oct. 2000, Nov. 2000 Dec. 2000 and Jan. 2001 thus having 4 sales value and 4 profit value. Thus for every month I will consider previous month values also to calculate the correlation and my output should be something like this:
Month Correlation
Oct. 2000 NA or Empty
Nov. 2000 NA or Empty
Dec. 2000 x
Jan. 2001 y
. .
. .
Dec. 2001 a
I know that in R there is a function cor(sales, profit) but how can I find out the correlation for my scenario?

Make some sample data:
> sales = c(1,4,3,2,3,4,5,6,7,6,7,5)
> profit = c(4,3,2,3,4,5,6,7,7,7,6,5)
> data = data.frame(sales=sales,profit=profit)
> head(data)
sales profit
1 1 4
2 4 3
3 3 2
4 2 3
5 3 4
6 4 5
Here's the beef:
> data$runcor = c(NA,NA,
sapply(3:nrow(data),
function(i){
cor(data$sales[1:i],data$profit[1:i])
}))
> data
sales profit runcor
1 1 4 NA
2 4 3 NA
3 3 2 -0.65465367
4 2 3 -0.63245553
5 3 4 -0.41931393
6 4 5 0.08155909
7 5 6 0.47368421
8 6 7 0.69388867
9 7 7 0.78317543
10 6 7 0.81256816
11 7 6 0.80386072
12 5 5 0.80155885
So now data$runcor[3] is the correlation of the first 3 sales and profit numbers.
Note I call this runcor as its a "running correlation", like a "running sum" which is the sum of all elements so far. This is the correlation of all pairs so far.

Another possibility would be: (if dat1 and dat2 are the initial datasets)
Update
dat1$Month <- gsub("\\.", "", dat1$Month)
datN <- merge(dat1, dat2, sort=FALSE, by.x="Month", by.y="period")
indx <- sequence(3:nrow(datN)) #create index to replicate the rows
indx1 <- cumsum(c(TRUE,diff(indx) <0)) #create another index to group the rows
#calculate the correlation grouped by `indx1`
datN$runcor <- setNames(c(NA, NA,by(datN[indx,-1],
list(indx1), FUN=function(x) cor(x$sales, x$profit) )), NULL)
datN
# Month sales profit runcor
#1 Oct 2000 24.1 14.1 NA
#2 Nov 2000 23.3 3.3 NA
#3 Dec 2000 43.9 13.9 0.5155911
#4 Jan 2001 53.8 23.8 0.8148546
#5 Feb 2001 74.9 44.9 0.9345166
#6 Mar 2001 25.0 15.0 0.9119941
#7 Apr 2001 48.5 58.5 0.7056301
#8 May 2001 18.0 18.0 0.6879528
#9 Jun 2001 68.1 58.1 0.7647177
#10 Jul 2001 78.0 38.0 0.7357748
#11 Aug 2001 48.8 28.8 0.7351366
#12 Sep 2001 48.9 18.9 0.7190413
#13 Oct 2001 34.3 24.3 0.7175138
#14 Nov 2001 54.1 24.1 0.7041889
#15 Dec 2001 29.3 19.3 0.7094334

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

R - split data frame and save to different files - r

Related

Seasonal package: Forecasts end date [...] must end on or before user-defined regression variables end date

Converting Dates to Julian Date

Convert rows to Columns in R

Linear model/lmList with nested/multiple group categories?

correlation between two data frames in R

Categories

Resources