I am currently trying to do Theil-Sen trend estimates with a number of time series. How should I convert the Date variables so that they can be run in mblm package? The dates currently exist like so 'Apr 1981'. I want to use monthly medians in this assessment. See attached data.frame.
Thanks!
mo yr doc Date
04 1981 2.800 Apr 1981
05 1982 2.700 May 1982
10 1999 0.500 Oct 1999
05 2000 2.400 May 2000
06 2000 1.200 Jun 2000
07 2000 0.950 Jul 2000
08 2000 0.700 Aug 2000
09 2000 0.750 Sep 2000
10 2000 0.600 Oct 2000
11 2000 0.785 Nov 2000
12 2000 0.660 Dec 2000
01 2001 0.710 Jan 2001
Related
I'm attempting to find the percent differences of state characteristics (using a defined index created using factor analysis) between the years 2012 and 2017. However some states begin with a score of -0.617 (2012) and end with -1.25 (2017), creating a positive percent difference rather than a negative.
The only other thing I've tried is subtracting 1 from the fraction factor1/lag(factor1). Below is is the code I'm currently working with:
STFACTOR %>>%
dplyr::select(FIPSst, Geography, Year, factor1) %>>%
filter(Year == c(2012, 2017)) %>>%
group_by(Geography) %>>%
mutate(pct_change = (factor1/lag(factor1)-1) * 100)
These are the changes and results from each change in code
mutate(pct_change = (1-factor1/lag(factor1)) * 100)
FIPSst Geography Year factor1[,1] pct_change
<chr> <fct> <int> <dbl> <dbl>
1 01 Alabama 2012 1.82 NA
2 01 Alabama 2017 0.945 47.9
3 04 Arizona 2012 0.813 NA
4 04 Arizona 2017 0.108 86.7
5 05 Arkansas 2012 1.52 NA
6 05 Arkansas 2017 0.626 58.8
7 06 California 2012 1.04 NA
8 06 California 2017 0.0828 92.1
9 08 Colorado 2012 -0.617 NA
10 08 Colorado 2017 -1.25 -102.
mutate(pct_change = (factor1/lag(factor1)-1) * 100)
FIPSst Geography Year factor1[,1] pct_change
<chr> <fct> <int> <dbl> <dbl>
1 01 Alabama 2012 1.82 NA
2 01 Alabama 2017 0.945 -47.9
3 04 Arizona 2012 0.813 NA
4 04 Arizona 2017 0.108 -86.7
5 05 Arkansas 2012 1.52 NA
6 05 Arkansas 2017 0.626 -58.8
7 06 California 2012 1.04 NA
8 06 California 2017 0.0828 -92.1
9 08 Colorado 2012 -0.617 NA
10 08 Colorado 2017 -1.25 102.
I would expect the final result to look like this:
FIPSst Geography Year factor1[,1] pct_change
<chr> <fct> <int> <dbl> <dbl>
1 01 Alabama 2012 1.82 NA
2 01 Alabama 2017 0.945 -47.9
3 04 Arizona 2012 0.813 NA
4 04 Arizona 2017 0.108 -86.7
5 05 Arkansas 2012 1.52 NA
6 05 Arkansas 2017 0.626 -58.8
7 06 California 2012 1.04 NA
8 06 California 2017 0.0828 -92.1
9 08 Colorado 2012 -0.617 NA
10 08 Colorado 2017 -1.25 -102.
mutate(pct_change = (factor1-lag(factor1))/lag(abs(factor1)) * 100)
Above is the final solution to the problem, subtracted the old number from the new before I divided by the absolute value of the old number.
we can use
mutate(pct_change =(factor1 - lag(factor1))/abs(lag(factor1)) * 100)
I have a data frame with monthly temperature data for several locations:
> df4[1:36,]
location variable cut month year freq
1 Adamantina temperature 10 Jan 1981 21.0
646 Adamantina temperature 10 Feb 1981 20.5
1291 Adamantina temperature 10 Mar 1981 21.5
1936 Adamantina temperature 10 Apr 1981 21.5
2581 Adamantina temperature 10 May 1981 24.0
3226 Adamantina temperature 10 Jun 1981 21.5
3871 Adamantina temperature 10 Jul 1981 22.5
4516 Adamantina temperature 10 Aug 1981 23.5
5161 Adamantina temperature 10 Sep 1981 19.5
5806 Adamantina temperature 10 Oct 1981 21.5
6451 Adamantina temperature 10 Nov 1981 23.0
7096 Adamantina temperature 10 Dec 1981 19.0
2 Adolfo temperature 10 Jan 1981 24.0
647 Adolfo temperature 10 Feb 1981 20.0
1292 Adolfo temperature 10 Mar 1981 24.0
1937 Adolfo temperature 10 Apr 1981 23.0
2582 Adolfo temperature 10 May 1981 18.0
3227 Adolfo temperature 10 Jun 1981 21.0
3872 Adolfo temperature 10 Jul 1981 22.0
4517 Adolfo temperature 10 Aug 1981 19.0
5162 Adolfo temperature 10 Sep 1981 19.0
5807 Adolfo temperature 10 Oct 1981 24.0
6452 Adolfo temperature 10 Nov 1981 24.0
7097 Adolfo temperature 10 Dec 1981 24.0
3 Aguai temperature 10 Jan 1981 24.0
648 Aguai temperature 10 Feb 1981 20.0
1293 Aguai temperature 10 Mar 1981 22.0
1938 Aguai temperature 10 Apr 1981 20.0
2583 Aguai temperature 10 May 1981 21.5
3228 Aguai temperature 10 Jun 1981 20.5
3873 Aguai temperature 10 Jul 1981 24.0
4518 Aguai temperature 10 Aug 1981 23.5
5163 Aguai temperature 10 Sep 1981 18.5
5808 Aguai temperature 10 Oct 1981 21.0
6453 Aguai temperature 10 Nov 1981 22.0
7098 Aguai temperature 10 Dec 1981 23.5
What I need to do is to programmatically split this data frame by location and create a .Rdata file for every location.
In the example above, I would have three different files - Adamantina.Rdata, Adolfo.Rdata and Aguai.Rdata - containing all the columns but only the rows corresponding to those locations.
It needs to be efficient and programmatic, because in my actual data I have about 700 different locations and about 50 years of data for every location.
Thanks in advance.
This is borrowing from a previous answer, but I don't believe that answer does you want.
First, as they suggest, you want to split up your data set.
splitData <- split(df4, df4$location)
Now, to go through this list and one by one, save your datasetset, this can be done with by pulling off the names:
allNames <- names(splitData)
for(thisName in allNames){
saveName = paste0(thisName, '.Rdata')
saveRDS(splitData[[thisName]], file = saveName)
}
To split data frame, use split(df4, df4$location). It will create data frames named Adamantina, Adolfo, Aguai, etc.
And to save these new data frames into locations.RData file, use save(Adamantina, Adolfo, Aguai, file="locations.RData"). save.image(file="filename.RData") will save everything in current R session into filename.RData file.
You can read more about save and save.image here.
Edit:
If number of splits is way too large, then use this approach:
locations <- split(df4, df4$location)
save(locations, "locations.RData")
locations.RData will then load as a list.
I am trying to build a model for monthly energy consumption based on weather, grouped by location (there are ~1100) AND year (I would like to do it from 2011-2014). The data is called factin and looks like this:
Store Month Days UPD HD CD Year
1 August, 2013 31 6478.27 0.06 10.03 2013
1 September, 2013 30 6015.38 0.50 5.67 2013
1 October, 2013 31 5478.21 5.29 1.48 2013
1 November, 2013 30 5223.78 18.60 0.00 2013
1 December, 2013 31 5115.80 20.52 0.23 2013
6 January, 2011 31 4517.56 27.45 0.00 2011
6 February, 2011 28 4116.07 16.75 0.07 2011
6 March, 2011 31 3981.78 12.68 0.39 2011
6 April, 2011 30 4041.68 3.83 2.53 2011
6 May, 2011 31 4287.23 1.61 6.58 2011
And my model code, which just spits out 1 set of coefficients for all the years of each store, looks like this:
factout <- lmList(UPD ~ HD + CD | Store, factin)
My question is: is there any way I can get coefficients for each store AND year without creating a separate data frame for each year?
dat <- read.table(header = T, stringsAsFactors = F, text = "Store Month year Days UPD HD CD Year
1 August 2013 31 6478.27 0.06 10.03 2013
1 September 2013 30 6015.38 0.50 5.67 2013
1 October 2013 31 5478.21 5.29 1.48 2013
1 November 2013 30 5223.78 18.60 0.00 2013
1 December 2013 31 5115.80 20.52 0.23 2013
6 January 2011 31 4517.56 27.45 0.00 2011
6 February 2011 28 4116.07 16.75 0.07 2011
6 March 2011 31 3981.78 12.68 0.39 2011
6 April 2011 30 4041.68 3.83 2.53 2011
6 May 2011 31 4287.23 1.61 6.58 2011")
factout <- lmList(UPD ~ HD + CD | Store, dat)
data.frame(Store = unique(dat$Store), summary(factout)$coef[1:2,1,1:3])
(Intercept) HD CD
1 5405.108 -12.90986 107.2061
6 3581.307 32.93137 102.9780
I have one data frame which has sales values for the time period Oct. 2000 to Dec. 2001 (15 months). Also I have profit values for the same time period as above and I want to find the correlation between these two data frames month wise for these 15 months in R. My data frame sales is:
Month sales
Oct. 2000 24.1
Nov. 2000 23.3
Dec. 2000 43.9
Jan. 2001 53.8
Feb. 2001 74.9
Mar. 2001 25
Apr. 2001 48.5
May. 2001 18
Jun. 2001 68.1
Jul. 2001 78
Aug. 2001 48.8
Sep. 2001 48.9
Oct. 2001 34.3
Nov. 2001 54.1
Dec. 2001 29.3
My second data frame profit is:
period profit
Oct 2000 14.1
Nov 2000 3.3
Dec 2000 13.9
Jan 2001 23.8
Feb 2001 44.9
Mar 2001 15
Apr 2001 58.5
May 2001 18
Jun 2001 58.1
Jul 2001 38
Aug 2001 28.8
Sep 2001 18.9
Oct 2001 24.3
Nov 2001 24.1
Dec 2001 19.3
Now I know that for initial two months I cannot get the correlation as there are not enough values but from Dec 2000 onwards I want to calculate the correlation by taking into consideration the previous months values. So, for Dec. 200 I will consider values of Oct. 2000, Nov. 2000 and Dec. 2000 which will give me 3 sales value and 3 profit values. Similarly for Jan. 2001 I will consider values of Oct. 2000, Nov. 2000 Dec. 2000 and Jan. 2001 thus having 4 sales value and 4 profit value. Thus for every month I will consider previous month values also to calculate the correlation and my output should be something like this:
Month Correlation
Oct. 2000 NA or Empty
Nov. 2000 NA or Empty
Dec. 2000 x
Jan. 2001 y
. .
. .
Dec. 2001 a
I know that in R there is a function cor(sales, profit) but how can I find out the correlation for my scenario?
Make some sample data:
> sales = c(1,4,3,2,3,4,5,6,7,6,7,5)
> profit = c(4,3,2,3,4,5,6,7,7,7,6,5)
> data = data.frame(sales=sales,profit=profit)
> head(data)
sales profit
1 1 4
2 4 3
3 3 2
4 2 3
5 3 4
6 4 5
Here's the beef:
> data$runcor = c(NA,NA,
sapply(3:nrow(data),
function(i){
cor(data$sales[1:i],data$profit[1:i])
}))
> data
sales profit runcor
1 1 4 NA
2 4 3 NA
3 3 2 -0.65465367
4 2 3 -0.63245553
5 3 4 -0.41931393
6 4 5 0.08155909
7 5 6 0.47368421
8 6 7 0.69388867
9 7 7 0.78317543
10 6 7 0.81256816
11 7 6 0.80386072
12 5 5 0.80155885
So now data$runcor[3] is the correlation of the first 3 sales and profit numbers.
Note I call this runcor as its a "running correlation", like a "running sum" which is the sum of all elements so far. This is the correlation of all pairs so far.
Another possibility would be: (if dat1 and dat2 are the initial datasets)
Update
dat1$Month <- gsub("\\.", "", dat1$Month)
datN <- merge(dat1, dat2, sort=FALSE, by.x="Month", by.y="period")
indx <- sequence(3:nrow(datN)) #create index to replicate the rows
indx1 <- cumsum(c(TRUE,diff(indx) <0)) #create another index to group the rows
#calculate the correlation grouped by `indx1`
datN$runcor <- setNames(c(NA, NA,by(datN[indx,-1],
list(indx1), FUN=function(x) cor(x$sales, x$profit) )), NULL)
datN
# Month sales profit runcor
#1 Oct 2000 24.1 14.1 NA
#2 Nov 2000 23.3 3.3 NA
#3 Dec 2000 43.9 13.9 0.5155911
#4 Jan 2001 53.8 23.8 0.8148546
#5 Feb 2001 74.9 44.9 0.9345166
#6 Mar 2001 25.0 15.0 0.9119941
#7 Apr 2001 48.5 58.5 0.7056301
#8 May 2001 18.0 18.0 0.6879528
#9 Jun 2001 68.1 58.1 0.7647177
#10 Jul 2001 78.0 38.0 0.7357748
#11 Aug 2001 48.8 28.8 0.7351366
#12 Sep 2001 48.9 18.9 0.7190413
#13 Oct 2001 34.3 24.3 0.7175138
#14 Nov 2001 54.1 24.1 0.7041889
#15 Dec 2001 29.3 19.3 0.7094334
I have this data.frame:
counts <- data.frame(year = sort(rep(2000:2009, 12)), month = rep(month.abb,10), count = sample(1:500, 120, replace = T))
First 20 rows of data:
head(counts, 20)
year month count
1 2000 Jan 14
2 2000 Feb 182
3 2000 Mar 462
4 2000 Apr 395
5 2000 May 107
6 2000 Jun 127
7 2000 Jul 371
8 2000 Aug 158
9 2000 Sep 147
10 2000 Oct 41
11 2000 Nov 141
12 2000 Dec 27
13 2001 Jan 72
14 2001 Feb 7
15 2001 Mar 40
16 2001 Apr 351
17 2001 May 342
18 2001 Jun 81
19 2001 Jul 442
20 2001 Aug 389
Lets say I try to calculate the standard deviation of these data using the usual R code:
library(plyr)
ddply(counts, .(month), summarise, s.d. = sd(count))
month s.d.
1 Apr 145.3018
2 Aug 140.9949
3 Dec 173.9406
4 Feb 127.5296
5 Jan 148.2661
6 Jul 162.4893
7 Jun 133.4383
8 Mar 125.8425
9 May 168.9517
10 Nov 93.1370
11 Oct 167.9436
12 Sep 166.8740
This gives the standard deviation around the mean of each month. How can I get R to output standard deviation around maximum value of each month?
you want: "max of values per month and the average from this maximum value" [which is not the same as the standard deviation].
counts <- data.frame(year = sort(rep(2000:2009, 12)), month = rep(month.abb,10), count = sample(1:500, 120, replace = T))
library(data.table)
counts=data.table(counts)
counts[,mean(count-max(count)),by=month]
This question is highly vague. If you want to calculate the standard deviation of the differences to the maximum, you can use this code:
> library(plyr)
> ddply(counts, .(month), summarise, sd = sd(count - max(count)))
month sd
1 Apr 182.5071
2 Aug 114.3068
3 Dec 117.1049
4 Feb 184.4638
5 Jan 138.1755
6 Jul 167.0677
7 Jun 100.8841
8 Mar 144.8724
9 May 173.3452
10 Nov 132.0204
11 Oct 127.4645
12 Sep 152.2162