I am trying to read in a time series and do a plot.ts(), however I am getting weird results. Perhaps I did something wrong. I tried including the start and end dates but the output is still wrong.
Any help appreciated. Thank you.
This is the code and output:
sales1 <- read.csv("TimeS.csv",header=TRUE)
sales1
salesT <- ts(sales1)
salesT
plot.ts(salesT)
output:
> sales1 <- read.csv("TimeS.csv",header=TRUE)
> sales1
year q1 q2 q3 q4
1 1991 4.8 4.1 6.0 6.5
2 1992 5.8 5.2 6.8 7.4
3 1993 6.0 5.6 7.5 7.8
4 1994 6.3 5.9 8.0 8.4
> salesT <- ts(sales1)
> salesT
Time Series:
Start = 1
End = 4
Frequency = 1
year q1 q2 q3 q4
1 1991 4.8 4.1 6.0 6.5
2 1992 5.8 5.2 6.8 7.4
3 1993 6.0 5.6 7.5 7.8
4 1994 6.3 5.9 8.0 8.4
> plot.ts(salesT)
It looks like I can't paste the plot. instead of 1 graph it has 5 separate
plots stacked onto each other.
Try this
salesT<-ts(unlist(t(sales1[,-1])),start=c(1991,1),freq=4)
The format of the original data is difficult to use directly for a time series. You could try this instead:
sales1 <- t(sales1[,-1])
sales1 <- as.vector(sales1)
my_ts <- ts(sales1, frequency = 4, start=c(1991,1))
plot.ts(my_ts)
Here I think you need to format it correctly try this:
salesT <- ts(sales1)
ts.plot(salesT, frequency = 4, start = c(1991, 1), end = c(1994, 4)))
This line is making the times into one of the series which is unlikely what you want:
> salesT <- ts(sales1)
We need to transpose the data frame in order that it reads across the rows rather than down and we use c to turn the resulting matrix into a vector forming the data portion of the series. (continued after chart)
# create sales1
Lines <- "year q1 q2 q3 q4
1 1991 4.8 4.1 6.0 6.5
2 1992 5.8 5.2 6.8 7.4
3 1993 6.0 5.6 7.5 7.8
4 1994 6.3 5.9 8.0 8.4"
sales1 <- read.table(text = Lines, header = TRUE)
# convert to ts and plot
salesT <- ts(c(t(sales1[-1])), start = sales1[1, 1], freq = 4)
plot(salesT)
Regarding the comment, if the data looks like this then it is more straight forward and the lines below will produce the above plot. We have assumed that the data is sorted and starts at the bginning of a year so we do not need to use the second column:
Lines2 <- "year qtr sales
1 1991 q1 4.8
2 1991 q2 4.1
3 1991 q3 6.0
4 1991 q4 6.5
5 1992 q1 5.8
6 1992 q2 5.2
7 1992 q3 6.8
8 1992 q4 7.4
9 1993 q1 6.0
10 1993 q2 5.6
11 1993 q3 7.5
12 1993 q4 7.8
13 1994 q1 6.3
14 1994 q2 5.9
15 1994 q3 8.0
16 1994 q4 8.4"
sales2 <- read.table(text = Lines2, header = TRUE)
salesT2 <- ts(sales2$sales, start = sales2$year[1], freq = 4)
plot(salesT2)
Update fixed. Added response to comments.
Related
I have a data frame that look like this:
z <- data.frame(ent = c(1, 1, 1, 2, 2, 2, 3, 3, 3), year = c(1995, 2000, 2005, 1995, 2000, 2005, 1995, 2000, 2005), pobtot = c(50, 60, 70, 10, 4, 1, 100, 105, 110))
As you can see, there is a gap between 5 years for every "ent". I want to interpolate data to every missing year: 1996, 1997, 1998, 1999, 2001, 2002, 2003, 2004 and also prognosticate to 2006, 2007 and 2008. Is there a way to do this?
Any help would be appreciated.
We can use complete to expand the data for each 'ent' and the 'year' range, then with na.approx interpolate the missing values in 'pobtot'
library(dplyr)
library(tidyr)
z %>%
complete(ent, year = 1995:2008) %>%
mutate(pobtot = zoo::na.approx(pobtot, na.rm = FALSE))
Assuming you want linear interpolation, R uses approx() for such things by default, e.g. for drawing lines in a plot. We may also use that function to interpolate the years. It doesn't extrapolate, though, but we could use forecast::ets() with default settings for this which calculates an exponential smoothing state space model. Note, however, that this may also produce negative values, but OP hasn't stated what is needed in such a case. So anyway in a by() approach we could do:
library(forecast)
p <- 3 ## define number of years for prediction
res <- do.call(rbind, by(z, z$ent, function(x) {
yseq <- min(x$year):(max(x$year) + p) ## sequence of years + piction
a <- approx(x$year, x$pobtot, head(yseq, -p))$y ## linear interpolation
f <- predict(ets(a), 3) ## predict `p` years
r <- c(a, f$mean) ## combine interpolation and prediction
data.frame(ent=x$ent[1], year=yseq, pobtot=r) ## output as data frame
}))
Result
res
# ent year pobtot
# 1.1 1 1995 50.0
# 1.2 1 1996 52.0
# 1.3 1 1997 54.0
# 1.4 1 1998 56.0
# 1.5 1 1999 58.0
# 1.6 1 2000 60.0
# 1.7 1 2001 62.0
# 1.8 1 2002 64.0
# 1.9 1 2003 66.0
# 1.10 1 2004 68.0
# 1.11 1 2005 70.0
# 1.12 1 2006 72.0
# 1.13 1 2007 74.0
# 1.14 1 2008 76.0
# 2.1 2 1995 10.0
# 2.2 2 1996 8.8
# 2.3 2 1997 7.6
# 2.4 2 1998 6.4
# 2.5 2 1999 5.2
# 2.6 2 2000 4.0
# 2.7 2 2001 3.4
# 2.8 2 2002 2.8
# 2.9 2 2003 2.2
# 2.10 2 2004 1.6
# 2.11 2 2005 1.0
# 2.12 2 2006 0.4
# 2.13 2 2007 -0.2
# 2.14 2 2008 -0.8
# 3.1 3 1995 100.0
# 3.2 3 1996 101.0
# 3.3 3 1997 102.0
# 3.4 3 1998 103.0
# 3.5 3 1999 104.0
# 3.6 3 2000 105.0
# 3.7 3 2001 106.0
# 3.8 3 2002 107.0
# 3.9 3 2003 108.0
# 3.10 3 2004 109.0
# 3.11 3 2005 110.0
# 3.12 3 2006 111.0
# 3.13 3 2007 112.0
# 3.14 3 2008 113.0
We could quickly check this in a plot, which, apart from the negative values of entity 2 looks quite reasonable.
with(res, plot(year, pobtot, type='n', main='z'))
with(res[res$year < 2006, ], points(year, pobtot, pch=20, col=3))
with(res[res$year > 2005, ], points(year, pobtot, pch=20, col=4))
with(res[res$year %in% z$year, ], points(year, pobtot, pch=20, col=1))
abline(h=0, lty=3)
legend(2005.25, 50, c('measurem.', 'interpol.', 'extrapol.'), pch=20,
col=c(1, 3, 4), cex=.8, bty='n')
My R dataset (migration) looks like this:
date gender UK USA Canada Mexico
1990 M 4.2 6.3 4.0 5.1
1990 F 5.2 4.3 6.0 4.1
1991 M 3.2 5.3 5.0 7.1
1991 F 4.2 5.3 4.0 4.1
1992 M 3.2 3.3 2.0 5.1
1992 F 6.2 6.3 4.0 3.1
What do I want to do?
I want to create a plot showing the trend line by year of all countries.
I want to color by gender
Facet by countries
What did I do?
I produced the following code
ggplot(migration,
aes(date,gender, color=gender)) +
geom_point() +
facet_wrap(UK~USA~Canada~Mexico)
However, it does not work. Please kindly help me solve this?
library(ggplot2)
library(tidyr)
migl <- gather(data = migration, country, value, -c(date, gender))
ggplot(data = migl,
aes(x = date, y = value, color = gender)) +
geom_point(size=2) +
geom_smooth()+
facet_wrap(~country)
Data:
migration <- read.table(text="date gender UK USA Canada Mexico
1990 M 4.2 6.3 4.0 5.1
1990 F 5.2 4.3 6.0 4.1
1991 M 3.2 5.3 5.0 7.1
1991 F 4.2 5.3 4.0 4.1
1992 M 3.2 3.3 2.0 5.1
1992 F 6.2 6.3 4.0 3.1", header=T)
I am trying to calculate diameter growth for a set of trees over a number of years in a dataframe in which each row is a given tree during a given year. Typically, this sort of data has each individual stem as a single row with that stem's diameter for each year given in a separate column, but for various reasons, this dataframe needs to remain such that each row is an individual stem in an individual year. A simplistic model version of the data would be as follows
df<-data.frame("Stem"=c(1:5,1:5,1,2,3,5,1,2,3,5,6),
"Year"=c(rep(1997,5), rep(1998,5), rep(1999,4), rep(2000,5)),
"Diameter"=c(1:5,seq(1.5,5.5,1),2,3,4,6,3,5,7,9,15))
df
Stem Year DAP
1 1 1997 1.0
2 2 1997 2.0
3 3 1997 3.0
4 4 1997 4.0
5 5 1997 5.0
6 1 1998 1.5
7 2 1998 2.5
8 3 1998 3.5
9 4 1998 4.5
10 5 1998 5.5
11 1 1999 2.0
12 2 1999 3.0
13 3 1999 4.0
14 5 1999 6.0
15 1 2000 3.0
16 2 2000 5.0
17 3 2000 7.0
18 5 2000 9.0
19 6 2000 15.0
What I am trying to accomplish is to make a new column that takes the diameter for a given stem in a given year and subtracts the diameter for that same stem in the previous year. I assume that this will require some set of nested for loops. Something like
for (i in 1:length(unique(df$Stem_ID){
for (t in 2:length(unique(df$Year){
.....
}
}
What I'm struggling with is how to write the function that calculates:
Diameter[t]-Diameter[t-1] for each stem. Any suggestions would be greatly appreciated.
Try:
> do.call(rbind, lapply(split(df, df$Stem), function(x) transform(x, diff = c(0,diff(x$Diameter)))))
Stem Year Diameter diff
1.1 1 1997 1.0 0.0
1.6 1 1998 1.5 0.5
1.11 1 1999 2.0 0.5
1.15 1 2000 3.0 1.0
2.2 2 1997 2.0 0.0
2.7 2 1998 2.5 0.5
2.12 2 1999 3.0 0.5
2.16 2 2000 5.0 2.0
3.3 3 1997 3.0 0.0
3.8 3 1998 3.5 0.5
3.13 3 1999 4.0 0.5
3.17 3 2000 7.0 3.0
4.4 4 1997 4.0 0.0
4.9 4 1998 4.5 0.5
5.5 5 1997 5.0 0.0
5.10 5 1998 5.5 0.5
5.14 5 1999 6.0 0.5
5.18 5 2000 9.0 3.0
6 6 2000 15.0 0.0
Rnso's answer works. You could also do the slightly shorter:
>df[order(df$Stem),]
>df$diff <- unlist(tapply(df$Diameter,df$Stem, function(x) c(NA,diff(x))))
Stem Year Diameter diff
1 1 1997 1.0 NA
6 1 1998 1.5 0.5
11 1 1999 2.0 0.5
15 1 2000 3.0 1.0
2 2 1997 2.0 NA
7 2 1998 2.5 0.5
12 2 1999 3.0 0.5
16 2 2000 5.0 2.0
3 3 1997 3.0 NA
8 3 1998 3.5 0.5
13 3 1999 4.0 0.5
17 3 2000 7.0 3.0
4 4 1997 4.0 NA
9 4 1998 4.5 0.5
5 5 1997 5.0 NA
10 5 1998 5.5 0.5
14 5 1999 6.0 0.5
18 5 2000 9.0 3.0
19 6 2000 15.0 NA
Or if you're willing to use the data.table package you can be very succinct:
>require(data.table)
>DT <- data.table(df)
>setkey(DT,Stem)
>DT <- DT[,diff:= c(NA, diff(Diameter)), by = Stem]
>df <- as.data.frame(DT)
Stem Year Diameter diff
1 1 1997 1.0 NA
2 1 1998 1.5 0.5
3 1 1999 2.0 0.5
4 1 2000 3.0 1.0
5 2 1997 2.0 NA
6 2 1998 2.5 0.5
7 2 1999 3.0 0.5
8 2 2000 5.0 2.0
9 3 1997 3.0 NA
10 3 1998 3.5 0.5
11 3 1999 4.0 0.5
12 3 2000 7.0 3.0
13 4 1997 4.0 NA
14 4 1998 4.5 0.5
15 5 1997 5.0 NA
16 5 1998 5.5 0.5
17 5 1999 6.0 0.5
18 5 2000 9.0 3.0
19 6 2000 15.0 NA
If you have a large dataset, data.table has the advantage of being extremely fast.
I'd like to calculate monthly temperature anomalies on a time-series with several stations.
I call here "anomaly" the difference of a single value from a mean calculated on a period.
My data frame looks like this (let's call it "data"):
Station Year Month Temp
A 1950 1 15.6
A 1980 1 12.3
A 1990 2 11.4
A 1950 1 15.6
B 1970 1 12.3
B 1977 2 11.4
B 1977 4 18.6
B 1980 1 12.3
B 1990 11 7.4
First, I made a subset with the years comprised between 1980 and 1990:
data2 <- subset(data, Year>=1980& Year<=1990)
Second, I used plyr to calculate monthly mean (let's call this "MeanBase") between 1980 and 1990 for each station:
data3 <- ddply(data2, .(Station, Month), summarise,
MeanBase = mean(Temp, na.rm=TRUE))
Now, I'd like to calculate, for each line of data, the difference between the corresponding MeanBase and the value of Temp... but I'm not sure to be in the right way (I don't see how to use data3).
You can use ave in base R to get this.
transform(data,
Demeaned=Temp - ave(replace(Temp, Year < 1980 | Year > 1990, NA),
Station, Month, FUN=function(t) mean(t, na.rm=TRUE)))
# Station Year Month Temp Demeaned
# 1 A 1950 1 15.6 3.3
# 2 A 1980 1 12.3 0.0
# 3 A 1990 2 11.4 0.0
# 4 A 1950 1 15.6 3.3
# 5 B 1970 1 12.3 0.0
# 6 B 1977 2 11.4 NaN
# 7 B 1977 4 18.6 NaN
# 8 B 1980 1 12.3 0.0
# 9 B 1990 11 7.4 0.0
The result column will have NaNs for Month-Station combinations that have no years in your specified range.
I am new in using R and I have a question for which I am trying to find the answer. I have a file organized as follows (it has thousands of rows but I just show a sample for simplicity):
YEAR Month day S1 T1 T2 R1
1965 3 2 11.7 20.6 11.1 18.8
1965 3 3 14.0 16.7 3.3 0.0
1965 3 4 -99.9 -99.9 -99.9 -99.9
1965 3 5 9.2 5.6 0.0 -99.9
1965 3 6 10.1 6.7 0.0 -99.9
1965 3 7 9.7 7.2 1.1 0.0
I would like to know for each column (T1, T2, and R1) in which Year, Month and Day the -99.9 are located; e.g. from 1980/1/3 to 1980/1/27 there are X -99.9 for T1, from 1990/2/10 to 1990/3/30 there are Y-99.9 for T1....and so on. Same for T2, and R.
How can do this in R?
This is only one file like this but I have almost 2000 files with the same problem (I know I need to loop it) but if I know how to do it for one file then I will just create a loop.
I really appreciate any help. Thank you very much in advance for reading and helping!!!
I took the liberty of renaming your last dataframe column "R1"
lapply(c('T1', 'T2', 'R1'), function(x) { dfrm[ dfrm[[x]]==-99.9 , # rows to select
1:3 ] }# columns to return
)
#-------------
[[1]]
YEAR Month day
3 1965 3 4
[[2]]
YEAR Month day
3 1965 3 4
[[3]]
YEAR Month day
3 1965 3 4
4 1965 3 5
5 1965 3 6
It wasn't clear whether you wanted the values or counts (and I don't think you can have both in the same report.) If you wanted to name the entries:
> misdates <- .Last.value
> names(misdates) <- c('T1', 'T2', 'R1')
If you wanted a count:
lapply(misdates, NROW)
$T1
[1] 1
$T2
[1] 1
$R1
[1] 3
(You might want to learn how to use NA values. Using numbers as missing values is not recommended data management.)
If I understand correctly, you want to obtain how many "-99.9"s you get per month AND by column,
Here's my code for S1 using plyr. You'll note that I expanded your example to get one more month of data.
library(plyr)
my.table <-read.table(text="YEAR Month day S1 T1 T2 R1
1965 3 2 11.7 20.6 11.1 18.8
1965 3 3 14.0 16.7 3.3 0.0
1965 3 4 -99.9 -99.9 -99.9 -99.9
1965 3 5 9.2 5.6 0.0 -99.9
1965 3 6 10.1 6.7 0.0 -99.9
1965 3 7 9.7 7.2 1.1 0.0
1966 1 7 -99.9 7.2 1.1 0.0
1966 1 8 -99.9 7.2 1.1 0.0
", header=TRUE, as.is=TRUE,sep = " ")
#Create a year/month column to summarise per month
my.table$yearmonth <-paste(my.table$YEAR,"/",my.table$Month,sep="")
S1 <-count(my.table[my.table$S1==-99.9,],"yearmonth")
S1
yearmonth freq
1 1965/3 1
2 1966/1 2