Can anyone explain this R language ts function - r

myts <- ts(mydata[,2:4], start = c(1981, 1), frequency = 4)
Above line of code is clear to me except the part ts(mydata[,2:4]) I know that ts(mydata[,2] tells us that R should read from 2 column however, what does 2:4 stands for?
data looks like this
usnim
1 2002-01-01 4.08
2 2002-04-01 4.10
3 2002-07-01 4.06
4 2002-10-01 4.04
Could you provide an example when to use 2:4 ?

Related

Data aggregation by week or by 3 days

Here is an example of my data:
Date Prec aggregated by week (output)
1/1/1950 3.11E+00 4.08E+00
1/2/1950 3.25E+00 9.64E+00
1/3/1950 4.81E+00 1.15E+01
1/4/1950 7.07E+00
1/5/1950 4.25E+00
1/6/1950 3.11E+00
1/7/1950 2.97E+00
1/8/1950 2.83E+00
1/9/1950 2.72E+00
1/10/1950 2.72E+00
1/11/1950 2.60E+00
1/12/1950 2.83E+00
1/13/1950 1.70E+01
1/14/1950 3.68E+01
1/15/1950 4.24E+01
1/16/1950 1.70E+01
1/17/1950 7.07E+00
1/18/1950 3.96E+00
1/19/1950 3.54E+00
1/20/1950 3.40E+00
1/21/1950 3.25E+00
I have long time series precipitation data and I want to aggregate it in such a way that (output is in third column; I calculated it from excel) is as follows
If I do aggregation by weekly
output in 1st cell = average prec from day 1 to 7 days.
output in 2nd cell = average prec from 8 to 14 days.
Output in 3rd cell=average prec from 15 to 21 day
If I do aggregation by 3 days
output in 1st cell = average of day 1 to 3 days.
output in 2nd cell = average of day 4 to 6 days.
I will provide the function with "prec" and the "time step" input. I tried loops and lubridate, POSIXct, and some other functions, but I cant figure out the output like in third column.
One code I came up with ran without error but my output is bot correct.
Where dat is my data set.
tt=as.POSIXct(paste(dat$Date),format="%m/%d/%Y") #converting date formate
datZoo <- zoo(dat[,-c(1,3)], tt)
weekly <- apply.weekly(datZoo,mean)
prec_NLCD <-data.frame (weekly)
Also I wanted to write it in form of a function. Your suggestions will be helpful.
Assuming the data shown reproducibly in the Note at the end create the weekly means, zm, and then merge it with z.
(It would seem to make more sense to merge the means at the point that they are calculated, i.e. merge(z, zm) in place of the line marked ##, but for consistency with the output shown in the question they are put at the head of the data below.)
library(zoo)
z <- read.zoo(text = Lines, header = TRUE, format = "%m/%d/%Y")
zm <- rollapplyr(z, 7, by = 7, mean)
merge(z, zm = zoo(coredata(zm), head(time(z), length(zm)))) ##
giving:
z zm
1950-01-01 3.11 4.081429
1950-01-02 3.25 9.642857
1950-01-03 4.81 11.517143
1950-01-04 7.07 NA
1950-01-05 4.25 NA
1950-01-06 3.11 NA
1950-01-07 2.97 NA
1950-01-08 2.83 NA
1950-01-09 2.72 NA
1950-01-10 2.72 NA
1950-01-11 2.60 NA
1950-01-12 2.83 NA
1950-01-13 17.00 NA
1950-01-14 36.80 NA
1950-01-15 42.40 NA
1950-01-16 17.00 NA
1950-01-17 7.07 NA
1950-01-18 3.96 NA
1950-01-19 3.54 NA
1950-01-20 3.40 NA
1950-01-21 3.25 NA
Note:
Lines <- "Date Prec
1/1/1950 3.11E+00
1/2/1950 3.25E+00
1/3/1950 4.81E+00
1/4/1950 7.07E+00
1/5/1950 4.25E+00
1/6/1950 3.11E+00
1/7/1950 2.97E+00
1/8/1950 2.83E+00
1/9/1950 2.72E+00
1/10/1950 2.72E+00
1/11/1950 2.60E+00
1/12/1950 2.83E+00
1/13/1950 1.70E+01
1/14/1950 3.68E+01
1/15/1950 4.24E+01
1/16/1950 1.70E+01
1/17/1950 7.07E+00
1/18/1950 3.96E+00
1/19/1950 3.54E+00
1/20/1950 3.40E+00
1/21/1950 3.25E+00"

R algorithm for deleting rows with artefacts

I'm analyzing DEM data of rivers with R and need assistance with the data processing. The DEM data include many artifacts, where the river longitudinal profile goes slightly uphill, which is in fact nonsense. So I would like to have an algorithm to delete all rows from the data set where the Z value (elevation) is higher than the predecessor. To explain it better, just look at the following data rows:
*data.frame*
ID Z
1 105.2
2 105.4
3 105.3
4 105.1
5 105.1
6 105.2
7 104.9
I would like to delete rows 2, 3 and 6 from the list. I wrote the following code but it doesn't work:
i <- *data.frame*[1,2]
for (n in *data.frame*[,2]) {if(n-i>0) *data.frame*[i,2]=0 else i <- n}
I would be very appreciated if anybody can help.
Apparently, you want to recursively remove values until there are no increasing values. It would be easiest to simply use a while loop:
DF <- read.table(text = "ID Z
1 105.2
2 105.4
3 105.3
4 105.1
5 105.1
6 105.2
7 104.9", header = TRUE)
while(any(diff(DF$Z) > 0)) DF <- DF[c(TRUE, diff(DF$Z) <= 0),]
# ID Z
#1 1 105.2
#4 4 105.1
#5 5 105.1
#7 7 104.9
Test yourself, if this is sufficiently efficient.
I would also like to comment on your whole idea of data cleaning here. I find it very dubious. How do you now that there is no error in the values that don't increase? You might remove perfectly valid values because there is an error in a (strongly) decreasing value.

Is there a way to multiply two columns in R like this, ex. 17.5 x 4 would give me a list of 4 rows 17.5,17.5,17.5,17.5?

I need to multiply two columns so that the result, columnC, is a list of columnA with columnB entries (sorry if that is confusing I dont know how else to say it). So columnA (17.5) * columnB (4) gives columnC (17.5, 17.5, 17.5, 17.5).
Is this possible? I need to make a histogram in R but the data is entered in the A B format (i.e. there were 4 ind at 17.5, 2 ind at 16.8, 5 ind at 15.9, etc) but I cannot get the plotting to work this way so I thought if I changed it to just a list of values it would work. It is a very large data set and doing this manually is prohibitive. Is there a better way to do this? New to R so any help is greatly appreciated.
As far as I can tell the following code will do what you want it to:
dat <- data.frame( x = c(17.5,16.8,15.9),y=c(4,2,5))
newDat <- data.frame( x = rep(dat$x,dat$y), y = rep(1,sum(dat$y) ) )
if(!require("ggplot2")){ #INCLUDE PACKAGE ggplot2 AND INSTALL IT IF IT'S NOT ALREADY INSTALLED
install.packages("ggplot2",repos="http://ftp.heanet.ie/mirrors/cran.r-project.org/",dependencies = TRUE)
library("ggplot2")
}
ggplot(newDat, aes(x=x, y=y, fill=factor(x))) + geom_bar(stat="identity")
Depending on the size of your data this might not make sense and you might want to do something other than appending a column of 1 to your dataframe, but for this toy example it functions fine. You should get something like the following:
Suppose your data frame (called df) is as follows:
A B
1 17.5 5
2 16.8 8
One way to expand (i.e. replicate) is
df <- df[rep(rownames(df), df$B),]
# A B
#1 17.5 5
#1.1 17.5 5
#1.2 17.5 5
#1.3 17.5 5
#1.4 17.5 5
#2 16.8 8
#2.1 16.8 8
#2.2 16.8 8
#2.3 16.8 8
#2.4 16.8 8
#2.5 16.8 8
#2.6 16.8 8
#2.7 16.8 8
If you want to 'tidy' your rownames you can just do,
rownames(df) <- NULL

Populate a column with forecasts of panel data using data.table in R

I have a panel data with "entity" and "year". I have a column "x" with values that i consider like time series. I want to create a new column "xp" where for each "entity" I give, for each "year", the value obtained from the forecast of the previous 5 years. If there are less than 5 previous values available, xp=NA.
For the sake of generality, the forecast is the output of a function built in R from a couple of predefinite functions found in some packages like "forecast". If it is easier with a specific function, let's use forecast(auto.arima(x.L5:x.L1),h=1).
For now, I use data.table in R because it is so much faster for all the other manipulations I make on my dataset.
However, what I want to do is not data.table 101 and I struggle with it.
I would so much appreciate a bit of your time to help me on that.
Thanks.
Here is an extract of what i would like to do:
entity year x xp
1 1980 21 NA
1 1981 23 NA
1 1982 32 NA
1 1983 36 NA
1 1984 38 NA
1 1985 45 42.3 =f((21,23,32,36,38))
1 1986 50 48.6 =f((23,32,36,38,45))
2 1991 2 NA
2 1992 4 NA
2 1993 6 NA
2 1994 8 NA
2 1995 10 NA
2 1996 12 12.4 =f((2,4,6,8,10))
2 1997 14 13.9 =f((4,6,8,10,12))
...
As suggested by Eddi, I found a way using rollapply:
DT <- data.table(mydata)
DT <- DT[order(entity,year)]
DT[,xp:=rollapply(.SD$x,5,timeseries,align="right",fill=NA,by="entity"]
with:
timeseries <- function(x){
fit <- auto.arima(x)
value <- as.data.frame(forecast(fit,h=1))[1,1]
return(value)
}
For a sample of mydata, it works perfectly. However, when I use the whole dataset (150k lines), after some computing time, i have the following error message:
Error in seq.default(start.at,NROW(data),by = by) : wrong sign in 'by' argument
Where does it come from?
Can it come from the "5" parameter in rollapply and from some specifities of certain entities in the dataset (not enough data...)?
Thanks again for your time and help.

R merge with itself

Can I merge data like
name,#797,"Stachy, Poland"
at_rank,#797,1
to_center,#797,4.70
predicted,#797,4.70
According to the second column and take the first column as column names?
name at_rank to_center predicted
#797 "Stachy, Poland" 1 4.70 4.70
Upon request, the whole set of data: http://sprunge.us/cYSJ
The first problem, of reading the data in, should not be a problem if your strings with commas are quoted (which they seem to be). Using read.csv with the header=FALSE argument does the trick with the data you shared. (Of course, if the data file had headers, delete that argument.)
From there, you have several options. Here are two.
reshape (base R) works fine for this:
myDF <- read.csv("http://sprunge.us/cYSJ", header=FALSE)
myDF2 <- reshape(myDF, direction="wide", idvar="V2", timevar="V1")
head(myDF2)
# V2 V3.name V3.at_rank V3.to_center V3.predicted
# 1 #1 Kitoman 1 2.41 2.41
# 5 #2 Hosaena 2 4.23 9.25
# 9 #3 Vinzelles, Puy-de-Dôme 1 5.20 5.20
# 13 #4 Whitelee Wind Farm 6 3.29 8.07
# 17 #5 Steveville, Alberta 1 9.59 9.59
# 21 #6 Rocher, Ardèche 1 0.13 0.13
The reshape2 package is also useful in these cases. It has simpler syntax and the output is also a little "cleaner" (at least in terms of variable names).
library(reshape2)
myDFw_2 <- dcast(myDF, V2 ~ V1)
# Using V3 as value column: use value.var to override.
head(myDFw_2)
# V2 at_rank name predicted to_center
# 1 #1 1 Kitoman 2.41 2.41
# 2 #10 4 Icaraí de Minas 6.07 8.19
# 3 #100 2 Scranton High School (Pennsylvania) 5.78 7.63
# 4 #1000 1 Bat & Ball Inn, Clanfield 2.17 2.17
# 5 #10000 3 Tăuteu 1.87 5.87
# 6 #10001 1 Oak Grove, Northumberland County, Virginia 5.84 5.84
Look at the reshape package from Hadley. If I understand correctly, you are just pivoting your data from long to wide.
I think in this case all you really need to do is transpose, cast to data.frame, set the colnames to the first row and then remove the first row. It might be possible to skip the last step through some combination of arguments to data.frame but I don't know what they are right now.

Resources