I have a dataframe with climatic values like temperature_max, temperature_min... in diferent locations. The data collection is a time series data there are some especific days in which there are no data registration. I woul like to impute taking in account date and also the location (place variable in the dataframe)
I have tried to impute those missing values with amelia. But no imputation is done with warning information
Checking variables:
head(df): PLACE, DATE, TEMP_MAX, TEMP_MIN, TEMP_AVG
PLACE DATE TEMP_MAX TEMP_MIN TEMP_AVG
F 12/01/2007 19.7 2.5 10.1
F 13/01/2007 18.8 3.5 10.4
F 14/01/2007 17.3 2.4 10.4
F 15/01/2007 19.5 4.0 9.2
F 16/01/2007
F 17/01/2007 21.5 2.8 9.7
F 18/01/2007 17.7 3.3 12.9
F 19/01/2007 18.3 3.8 9.7
A 16/01/2007 17.7 3.4 9.7
A 17/01/2007
A 18/01/2007 19.7 6.2 10.4
A 19/01/2007 17.7 3.8 10.1
A 20/01/2007 18.6 3.8 12.9
This is just some of the records of my data set.
DF = amelia(df, m=4, ts= c("DATE"), cs = c("PLACE"))
where DATE is time series data (01/01/2001, 02/01/2001, 03/01/2001...) but if you filter by PLACE the time series is not equal (not the same star and end time).
I have 3 questions:
I am not sure if I should have the time series data complete for all the places, I mean same start and end time for all the places.
I am not using lags or polytime parameters so, am I imputting correctly taking in account time series influence? I am not sure about how to use lag parameter although I have checked the R package information.
The last question is that when I try to use that code there is a warning
and no imputation is done.
Warning: There are observations in the data that are completely missing.
These observations will remain unimputed in the final datasets.
-- Imputation 1 --
No missing data in bootstrapped sample: EM chain unnecessary
-- Imputation 2 --
No missing data in bootstrapped sample: EM chain unnecessary
-- Imputation 3 --
No missing data in bootstrapped sample: EM chain unnecessary
-- Imputation 4 --
No missing data in bootstrapped sample: EM chain unnecessary
Can someone help me with this?
Thanks very much for your time!
For the software it does not matter if you have different start and end dates for different places. I think that it is more up to you and your thoughts on the data. I would ask myself, if those were missing data (missing at random) thus I would create empty rows in your data set or not.
You want to use lags in order to use past values of the variable to improve the prediction of missing values. It is not mandatory (i.e., the function can impute missing data even without such a specification) but it can be useful.
I contacted the author of the package and he told me that you need to specify the splinetime or polytime arguments to make sure that Amelia will use the time-series information to impute. For instance, if you set polytime = 3, it will impute based on a cubic of time. If you do that, I think you shouldn't see that error anymore.
I have a small dataset containing some date/time information. e.g.:
type start end price time
rental location A Location B 0 23:50:00
rental location A Location B 0 18:32:00
rental location A Location B 0 10:10:00
rental location A Location B 0 09:54:00
rental location A Location B 0 20:48:00
I want to write a set of if/or statements in R to create a new column (price) that displays a price for on and off peak times. This is a data set showing bike rental times, and I want to compare it to the cost of peak time public transit travel.
So, there are two values possible in the column price: $2.9 and $2.4.
Peak times are between 6:30-9:30 and 16:30-19:30.
there has to be a better way to do this, but now I wrote the following set of conditions:
First, I used as.POSIXlt so I could use $hour and $min to split out the hours and minutes from the data individually.
The data frame to begin with is data
time2 <- strptime(data$time, "%H:%M:%OS")
posixlt <- as.POSIXlt(time2, format="%d-%m-%Y %H:%M:%S")
names(unclass(posixlt))
peak <- posixlt
from the new peak column containing times, I want to generate a new data.frame that contains the original data and a separate column for hours and minutes.
df <-cbind(data,peak$hour, peak$min)
as.numeric(peak$hour)
as.numeric(peak$min)
Now I set my conditions to account for the different time possibilities and the respective prices in the df$price column.
df$price[peak$hour <6] <- 2.4
df$price[((peak$hour >= 6) & (peak$hour <=9))] <- 2.9
df$price[peak$hour==9 & peak$min >=30] <- 2.4
df$price[peak$hour>9 & peak$hour <=16] <- 2.4
df$price[peak$hour==16 & peak$min >=30] <- 2.9
df$price[peak$hour>16 & peak$hour<19] <- 2.9
df$price[peak$hour>19] <- 2.4
df$price[peak$hour==19 & peak$min <=30] <- 2.9
df$price[peak$hour==19 & peak$min >=30] <- 2.4
It worked, but there has to be a more efficient way of doing this that I am overlooking?
Perhaps I didn't need to do all the prep work with the time column, and I'm curious to see what other options there are for future reference.
I'm quite a newbie in R so I was interested in the optimality of my solution. Even if it works it could be (a bit) long and I wanted your advice to see if the "way I solved it" is "the best" and it could help me to learn new techniques and functions in R.
I have a dataset on students identified by their id and I have the school where they are matched and the score they obtained at a specific test (so for short: 3 variables id,match and score).
I need to construct the following table: for students in between two percentiles of score, I need to calculate the average score (between students) of the average score of the students of the school they are matched to (so for each school I take the average score of the students matched to it and then I calculate the average of this average for percentile classes, yes average of a school could appear twice in this calculation). In English it allows me to answer: "A student belonging to the x-th percentile in terms of score will be in average matched to a school with this average quality".
Here is an example in the picture:
So in that case, if I take the median (15) for the split (rather than percentiles) I would like to obtain:
[0,15] : 9.5
(15,24] : 20.25
So for students having a score between 0 and 15 I take the average of the average score of the school they are matched to (note that b average will appears twice but that's ok).
Here how I did it:
match <- c(a,b,a,b,c)
score <- c(18,4,15,8,24)
scoreQuant <- cut(score,quantile(score,probs=seq(0,1,0.1),na.rm=TRUE))
AvgeSchScore <- tapply(score,match,mean,na.rm=TRUE)
AvgScore <- 0
for(i in 1:length(score)) {
AvgScore[i] <- AvgeSchScore[match[i]]
}
results <- tapply(AvgScore,scoreQuant,mean,na.rm = TRUE)
If you have a more direct way of doing it.. Or I think the bad point is 3) using a loop, maybe apply() is better ? But I'm not sure how to use it here (I tried to code my own function but it crashed so I "bruted force it").
Thanks :)
The main fix is to eliminate the for loop with:
AvgScore <- AvgeSchScore[match]
R allows you to subset in ways that you cannot in other languages. The tapply function outputs the names of the factor that you grouped by. We are using those names for match to subset AvgeScore.
data.table
If you would like to try data.table you may see speed improvements.
library(data.table)
match <- c("a","b","a","b","c")
score <- c(18,4,15,8,24)
dt <- data.table(id=1:5, match, score)
scoreQuant <- cut(dt$score,quantile(dt$score,probs=seq(0,1,0.1),na.rm=TRUE))
dt[, AvgeScore := mean(score), match][, mean(AvgeScore), scoreQuant]
# scoreQuant V1
#1: (17.4,19.2] 16.5
#2: NA 6.0
#3: (12.2,15] 16.5
#4: (7.2,9.4] 6.0
#5: (21.6,24] 24.0
It may be faster than base R. If the value in the NA row bothers you, you can delete it after.
I have an example data set that looks like this:
Ho<-c(12,12,12,24,12,11,12,12,14,12,11,13,25,25,12,11,13,12,11,11,12,14,12,2,2,2,11,12,13,14,12,11,12,3,2,2,2,3,2,2,1,14,12,11,13,11,12,13,12,11,12,12,12,2,2,2,12,12,12,12,15)
This data set has both positive and negative spikes in it that I would like to use as markers to calculate means on within the data. I would define the start of a spike as any number that is 40% greater or lessor than the number preceding it. A spike ends when it jumps back by more than 40%. So ideally I would like to locate each spike in the data set, and take the mean of the 5 data points immediately following the last number of the spike.
As can be seen, a spike can last for up to 5 data points long. The rule for averaging I would like to follow are:
Start averaging after the last recorded spike data point, not after the first spike data point. So if a spike lasts for three data points, begin averaging after the third spiked data point.
So the ideal output would look something like this:
1= 12.2
2= 11.8
3= 12.4
4= 12.2
5= 12.6
With the first spike being Ho(4)- followed by the following 5 numbers (12,11,12,12,14) for a mean of 12.1
The next spike in the data is data points Ho(13,14) (25,25) followed by the set of 5 numbers (12,11,13,12,11) for an average of 11.8.
And so on for the rest of the sequence.
It kind of seems like you're actually defining a spike to mean differing from the "medium" values in the dataset, as opposed to differing from the previous value. I've operationalized this by defining a spike as being any data more than 40% above or below the median value (which is 12 for the sample data posted). Then you can use the nifty rle function to get at your averages:
r <- rle(Ho >= mean(Ho)*0.6 & Ho <= median(Ho)*1.4)
run.begin <- cumsum(r$lengths)[r$values] - r$lengths[r$values] + 1
run.end <- run.begin + pmin(4, r$lengths[r$values]-1)
apply(cbind(run.begin, run.end), 1, function(x) mean(Ho[x[1]:x[2]]))
# [1] 12.2 11.8 12.4 12.2 12.6
So here is come code that seems to get the same result as you.
#Data
Ho<-c(12,12,12,24,12,11,12,12,14,12,11,13,25,25,12,11,13,12,11,11,12,14,12,2,2,2,11,12,13,14,12,11,12,3,2,2,2,3,2,2,1,14,12,11,13,11,12,13,12,11,12,12,12,2,2,2,12,12,12,12,15)
#plot(seq_along(Ho), Ho)
#find changes
diffs<-tail(Ho,-1)/head(Ho,-1)
idxs<-which(diffs>1.4 | diffs<.6)+1
starts<-idxs[seq(2, length(idxs), by=2)]
ends<-ifelse(starts+4<=length(Ho), starts+4, length(Ho))
#find means
mapply(function(a,b) mean(Ho[a:b]), starts, ends)
I am using the TTR package to generate stock indicators. However, the indicator functions add NA (where applicable -- e.g. CMO, SMA, CMF, etc.) to the beginning of the series instead of the end. Is there a way to align the output to the left so the NA values are added to the end of the series as opposed to the beginning?
For example:
library(TTR)
x = 1:10
# TTR's simple moving average
SMA(x,n=2)
[1] NA 1.5 2.5 3.5 4.5 5.5 6.5 7.5 8.5 9.5
zoo package has an align option to pad the series with NAs at the end:
library(zoo)
rollmean(x,2,na.pad=TRUE,align='left')
[1] 1.5 2.5 3.5 4.5 5.5 6.5 7.5 8.5 9.5 NA
Is there a way to specify something like this in TTR as I need to generate indicators beyond moving averages? I guess I can create a wrapper around these functions and manually shift the resulting values but not sure if there is a better way to do it.
Also, since TTR is heavily used to add indicators to stock prices, I am wondering why the padding is at the beginning as opposed to the end especially since most historical prices as sorted in descending order(by date)? In the above example, if x[1] is the price of a stock today and x[10] the price 10-days ago, shouldn't the moving average (span = 2) for today the average of today + yesterday? As much as I would like to add NAs at the end, I would also like to make sure I am not misinterpreting how these indicators are used.
Thanks,
-e
I couldn't figure out an option in the function call to shift the series in a different direction. However, now I understand why TTR shifts the series downwards. Historical stock prices obtained via quantmod's getSymbols() returns them sorted by date in ascending order. When I manually download the quotes from Yahoo! or use ystockquote.py, the order is descending. I just re-sorted my data by date and used the TTR library as-is.
There were certain vectors that I wanted to be shifted up (padded with NAs) and I just used this code:
miss_len = length(x[is.na(x])
x = x[!is.na(x)]
length(x) = length(x) + miss_len