sum only the negative values in a row from an xts-object - r

I have an xts-object and I want to sum only the negative values in the first row.
The code
sum(test[test<0])
gives me the error
Error in `[.xts`(test, test < 0) : 'i' or 'j' out of range
but
sum(test[1],na.rm=TRUE)
works, but then I have the sum of all the values, not just the negative ones:
[1] -0.9786889
Without giving an example of data yet, does anybody know why this simple code doesnt work?
The dimension of the xts-object is > dim(test) is 216 39.

There is a chance that this is related to the open xts issue here ---> here
It appears that xts objects cannot be subset in this manner. Instead, you should use coredata(), an xts function that treats the "core data" as a matrix.
It's a little messy but:
sum(coredata(test)[1,][coredata(test)[1,] < 0]
should work. If you wanted to do this for all rows, you would need to use something like apply() or a for loop.

Here are some possibilities:
sum(pmin(test[1], 0), na.rm = TRUE)
sum(test[1] * (test[1] < 0), na.rm = TRUE)
sum(Filter(function(x) x < 0, coredata(test[1])))

Related

Looped Equation on different subset of data R

I am trying to set up an earning pattern on some data. I'm doing this by creating an 'Earned_Multiplier' variable which I can then use to multiply on whatever other variable necessary later on. Where the 'Earned_Duration' is >0 and <= 30, the Earned_Multiplier should be equal to ((Earned_Duration/30)*0.347), where the 'Earned_Duration' is >30 and <=60, the Earned_Multiplier should be equal to (0.347+((Earned_Duration/30)*0.16)), and so on.
I'm hoping the below should make sense given the above description. Unfortunately I am getting the error message "NAs are not allowed in subscripted assignments". I feel like this is likely because I need to be using a loop to do the calculation?
Could anyone help direct me as to how to build this loop and making sure it does the right calculation for each different subset?
Output_All$Earned_Multiplier <- 1
Output_All$Earned_Multiplier[Output_All$Earned_Duration == 0] <- 0
Output_All$Earned_Multiplier[(Output_All$Earned_Duration > 0) &
(Output_All$Earned_Duration <= 30)] <- 0+
((Output_All$Earned_Duration/30)*.347) # Month 1
Output_All$Earned_Multiplier[(Output_All$Earned_Duration > 30) &
(Output_All$Earned_Duration <= 60)] <- .347+(((Output_All$Earned_Duration-
30)/30)*.16) # Month 2
Output_All$Earned_Multiplier[(Output_All$Earned_Duration > 60) &
(Output_All$Earned_Duration <= 90)] <- .507+(((Output_All$Earned_Duration-
60)/30)*.085) # Month 3
It would be helpful to provide a dummy dataset so we could work on that. You probably have some NAs in your dataset causing that error.
In any case, using the dplyr library you could do an ifelse statement along with a mutate to create a new column with your calculation result:
library(dplyr)
Output_All <- Output_All %>% mutate(Earned_Multiplier = ifelse(Earned_Duration == 0, 0,
ifelse(Earned_Duration>0&Earned_Duration<=30, (Earned_Duration/30)*0.347,
ifelse(Earned_Duration>30&Earned_Duration<=60, (0.347+((Earned_Duration/30)*0.16)), #close with final else here, if none of the above is met
))))# or continue with more ifelse statements
Regarding the NAs:
If you do have NAs and they are causing you issues, depending on your preference, you can include this as part of your logical statements:
!is.na(Earned_Duration) # dont forget to add & if you add it as a condition
to make sure that NAs are disregarded.

Removing dataframe outliers in R with `boxplot.stats`

I'm relatively new at R, so please bear with me.
I'm using the Ames dataset (full description of dataset here; link to dataset download here).
I'm trying to create a subset data frame that will allow me to run a linear regression analysis, and I'm trying to remove the outliers using the boxplot.stats function. I created a frame that will include my samples using the following code:
regressionFrame <- data.frame(subset(ames_housing_data[,c('SalePrice','GrLivArea','LotArea')] , BldgType == '1Fam'))
My next objective was to remove the outliers, so I tried to subset using a which() function:
regressionFrame <- regressionFrame[which(regressionFrame$GrLivArea != boxplot.stats(regressionFrame$GrLivArea)$out),]
Unfortunately, that produced the
longer object length is not a multiple of shorter object length
error. Does anyone know a better way to approach this, ideally using the which() subsetting function? I'm assuming it would include some form of lapply(), but for the life of me I can't figure out how. (I figure I can always learn fancier methods later, but this is the one I'm going for right now since I already understand it.)
Nice use with boxplot.stats.
You can not test SAFELY using != if boxplot.stats returns you more than one outliers in $out. An analogy here is 1:5 != 1:3. You probably want to try !(1:5 %in% 1:3).
regressionFrame <- subset(regressionFrame,
subset = !(GrLivArea %in% boxplot.stats(GrLivArea)$out))
What I mean by SAFELY, is that 1:5 != 1:3 gives a wrong result with a warning, but 1:6 != 1:3 gives a wrong result without warning. The warning is related to the recycling rule. In the latter case, 1:3 can be recycled to have the same length of 1:6 (that is, the length of 1:6 is a multiple of the length of 1:3), so you will be testing with 1:6 != c(1:3, 1:3).
A simple example.
x <- c(1:10/10, 101, 102, 103) ## has three outliers: 101, 102 and 103
out <- boxplot.stats(x)$out ## `boxplot.stats` has picked them out
x[x != out] ## this gives a warning and wrong result
x[!(x %in% out)] ## this removes them from x

Vectorized ifelse conundrum

I have two arrays "begin" and "end_a" which contain some integer indices, except that some of the entries in "end_a" are NA.
And panelDataset is a matrix which contains the data. I want to take the means of the rows of panelDataset corresponding to non-NA entries of begin and end_a.
I have this working in serial fashion and it works fine, but when I tried to vectorize it as follows
switch_mu=ifelse(!is.na(end_a),mean(panelDataset[begin: end_a,4]),NA)
It gives an error: Error in begin:end_a : NA/NaN argument.
When I check the entries of end_a separately for NAs using is.na(end_a), it does show the correct entries of the array as NA. So, that is not an issue.
I know I am missing something trivial. Any thoughts?
Try this:
means <- apply(na.omit(cbind(begin, end_a)), 1,
function(x) mean(panelDataset[x[1]:x[2], 4]))
replace(end_a, !is.na(end_a), means)

R returns list instead of filling in dataframe column

I am trying to use apply() to fill in an additional column in a dataframe and by calling a function I created with each row of the data frame.
The dataframe is called Hit.Data has 2 columns Zip.Code and Hits. Here are a few rows
Zip.Code , Hits
97222 , 20
10100 , 35
87700 , 23
The apply code is the following:
Hit.Data$Zone = apply(Hit.Data, 1, function(x) lookupZone("89000", x["Zip.Code"]))
The lookupZone() function is the following:
lookupZone <- function(sourceZip, destZip){
sourceKey = substr(sourceZip, 1, 3)
destKey = substr(destZips, 1, 3)
return(zipToZipZoneMap[[sourceKey]][[destKey]])
}
All the lookupZone() function does is take the 2 strings, truncates to the required characters and looks up the values. What happens when I run this code though is that R assigns a list to Hit.Data$Zone instead of filling in data row by row.
> typeof(Hit.Data$Zone)
[1] "list
What baffles me is that when I use apply and just tell it to put a number in it works correctly:
> Hit.Data$Zone = apply(Hit.Data, 1, function(x) 2)
> typeof(Hit.Data$Zone)
[1] "double"
I know R has a lot of strange behavior around dropping dimensions of matrices and doing odd things with lists but this looks like it should be pretty straightforward. What am I missing? I feel like there is something fundamental about R I am fighting, and so far it is winning.
Your problem is that you are occasionally looking up non-existing entries in your hashmap, which causes hash to silently return NULL. Consider:
> hash("890", hash("972"=3, "101"=3, "877"=3))[["890"]][["101"]]
[1] 3
> hash("890", hash("972"=3, "101"=3, "877"=3))[["890"]][["100"]]
NULL
If apply encounters any NULL values, then it can't coerce the result to a vector, so it will return a list. Same will happen with sapply.
You have to ensure that all possible combinations of the first three zip code digits in your data are present in your hash, or you need logic in your code to return NA instead of NULL for missing entries.
As others have said, it's hard to diagnose without knowing what ZiptoZipZoneMap(...) is doing, but you could try this:
Hit.Data$Zone <- sapply(Hit.Data$Zip.Code, function(x) lookupZone("89000", x))

Hmisc - cut2 - create factors from times

I'm trying to use the cut2() function from the Hmisc package to create a factor based on time periods.
Here's some code:
library(Hmisc)
i.time <- as.POSIXct("2013-07-16 13:55:14 CEST")
f.time <- i.time+as.difftime(1, units="hours")
data.points <- seq(from=i.time, to=f.time, by="1 sec")
cut.points <- seq(from=i.time, to=f.time, by="60 sec")
intervals <- cut2(x=data.points, cuts=cut.points, minmax=TRUE)
I expected intervals to be created such that each point in data.point were placed in a interval of time.
But there are some NA values in the end:
> tail(intervals, 1)
[1] <NA>
60 Levels: [2013-07-16 13:55:14,2013-07-16 13:56:14) ... [2013-07-16 14:54:14,2013-07-16 14:55:14]
I was expecting that the option minmax=TRUE would make sure that hte cuts included all the values in data.points.
Can anyone clarify what's going on here? How can I use the cut2 function to generate a factor that includes all the values in the data?
The reason I use cut2 in preference to cut is that its default for "right" is the way I expect it to work (left-closed intervals). Looking at the code I see that when 'cuts' is present in the argument list, then the cut function is used with a shifted set of cuts that has the effect of making the intervals left-closed, and then the code relabels the factor to change the "("'s to ["'s, but then does not use include.lowest = TRUE. This has the effect of turning the last value into <NA>. Frankly, I see this as a bug. After looking at this more closely I see that cut2's help page does not promise to handle either Date or date-time objects, so "bug" is too strong. It completely fails with Date objects and it appears to be only an accident that is is almost correct with POSIXct objects. (This implementation is somewhat surprising to me in that I always assumed that it was just using cut( ... , right=FALSE, include.lowest=TRUE).)
You can alter the code and one idea I had was to extend the range back to the right end point in the original data by changing this line:
r <- range(x, na.rm = TRUE)
To this line:
r <- range(c(x,max(x)+min(diff(x.unique))/2), na.rm = TRUE)
It's not exactly the result I expected since you get a new category at the right end because the penultimate interval was still open on the right.
intervals <- cut3(x=data.points, cuts=cut.points, minmax=TRUE)
> tail(intervals, 1)
[1] 2013-07-16 14:55:14
61 Levels: [2013-07-16 13:55:14,2013-07-16 13:56:14) ...
> tail(intervals, 2)
[1] [2013-07-16 14:54:14,2013-07-16 14:55:14) 2013-07-16 14:55:14
61 Levels: [2013-07-16 13:55:14,2013-07-16 13:56:14) ...
A different idea gives a more satisfactory result. Change only this line:
y <- cut(x, k2)
To to this:
y <- cut(x, k2, include.lowest=TRUE)
Giving the expected right and left closed interval and no NA:
tail(intervals, 2)
[1] [2013-07-16 14:54:14,2013-07-16 14:55:14] [2013-07-16 14:54:14,2013-07-16 14:55:14]
60 Levels: [2013-07-16 13:55:14,2013-07-16 13:56:14) ...
Note: include.lowest=TRUE with right=FALSE, will actually become include.highest. And I'm scratching my head about why I am actually getting the desired behavior in this case when I did not also need to do something with the 'right' parameter. I sent Frank Harrell a message, and he is willing to consider revisions to the code to handle other cases. I'm working on that.
Why this is an issue: The labeling for cut.POSIXt and cut.Date differs from the labeling of cut.numeric (actually cut.default) results. The former two label strategy is to just reprot the beginnings of the intervals whereas the labeling from cut.numeric includes "[" and ")" and the ends of the intervals. Compare the output from these:
levels( cut(0+1:100, 3) )
levels( cut(Sys.time()+1:100, 3) )
levels( cut(Sys.Date()+1:100, 3) )
from ??cut2:
minmax :
if cuts is specified but min(x) < min(cuts) or max(x) > max(cuts),
augments cuts to include min and max x
Checking your arguments:
x=data.points
cuts=cut.points
r <- range(x, na.rm = TRUE)
(r[1] < min(cuts) | (r[2] > max(cuts)))
FALSE ## no need to include mean and max
So here setting minmax don't change the result. But here a result using cut by setting include.lowest=TRUE) :
res <- cut(x=data.points, breaks=cut.points, include.lowest=TRUE)
table(is.na(res))

Resources