Conditional Logic and Looping - r

I am trying to create a conditional loop to create a new variable called BigSales which should be given a value of 'yes' if either the date occurred before 2012 or the total gross for the day exceeded $65 million. Otherwise, it should be given a value of 'no'.
I tried:
for(i in 1:45){
if(movies$Gross[i] > 65 | movies$Date[i] < 2012-01-01){
movies$BigSales[i] <- "yes"}
else (
movies$BigSales[i] <- "no"
)
}
But I got the error message:
Error in if (movies$Gross[i] > 65 | movies$Date[i] < 2012 - 1 - 1) { :
missing value where TRUE/FALSE needed
In addition to that, the data set contains 100 observations, but it is only reading 45. How can I solve this?

It's possible to add a conditional column in this matter, but there are tools out there that make this easier and more comprehensible.
library(plyr)
library(dplyr)
movies <- mutate(movies, BigSales = ifelse(Gross > 65 && Date < "2012-01-01","yes","no"))
You should also be careful working with dates - call str(movies$Date) to make sure it's of the "Date" type, and if not you should pass it to as.Date
To answer your question as you asked it, you didn't put quotes around the date you listed, so it tried to evaluate it as 2012 - 2. If you'd prefer to solve this problem with the code you have, use "2012-01-01"

ifelse is vectorised, meaning it takes each item from input vector, process for the condition and returns a vector.
Another point is that since OP has mentioned that date before 2012 will be considered as BigSales "yes". Hence checking for only year of movies$Date will do the trick.
In base R, solution could be in
movies$BigSales <- ifelse(movies$Gross > 65 | as.numeric(format(movies$Date,"%Y")) < 2012,
"yes", "no")
Note : movies$Date is expected of type Date or POSIXct

Related

extracting value of variable from dataframe

I have one issue in selecting a value of one variable conditional on the value of another variable in a dataframe.
Dilutionfactor=c(1,3,9,27,80)
Log10Dilutionfactor=log10(Dilutionfactor)
Protection=c(100,81.25,40,10.52,0)
RM=as.data.frame(cbind(Dilutionfactor,Log10Dilutionfactor,Protection))
Now i want to know the value of Log10Dilutionfactor condition on the value of Protection is equal to either 50 (if it appear) or the value immediately just below 50.
when i used subset(RM,Protection<= 50)it gives three rows and when I tried RM[grepl(RM$Protection<=50,Log10Dilutionfactor),] it gives 0 values with warning message. I really appreciate if someone help me.
You can use 2 subset:
subset(RM,Protection==max(subset(RM,Protection<= 50)$Protection))$Log10Dilutionfactor
# [1] 0.954243
You could use
with(RM, Log10Dilutionfactor[which(Protection == max(Protection[Protection <= 50]))])
# [1] 0.9542425
or find the index value of protection that is closest to 50
index = which(abs(RM$Protection-50)<=min(abs(RM$Protection-50)))
and then look it up in what ever column you want. e.g for Dilutionfactor
RM$Dilutionfactor[index]

Descriptive statistics of time variables

I want to compute simple descriptive statistics (mean, etc) of times when people go to bed. I ran into two problems. The original data comes from an Excel file in which just the time that people went to bed, were typed in - in 24 hrs format. My problem is that r so far doesn't recognizes if people went to bed at 1.00 am the next day. Meaning that a person who went to bed at 10 pm is 3 hrs apart from the one at 1.00 am (and not 21 hrs).
In my dataframe the variable in_bed is a POSIXct format so I thought to apply an if-function telling that if the time is before 12:00 than I want to add 24 hrs.
My function is:
Patr$in_bed <- if(Patr$in_bed <= ) {
Patr$in_bed + 24*60*60
}
My data frame looks like this
in_bed
1 1899-12-30 22:13:00
2 1899-12-30 23:44:00
3 1899-12-30 00:08:00
If I run my function my variable gets deleted and the following error message gets printed:
Warning message:
In if (Patr$in_bed < "1899-12-30 12:00") { :
the condition has length > 1 and only the first element will be used
What do I do wrong or does anyone has a better idea? And can I run commands such as mean on variables in POSIXct format and if not how do I do it?
When you compare Patr$in_bed (vector) and "1899-12-30 12:00" (single value), you get a logical vector. But the IF statement requires a single logical, thus it generates a warning and consider only the first element of the vector.
You can try :
Patr$in_bed <- Patr$in_bed + 24*60*60 * (Patr$in_bed < as.POSIXct("1899-12-30 12:00"))
Explanations : the comparison in the parenthesis will return a logical vector, which will be converted to integer (0 for FALSE and 1 for TRUE). Then the dates for which the statement is true will have +24*60*60, and the others dates will have +0.
But since the POSIXct format includes the date, I don't see the purpose of adding 24 hrs. For instance,
as.POSIXct("1899-12-31 01:00:00") - as.POSIXct("1899-12-30 22:00:00")
returns a time difference of 3 hours, not 21.
To answer your last question, yes you can compute the mean of a POSIXct vector, simply with :
mean(Patr$in_bed)
Hope it helps,
Jérémy

Conditional Label in R without Loops

I'm trying to find out the best (best as in performance) to having a data frame of the form getting a new column called "Season" with each of the four seasons of the year:
MON DAY YEAR
1 1 1 2010
2 1 1 2010
3 1 1 2010
4 1 1 2010
5 1 1 2010
6 1 1 2010
One straightforward to do this is create a loop conditioned on the MON and DAY column and assign the value one by one but I think there is a better way to do this. I've seen on other posts suggestions for ifelse or := or apply but most of the problem stated is just binary or the value can be assigned based on a given single function f based on the parameters.
In my situation I believe a vector containing the four stations labels and somehow the conditions would suffice but I don't see how to put everything together. My situation resembles more of a switch case.
Using modulo arithmetic and the fact that arithmetic operators coerce logical-values to 0/1 will be far more efficient if the number of rows is large:
d$SEASON <- with(d, c( "Winter","Spring", "Summer", "Autumn")[
1+(( (DAY>=21) + MON-1) %/% 3)%%4 ] )
The first added "1" shifts the range of the %%4 operationon all the results inside the parentheses from 0:3 to 1:4. The second subtracted "1" shifts the (inner) 1:12 range back to 0:11 and the (DAY >= 21) advances the boundary months forward one.
I'll start by giving a simple answer then I'll delve into the details.
I quick way to do this would be to check the values of MON and DAY and output the correct season. This is trivial :
f=function(m,d){
if(m==12 && d>=21) i=3
else if(m>9 || (m==9 && d>=21)) i=2
else if(m>6 || (m==6 && d>=21)) i=1
else if(m>3 || (m==3 && d>=21)) i=0
else i=3
}
This f function, given a day and a month, will return an integer corresponding to the season (it doesn't matter much if it's an integer or a string ; integer only allows to save a bit of memory but it's a technicality).
Now you want to apply it to your data.frame. No need to use a loop for this ; we'll use mapply. d will be our simulated data.frame. We'll factor the output to have nice season names.
d=data.frame(MON=rep(1:12,each=30),DAY=rep(1:30,12),YEAR=2012))
d$SEA=factor(
mapply(f,d$MON,d$DAY),
levels=0:3,
labels=c("Spring","Summer","Autumn","Winter")
)
There you have it !
I realize seasons don't always change a 21st. If you need fine tuning, you should define a 3-dimension array as a global variable to store the accurate days. Given a season and a year, you could access the corresponding day and replace the "21"s in the f function with the right calls (you would obviously add a third argument for the year).
About the things you mentionned in your question :
ifelse is the "functionnal" way to make a conditionnal test. On atomic variables it's only slightly better than the conditionnal statements but it is vectorized, meaning that if the argument is a vector, it will loop itself on its elements. I'm not familiar with it but it's the way to got for an optimized solution
mapply is derived from sapply of the "apply family" and allows to call a function with several arguments on vector (see ?mapply)
I don't think := is a standard operator in R, which brings me to my next point :
data.table ! It's a package that provides a new structure that extends data.frame for fast computing and typing (among other things). := is an operator in that package and allows to define new columns. In our case you could write d[,SEA:=mapply(f,MON,DAY)] if d is a data.table.
If you really care about performance, I can't insist enough on using data.table as it is a major improvement if you have a lot of data. I don't know if it would really impact time computing with the solution I proposed though.

Using Loops to find the Difference between two Rows in the same Column

So I have a 252 rows of data in column 4, and I would like to find the difference between two consecutive rows throughout the entire column
My current code is:
appleClose<-NULL
for (i in 1:Apple[1]){
appleClose[i] <- AA[i,4]
}
appleClose[]
I tried, and failed, with:
appleClose<-NULL
for (i in 1:Apple[1]){
appleClose[i] <- AA[i,4] - AA[i+1,4]
}
appleClose[]
Edit:
I am trying to optimize a stock market portfolio in retrospect.
AA is the ticker symbol for Apple. I downloaded that information through some R code written earlier in the program.
I have not yet checked out the diff function yet. I will do that now.
The error I am receiving is
Error in [.xts(AA, i + 1, 4) : subscript out of bounds
Is this what you mean?
> Apple=runif(5,1,10)
#5 numbers
> Apple
[1] 3.362267 2.489085 3.899513 5.591127 9.315716
#4 differences
> diff(Apple)
[1] -0.8731816 1.4104271 1.6916143 3.7245894
or depending on your data either
>diff(AA$Apple)
or maybe
>diff(AA[,4])
Another option (if you are referring to this, your question is not much clear)
AA[-1,4]- AA[-dim(A)[1],4]

How to subset a spatial polygon in R by matching partial strings?

I wanna divide a polygon shapefile (deforestation in brazilian Amazon), by years of deforestation. The years are in a string field, like "d2010_1", "d2010_2", "d2011_1" and so on. I want to divide it in 5 year periods. I tried the following:
d00a04 <- prodes[grepl("d2000",prodes#data$CLASS_NAME) ||
grepl("d2001",prodes#data$CLASS_NAME) ||
grepl("d2002",prodes#data$CLASS_NAME) ||
grepl("d2003",prodes#data$CLASS_NAME) ||
grepl("d2004",prodes#data$CLASS_NAME),]
but it gave the following error:
Error in if (is.numeric(i) && i < 0) { :
missing value where TRUE/FALSE needed
I also tried:
anos00a04 = c("d2000","d2001","d2002","d2003","d2004")
d00a04 <- subset(prodes,prodes#data$CLASS_NAME %in% anos00a04)
but it gave the same error message. I've seen some examples like here, here and here, but I need to see if the beginning of the string matches, not numerical operators such as <, > or ==. Any help, please?
EDIT: I figured out a way, but something strange is happening. I did the following:
anos <- sort(unique(prodes#data$CLASS_NAME))
anos00a04 <- anos[2:20]
The first command gives me all 49 levels from the original shapefile. The second returns only those between 2000 and 2004. So far so good. But when I ask to see the second variable, it shows the 19 itens (d2000_2 d2000_3 d2001_0 d2001_3 d2001_4...), but below it says: "49 Levels: d1997_0 d2000_2 d2000_3..." including those that were supposed to stay out (and were out of the listing). What's happening?
PS: "anos" is the portuguese word for "years".

Resources