I have to migrate an R script to Python and found the following if chain in R:
PreparedData <- PreparedData %>% mutate(T.Churn3 = ifelse(lead(T.Purchases) == 0 & T.Purchases > 0 | lead(T.Purchases, 2) == 0 & lead(T.Purchases) > 0 & T.Purchases >0 | lead(T.Purchases, 2) == 0 & lead(T.Purchases) == 0 & T.Purchases >0, 1, 0))
Now I'm struggling with the evaluation order here for R. For me this statement looks unnecessarily bloated. This is how I understand the order of evaluation:
Check if Purchases of the next row is zero and if purchases of the current row is bigger than zero.
If 1. does not apply check if the purchases 2 rows ahead is zero and the purchase of the next row and the purchase of the acual row are bigger than zero
If 1. and 2. do not apply check if the purchases 2 rows ahead is zero, the purchases of the next row is zero and the current purchases is bigger than zero
I'm really not sure if that is right, but it is the only thing which could barely make sense to me. If that assumption of mine is right, then the third statement part would be unnecessary because the first statement part is part of the third statement.
Can anyone shed some light here?
Best regards,
André
Thanks to Roland's comment I think I was able to figure out what that statement actually does and it indeed is not very efficient. So for anyone else struggling with the logical operators maybe the following approach can also help:
Starting from left I looked at each logical operation and wrote their outcome as T/F one line below. After that I used the T/F with the next part of the logical chain until I reached the end of the line. In the end the picture looked like that:
The rest I had to do was to put in all possibilities and after that I noticed that the whole chain can be simplified to:
ifelse(lead(T.Purchases) == 0 & lead(T.Purchases,2) == 0 & T.Purchases > 0,1,0)
Related
I am trying to create a new column ($Correct) in a data frame based on values in two other columns ($Condition and $Response).
I realise that there are multiple ways of achieving this (I have since used another method), but I'm interested in the reason why the method below did not work.
training_data.df$Correct<- 0
training_data.df$Correct[training_data.df$Condition==2 & training_data.df$Response==1] <- 1
training_data.df$Correct[(training_data.df$Condition==1|3) & training_data.df$Response==2] <- 1
This method produces the correct values in the output (the new $Correct column), except for cases where $Condition==2 and $Response==2 (the value '1' prints in the $Correct column rather than '0').
This line of code works correctly on its own, but not in combination with the other (last) line for $Condition==1|3.
Can anyone explain why this occurs?
training_data.df$Condition==1|3
reads as:
"(training_data.df$Condition is equal to 1)"
or
"three".
"(training_data.df$Condition is equal to 1)" can be TRUE or FALSE.
"three" not so much.
Whereas what I think you mean is:
"training_data.df$Condition is equal to (either 1 or 3)".
This would be (training_data.df$Condition==1 | training_data.df$Condition==3) or training_data.df$Condition %in% c(1,3).
I have a dataset called college, and one of the columns is 'accepted'. There are two values for this column - 1 (which means student was accepted) and 0 (which means student was not accepted). I was to find the accepted student percentage.
I did this...
table(college$accepted)
which gave me the frequency of 1 and 0. (1 = 44,224 and 0 = 75,166). I then manually added those two values together (119,390) and divided the 44,224/119,390. This is fine and gets me the value I was looking for. But I would really like to know how I could do this with R code, since I'm sure there is a way to do it that I just haven't thought of.
Thanks!
Perhaps you can use prop.table like below
prop.table(table(college$accepted))["1"]
If it's a simple 0/1 column then you only need take the column mean.
mean_accepted <- mean(df$accepted)
you could first sum the column, and the count the total number in the column
sum(college$accepted)/length(college$accepted)
To make the code more explicit and describe your intent better, I suggest using a condition to identify the cases that meet your criteria for inclusion. For example:
college$accepted == 1
Then take the average of the logical vector to compute the proportion (between 0 and 1), multiply by 100 to make it a percentage.
100 * mean(college$accepted == 1, na.rm = TRUE)
I have a dataset named bwght which contains the variable cigs (cigarattes smoked per day)
When I calculate the mean of cigs in the dataset bwght using:
mean(bwght$cigs), I get a number 2.08.
Only 212 of the 1388 women in the sample smoke (and 1176 does not smoke):
summary(bwght$cigs>0) gives the result:
Mode FALSE TRUE NA's
logical 1176 212 0
I'm asked to find the average of cigs among the women who smoke (the 212).
I'm having a hard time finding the right syntax for excluding the non smokers = 0
I have tried:
mean(bwght$cigs| bwght$cigs>0)
mean(bwght$cigs>0 | bwght$cigs=TRUE)
if (bwght$cigs > 0){
sum(bwght$cigs)
}
x <-as.numeric(bwght$cigs, rm="0");
mean(x)
But nothing seems to work! Can anyone please help me??
If you want to exclude the non-smokers, you have a few options. The easiest is probably this:
mean(bwght[bwght$cigs>0,"cigs"])
With a data frame, the first variable is the row and the next is the column. So, you can subset using dataframe[1,2] to get the first row, second column. You can also use logic in the row selection. By using bwght$cigs>0 as the first element, you are subsetting to only have the rows where cigs is not zero.
Your other ones didn't work for the following reasons:
mean(bwght$cigs| bwght$cigs>0)
This is effectively a logical comparison. You're asking for the TRUE / FALSE result of bwght$cigs OR bwght$cigs>0, and then taking the mean on it. I'm not totally sure, but I think R can't even take data typed as logical for the mean() function.
mean(bwght$cigs>0 | bwght$cigs=TRUE)
Same problem. You use the | sign, which returns a logical, and R is trying to take the mean of logicals.
if(bwght$cigs > 0){sum(bwght$cigs)}
By any chance, were you a SAS programmer originally? This looks like how I used to type at first. Basically, if() doesn't work the same way in R as it does in SAS. In that example, you are using bwght$cigs > 0 as the if condition, which won't work because R will only look at the first element of the vector resulting from bwght$cigs > 0. R handles looping differently from SAS - check out functions like lapply, tapply, and so on.
x <-as.numeric(bwght$cigs, rm="0")
mean(x)
I honestly don't know what this would do. It might work if rm="0" didn't have quotes...?
mean(bwght[bwght$cigs>0,"cigs"])
I found the statement failed, returning "argument is not numeric or logical: returning NA"
Converting to matrix solved this:
mean(data.matrix(bwght[bwght$cigs>0,"cigs"]))
I'm new to R and I'm looking through a book called "Discovering Statistics using R".
Although the book implies you don't need any statistical background, some of the content isn't covered/explained...
I'm trying to sum the elements of a vector starting from position 1 until a positive element is present.
I found this question which is very similar to what I'm trying to achieve. However when I implement it, it doesn't always seem to work (and it sometimes appears to include the first positive element)...
My program is:
vecA <- runif(10, -10, 10);
sumA <-sum(vecA [1:min(which(vecA < 0))]);
Is there a more robust way to calculate this without using loops that works every time and doesn't add the positive element? I'm not at the looping stage of my books yet.
I also found this site which asks a similar question but their answer errors:
sum(vecA [seq_len(which.max(vecA > 0)]);
You can use the following code:
sum(vecA * !cumsum(vecA > 0))
This also works if the first element is positive or all elements are negative.
You want to use > not < to sum all elements until the first positive one is reached.
You're currently summing from 1 until the first negative value is reached (including the first negative value).
sum(vecA[1:min(which(vecA>0))-1])
the which() function will return all of the positions of the positive elements, then taking the sum from 1 to the position of the first positive - 1 will guarantee you are summing all of the negative elements
match function is usually the fastest to find the first occurrence of some element in a vector, so another version of this could look like follows:
first.positive <- match(TRUE, vecA > 0)
sumA <- sum( vecA[ 1 : first.positive ] ) - vecA[first.positive]
This will give you zero if positive element is the first.
I have a dataset named bwght which contains the variable cigs (cigarattes smoked per day)
When I calculate the mean of cigs in the dataset bwght using:
mean(bwght$cigs), I get a number 2.08.
Only 212 of the 1388 women in the sample smoke (and 1176 does not smoke):
summary(bwght$cigs>0) gives the result:
Mode FALSE TRUE NA's
logical 1176 212 0
I'm asked to find the average of cigs among the women who smoke (the 212).
I'm having a hard time finding the right syntax for excluding the non smokers = 0
I have tried:
mean(bwght$cigs| bwght$cigs>0)
mean(bwght$cigs>0 | bwght$cigs=TRUE)
if (bwght$cigs > 0){
sum(bwght$cigs)
}
x <-as.numeric(bwght$cigs, rm="0");
mean(x)
But nothing seems to work! Can anyone please help me??
If you want to exclude the non-smokers, you have a few options. The easiest is probably this:
mean(bwght[bwght$cigs>0,"cigs"])
With a data frame, the first variable is the row and the next is the column. So, you can subset using dataframe[1,2] to get the first row, second column. You can also use logic in the row selection. By using bwght$cigs>0 as the first element, you are subsetting to only have the rows where cigs is not zero.
Your other ones didn't work for the following reasons:
mean(bwght$cigs| bwght$cigs>0)
This is effectively a logical comparison. You're asking for the TRUE / FALSE result of bwght$cigs OR bwght$cigs>0, and then taking the mean on it. I'm not totally sure, but I think R can't even take data typed as logical for the mean() function.
mean(bwght$cigs>0 | bwght$cigs=TRUE)
Same problem. You use the | sign, which returns a logical, and R is trying to take the mean of logicals.
if(bwght$cigs > 0){sum(bwght$cigs)}
By any chance, were you a SAS programmer originally? This looks like how I used to type at first. Basically, if() doesn't work the same way in R as it does in SAS. In that example, you are using bwght$cigs > 0 as the if condition, which won't work because R will only look at the first element of the vector resulting from bwght$cigs > 0. R handles looping differently from SAS - check out functions like lapply, tapply, and so on.
x <-as.numeric(bwght$cigs, rm="0")
mean(x)
I honestly don't know what this would do. It might work if rm="0" didn't have quotes...?
mean(bwght[bwght$cigs>0,"cigs"])
I found the statement failed, returning "argument is not numeric or logical: returning NA"
Converting to matrix solved this:
mean(data.matrix(bwght[bwght$cigs>0,"cigs"]))