I'm keen to make the leap from SPSS to R.
A common command used in SPSS is applying filtering. Can somebody please advise on why I am receiving an error message?
2019dataset=read.spss("C:\\SPSS data\\2019dataset.sav")
selected_2019dataset <- 2019dataset[ which(2019dataset$hhweight > 0 & 2019dataset$income~=0 & 2019dataset$age > 16 & 2019dataset$age < 59),]
I'm getting an error saying that there is an unexpected '='
The filter I am trying to replicate in SPSS syntax is:
SELECT IF ((hhweight > 0) AND (income~=0) AND (age > 16 AND age <59)).
I've been following the example here:
https://www.statmethods.net/management/subset.html
Grateful for any suggestions.
Thanks.
Instead of 2019dataset$income~=0
try 2019dataset$income!=0 if you want "not equal to"
or 2019dataset$income==0 if you want "equal to"
Spaces might make reading clearer so 2019dataset$income != 0 or 2019dataset$income == 0 would be an improvement, and you may not need which, but these are less important
I transitioned from SPSS to R and I prefer to use the tidyverse package, which I think is a bit more intuitive.
Your code would look something like:
library(tidyverse)
selected_2019dataset <- 2019dataset %>%
filter(hhweight > 0 & income == 0 & age > 16 & age < 59)
Related
I have to migrate an R script to Python and found the following if chain in R:
PreparedData <- PreparedData %>% mutate(T.Churn3 = ifelse(lead(T.Purchases) == 0 & T.Purchases > 0 | lead(T.Purchases, 2) == 0 & lead(T.Purchases) > 0 & T.Purchases >0 | lead(T.Purchases, 2) == 0 & lead(T.Purchases) == 0 & T.Purchases >0, 1, 0))
Now I'm struggling with the evaluation order here for R. For me this statement looks unnecessarily bloated. This is how I understand the order of evaluation:
Check if Purchases of the next row is zero and if purchases of the current row is bigger than zero.
If 1. does not apply check if the purchases 2 rows ahead is zero and the purchase of the next row and the purchase of the acual row are bigger than zero
If 1. and 2. do not apply check if the purchases 2 rows ahead is zero, the purchases of the next row is zero and the current purchases is bigger than zero
I'm really not sure if that is right, but it is the only thing which could barely make sense to me. If that assumption of mine is right, then the third statement part would be unnecessary because the first statement part is part of the third statement.
Can anyone shed some light here?
Best regards,
André
Thanks to Roland's comment I think I was able to figure out what that statement actually does and it indeed is not very efficient. So for anyone else struggling with the logical operators maybe the following approach can also help:
Starting from left I looked at each logical operation and wrote their outcome as T/F one line below. After that I used the T/F with the next part of the logical chain until I reached the end of the line. In the end the picture looked like that:
The rest I had to do was to put in all possibilities and after that I noticed that the whole chain can be simplified to:
ifelse(lead(T.Purchases) == 0 & lead(T.Purchases,2) == 0 & T.Purchases > 0,1,0)
I have a variable (RANK) that I am trying to assign certain values to based on certain conditions. There are 13 possible ranks (1 being the most serious type of crime to 10 being the least serious, and 13 which is basically 'missing' and what I am trying to replace based on the crime committed)
I have made a set of rules using str_detect to indicate (True or False) if a certain string of text is included in the observation. I currently have 22 rules, but will be including more as I build out the code.
I have tried both case_when and If_else. However, I know that with case_when, it will take the first condition that matches meaning things need to be in a very particular order, which unfortunately I don't think will work for me. Example code below.
cleaned_data<-dataset_toclean2 %>% mutate(
RANK=case_when(
Murder==FALSE & sexoffense==FALSE & felony==TRUE & commonlaw==FALSE & kidnap==TRUE|FALSE & robbery==TRUE|FALSE & rwdw==TRUE|FALSE & awdw==TRUE|FALSE & abuse==TRUE|FALSE & discharge==TRUE|FALSE|burglary==TRUE & second==FALSE~"5",
secondmurder==TRUE~"3",
Murder==TRUE & FirstDegree==TRUE|FALSE & conspir==TRUE|FALSE & attempt==TRUE|FALSE & access==TRUE|FALSE & aid==TRUE|FALSE & solic==TRUE|FALSE | mansla== TRUE|FALSE| Murder==TRUE~"2",
sexoffense==TRUE~"4",
TRUE ~ "13"
) )
However, when I view the cleaned_data, all of the 2's, 3's, and 4's are correct but the 5's only have a handful that are accurately changed. The rest of the observations that should be 5's remain as 13 (unranked) even though they fit my rules.
any suggestions would be really appreciated as I know it will get pretty unruly as I make more rules for the remaining ranks.
I am trying to create a conditional loop to create a new variable called BigSales which should be given a value of 'yes' if either the date occurred before 2012 or the total gross for the day exceeded $65 million. Otherwise, it should be given a value of 'no'.
I tried:
for(i in 1:45){
if(movies$Gross[i] > 65 | movies$Date[i] < 2012-01-01){
movies$BigSales[i] <- "yes"}
else (
movies$BigSales[i] <- "no"
)
}
But I got the error message:
Error in if (movies$Gross[i] > 65 | movies$Date[i] < 2012 - 1 - 1) { :
missing value where TRUE/FALSE needed
In addition to that, the data set contains 100 observations, but it is only reading 45. How can I solve this?
It's possible to add a conditional column in this matter, but there are tools out there that make this easier and more comprehensible.
library(plyr)
library(dplyr)
movies <- mutate(movies, BigSales = ifelse(Gross > 65 && Date < "2012-01-01","yes","no"))
You should also be careful working with dates - call str(movies$Date) to make sure it's of the "Date" type, and if not you should pass it to as.Date
To answer your question as you asked it, you didn't put quotes around the date you listed, so it tried to evaluate it as 2012 - 2. If you'd prefer to solve this problem with the code you have, use "2012-01-01"
ifelse is vectorised, meaning it takes each item from input vector, process for the condition and returns a vector.
Another point is that since OP has mentioned that date before 2012 will be considered as BigSales "yes". Hence checking for only year of movies$Date will do the trick.
In base R, solution could be in
movies$BigSales <- ifelse(movies$Gross > 65 | as.numeric(format(movies$Date,"%Y")) < 2012,
"yes", "no")
Note : movies$Date is expected of type Date or POSIXct
So I have a 252 rows of data in column 4, and I would like to find the difference between two consecutive rows throughout the entire column
My current code is:
appleClose<-NULL
for (i in 1:Apple[1]){
appleClose[i] <- AA[i,4]
}
appleClose[]
I tried, and failed, with:
appleClose<-NULL
for (i in 1:Apple[1]){
appleClose[i] <- AA[i,4] - AA[i+1,4]
}
appleClose[]
Edit:
I am trying to optimize a stock market portfolio in retrospect.
AA is the ticker symbol for Apple. I downloaded that information through some R code written earlier in the program.
I have not yet checked out the diff function yet. I will do that now.
The error I am receiving is
Error in [.xts(AA, i + 1, 4) : subscript out of bounds
Is this what you mean?
> Apple=runif(5,1,10)
#5 numbers
> Apple
[1] 3.362267 2.489085 3.899513 5.591127 9.315716
#4 differences
> diff(Apple)
[1] -0.8731816 1.4104271 1.6916143 3.7245894
or depending on your data either
>diff(AA$Apple)
or maybe
>diff(AA[,4])
Another option (if you are referring to this, your question is not much clear)
AA[-1,4]- AA[-dim(A)[1],4]
I wanna divide a polygon shapefile (deforestation in brazilian Amazon), by years of deforestation. The years are in a string field, like "d2010_1", "d2010_2", "d2011_1" and so on. I want to divide it in 5 year periods. I tried the following:
d00a04 <- prodes[grepl("d2000",prodes#data$CLASS_NAME) ||
grepl("d2001",prodes#data$CLASS_NAME) ||
grepl("d2002",prodes#data$CLASS_NAME) ||
grepl("d2003",prodes#data$CLASS_NAME) ||
grepl("d2004",prodes#data$CLASS_NAME),]
but it gave the following error:
Error in if (is.numeric(i) && i < 0) { :
missing value where TRUE/FALSE needed
I also tried:
anos00a04 = c("d2000","d2001","d2002","d2003","d2004")
d00a04 <- subset(prodes,prodes#data$CLASS_NAME %in% anos00a04)
but it gave the same error message. I've seen some examples like here, here and here, but I need to see if the beginning of the string matches, not numerical operators such as <, > or ==. Any help, please?
EDIT: I figured out a way, but something strange is happening. I did the following:
anos <- sort(unique(prodes#data$CLASS_NAME))
anos00a04 <- anos[2:20]
The first command gives me all 49 levels from the original shapefile. The second returns only those between 2000 and 2004. So far so good. But when I ask to see the second variable, it shows the 19 itens (d2000_2 d2000_3 d2001_0 d2001_3 d2001_4...), but below it says: "49 Levels: d1997_0 d2000_2 d2000_3..." including those that were supposed to stay out (and were out of the listing). What's happening?
PS: "anos" is the portuguese word for "years".