I have some data that I'm working with from a Udacity course (Link: Reddit Survey Responses). I'm trying to simplify the Employment Status variable by replacing any multi-word values with single word alternates using
RS$employment.status <- ifelse(RS$employment.status == "Not employed, but looking for work",
"Unemployed", RS$employment.status)
However, when I run the code any values that aren't supposed to be replaced are replaced with numeric values. Given that the else case is to use the field's value, I'm not sure why the text isn't preserved as-is.
Here's a screenshot of the initial data
And the after
So if anyone could point out
why the substitution is being made when it doesn't look like it should be;
what would be the correct way to accomplish what I'm trying to achieve;
it would be much appreciated.
The problem is that this variable is set as a Factor, so to fix your problem you can either add this argument when you read your data stringsAsFactors = FALSE or you could do this:
RS$employment.status <- ifelse(RS$employment.status == "Not employed, but looking for work",
"Unemployed", as.character(RS$employment.status))
Related
I am a beginner in R so this is a very basic question. I do not find a specific answer to it so I would like to ask you here.
I'm confronted with the following challenge; I'd like to recode a character variable and create one out of this.
Specifically, the variable in my data frame(data) is called "driver", with the categories "market", "legislation", "technology", and "mixed".
Now I would simply like to create a new variable, "driverrec", with the values "market" and "others". In "others" the three remaining variables shall be summarized.
I tried it with this page: http://rprogramming.net/recode-data-in-r/
Basically, I tried the following code to adopt on mine, but it won't work for more than one category.
#Create a new field called NewGrade
SchoolData$NewGrade <- recode(SchoolData$Grade,"5='Elementary'")
# my attempt
driverrec <- data$driver
recode(driverrec, "'Mixed'='others'") This is working.
But the whole recode is not working:
recode(driverrec, "'Mixed'='others'", "'Technology'='others'",
"'Legislation'='others'", "'Market'='market'" )
I am looking forward to and thank you for your help.
I found a solution not using the replace command:
data$driverrec[dataframe$driver=='Market'] <- 'market'
data$driverrec[is.na(dataframe$driver)==TRUE] <- 'others'
This worked fine; in order, someone is looking for a solution ;)!
My question involves summarizing a data frame where I am supposed to delete all empty cases. I tried using na.rm, but didn't work because the rows without value actually is written "not available", then I was getting an error due to missing data.
Looking around what I could do I came across a script where the person select the lines using the following command:
filtered <- x[x$State==s &
x$Hospital.30.Day.Death..Mortality..Rates.from.Heart.Attack != 'Not Available',
c("Hospital.Name","Hospital.30.Day.Death..Mortality..Rates.from.Heart.Attack")]
I fixed the issue of how to select the "not available", but I didn't understand what the ==s does. Can anyone explain it to me please?
A few things here:
your subsetting operation is doing three things at once:
selecting all rows where the State variable is equal to the value stored in the variable s (which must have been set before this line was run; otherwise you'd get an error); this is the meaning of x$State == s ...
and (this is what the & operator means) the Hospital-30-day-mortality-rates variable is not missing
and selecting just the hospital name and mortality-rate columns from the data set (this is what the bit after the , is doing)
If you are reading the data in from a file using read.csv() or read.table(), you could use the na.strings argument to specify that "Not Available" should automatically get transformed to R's missing value, NA
you might want to rename your long-named variable (there are handy renaming functions in the gdata, sjmisc, plyr, and dplyr packages: pick one)
you can also use subset from base R, or filter and select from dplyr, to perform these operations
I'm trying to match everything except a specific string in R, and I've seen a bunch of posts on this suggesting a negative lookaround, but I haven't gotten that to work.
I have a dataset looking at crime incidents in SF, and I want to sort cases that have a resolution or do not. In the resolution field, cases have things listed like arrest booked, arrest cited, juvenile booked, etc., or none. I want to relabel all the specific resolutions like the different arrests to "RESOLVED" and keep the instances with "NONE" as such. So, I thought I could gsub or grep for not "NONE".
Based on what I've read on finding all strings except one specific string, I would have thought this would work:
resolution_vector = grep("^(?!NONE$).*", trainData$Resolution, fixed=TRUE)
Where I make a vector that searches through my training dataset, specifically the resolution column, and finds the terms that aren't "NONE". But, I just get an empty vector.
Does anyone have suggestions, or know why this might not be working in R? Or, even if there was a way to just use gsub, how do I say "not NONE" for my regex in R?
trainData$Resolution = gsub("!NONE", RESOLVED, trainData$Resolution) << what's the way to negate the string here?
Based on your explanation, it seems as though you don't need regular expressions (i.e. gsub()) at all. You can use != since you are looking for all non-matches of an exact string. Perhaps you want
within(trainData, {
## next line only necessary if you have a factor column
Resolution <- as.character(Resolution)
Resolution[Resolution != "NONE"] <- "RESOLVED"
})
resolution_vector = grep("^(?!NONE$).*", trainData$Resolution, fixed=TRUE,perl=TRUE)
You need to use option perl=TRUE.
Sorry about the title. I'm actually having a hard time figuring out how to even phrase the question, which is why I can't just google it.
I want to get information from a data frame in R using a variable as the column title.
test = data.frame(season=c('winter','summer'), temp=c('cold','hot'))
what.season = 'winter'
test$what.season
The third line obviously doesn't work, but what I am trying to pass it is the value of what.season so that it reads test$winter and returns 'cold'
Edit for future readers: I'm tired and I phrased it wrong, but the correct answer got at what I was trying to do.
Here is how I would do it
test[test$season == "winter", ]$temp
The $ operator at the end selects to column of interest while the logical operator == selects the row of interest
You can also use subset function
> subset(test, season==what.season, select=temp)
temp
1 cold
You can use %in% command
test$temp[test$season%in%what.season]
test$season%in%what.season will give a logical output after searching all rows (of the column test$season) for the values of what.season (winter). You can then use the logical output to filter out values from the column test$temp.
The shortest way (that I know of) would be test[test$season==what.season, 'temp'].
Using the data frame mtcars on RStudio.
Say for example I want to subset mtcars[mtcars$cyl == 4,]
Tabbing after mtcars$ will provide a drop down list of variable names in the data frame.
Tabbing after mtcars[mtcars$ does not return the variable names.
Why does this happen?
it will if you add a space:
mtcars[ mtcars$
otherwise your expecting r to look in something called mtcars[mtcars not mtcars...
I was going to ask the same thing. I disagree with the answer that you are expecting R to look for something called mtcars[mtcars, because you can't even make that without putting it all in quotes anyway, e.g.
test[test <- c(1,3,2) # leaves you stuck with the next line being +
The only way to make such an abomination is:
"test[test" <- c(1,3,2)
And once made you still cant use
test[test[2]
You still need to use quotes
"test[test"[2]
So, as far as I can tell, tabbing after mtcars[mtcars$ failing is either a bug or has some sort of reason behind it. If there is a reason does anyone know what it is?