How do I recode character variable? - r

I am a beginner in R so this is a very basic question. I do not find a specific answer to it so I would like to ask you here.
I'm confronted with the following challenge; I'd like to recode a character variable and create one out of this.
Specifically, the variable in my data frame(data) is called "driver", with the categories "market", "legislation", "technology", and "mixed".
Now I would simply like to create a new variable, "driverrec", with the values "market" and "others". In "others" the three remaining variables shall be summarized.
I tried it with this page: http://rprogramming.net/recode-data-in-r/
Basically, I tried the following code to adopt on mine, but it won't work for more than one category.
#Create a new field called NewGrade
SchoolData$NewGrade <- recode(SchoolData$Grade,"5='Elementary'")
# my attempt
driverrec <- data$driver
recode(driverrec, "'Mixed'='others'") This is working.
But the whole recode is not working:
recode(driverrec, "'Mixed'='others'", "'Technology'='others'",
"'Legislation'='others'", "'Market'='market'" )
I am looking forward to and thank you for your help.

I found a solution not using the replace command:
data$driverrec[dataframe$driver=='Market'] <- 'market'
data$driverrec[is.na(dataframe$driver)==TRUE] <- 'others'
This worked fine; in order, someone is looking for a solution ;)!

Related

Assign a Value based on the numbers in a separate columns in R

So I kind of already know the possible solution but I don't know how to exactly go about it so please give me a bit of grace here.
I have a dataset for youtube trends that I want to read the values from two columns (likes and dislikes) and based off their contents I want an entry to be made in the new column. If the likes are higher than the dislikes I want it to be said as a 'positive' video and if it has more dislikes it should be 'negative'.
I'm primarily not sure how to go about this since most of the previous asks are based off of one column rather than two. I know some mentioned using cut, but would it still work the same?
all help is appreciated, thanks.
You can use a simple ifelse :
df$new_col <- ifelse(df$likes > df$dislikes, 'positive', 'negative')
This can also be written without ifelse as :
df$new_col <- c('negative', 'positive')[as.integer(df$likes > df$dislikes) + 1]
You can use Vectorize to create a vectorized version of a function. vfunc <- Vectorize(func) will allow you to call df$newcol <- vfunc(df$likes, df$dislikes) if your function takes two arguments and then return the result for each row in a vector that's assigned to a new column.

Keep the ID variable when creating seqelist in TraMineR

I am creating a list of sequences with TraMineR with the following code:
cj.seqe <- seqecreate(id=cj$party_id, time=cj$DATE_IN_num, event=cj$EVT_CD)
However, this list only contains the events and drops the id variable. I would like to merge the event sequences back to the original data. Is there a way to do this? I couldn't find anything in the docs. Thanks!
Maybe you could try with the following:
cj.seqe <- seqecreate(data=cj, id="party_id", time=cj$DATE_IN_num, event=cj$EVT_CD)
I suggest this because, to my experience, when I tried defining the id in seqcreate function in this way
id=dataset.name$variable_id didn't work!
Whereas trying defining id with id="variabile_id" worked!
I hope this could help despite I have no rigorous explanation.

How to subset (without filtering) multiple columns from a data frame in R

I'm sorry this may have been done to death, but all the answers I've found veer all over the map into extreme exotica. I can subset using [[]] (I've learned from stackoverflow that I'm not supposed to use subset() and similar for my scripts, since they're intended for interactive use) for a single column, but I can't figure out how to make the leap to more than one column. These two work, of course:
outcomeA <- outcome[['Hospital.Name']]
outcomeB <- outcome[['TX]]
But I've tried a dozen permutations to get both of those columns, like so:
outcomeC <- outcome[[c('Hospital.Name', 'TX')]] (gives "subscript out of bound")
outcomeC <- outcome[c('Hospital.Name', 'TX')] (gives "undefined columns selected")
etc, but they all fail. Can someone please put me out of my misery and help me select more than one column?
Thanks - Ed
Did you try this with a comma and single brackets
outcomeC <- outcome[,c('Hospital.Name', 'TX')]
Also you can only get column names that exist in your data. check them against:
names(outcome)

Strangeness with filtering in R and showing summary of filtered data

I have a data frame loaded using the CSV Library in R, like
mySheet <- read.csv("Table.csv", sep=";")
I now can print a summary on that mySheet object
summary(mySheet)
and it will show me a summary for each column, for example, one column named Diagnose has the unique values RCM, UCM, HCM and it shows the number of occurences of each of these values.
I now filter by a diagnose, like
subSheet <- mySheet[mySheet$Diagnose=='UCM',]
which seems to be working, when I just type subSheet in the console it will print only the rows where the value has been matched with 'UCM'
However, if I do a summary on that subSheet, like
summary(subSheet)
it still 'knows' about the other two possibilities RCM and HCM and prints those having a value of 0. However, I expected that the new created object will NOT know about the possible values of the original mySheet I initially loaded.
Is there any way to get rid of those other possible values after filtering? I also tried subset but this one just seems to be some kind of shortcut to '[' for the interactive mode... I also tried DROP=TRUE as option, but this one didn't change the game.
Totally mind squeezing :D Any help is highly appreciated!
What you are dealing with here are factors from reading the csv file. You can get subSheet to forget the missing factors with
subSheet$Diagnose <- droplevels(subSheet$Diagnose)
or
subSheet$Diagnose <- subSheet$Diagnose[ , drop=TRUE]
just before you do summary(subSheet).
Personally I dislike factors, as they cause me too many problems, and I only convert strings to factors when I really need to. So I would have started with something like
mySheet <- read.csv("Table.csv", sep=";", stringsAsFactors=FALSE)

Generating a dummy variable from lots of categories

So...I have a large data set with a variable that has many categories. I want to create new variables that group some of those categories into one.
I could do that with a conditional statement, but given the amount of categories it would take me forever to go one line at the time. Also, while my original variable is numeric, the values themselves are random so I can´t use logical or range statements.
How do I create this conditional variable based on many particular values?
I tried the following, but without success. Below is an example of the different categories I want to group into one.
classes <- c(549,162,210,222,44,96,62,208,525,202,149,442,427,
564,423,106,422,546,205,560,127,536,34,261,568,
366,524,401,548,95,156,8,528, 430,527,556,203,554,523,
501,530,55,252,585,19,540,71,204,502,504, 196,436,48,
102,526,201,521,23,558,552,118,416,117,216,510,494,
516,544,518)
So this seemed pretty intuitive to me, but it doesn´t work.
df$chem<- cbind(ifelse(df$class == classes ,1,0))
Needless to say I´m a beginner, and this is probably not so hard to do, but I´ve been looking for a solution to this particular problem and I can´t seem to find it. What am I missing? Thanks!
You are looking for %in% not ==
eg
df$chem <- cbind(ifelse(df$class %in% classes ,1,0))
or using the logical to numeric conversion
df$chem <- as.numeric(df$class %in% classes)
if you want individual dummy variables for all the categories in df$class then you can use the class.ind function in the package nnet (which is shipped as a recommended package)
library(nnet)
class_ind <- class.ind(df$class)
# add if you want to combine with the original
df_ind <- do.call(cbind, list(df, class.ind(df$class))

Resources