Generating a dummy variable from lots of categories - r

So...I have a large data set with a variable that has many categories. I want to create new variables that group some of those categories into one.
I could do that with a conditional statement, but given the amount of categories it would take me forever to go one line at the time. Also, while my original variable is numeric, the values themselves are random so I can´t use logical or range statements.
How do I create this conditional variable based on many particular values?
I tried the following, but without success. Below is an example of the different categories I want to group into one.
classes <- c(549,162,210,222,44,96,62,208,525,202,149,442,427,
564,423,106,422,546,205,560,127,536,34,261,568,
366,524,401,548,95,156,8,528, 430,527,556,203,554,523,
501,530,55,252,585,19,540,71,204,502,504, 196,436,48,
102,526,201,521,23,558,552,118,416,117,216,510,494,
516,544,518)
So this seemed pretty intuitive to me, but it doesn´t work.
df$chem<- cbind(ifelse(df$class == classes ,1,0))
Needless to say I´m a beginner, and this is probably not so hard to do, but I´ve been looking for a solution to this particular problem and I can´t seem to find it. What am I missing? Thanks!

You are looking for %in% not ==
eg
df$chem <- cbind(ifelse(df$class %in% classes ,1,0))
or using the logical to numeric conversion
df$chem <- as.numeric(df$class %in% classes)
if you want individual dummy variables for all the categories in df$class then you can use the class.ind function in the package nnet (which is shipped as a recommended package)
library(nnet)
class_ind <- class.ind(df$class)
# add if you want to combine with the original
df_ind <- do.call(cbind, list(df, class.ind(df$class))

Related

Assign a Value based on the numbers in a separate columns in R

So I kind of already know the possible solution but I don't know how to exactly go about it so please give me a bit of grace here.
I have a dataset for youtube trends that I want to read the values from two columns (likes and dislikes) and based off their contents I want an entry to be made in the new column. If the likes are higher than the dislikes I want it to be said as a 'positive' video and if it has more dislikes it should be 'negative'.
I'm primarily not sure how to go about this since most of the previous asks are based off of one column rather than two. I know some mentioned using cut, but would it still work the same?
all help is appreciated, thanks.
You can use a simple ifelse :
df$new_col <- ifelse(df$likes > df$dislikes, 'positive', 'negative')
This can also be written without ifelse as :
df$new_col <- c('negative', 'positive')[as.integer(df$likes > df$dislikes) + 1]
You can use Vectorize to create a vectorized version of a function. vfunc <- Vectorize(func) will allow you to call df$newcol <- vfunc(df$likes, df$dislikes) if your function takes two arguments and then return the result for each row in a vector that's assigned to a new column.

In R, dataframe[-NULL] returns an empty dataframe

I'm creating some routines in R to ease model creation and to distinguish several groups based on several parameters (ex: original watches VS fakes ones using watches common attributes).
During the proccess, I keep track of the potential excluded lines in a vector (empty at first), and I get ride of them at the end using:
model$var <- raw_data[-line_excluded,]
The problem is that if line_excluded is c() (ndlr no line exlcuded), model$var is an empty dataframe then in that case I want all the lines of the dataframe.
The only solution I have think about is the us of
if (!is.null(line_excluded)){
model$var <- raw_data[-line_excluded,]}
But that's not really pretty, and I have several tracking variables as line_excluded which need that.
Thanks for the help
You can make it in another way using setdiff(), which can deal with empty line_excluded i.e.,
model$var <- raw_data[setdiff(seq(nrow(raw_data)),line_excluded),]
You can also try:
model$var <- raw_data[!(1:nrow(raw_data) %in% line_excluded),]
This is similar to what #THomasIsCoding suggested, you look for the row numbers that are not in your line_excluded..

Subsetting within a function

I'm trying to subset a dataframe within a function using a mixture of fixed variables and some variables which are created within the function (I only know the variable names, but cannot vectorise them beforehand). Here is a simplified example:
a<-c(1,2,3,4)
b<-c(2,2,3,5)
c<-c(1,1,2,2)
D<-data.frame(a,b,c)
subbing<-function(Data,GroupVar,condition){
g=Data$c+3
h=Data$c+1
NewD<-data.frame(a,b,g,h)
subset(NewD,select=c(a,b,GroupVar),GroupVar%in%condition)
}
Keep in mind that in my application I cannot compute g and h outside of the function. Sometimes I'll want to make a selection according to the values of h (as above) and other times I'll want to use g. There's also the possibility I may want to use both, but even just being able to subset using 1 would be great.
subbing(D,GroupVar=h,condition=5)
This returns an error saying that the object h cannot be found. I've tried to amend subset using as.formula and all sorts of things but I've failed every single time.
Besides the ease of the function there is a further reason why I'd like to use subset.
In the function I'm actually working on I use subset twice. The first time it's the simple subset function. It's just been pointed out below that another blog explored how it's probably best to use the good old data[colnames()=="g",]. Thanks for the suggestion, I'll have a go.
There is however another issue. I also use subset (or rather a variation) in my function because I'm dealing with several complex design surveys (see package survey), so subset.survey.design allows you to get the right variance estimation for subgroups. If I selected my group using [] I would get the wrong s.e. for my parameters, so I guess this is quite an important issue.
Thank you
It's happening right as the function is trying to define GroupVar in the beginning. R is looking for the object h by itself (not within the dataframe).
The best thing to do is refer to the column names in quotes in the subset function. But of course, then you'd have to sidestep the condition part:
subbing <- function(Data, GroupVar, condition) {
....
DF <- subset(Data, select=c("a","b", GroupVar))
DF <- DF[DF[,3] %in% condition,]
}
That will do the trick, although it can be annoying to have one data frame indexing inside another.

How to use a value that is specified in a function call as a "variable"

I am wondering if it is possible in R to use a value that is declared in a function call as a "variable" part of the function itself, similar to the functionality that is available in SAS IML.
Given something like this:
put.together <- function(suffix, numbers) {
new.suffix <<- as.data.frame(numbers)
return(new.suffix)
}
x <- c(seq(1000,1012, 1))
put.together(part.a, x)
new.part.a ##### does not exist!!
new.suffix ##### does exist
As it is written, the function returns a dataframe called new.suffix, as it should because that is what I'm asking it to do.
I would like to get a dataframe returned that is called new.part.a.
EDIT: Additional information was requested regarding the purpose of the analysis
The purpose of the question is to produce dataframes that will be sent to another function for analysis.
There exists a data bank where elements are organized into groups by number, and other people organize the groups
into a meaningful set.
Each group has an id number. I use the information supplied by others to put the groups together as they are specified.
For example, I would be given a set of id numbers like: part-1 = 102263, 102338, 202236, 302342, 902273, 102337, 402233.
So, part-1 has seven groups, each group having several elements.
I use the id numbers in a merge so that only the groups of interest are extracted from the large data bank.
The following is what I have for one set:
### all.possible.elements.bank <- .csv file from large database ###
id.part.1 <- as.data.frame(c(102263, 102338, 202236, 302342, 902273, 102337, 402233))
bank.names <- c("bank.id")
colnames(id.part.1) <- bank.names
part.sort <- matrix(seq(1,nrow(id.part.1),1))
sort.part.1 <- cbind(id.part.1, part.sort)
final.part.1 <- as.data.frame(merge(sort.part.1, all.possible.elements.bank,
by="bank.id", all.x=TRUE))
The process above is repeated many, many times.
I know that I could do this for all of the collections that I would pull together, but I thought I would be able to wrap the selection process into a function. The only things that would change would be the part numbers (part-1, part-2, etc..) and the groups that are selected out.
It is possible using the assign function (and possibly deparse and substitute), but it is strongly discouraged to do things like this. Why can't you just return the data frame and call the function like:
new.part.a <- put.together(x)
Which is the generally better approach.
If you really want to change things in the global environment then you may want a macro, see the defmacro function in the gtools package and most importantly read the document in the refrences section on the help page.
This is rarely something you should want to do... assigning to things out of the function environment can get you into all sorts of trouble.
However, you can do it using assign:
put.together <- function(suffix, numbers) {
assign(paste('new',
deparse(substitute(suffix)),
sep='.'),
as.data.frame(numbers),
envir=parent.env(environment()))
}
put.together(part.a, 1:20)
But like Greg said, its usually not necessary, and always dangerous if used incorrectly.

Unable to filter a data frame?

I am using something like this to filter my data frame:
d1 = data.frame(data[data$ColA == "ColACat1" & data$ColB == "ColBCat2", ])
When I print d1, it works as expected. However, when I type d1$ColB, it still prints everything from the original data frame.
> print(d1)
ColA ColB
-----------------
ColACat1 ColBCat2
ColACat1 ColBCat2
> print(d1$ColA)
Levels: ColACat1 ColACat2
Maybe this is expected but when I pass d1 to ggplot, it messes up my graph and does not use the filter. Is there anyway I can filter the data frame and get only the records that match the filter? I want d1 to not know the existence of data.
As you allude to, the default behavior in R is to treat character columns in data frames as a special data type, called a factor. This is a feature, not a bug, but like any useful feature if you're not expecting it and don't know how to properly use it, it can be quite confusing.
factors are meant to represent categorical (rather than numerical, or quantitative) variables, which comes up often in statistics.
The subsetting operations you used do in fact work normally. Namely, they will return the correct subset of your data frame. However, the levels attribute of that variable remains unchanged, and still has all the original levels in it.
This means that any method written in R that is designed to take advantage of factors will treat that column as a categorical variable with a bunch of levels, many of which just aren't present. In statistics, one often wants to track the presence of 'missing' levels of categorical variables.
I actually also prefer to work with stringsAsFactors = FALSE, but many people frown on that since it can reduce code portability. (TRUE is the default, so sharing your code with someone else may be risky unless you preface every single script with a call to options).
A potentially more convenient solution, particularly for data frames, is to combine the subset and droplevels functions:
subsetDrop <- function(...){
droplevels(subset(...))
}
and use this function to extract subsets of your data frames in a way that is assured to remove any unused levels in the result.
This was such a pain! ggplot messes up if you don't do this right. Using this option at the beginning of my script solved it:
options(stringsAsFactors = FALSE)
Looks like it is the intended behavior but unfortunately I had turned this feature on for some other purpose and it started causing trouble for all my other scripts.

Resources