Unable to filter a data frame? - r

I am using something like this to filter my data frame:
d1 = data.frame(data[data$ColA == "ColACat1" & data$ColB == "ColBCat2", ])
When I print d1, it works as expected. However, when I type d1$ColB, it still prints everything from the original data frame.
> print(d1)
ColA ColB
-----------------
ColACat1 ColBCat2
ColACat1 ColBCat2
> print(d1$ColA)
Levels: ColACat1 ColACat2
Maybe this is expected but when I pass d1 to ggplot, it messes up my graph and does not use the filter. Is there anyway I can filter the data frame and get only the records that match the filter? I want d1 to not know the existence of data.

As you allude to, the default behavior in R is to treat character columns in data frames as a special data type, called a factor. This is a feature, not a bug, but like any useful feature if you're not expecting it and don't know how to properly use it, it can be quite confusing.
factors are meant to represent categorical (rather than numerical, or quantitative) variables, which comes up often in statistics.
The subsetting operations you used do in fact work normally. Namely, they will return the correct subset of your data frame. However, the levels attribute of that variable remains unchanged, and still has all the original levels in it.
This means that any method written in R that is designed to take advantage of factors will treat that column as a categorical variable with a bunch of levels, many of which just aren't present. In statistics, one often wants to track the presence of 'missing' levels of categorical variables.
I actually also prefer to work with stringsAsFactors = FALSE, but many people frown on that since it can reduce code portability. (TRUE is the default, so sharing your code with someone else may be risky unless you preface every single script with a call to options).
A potentially more convenient solution, particularly for data frames, is to combine the subset and droplevels functions:
subsetDrop <- function(...){
droplevels(subset(...))
}
and use this function to extract subsets of your data frames in a way that is assured to remove any unused levels in the result.

This was such a pain! ggplot messes up if you don't do this right. Using this option at the beginning of my script solved it:
options(stringsAsFactors = FALSE)
Looks like it is the intended behavior but unfortunately I had turned this feature on for some other purpose and it started causing trouble for all my other scripts.

Related

R empty data frame after subsetting by factor

I need to subset my data depending on the content of one factor variable.
I tried to do it with subset:
new <- subset(data, original$Group1=="SALAD")
data is already a subset from a bigger data frame, in original I have the factor variable which should identify the wanted rows.
This works perfectly for one level of the factor variable, but (and I really don´t understand why!!) when I do it with the other factor level "BREAD" it creates the data frame but says "no data available" - so it is empty. I´ve imported the data from SPSS, if this matters. I´ve already checked the factor levels, but the naming should be right!
Would be really grateful for help, I spent 3 hours on this problem and wasn´t able to find a solution.
I´ve also tried other ways to subset my data (e.g. split), but I want a data frame as output.
Do you have advice in general, what is the best way to subset a data frame if I want e.g. 3 columns of this data frame and these should be extracted depending on the level of a factor (most Code examples are only for one or all columns..)
The entire point of the subset function (as I understand it) is to look inside the data frame for the right variable - so you can type
subset(data, var1 == "value")
instead of
data[data$var1 == "value,]
Please correct me anyone if that is incorrect.
Now, in you're case, you are explicitly taking Group1 from the data frame original and using that to subset data - which you say is a subset of original. Based on this, I see no reason to believe (and every reason not to believe) that the elements of original$Group1 will align with the rows of data. If Group1 is defined within data, why not just use the copy defined there - which is aligned correctly? If not, you need to be very explicit about what you are trying to accomplish, so that you can ensure that things are aligned correctly.

How to pass a name to a function like dplyr::distinct()

I have a list of five data frames full of user responses to a survey.
In each of these data frames, the second column is the user id number. Some of the users took the survey multiple times, and I am trying to weed out the duplicate responses and just keep the first record.
The naming conventions are fairly standard, so the column in the first data frame is called akin to survey1_id and the second is survey2_id, etc. with the exception being that the column in the third data frame is called survey3a_id.
So basically what I tried to do was this:
for (i in seq(1,5)) {
newdata <- distinct(survey_list[[i]], grep(names("^survey.*_id$", survey_list[[i]]), value = TRUE))
}
But this doesn't work.
I originally thought it was just because the grep output had quotes around it, but I tried to strip them with noquote() and that didn't work. I then realized that distinct() doesn't actually evaluate the second argument, it just takes it literally, so I tried to force it to evaluate using eval(), but that didn't work. (Not sure I really expected it to.)
So now I'm kind of stuck. I don't know if the best solution is just to write five individual lines of code or, for a more generalizable solution, to sort and compare item-by-item in a loop? Was just hoping for a cleaner solution. I'm kind of new to this stuff.

Why does R mix up numerical with categorial variables?

I am confused. I input a .csv file in R and want to fit a linear multivariate regression model.
However, R declares all my obvious numeric variables to be factors and my categorial variables to be integers. Therefore, I cannot fit the model.
Does anyone know how to resolve this?
I know this is probably so basic. But I really need to know this. Elsewhere, I found only posts concerning how to declare factors. But this does not apply here.
Any suggestions very much appreciated!
The easiest way, imo, to handle this is to just tell R what type of data your columns contain when you read them into the workspace. For example, if you have a csv file where the first column should be characters, columns 2-21 should be numeric, and column 22 should be a factor, here's how I would read that csv file into the workspace:
Data <- read.csv("MyData.csv", colClasses=c("character", rep("numeric", 20), "factor"))
Sometimes (with certain versions of R, as Andrew points out) float entries in a CSV are long enough that it thinks they are strings and not floats. In this case, you can do the following
data <- read.csv("filename.csv")
data$some.column <- as.numeric(as.character(data$some.column))
Or you could pass stringsAsFactors=F to the read.csv call, and just apply as.numeric in the next line. That might be a bad idea though if you have a lot of data.
It's a little harder to say what's going on with the categorical variables. You might want to try just treating those as strings and see how that works. Sometimes R will treat factor vectors as being of numeric type, so this is a good first sanity check. If that doesn't work, you can also see if the regression functions in question will let you declare how the variables should be treated.
It is hard to tell without a sample of your data file and the commands that you have been using to try and work with the data, but here are some general problems that can lead to what you describe (though there could be other possibilities as well).
The read.csv and read.table (which is called by read.csv) function will try and guess the types of data when they are not told what each column should be (the colClasses argument). If everything looks like a number then it will convert to a number, but if it sees anything in the first lines that does not look like part of a number then it will read it in as character and convert to a factor. Some of the common reasons why what you think should be a number but R sees something non-numeric include: a finger slip results in a letter somewhere in the column; similar looking substitutions, O for 0 or l for 1; a comma where one is not expected, many European files use , where R expects . (but there are options to tell R what you want here) or if you use read.table without setting sep when it really is a comma separated file.
If you have a categorical variable represented by integers, then R will convert it to integers unless you tell it to make a factor. If you use as.numeric on a factor then it will return the integers used to represent the factor internally. How to convert a factor with labels that are numbers to a numeric is a question (and answer) in the FAQ.
If this does not point you in the right direction then give us a sample of your data and what commands you are using.

Strangeness with filtering in R and showing summary of filtered data

I have a data frame loaded using the CSV Library in R, like
mySheet <- read.csv("Table.csv", sep=";")
I now can print a summary on that mySheet object
summary(mySheet)
and it will show me a summary for each column, for example, one column named Diagnose has the unique values RCM, UCM, HCM and it shows the number of occurences of each of these values.
I now filter by a diagnose, like
subSheet <- mySheet[mySheet$Diagnose=='UCM',]
which seems to be working, when I just type subSheet in the console it will print only the rows where the value has been matched with 'UCM'
However, if I do a summary on that subSheet, like
summary(subSheet)
it still 'knows' about the other two possibilities RCM and HCM and prints those having a value of 0. However, I expected that the new created object will NOT know about the possible values of the original mySheet I initially loaded.
Is there any way to get rid of those other possible values after filtering? I also tried subset but this one just seems to be some kind of shortcut to '[' for the interactive mode... I also tried DROP=TRUE as option, but this one didn't change the game.
Totally mind squeezing :D Any help is highly appreciated!
What you are dealing with here are factors from reading the csv file. You can get subSheet to forget the missing factors with
subSheet$Diagnose <- droplevels(subSheet$Diagnose)
or
subSheet$Diagnose <- subSheet$Diagnose[ , drop=TRUE]
just before you do summary(subSheet).
Personally I dislike factors, as they cause me too many problems, and I only convert strings to factors when I really need to. So I would have started with something like
mySheet <- read.csv("Table.csv", sep=";", stringsAsFactors=FALSE)

Remove values from a dataset based on a vector of those values

I have a dataset that looks like this, except it's much longer and with many more values:
dataset <- data.frame(grps = c("a","b","c","a","d","b","c","a","d","b","c","a"), response = c(1,4,2,6,4,7,8,9,4,5,0,3))
In R, I would like to remove all rows containing the values "b" or "c" using a vector of values to remove, i.e.
remove<-c("b","c")
The actual dataset is very long with many hundreds of values to remove, so removing values one-by-one would be very time consuming.
Try:
dataset[!(dataset$grps %in% remove),]
There's also subset:
subset(dataset, !(grps %in% remove))
... which is really just a wrapper around [ that lets you skip writing dataset$ over and over when there are multiple subset criteria. But, as the help page warns:
This is a convenience function intended for use interactively. For
programming it is better to use the standard subsetting functions like
‘[’, and in particular the non-standard evaluation of argument
‘subset’ can have unanticipated consequences.
I've never had any problems, but the majority of my R code is scripting for my own use with relatively static inputs.
2013-04-12
I have now had problems. If you're building a package for CRAN, R CMD check will throw a NOTE if you have use subset in this way in your code - it will wonder if grps is a global variable, even though subset is evaluating it within dataset's environment (not the global one). So if there's any possiblity your code will end up in a package and you feel squeamish about NOTEs, stick with Rcoster's method.

Resources