Strangeness with filtering in R and showing summary of filtered data - r

I have a data frame loaded using the CSV Library in R, like
mySheet <- read.csv("Table.csv", sep=";")
I now can print a summary on that mySheet object
summary(mySheet)
and it will show me a summary for each column, for example, one column named Diagnose has the unique values RCM, UCM, HCM and it shows the number of occurences of each of these values.
I now filter by a diagnose, like
subSheet <- mySheet[mySheet$Diagnose=='UCM',]
which seems to be working, when I just type subSheet in the console it will print only the rows where the value has been matched with 'UCM'
However, if I do a summary on that subSheet, like
summary(subSheet)
it still 'knows' about the other two possibilities RCM and HCM and prints those having a value of 0. However, I expected that the new created object will NOT know about the possible values of the original mySheet I initially loaded.
Is there any way to get rid of those other possible values after filtering? I also tried subset but this one just seems to be some kind of shortcut to '[' for the interactive mode... I also tried DROP=TRUE as option, but this one didn't change the game.
Totally mind squeezing :D Any help is highly appreciated!

What you are dealing with here are factors from reading the csv file. You can get subSheet to forget the missing factors with
subSheet$Diagnose <- droplevels(subSheet$Diagnose)
or
subSheet$Diagnose <- subSheet$Diagnose[ , drop=TRUE]
just before you do summary(subSheet).
Personally I dislike factors, as they cause me too many problems, and I only convert strings to factors when I really need to. So I would have started with something like
mySheet <- read.csv("Table.csv", sep=";", stringsAsFactors=FALSE)

Related

For loop to create multiple empty data frames gives error

I wrote a for loop to create empty multiple data frames, using a vector of names, but even though it seemed really easy at start I got an error message : Error in ID_names[i] <- data.frame() : replacement has length zero
To be more specific I' ll provide you with a reproducable example:
ID_names <- c("Athens","Rome","Barcelona","London","Paris","Madrid")
for(i in 1:length(ID_names){
ID_names[i] <- data.frame()
}
Do you have any idea why this is wrong? I would like to ask you not only provide a solution, but specify me why this for loop is wrong in order to avoid such kind of mistakes in the future.
You are trying to store a dataframe in one element of a vector (ID_names[i]) which is not possible. You might want to create a list of empty dataframes and assign names to it which can be done using replicate.
ID_names <- c("Athens","Rome","Barcelona","London","Paris","Madrid")
list_data <- setNames(replicate(length(ID_names), data.frame()), ID_names)
However, very rarely such initialisation of empty dataframes will be useful. It ends up creating more confusion down the road. Depending on your actual use case there might be other better ways to handle this.

How to deal with long variable names when using stargazer to make tables in R?

I try to display the first 20 rows of a data frame by using stargazer. But some of the variable names are so long (such as Prevelance of unnourishment (% of population)) that the table just cannot fit in. I understand that renaming the variables with shorter names will work but that's not the way I'm looking for. I also thought about changing the latex codes that has been produced but turned out those cannot be changed. I guess the best way is to do something with the R command. Mine is:
stargazer(as.matrix(data[1:20,]), type='latex')
How should I change it to make the table fit in?
Thanks a lot!
use abbreviate to shorten names. You can control the length of names by adjusting minlength argument. For more info, please read ?abbreviate
By following this, sometimes non-unique names may appear, so to take care of it, you would use make.unique on the abbreviated names.
colnames(data) <- abbreviate( colnames(data), minlength = 3, strict = TRUE )
stargazer(as.matrix(data[1:20,]), type='latex')

How to subset (without filtering) multiple columns from a data frame in R

I'm sorry this may have been done to death, but all the answers I've found veer all over the map into extreme exotica. I can subset using [[]] (I've learned from stackoverflow that I'm not supposed to use subset() and similar for my scripts, since they're intended for interactive use) for a single column, but I can't figure out how to make the leap to more than one column. These two work, of course:
outcomeA <- outcome[['Hospital.Name']]
outcomeB <- outcome[['TX]]
But I've tried a dozen permutations to get both of those columns, like so:
outcomeC <- outcome[[c('Hospital.Name', 'TX')]] (gives "subscript out of bound")
outcomeC <- outcome[c('Hospital.Name', 'TX')] (gives "undefined columns selected")
etc, but they all fail. Can someone please put me out of my misery and help me select more than one column?
Thanks - Ed
Did you try this with a comma and single brackets
outcomeC <- outcome[,c('Hospital.Name', 'TX')]
Also you can only get column names that exist in your data. check them against:
names(outcome)

Why does R not auto-complete data frame variables when subsetting

Using the data frame mtcars on RStudio.
Say for example I want to subset mtcars[mtcars$cyl == 4,]
Tabbing after mtcars$ will provide a drop down list of variable names in the data frame.
Tabbing after mtcars[mtcars$ does not return the variable names.
Why does this happen?
it will if you add a space:
mtcars[ mtcars$
otherwise your expecting r to look in something called mtcars[mtcars not mtcars...
I was going to ask the same thing. I disagree with the answer that you are expecting R to look for something called mtcars[mtcars, because you can't even make that without putting it all in quotes anyway, e.g.
test[test <- c(1,3,2) # leaves you stuck with the next line being +
The only way to make such an abomination is:
"test[test" <- c(1,3,2)
And once made you still cant use
test[test[2]
You still need to use quotes
"test[test"[2]
So, as far as I can tell, tabbing after mtcars[mtcars$ failing is either a bug or has some sort of reason behind it. If there is a reason does anyone know what it is?

Unable to filter a data frame?

I am using something like this to filter my data frame:
d1 = data.frame(data[data$ColA == "ColACat1" & data$ColB == "ColBCat2", ])
When I print d1, it works as expected. However, when I type d1$ColB, it still prints everything from the original data frame.
> print(d1)
ColA ColB
-----------------
ColACat1 ColBCat2
ColACat1 ColBCat2
> print(d1$ColA)
Levels: ColACat1 ColACat2
Maybe this is expected but when I pass d1 to ggplot, it messes up my graph and does not use the filter. Is there anyway I can filter the data frame and get only the records that match the filter? I want d1 to not know the existence of data.
As you allude to, the default behavior in R is to treat character columns in data frames as a special data type, called a factor. This is a feature, not a bug, but like any useful feature if you're not expecting it and don't know how to properly use it, it can be quite confusing.
factors are meant to represent categorical (rather than numerical, or quantitative) variables, which comes up often in statistics.
The subsetting operations you used do in fact work normally. Namely, they will return the correct subset of your data frame. However, the levels attribute of that variable remains unchanged, and still has all the original levels in it.
This means that any method written in R that is designed to take advantage of factors will treat that column as a categorical variable with a bunch of levels, many of which just aren't present. In statistics, one often wants to track the presence of 'missing' levels of categorical variables.
I actually also prefer to work with stringsAsFactors = FALSE, but many people frown on that since it can reduce code portability. (TRUE is the default, so sharing your code with someone else may be risky unless you preface every single script with a call to options).
A potentially more convenient solution, particularly for data frames, is to combine the subset and droplevels functions:
subsetDrop <- function(...){
droplevels(subset(...))
}
and use this function to extract subsets of your data frames in a way that is assured to remove any unused levels in the result.
This was such a pain! ggplot messes up if you don't do this right. Using this option at the beginning of my script solved it:
options(stringsAsFactors = FALSE)
Looks like it is the intended behavior but unfortunately I had turned this feature on for some other purpose and it started causing trouble for all my other scripts.

Resources