adding rows to data.frame with mixed data format - r

I have a table containing mixed values, including character string and numerical value. I chose to use a data.frame to store it. However, I have met serious difficulties in adding extra rows to the data frame. Error messages, like invalid factor level, NA generated, keep occurring.
Besides using a data.frame, are there any data structure that can help avoid the issue of invalid factor level, NA generated, while still supporting the mixed data format?

data.frame (or data.table) would most likely me the data structure you are looking for.
To put it plainly: in order to add "a new row", every new element needs to confirm to whatever restrictions pertain to its column. Mostly, that means being of the same class.
If you are adding elements to a factor column (as the error you received indicates) then there is an additional requirement that the new values must also be levels of that factor. (see ?factor for more info)
If one is using factors deliberately, then the above is a good thing. However, if one has a column that was unintentionally coerced to a factor, then the above is a P.I.T.A.
Unfortunately, the default on most functions that generate data.frames is to have stringsAsFactors=TRUE. This, imho, is annoying. But all you have to do is turn that flag off, and you should be set.

Related

Converting chr to numeric and still not able to take mean

I am working with a dataframe from NYC opendata. On the information page it claims that a column, ACRES, is numeric, but when I download it is chr. I've tried the following:
parks$ACRES <- as.numeric(as.character(parks$ACRES))
which turned the column info type into dbl, but I was unable to take the mean, so I tried:
parks$ACRES <- as.integer(as.numeric(parks$ACRES))
I've also tried sapply() and I get an error message with NAs introduced by coercion. I tried convert() to but R didn't recognize it though it is supposed to be part of dplyr.
Either way I get NA as a result for the mean.
I've tried taking the mean a few different ways:
mean(parks[["ACRES"]])
mean(parks$ACRES)
Which also didn't work? Is it the dataframe? I'm wondering since it is from the government there are limits?
I'd appreciate any help.
You have NAs in your data. Either they were there before you converted or some of the data can't be converted to numeric directly (do you have comma separators for the 1000s in your input? Those need to be removed before converting to numeric).
Identifying why you have NAs and fixing if necessary is the first step you'll need to do. If the NAs are valid then what you want to do is to add the na.rm = TRUE parameter to the mean function which ignores NAs while calculating the mean.
Check to see how ACRES is being loaded in (i.e., what data type is it?). If it's being loaded in as a factor, you will have trouble changing a factor to a numerical value. The way to solve this is to use the 'stringsAsFactors = FALSE' argument in your read.csv or whatever function you're using to read in the data.

R empty data frame after subsetting by factor

I need to subset my data depending on the content of one factor variable.
I tried to do it with subset:
new <- subset(data, original$Group1=="SALAD")
data is already a subset from a bigger data frame, in original I have the factor variable which should identify the wanted rows.
This works perfectly for one level of the factor variable, but (and I really don´t understand why!!) when I do it with the other factor level "BREAD" it creates the data frame but says "no data available" - so it is empty. I´ve imported the data from SPSS, if this matters. I´ve already checked the factor levels, but the naming should be right!
Would be really grateful for help, I spent 3 hours on this problem and wasn´t able to find a solution.
I´ve also tried other ways to subset my data (e.g. split), but I want a data frame as output.
Do you have advice in general, what is the best way to subset a data frame if I want e.g. 3 columns of this data frame and these should be extracted depending on the level of a factor (most Code examples are only for one or all columns..)
The entire point of the subset function (as I understand it) is to look inside the data frame for the right variable - so you can type
subset(data, var1 == "value")
instead of
data[data$var1 == "value,]
Please correct me anyone if that is incorrect.
Now, in you're case, you are explicitly taking Group1 from the data frame original and using that to subset data - which you say is a subset of original. Based on this, I see no reason to believe (and every reason not to believe) that the elements of original$Group1 will align with the rows of data. If Group1 is defined within data, why not just use the copy defined there - which is aligned correctly? If not, you need to be very explicit about what you are trying to accomplish, so that you can ensure that things are aligned correctly.

Error in col2rgb(d) : invalid color name in tweenr

I'm getting this error a lot in using tweenr in RStudio on mac but I'm unable to replicate it using dummy dataset. My dataset is a list of data frames with I want to apply tween_states. Works fine on dummy data, but always return Error in col2rgb(d) : invalid color name and recognise my first character column as a 'color' whenever I use real data.
Hard to be sure, but I think you are passing too many columns to the tweenr function.
The data you send to the tweenr function should be trimmed column wise to only contain the columns used as argument names and one additional column of values that will be tweened
Getting the same issue! I fixed it by making sure the first column only has numbers, no strings. For whatever reason the first column is interpreted as colors if it contains strings. I didn't need to trim any columns down as the other poster suggested.

What's the easiest way to ignore one row of data when creating a histogram in R?

I have this csv with 4000+ entries and I am trying to create a histogram of one of the variables. Because of the way the data was collected, there was a possibility that if data was uncollectable for that entry, it was coded as a period (.). I still want to create a histogram and just ignore that specific entry.
What would be the best or easiest way to go about this?
I tried making it so that the histogram would only use the data for every entry except the one with the period by doing
newlist <- data1$var[1:3722]+data1$var[3724:4282]
where 3723 is the entry with the period, but R said that + is not meaningful for factors. I'm not sure if I went about this the right way, my intention was to create a vector or list or table conjoining those two subsets above into one bigger list called newlist.
Your problem is deeper that you realize. When R read in the data and saw the lone . it interpreted that column as a factor (categorical variable).
You need to either convert the factor back to a numeric variable (this is FAQ 7.10) or reread the data forcing it to read that column as numeric, if you are using read.table or one of the functions that calls read.table then you can set the colClasses argument to specify a numeric column.
Once the column of data is a numeric variable then a negative subscript or !is.na will work (or some functions will automatically ignore the missing value).

Unable to filter a data frame?

I am using something like this to filter my data frame:
d1 = data.frame(data[data$ColA == "ColACat1" & data$ColB == "ColBCat2", ])
When I print d1, it works as expected. However, when I type d1$ColB, it still prints everything from the original data frame.
> print(d1)
ColA ColB
-----------------
ColACat1 ColBCat2
ColACat1 ColBCat2
> print(d1$ColA)
Levels: ColACat1 ColACat2
Maybe this is expected but when I pass d1 to ggplot, it messes up my graph and does not use the filter. Is there anyway I can filter the data frame and get only the records that match the filter? I want d1 to not know the existence of data.
As you allude to, the default behavior in R is to treat character columns in data frames as a special data type, called a factor. This is a feature, not a bug, but like any useful feature if you're not expecting it and don't know how to properly use it, it can be quite confusing.
factors are meant to represent categorical (rather than numerical, or quantitative) variables, which comes up often in statistics.
The subsetting operations you used do in fact work normally. Namely, they will return the correct subset of your data frame. However, the levels attribute of that variable remains unchanged, and still has all the original levels in it.
This means that any method written in R that is designed to take advantage of factors will treat that column as a categorical variable with a bunch of levels, many of which just aren't present. In statistics, one often wants to track the presence of 'missing' levels of categorical variables.
I actually also prefer to work with stringsAsFactors = FALSE, but many people frown on that since it can reduce code portability. (TRUE is the default, so sharing your code with someone else may be risky unless you preface every single script with a call to options).
A potentially more convenient solution, particularly for data frames, is to combine the subset and droplevels functions:
subsetDrop <- function(...){
droplevels(subset(...))
}
and use this function to extract subsets of your data frames in a way that is assured to remove any unused levels in the result.
This was such a pain! ggplot messes up if you don't do this right. Using this option at the beginning of my script solved it:
options(stringsAsFactors = FALSE)
Looks like it is the intended behavior but unfortunately I had turned this feature on for some other purpose and it started causing trouble for all my other scripts.

Resources