Sorry for a kind of newb-ish question, as I've been using R for years, but I hadn't noticed this behavior until a student pointed it out to me and I can't explain it. First, build a little data frame. x-values greater than 100 are supposed to be illegal, but some have snuck in here. We also have a "group" independent variable.:
x = c(20, 30, 50, 60, 150, 35, 55, 75, 45, 145)
g = c(1,1,1,1,1,2,2,2,2,2)
df = data.frame(cbind(x,g))
Now, box plots, both grouped and ungrouped, which show all the data, including the illegal values, as they should:
boxplot(x~g)
boxplot(x)
So, we want to remove the illegal values by selecting only those rows in the frame with x-values less than 100. The grouped version works exactly as expected:
boxplot(x~g, data=df[x < 100,])
But the ungrouped one doesn't! All the data, including the values over 100, are plotted. Why does the previous one work and this one doesn't?
boxplot(x, data=df[x < 100,])
I'm sure I'm missing something simple, but for the life of me I can't figure out what it is, and I couldn't find the answer via Google or searching here.
boxplot is an S3 generic, which means that depending on what the first argument is, totally different functions are actually being called. boxplot.formula has different arguments than boxplot.default. Specifically, boxplot.default has no data argument at all; it's probably being sucked into ... and is then ignored as an unknown graphical parameter.
Try boxplot(x[x < 100]) instead.
The reason is because boxplot is reading x from the global environment, and not the data frame.
Note that this does not work as well:
df1 = df[x < 100, ]
boxplot(x, data=df1)
However, this works:
boxplot(df[df$x < 100, 'x'])
Related
So say I have a dataframe with a column for "play" and two columns with values:
df <- data.frame(Play = c("Comedy", "Midsummer", "Hamlet"),
he = c(105, 20, 210),
she = c(100, 23, 212))
I would like to get two vectors, one containing each Play with a higher value for "he" than "she", and one for the opposite, so each Play that has a higher value for "she" than "he".
I've looked at a few ways I thought going about it but none really seems to work, I tried building a 'if (x > y) {print z}' function then apply() that over my dataframe but I'm really far to inexperienced and run into so many problems, there ought to be simpler way than that …
as.character(df$Play)[df$he>df$she]
as.character(df$Play)[df$he<df$she]
Are the above 2 expressions solve your problem?
I'm attempting to create a new variable which combines two existing variables. I've taught myself how to do this with an if else statement with a loop. However, all of the examples online create variables to use in their if else loops by assigning a few values to a variable name. I understand why the online examples do this, but it makes it hard for me to figure out how to incorporate the if else loop into an existing tibble.
The new variable should use values from one existing variable (b) if they are not missing and values from another existing variable (a) if the scores for the first variable (b) are missing. At time 1, participants might have taken test a, test b, or both. I've filtered out those participants who took neither test. Now I'm trying to create a new variable (c) which combines the two tests. If participants took test b (is not missing), the new variable should reflect test b scores. If participants did not take test b (is missing), the new variable should reflect test a scores. I can make an example work using the code below, but I can't get a similar format to work with the variables in my actual data.
a <- c(40, 50, 60, 70, 80, 90, 100)
b <- c(10, NA, 30, NA, NA, 40, 50)
c <- vector("double")
for (i in seq_along(b)) {
if (!is.na(b[i])) {
c[i] <- (b[i])
} else {
c[i] <- (a[i])
}
}
c
I am at the point with R where I would like to start writing my own functions because I tend to need to do the same things over and over. However, I am struggling to see how I can generalize what I write. Looking at source code has not helped me learn very well because often it seems that .Internal or .Primitive functions (or other commands I do not know) are used extensively. I would like to simply start by turning my normal copy-pasted solutions into functions - fancier things can come later!
As an example: I do a lot of data formatting that requires doing some operation, and then filling in a data frame with zeros for all other combinations that did not have any data (e.g., years that did not have observations and were therefore not originally recorded, etc). I need to do this over and over for different data sets that have different sets of variables, but the idea and implementation is always the same.
My non-function way of solving this has been (for a specific implementation and minimal example):
df <- data.frame(County = c(1, 45, 57),
Year = c(2002, 2003, 2003),
Level = c("Mean", "Mean", "Mean"),
Obs = c(1.4, 1.9, 10.2))
#Create expanded version of data frame
Counties <- seq(from = 1, to = 77, by = 2)
Years <- seq(from = 1999, to = 2014, by = 1)
Levels <- c("Max", "Mean")
Expansion <- expand.grid(Counties, Years, Levels)
Expansion[4] <- 0
colnames(Expansion) <- colnames(df)
#Merge and order them so that the observed value is on top
df_full <- merge(Expansion, df, all = TRUE)
df_full$duplicate <- with(df_full,
paste(Year, County, Level))
df_full <- df_full[order(df_full$Year,
df_full$County,
df_full$Level,
-abs(df_full$Obs)), ]
#Deduplicate by taking the first that shows up (the observation)
df_full <- df_full[ !duplicated(df_full$duplicate), ]
df_full$duplicate <- NULL
I would like to generalize this so that I could somehow put in a data frame (and probably select the columns I need to order by since that sometimes changes) and then get the expanded version out. My first implementation consisted of a function with too many arguments (the data-frame and then all the column names I wanted to order/expand.grid by) and it also did not work:
gridExpand <- function(df, col1, col2=NULL, col3=NULL, measure){
#Started with "Expansion" being a global outside of the function
#It is identical the first part of the above code
ex <- merge(Expansion, df, all = TRUE)
ex$dupe <- with(ex,
paste(col1, col2, col3))
ex <- ex[order(with(ex,
col1, col2, col3, -abs(measure)))]
ex <- ex[ !duplicated(ex$dupe)]
ex <- subset(ex, select = -(dupe))
}
df_full <- gridExpand(df, Year, County, Level, Obs)
Error in paste(col1, col2, col3) : object 'Year' not found
I am assuming that this did not work because R has no way to know where 'Year' came from. I could potentially try paste(df, "$Year") but it would create "df$Year" which obviously will not work. And I do not ever see anyone else do this in their functions so clearly I am missing how it is that people reference things in data frame relevant functions.
I would ideally like to know of some resources that could help with thinking about generalization, or if someone can point me in the right direction to solving this particular problem I think it might help me see what I am doing wrong. I do not know of a better way to ask for help - I have been trying to read tutorials on writing functions for about 3 months and it is not clicking.
At a glance, the biggest thing that you can do is to not use non-standard-evaluation shortcuts inside your functions: things like $, subset() and with(). These are functions intended for convenient interactive use, not extensible programmatic use. (See, e.g., the Warning in ?subset which should probably be added to ?with, fortunes::fortune(312), fortunes::fortune(343).)
fortunes::fortune(312)
The problem here is that the $ notation is a magical shortcut and like
any other magic if used incorrectly is likely to do the programmatic
equivalent of turning yourself into a toad. -- Greg Snow (in
response to a user that wanted to access a column whose name is stored
in y via x$y rather than x[[y]])
R-help (February 2012)
fortunes::fortune(343)
Sooner or later most R beginners are bitten by this all too convenient shortcut. As an R
newbie, think of R as your bank account: overuse of $-extraction can lead to undesirable
consequences. It's best to acquire the [[ and [ habit early.
-- Peter Ehlers (about the use of $-extraction)
R-help (March 2013)
When you start writing functions that work on data frames, if you need to reference column names you should pass them in as strings, and then use [ or [[ to get the column based on the string stored in a variable name. This is the simplest way to make functions flexible with user-specified column names. For example, here's a simple stupid function that tests if a data frame has a column of the given name:
does_col_exist_1 = function(df, col) {
return(!is.null(df$col))
}
does_col_exist_2 = function(df, col) {
return(!is.null(df[[col]])
# equivalent to df[, col]
}
These yield:
does_col_exist_1(mtcars, col = "jhfa")
# [1] FALSE
does_col_exist_1(mtcars, col = "mpg")
# [1] FALSE
does_col_exist_2(mtcars, col = "jhfa")
# [1] FALSE
does_col_exist_2(mtcars, col = "mpg")
# [1] TRUE
The first function is wrong because $ doesn't evaluate what comes after it, no matter what value I set col to when I call the function, df$col will look for a column literally named "col". The brackets, however, will evaluate col and see "oh hey, col is set to "mpg", let's look for a column of that name."
If you want lots more understanding of this issue, I'd recommend the Non-Standard Evaluation Section of Hadley Wickham's Advanced R book.
I'm not going to re-write and debug your functions, but if I wanted to my first step would be to remove all $, with(), and subset(), replacing with [. There's a pretty good chance that's all you need to do.
I would like to use the data.table package in R to dynamically generate aggregations, but I am running into an error. Below, let my.dt be of type data.table.
sex <- c("M","F","M","F")
age <- c(19, 23, 26, 21)
dependent.variable <- c(1400, 1500, 1250, 1100)
my.dt <- data.table(sex, age, dependent.variable)
grouping.vars <- c("sex", "age")
for (i in 1:2) {
my.dt[,sum(dependent.variable), by=grouping.vars[i]]
}
If I run this, I get errors:
Error in `[.data.table`(my.dt, , sum(dependent.variable), by = grouping.vars[i] :
by must evaluate to list
Yet the following works without error:
my.dt[,sum(dependent.variable), by=sex]
I see why the error is occurring, but I do not see how to use a vector with the by parameter.
[UPDATE] 2 years after question was asked ...
On running the code in the question, data.table is now more helpful and returns this (using 1.8.2) :
Error in `[.data.table`(my.dt, , sum(dependent.variable), by = grouping.vars[i]) :
'by' appears to evaluate to column names but isn't c() or key(). Use by=list(...)
if you can. Otherwise, by=eval(grouping.vars[i]) should work. This is for efficiency
so data.table can detect which columns are needed.
and following the advice in the second sentence of error :
my.dt[,sum(dependent.variable), by=eval(grouping.vars[i])]
sex V1
1: M 2650
2: F 2600
Old answer from Jul 2010 (by can now be double and character, though) :
Strictly speaking the by needs to evaluate to a list of vectors each with storage mode integer, though. So the numeric vector age could also be coerced to integer using as.integer(). This is because data.table uses radix sorting (very fast) but the radix algorithm is specifically for integers only (see wikipedia's entry for 'radix sort'). Integer storage for key columns and ad hoc by is one of the reasons data.table is fast. A factor is of course an integer lookup to unique strings.
The idea behind by being a list() of expressions is that you are not restricted to column names. It is usual to write expressions of column names directly in the by. A common one is to aggregate by month; for example :
DT[,sum(col1), by=list(region,month(datecol))]
or a very fast way to group by yearmonth is by using a non epoch based date, such as yyyymmddL as seen in some of the examples in the package, like this :
DT[,sum(col1), by=list(region,month=datecol%/%100L)]
Notice how you can name the columns inside the list() like that.
To define and reuse complex grouping expressions :
e = quote(list(region,month(datecol)))
DT[,sum(col1),by=eval(e)]
DT[,sum(col2*col3/col4),by=eval(e)]
Or if you don't want to re-evaluate the by expressions each time, you can save the result once and reuse the result for efficiency; if the by expressions themselves take a long time to calculate/allocate, or you need to reuse it many times :
byval = DT[,list(region,month(datecol))]
DT[,sum(col1),by=byval]
DT[,sum(col2*col3/col4),by=byval]
Please see http://datatable.r-forge.r-project.org/ for latest info and status. A new presentation will be up there soon and hoping to release v1.5 to CRAN soon too. This contains several bug fixes and new features detailed in the NEWS file. The datatable-help list has about 30-40 posts a month which may be of interest too.
I did two changes to your original code:
sex <- c("M","F","M","F")
age <- c(19, 23, 26, 21)
age<-as.factor(age)
dependent.variable <- c(1400, 1500, 1250, 1100)
my.dt <- data.table(sex, age, dependent.variable)
for ( a in 1:2){
print(my.dt[,sum(dependent.variable), by=list(sex,age)[a]])
}
Numerical vector age should be forced into factors. As to by parameter, do not use quote for column names but group them into list(...). At least this is what the author has suggested.
As you would expect from a DSL aimed at data analysis, R handles missing/incomplete data very well, for instance:
Many R functions have an na.rm flag that when set to TRUE, remove the NAs:
>>> v = mean( c(5, NA, 6, 12, NA, 87, 9, NA, 43, 67), na.rm=T)
>>> v
(5, 6, 12, 87, 9, 43, 67)
But if you want to deal with NAs before the function call, you need to do something like this:
to remove each 'NA' from a vector:
vx = vx[!is.na(a)]
to remove each 'NA' from a vector and replace it w/ a '0':
ifelse(is.na(vx), 0, vx)
to remove entire each row that contains 'NA' from a data frame:
dfx = dfx[complete.cases(dfx),]
All of these functions permanently remove 'NA' or rows with an 'NA' in them.
Sometimes this isn't quite what you want though--making an 'NA'-excised copy of the data frame might be necessary for the next step in the workflow but in subsequent steps you often want those rows back (e.g., to calculate a column-wise statistic for a column that has missing rows caused by a prior call to 'complete cases' yet that column has no 'NA' values in it).
to be as clear as possible about what i'm looking for: python/numpy has a class, masked array, with a mask method, which lets you conceal--but not remove--NAs during a function call. Is there an analogous function in R?
Exactly what to do with missing data -- which may be flagged as NA if we know it is missing -- may well differ from domain to domain.
To take an example related to time series, where you may want to skip, or fill, or interpolate, or interpolate differently, ... is that just the (very useful and popular) zoo has all these functions related to NA handling:
zoo::na.approx zoo::na.locf
zoo::na.spline zoo::na.trim
allowing to approximate (using different algorithms), carry-forward or backward, use spline interpolation or trim.
Another example would be the numerous missing imputation packages on CRAN -- often providing domain-specific solutions. [ So if you call R a DSL, what is this? "Sub-domain specific solutions for domain specific languages" or SDSSFDSL? Quite a mouthful :) ]
But for your specific question: no, I am not aware of a bit-level flag in base R that allows you to mark observations as 'to be excluded'. I presume most R users would resort to functions like na.omit() et al or use the na.rm=TRUE option you mentioned.
It's a good practice to look at the data, hence infer about the type of missing values: is it MCAR (missing complete and random), MAR (missing at random) or MNAR (missing not at random)? Based on these three types, you can study the underlying structure of missing values and conclude whether imputation is at all applicable (you're lucky if it's not MNAR, 'cause, in that case, missing values are considered non-ignorable, and are related to some unknown underlying influence, factor, process, variable... whatever).
Chapter 3. in "Interactive and Dynamic Graphics for Data Analyst with R and GGobi" by Di Cook and Deborah Swayne is great reference regarding this topic.
You'll see norm package in action in this chapter, but Hmisc package has data imputation routines. See also Amelia, cat (for categorical missings imputation), mi, mitools, VIM, vmv (for missing data visualisation).
Honestly, I still don't quite understand is your question about statistics, or about R missing data imputation capabilities? I reckon that I've provided good references on second one, and about the first one: you can replace your NA's either with central tendency (mean, median, or similar), hence reduce the variability, or with random constant "pulled out" of observed (recorded) cases, or you can apply regression analysis with variable that contains NA's as criteria, and other variables as predictors, then assign residuals to NA's... it's an elegant way to deal with NA's, but quite often it would not go easy on your CPU (I have Celeron on 1.1GHz, so I have to be gentle).
This is an optimization problem... there's no definite answer, you should decide what/why are you sticking with some method. But it's always good practice to look at the data! =)
Be sure to check Cook & Swayne - it's an excellent, skilfully written guide. "Linear Models with R" by Faraway also contains a chapter about missing values.
So there.
Good luck! =)
The function na.exclude() sounds like what you want, although it's only an option for some (important) functions.
In the context of fitting and working with models, R has a family of generic functions for dealing with NAs: na.fail(), na.pass(), na.omit(), and na.exclude(). These are, in turn, arguments for some of R's key modeling functions, such as lm(), glm(), and nls() as well as functions in MASS, rpart, and survival packages.
All four generic functions basically act as filters. na.fail() will only pass the data through if there are no NAs, otherwise it fails. na.pass() passes all cases through. na.omit() and na.exclude() will both leave out cases with NAs and pass the other cases through. But na.exclude() has a different attribute that tells functions processing the resulting object to take into account the NAs. You could see this attribute if you did attributes(na.exclude(some_data_frame)). Here's a demonstration of how na.exclude() alters the behavior of predict() in the context of a linear model.
fakedata <- data.frame(x = c(1, 2, 3, 4), y = c(0, 10, NA, 40))
## We can tell the modeling function how to handle the NAs
r_omitted <- lm(x~y, na.action="na.omit", data=fakedata)
r_excluded <- lm(x~y, na.action="na.exclude", data=fakedata)
predict(r_omitted)
# 1 2 4
# 1.115385 1.846154 4.038462
predict(r_excluded)
# 1 2 3 4
# 1.115385 1.846154 NA 4.038462
Your default na.action, by the way, is determined by options("na.action") and begins as na.omit() but you can set it.