Arguments for Subset within a function in R colon v. greater or equal to - r

Suppose I have the following data.
x<- c(1,2, 3,4,5,1,3,8,2)
y<- c(4,2, 5,6,7,6,7,8,9)
data<-cbind(x,y)
x y
1 1 4
2 2 2
3 3 5
4 4 6
5 5 7
6 1 6
7 3 7
8 8 8
9 2 9
Now, if I subset this data to select only the observations with "x" between 1 and 3 I can do:
s1<- subset(data, x>=1 & x<=3)
and obtain my desired output:
x y
1 1 4
2 2 2
3 3 5
4 1 6
5 3 7
6 2 9
However, if I subset using the colon operator I obtained a different result:
s2<- subset(data, x==1:3)
x y
1 1 4
2 2 2
3 3 5
This time it only includes the first observation in which "x" was 1,2, or 3. Why?
I would like to use the ":" operator because I am writing a function so the user would input a range of values from which she wants to see an average calculated over the "y" variable. I would prefer if they can use ":" operator to pass this argument to the subset function inside my function but I don't know why subsetting with ":" gives me different results.
I'd appreciate any suggestions on this regard.

You can use %in% instead of ==
subset(data, x %in% 1:3)
In general, if we are comparing two vectors of unequal sizes, %in% would be used. There are cases where we can take advantage of the recycling (it can fail too) if the length of one of the vector is double that of the second. Some examples with some description is here.

Related

Assign elements of a vector in different vector groups [duplicate]

I have a vector X that contains positive numbers that I want to bin/discretize. For this vector, I want the numbers [0, 10) to show up just as they exist in the vector, but numbers [10,∞) to be 10+.
I'm using:
x <- c(0,1,3,4,2,4,2,5,43,432,34,2,34,2,342,3,4,2)
binned.x <- as.factor(ifelse(x > 10,"10+",x))
but this feels klugey to me. Does anyone know a better solution or a different approach?
How about cut:
binned.x <- cut(x, breaks = c(-1:9, Inf), labels = c(as.character(0:9), '10+'))
Which yields:
# [1] 0 1 3 4 2 4 2 5 10+ 10+ 10+ 2 10+ 2 10+ 3 4 2
# Levels: 0 1 2 3 4 5 6 7 8 9 10+
You question is inconsistent.
In description 10 belongs to "10+" group, but in code 10 is separated level.
If 10 should be in the "10+" group then you code should be
as.factor(ifelse(x >= 10,"10+",x))
In this case you could truncate data to 10 (if you don't want a factor):
pmin(x, 10)
# [1] 0 1 3 4 2 4 2 5 10 10 10 2 10 2 10 3 4 2 10
x[x>=10]<-"10+"
This will give you a vector of strings. You can use as.numeric(x) to convert back to numbers ("10+" become NA), or as.factor(x) to get your result above.
Note that this will modify the original vector itself, so you may want to copy to another vector and work on that.

Using Strings to Identify Sequence of Column Names in R

I am currently try to use pre-defined strings in order to identify multiple column names in R.
To be more explicit, I am using the ave function to create identification variables for subgroups of a dataframe. The twist is that I want the identification variables to be flexible, in such a manner that I would just pass it as a generic string.
A sample code would be:
ids = with(df,ave(rep(1,nrow(df)),subcolumn1,subcolumn2,subcolumn3,FUN=seq_along))
I would like to run this code in the following fashion (code below does not work as expected):
subColumnsString = c("subcolumn1","subcolumn2","subcolumn3")
ids = with(df,ave(rep(1,nrow(df)),subColumnsString ,FUN=seq_along))
I tried something with eval, but still did not work:
subColumnsString = c("subcolumn1","subcolumn2","subcolumn3")
ids = with(df,ave(rep(1,nrow(df)),eval(parse(text=subColumnsString)),FUN=seq_along))
Any ideas?
Thanks.
EDIT: Working code example of what I want:
df = mtcars
id_names = c("vs","am")
idDF_correct = transform(df,idItem = as.numeric(interaction(vs,am)))
idDF_wrong = cbind(df,ave(rep(1,nrow(df)),df[id_names],FUN=seq_along))
Note how in idDF_correct, the unique combinations are correctly mapped into unique values of idItem. In idDF_wrong this is not the case.
I think this achieves what you requested. Here I use the mtcars dataset that ships with R:
subColumnsString <- c("cyl","gear")
ids = with(mtcars, ave(rep(1,nrow(mtcars)), mtcars[subColumnsString], FUN=seq_along))
Just index your data.frame using the sub columns which returns a list that naturally works with ave
EDIT
ids = ave(rep(1,nrow(mtcars)), mtcars[subColumnsString], FUN=seq_along)
You can omit the with and just call plain 'ol ave, as G. Grothendieck, stated and you should also use their answer as it is much more general.
This defines a function whose arguments are:
data, the input data frame
by, a character vector of column names in data
fun, a function to use in ave
Code--
Ave <- function(data, by, fun = seq_along) {
do.call(function(...) ave(rep(1, nrow(data)), ..., FUN = fun), data[by])
}
# test
Ave(CO2, c("Plant", "Treatment"), seq_along)
giving:
[1] 1 2 3 4 5 6 7 1 2 3 4 5 6 7 1 2 3 4 5 6 7 1 2 3 4 5 6 7 1 2 3 4 5 6 7 1 2 3
[39] 4 5 6 7 1 2 3 4 5 6 7 1 2 3 4 5 6 7 1 2 3 4 5 6 7 1 2 3 4 5 6 7 1 2 3 4 5 6
[77] 7 1 2 3 4 5 6 7

How to remove outiers from multi columns of a data frame

I would like to get a data frame that contains only data that is within 2 SD per each numeric column.
I know how to do it for a single column but how can I do it for a bunch of columns at once?
Here is the toy data frame:
df <- read.table(text = "target birds wolfs Country
3 21 7 a
3 8 4 b
1 2 8 c
1 2 3 a
1 8 3 a
6 1 2 a
6 7 1 b
6 1 5 c",header = TRUE)
Here is the code line for getting only the data that is under 2 SD for a single column(birds).How can I do it for all numeric columns at once?
df[!(abs(df$birds - mean(df$birds))/sd(df$birds)) > 2,]
target birds wolfs Country
2 3 8 4 b
3 1 2 8 c
4 1 2 3 a
5 1 8 3 a
6 6 1 2 a
7 6 7 1 b
8 6 1 5 c
We can use lapply to loop over the dataset columns and subset the numeric vectors (by using a if/else condition) based on the mean and sd.
lapply(df, function(x) if(is.numeric(x)) x[!(abs((x-mean(x))/sd(x))>2)] else x)
EDIT:
I was under the impression that we need to remove the outliers for each column separately. But, if we need to keep only the rows that have no outliers for the numeric columns, we can loop through the columns with lapply as before, instead of returning 'x', we return the sequence of 'x' and then get the intersect of the list element with Reduce. The numeric index can be used for subsetting the rows.
lst <- lapply(df, function(x) if(is.numeric(x))
seq_along(x)[!(abs((x-mean(x))/sd(x))>2)] else seq_along(x))
df[Reduce(intersect,lst),]
I'm guessing that you are trying to filter your data set by checking that all of the numeric columns are within 2 SD (?)
In that case I would suggest to create two filters. 1 one that will indicate numeric columns, the second one that will check that all of them within 2 SD. For the second condition, we can use the built in scale function
indx <- sapply(df, is.numeric)
indx2 <- rowSums(abs(scale(df[indx])) <= 2) == sum(indx)
df[indx2,]
# target birds wolfs Country
# 2 3 8 4 b
# 3 1 2 8 c
# 4 1 2 3 a
# 5 1 8 3 a
# 6 6 1 2 a
# 7 6 7 1 b
# 8 6 1 5 c

recursive replacement in R

I am trying to clean some data and would like to replace zeros with values from the previous date. I was hoping the following code works but it doesn't
temp = c(1,2,4,5,0,0,6,7)
temp[which(temp==0)]=temp[which(temp==0)-1]
returns
1 2 4 5 5 0 6 7
instead of
1 2 4 5 5 5 6 7
Which I was hoping for.
Is there a nice way of doing this without looping?
The operation is called "Last Observation Carried Forward" and usually used to fill data gaps. It's a common operation for time series and thus implemented in package zoo:
temp = c(1,2,4,5,0,0,6,7)
temp[temp==0] <- NA
library(zoo)
na.locf(temp)
#[1] 1 2 4 5 5 5 6 7
You could use essentially your same logic except you'll want to apply it to the values vector that results from using rle
temp = c(1,2,4,5,0,0,6,0)
o <- rle(temp)
o$values[o$values == 0] <- o$values[which(o$values == 0) - 1]
inverse.rle(o)
#[1] 1 2 4 5 5 5 6 6

Binning a numeric variable

I have a vector X that contains positive numbers that I want to bin/discretize. For this vector, I want the numbers [0, 10) to show up just as they exist in the vector, but numbers [10,∞) to be 10+.
I'm using:
x <- c(0,1,3,4,2,4,2,5,43,432,34,2,34,2,342,3,4,2)
binned.x <- as.factor(ifelse(x > 10,"10+",x))
but this feels klugey to me. Does anyone know a better solution or a different approach?
How about cut:
binned.x <- cut(x, breaks = c(-1:9, Inf), labels = c(as.character(0:9), '10+'))
Which yields:
# [1] 0 1 3 4 2 4 2 5 10+ 10+ 10+ 2 10+ 2 10+ 3 4 2
# Levels: 0 1 2 3 4 5 6 7 8 9 10+
You question is inconsistent.
In description 10 belongs to "10+" group, but in code 10 is separated level.
If 10 should be in the "10+" group then you code should be
as.factor(ifelse(x >= 10,"10+",x))
In this case you could truncate data to 10 (if you don't want a factor):
pmin(x, 10)
# [1] 0 1 3 4 2 4 2 5 10 10 10 2 10 2 10 3 4 2 10
x[x>=10]<-"10+"
This will give you a vector of strings. You can use as.numeric(x) to convert back to numbers ("10+" become NA), or as.factor(x) to get your result above.
Note that this will modify the original vector itself, so you may want to copy to another vector and work on that.

Resources