R data frame select by global variable - r

I'm not sure how to do this without getting an error. Here is a simplified example of my problem.
Say I have this data frame DF
a b c d
1 2 3 4
2 3 4 5
3 4 5 6
Then I have a variable
x <- min(c(1,2,3))
Now I want do do the following
y <- DF[a == x]
But when I try to refer to some variable like "x" I get an error because R is looking for a column "x" in my data frame. I get the "undefined columns selected" error
How can I do what I am trying to do in R?

You may benefit from reading an Introduction to R, especially on matrices, data.frames and indexing. Your a is a column of a data.frame, your x is a scalar. The comparison you have there does not work.
Maybe you meant
R> DF$a == min(c(1,2,3))
[1] TRUE FALSE FALSE
R> DF[,"a"] == min(c(1,2,3))
[1] TRUE FALSE FALSE
R>
which tells you that the first row fits but not the other too. Wrapping this in which() gives you indices instead.

I think this is what you're looking for:
> x <- min(DF$a)
> DF[DF$a == x,]
a b c d
1 1 2 3 4
An easier way (avoiding the 'x' variable) would be this:
> DF[which.min(DF$a),]
a b c d
1 1 2 3 4
or this:
> subset(DF, a==min(a))
a b c d
1 1 2 3 4

Related

Duplicating R dataframe vector values using another vector as a guide

I have the following R dataframe: df = data.frame(value=c(5,4,3,2,1), a=c(2,0,1,6,9), b=c(7,0,0,3,4)). I would like to duplicate the values of a and b by the number of times of the corresponding position values in value. For example, Expanding b would look like b_ex = c(7,7,7,7,7,2,2,2,4). No values of three or four would be in b_ex because values of zero are in b[2] and b[3]. The expanded vectors would be assigned names and be stand-alone.
Thanks!
Maybe you are looking for :
result <- lapply(df[-1], function(x) rep(x[x != 0], df$value[x != 0]))
#$a
#[1] 2 2 2 2 2 1 1 1 6 6 9
#$b
#[1] 7 7 7 7 7 3 3 4
To have them as separate vectors in global environment use list2env :
list2env(result, .GlobalEnv)

using which command in R to copy subset of an array

I want to use which in R to copy a segment of array. However, it seems like which skips the repetitive elements. Here is an example:
a <- c(1,2,3,4,1,2,2,3)
b <- c(1,2)
a <- a[which(a==b)]
a
[1] 1 2 1 2
I want to have an output like:
a
[1] 1 2 1 2 2
Any ideas?
I think you want %in%. It returns a logical, TRUE, when the values of a are also in b. Then if you vectorize that, the result is those values of a that are also in b.
> a <- c(1,2,3,4,1,2,2,3)
> b <- c(1,2)
> a[a %in% b]
[1] 1 2 1 2 2

Subsetting data frame by factor level

I have a big data frame with state names in one colum and different indexes in the other columns.
I want to subset by state and create an object suitable for minimization of the index or a data frame with the calculation already given.
Here's one simple (short) example of what I have
m
x y
1 A 1.0
2 A 2.0
3 A 1.5
4 B 3.0
5 B 3.5
6 C 7.0
I want to get this
m
x y
1 A 1.0
2 B 3.0
3 C 7.0
I don't know if a function with a for loop is necessary. Like
minimize<-function(x,...)
for (i in m$x){
do something with data by factor value
apply to that something the min function in every column
return(y)
}
so when you call
minimize(A)
[1] 1
I tried to use %in% but didn't work (I got this error).
A%in%m
Error in match(x, table, nomatch = 0L) : object 'A' not found
When I define it it goes like this.
A<-c("A")
"A"%in%m
[1] FALSE
Thank you in advance
Use aggregate
> aggregate(.~x, FUN=min, dat)
x y
1 A 1
2 B 3
3 C 7
See this post to get some other alternatives.
Try aggregate:
aggregate(y ~ x, m, min)
x y
1 A 1
2 B 3
3 C 7
Using data.table
require(data.table)
m <- data.table(m)
m[, j=min(y), by=x]
# x V1
# 1: A 1
# 2: B 3
# 3: C 7

Select rows with identical columns from a data frame

I have a data frame with several columns.
I want to select the rows with no NAs (as with complete.cases)
and all columns identical.
E.g., for
> f <- data.frame(a=c(1,NA,NA,4),b=c(1,NA,3,40),c=c(1,NA,5,40))
> f
a b c
1 1 1 1
2 NA NA NA
3 NA 3 5
4 4 40 40
I want the vector TRUE,FALSE,FALSE,FALSE selecting just the first row because there all 3 columns are the same and none is NA.
I can do
Reduce("==",f[complete.cases(f),])
but that creates an intermediate data frame which I would love to avoid (to save memory).
Try this:
R > index <- apply(f, 1, function(x) all(x==x[1]))
R > index
[1] TRUE NA NA FALSE
R > index[is.na(index)] <- FALSE
R > index
[1] TRUE FALSE FALSE FALSE
The best (IMO) solution is from David Winsemius:
which( rowSums(f==f[[1]]) == length(f) )

R - find all unique values among subsets of a data frame

I have a data frame with two columns. The first column defines subsets of the data. I want to find all values in the second column that only appear in one subset in the first column.
For example, from:
df=data.frame(
data_subsets=rep(LETTERS[1:2],each=5),
data_values=c(1,2,3,4,5,2,3,4,6,7))
data_subsets data_values
A 1
A 2
A 3
A 4
A 5
B 2
B 3
B 4
B 6
B 7
I would want to extract the following data frame.
data_subsets data_values
A 1
A 5
B 6
B 7
I have been playing around with duplicated but I just can't seem to make it work. Any help is appreciated. There are a number of topics tackling similar problems, I hope I didn't overlook the answer in my searches!
EDIT
I modified the approach from #Matthew Lundberg of counting the number of elements and extracting from the data frame. For some reason his approach was not working with the data frame I had, so I came up with this, which is less elegant but gets the job done:
counts=rowSums(do.call("rbind",tapply(df$data_subsets,df$data_values,FUN=table)))
extract=names(counts)[counts==1]
df[match(extract,df$data_values),]
First, find the count of each element in df$data_values:
x <- sapply(df$data_values, function(x) sum(as.numeric(df$data_values == x)))
> x
[1] 1 2 2 2 1 2 2 2 1 1
Now extract the rows:
> df[x==1,]
data_subsets data_values
1 A 1
5 A 5
9 B 6
10 B 7
Note that you missed "A 5" above. There is no "B 5".
You had the right idea with duplicated. The trick is to combine fromLast = TRUE and fromLast = FALSE options to get a full list of non-duplicated rows.
!duplicated(df$data_values,fromLast = FALSE)&!duplicated(df$data_values,fromLast = TRUE)
[1] TRUE FALSE FALSE FALSE TRUE FALSE FALSE FALSE TRUE TRUE
Indexing your data.frame with this vector gives:
df[!duplicated(df$data_values,fromLast = FALSE)&!duplicated(df$data_values,fromLast = TRUE),]
data_subsets data_values
1 A 1
5 A 5
9 B 6
10 B 7
A variant of P Lapointe's answer would be
df[! df$data_values %in% df[duplicated( unique(df)$data_values ), ]$data_values,]
The unique() deals with the possibility (not in your test data) that some rows in the data may be identical and you want to keep them once if the same data_values does not appear for distinct data_sets (or distinct other columns).
You can use the 'dplyr' and 'explore' library to overcome this problem.
library(dplyr)
library(explore)
df=data.frame(
data_subsets=rep(LETTERS[1:2],each=5),
data_values=c(1,2,3,4,5,2,3,4,6,7))
df %>% describe(data_subsets)
######## output ########
#variable = data_subsets
#type = character
#na = 0 of 10 (0%)
#unique = 2
# A = 5 (50%)
# B = 5 (50%)

Resources