Define factors whose levels depend on another variable - r

Be this mock data:
set.seed(20120220)
x <- c(rep("a", 4), rep("b", 4))
y <- c(sample(c(1, 2), 8, replace = TRUE))
z <- data.frame(cbind(x, y))
Data frame z will look like this:
x y
1 a 1
2 a 1
3 a 1
4 a 2
5 b 2
6 b 1
7 b 2
8 b 2
I want to run something akin to factor(z$y, levels = 1:2, labels = c("alpha", "beta")), but I don't want every 1 to become alpha and every 2 to become beta. I want that to happen only for x = a. If x = b, I want 1 to become gamma and 2 to become delta.
In other words, I want my data frame to look like this:
x y
1 a alpha
2 a alpha
3 a alpha
4 a beta
5 b delta
6 b gamma
7 b delta
8 b delta
This is what I came up with so far:
for (i in 1:nrow(z)) {
if (z$x[i] == "a")
z$y[i] <- factor(z$y[i], levels = 1:2, labels = c("alpha", "beta"))
else
z$y[i] <- factor(z$y[i], levels = 1:2, labels = c("gamma", "delta"))
}
But it gives me several warning messages (one for each i) like this:
Warning messages:
1: In `[<-.factor`(`*tmp*`, i, value = c(NA, 1L, 1L, 2L, 2L, 1L, 2L, :
invalid factor level, NAs generated
And then, when I call z again, the data frame is a mess, every y has been made into <NA>.
I bet there's a simple solution for this, but I've been trying several approaches for hours to no avail. My head is about to explode! Help!

> z$ynew <- ifelse(z$x == "a", ifelse( z$y==1, "alpha", "beta"),
ifelse(z$y==1, "delta", "gamma") )
> z
x y ynew
1 a 1 alpha
2 a 1 alpha
3 a 1 alpha
4 a 2 beta
5 b 2 gamma
6 b 1 delta
7 b 2 gamma
8 b 2 gamma
(I guess I swapped your delta's and gamma's. If you want 'ynew' to be a factor then just: z$ynew <- factor(z$ynew)

What about using merge ?
# define x and y to 'alpha', 'beta' etc. correspondences
# (it's just one row for each possible factor)
auxDf <- data.frame( x = c('a', 'a', 'b', 'b' ),
y = c( 1, 2, 1, 2 ),
newy= c('alpha', 'beta', 'gamma', 'delta'))
# merge the 2 data.frame getting a new data.frame with the factors column
newDf <- merge(z,auxDf)
newDf

Here's one additional step to make the previous answer even a bit quicker -
you can use 'unique' to pull out all the unique combinations in a data frame.
auxDf=unique(z)
auxDf$newy=c('alpha','beta','gamma','delta')
Then, as in the previous post
newDf <- merge(z,auxDf)
newDf

I've managed to come up with a solution that works, even though it is quite messy.
First, create subsets of the data frame z for each x
z1 <- subset(z, x == "a")
z2 <- subset(z, x == "b")
Then, apply factor() to each subset:
z1$y <- factor(z1$y, levels = 1:2, labels = c("alpha", "beta"))
z2$y <- factor(z2$y, levels = 1:2, labels = c("gamma", "delta"))
And finally, reunite the subsets into the original object.
z <- rbind(z1, z2)

Related

R error in '[<-.data.frame 'C' *temp*',… replacement has # items, need #

I have a function that I've executed many times but is now throwing an error, which I do not understand. I'm trying to apply the function over a list.
I did not write the function and I have limited experience with functions. So, I'm not sure how to troubleshoot the code.
function:
myfun<-function(Year, SampleID, Species, Abundance, resamps) {
library(vegan)
counter<-1
simbaseline<-data.frame(array(NA,dim=c(length(unique(Year)),5)))
names(simbaseline)<-c("Year", "Jaccard","Horn","Bray","Pearson")
simnext<-data.frame(array(NA,dim=c(length(unique(Year)),5)))
names(simnext)<-c("Year", "Jaccard","Horn","Bray","Pearson")
simhind<-data.frame(array(NA,dim=c(length(unique(Year)),5)))
names(simhind)<-c("Year", "Jaccard","Horn","Bray","Pearson")
counter2<-1
# getting vector with number of samples per year
nsamples<-c()
for(y in unique(Year)){
nsamples<-c(nsamples, length(unique(SampleID[Year==y])))
}
t<-1
minsample<-min(nsamples)
for(repeats in 1:resamps){
raref<-data.frame(array(NA,dim=c(1,3)))
names(raref)<-c("Year","Species","Abundance")
for(y in unique(Year)){
#getting samples for this year
samps<-unique(SampleID[Year==y])
# re-sampling samples to equalize number of samples
sam<-as.character(sample(samps,minsample,replace=T))
# getting data that belongs to bootstraped samples
rarefyear<-data.frame(SampleID[which(SampleID %in% sam & Year == y)],
Species[which(SampleID %in% sam & Year == y)],
Abundance[which(SampleID %in% sam & Year == y)])
names(rarefyear)<-c("SampleID", "Species", "Abundance")
# calculating pooled abundances of eahc species to store
spabun<-tapply(as.numeric(rarefyear[,3]),rarefyear[,2],sum)
spar<-data.frame(rep(y, length(spabun)),names(spabun),spabun, row.names=NULL)
names(spar)<-c("Year","Species","Abundance")
raref<-rbind(raref,spar)
counter<-counter+1
}
# calculating year by species table of abundance
rareftabtemp<-with(raref,tapply(Abundance,list(Year,Species),function(x)x))
rareftabtemp[is.na(rareftabtemp)]<-0
Pearsoncor<-cor(t(log(rareftabtemp+1)), method="pearson")
# calculating between year similarities (NOT DISTANCE!) with Jaccard, Morisita-Horn, Bray and Pearson correlations
Jacsim<-as.matrix(1-vegdist(rareftabtemp, method="jaccard"))
Hornsim<-as.matrix(1-vegdist(rareftabtemp, method="horn"))
Braysim<-as.matrix(1-vegdist(rareftabtemp, method="bray"))
n<-length(unique(Year))
simbaseline[counter2:(counter2+n-2),]<-
cbind(unique(Year)[2:n],Jacsim[2:n],Hornsim[2:n],Braysim[2:n],Pearsoncor[2:n])
simnext[counter2:(counter2+n-2),]<-
cbind(unique(Year)[2:n],Jacsim[row(Jacsim)-col(Jacsim)==1],
Hornsim[row(Hornsim)-col(Hornsim)==1],
Braysim[row(Braysim)-col(Braysim)==1],
Pearsoncor[row(Pearsoncor)-col(Pearsoncor)==1])
# added hindcasting
simhind[counter2:(counter2+n-2),]<-
cbind(unique(Year)[1:(n-1)],
Jacsim[row(Jacsim) %in% 1:(max(row(Jacsim))-1) &
col(Jacsim)==max(col(Jacsim))],
Hornsim[row(Hornsim)%in%1:(max(row(Hornsim))-1) &
col(Hornsim)==max(col(Hornsim))],
Braysim[row(Braysim)%in%1:(max(row(Braysim))-1) &
col(Braysim)==max(col(Braysim))],
Pearsoncor[row(Pearsoncor)%in%1:(max(row(Pearsoncor))-1) &
col(Pearsoncor)==max(col(Pearsoncor))])
counter2<-counter2+n
}
baselinesim<-data.frame(unique(Year)[2:n],
tapply(simbaseline$Jaccard,simbaseline$Year,mean),
tapply(simbaseline$Horn,simbaseline$Year,mean),
tapply(simbaseline$Bray,simbaseline$Year,mean),
tapply(simbaseline$Pearson,simbaseline$Year,mean))
names(baselinesim)<-c("Year", "Jaccard","Horn","Bray","Pearson")
nextsim<-data.frame(unique(Year)[2:n],
tapply(simnext$Jaccard,simnext$Year,mean),
tapply(simnext$Horn,simnext$Year,mean),
tapply(simnext$Bray,simnext$Year,mean),
tapply(simnext$Pearson,simnext$Year,mean))
names(nextsim)<-c("Year", "Jaccard","Horn","Bray","Pearson")
hindcastsim<-data.frame(unique(Year)[1:(n-1)],
tapply(simhind$Jaccard,simhind$Year,mean),
tapply(simhind$Horn,simhind$Year,mean),
tapply(simhind$Bray,simhind$Year,mean),
tapply(simhind$Pearson,simhind$Year,mean))
names(hindcastsim)<-c("Year", "Jaccard","Horn","Bray","Pearson")
a<-list(baselinesim,nextsim,hindcastsim)
return(a)
}
error:
Error in [<-.data.frame(*tmp*, counter2:(counter2 + n - 2), , value = c(NA, : replacement has 2 items, need 5
Traceback
6.
stop(sprintf(ngettext(m, "replacement has %d item, need %d",
"replacement has %d items, need %d"), m, n * p), domain = NA)
5.
`[<-.data.frame`(`*tmp*`, counter2:(counter2 + n - 2), , value = structure(c(NA,
2009), .Dim = 2:1))
4.
`[<-`(`*tmp*`, counter2:(counter2 + n - 2), , value = structure(c(NA,
2009), .Dim = 2:1))
3.
myfun(x$Year, x$Bay, x$Species, x$Abundance, 20)
2.
FUN(X[[i]], ...)
1.
lapply(summer.split, function(x) myfun(x$Year, x$Bay,
x$Species, x$Abundance, 20))
Again, the function worked
Someone seems to have asked a similar question before and was answered by #Marat Talipov but I'm not experienced enough to make sense of what the solution was.
The answer was:
This error pops up when you're unlucky and i <- runif(n) < 1/2 consists only of FALSE, i.e. no permutations happen. You need to add a check in the swap function to fix this problem.
R error in '[<-.data.frame'... replacement has # items, need #
A subset of my data can be found here:
https://fil.email/sI4Kyhaj
The data was split by "Bay" to generate the list
Note that the function may not throw an error on a different machine because it seems to occur periodically.
The problem is located at the peace of code below (and similar indexed reasignment of dataframe's expressions in your code):
simbaseline[counter2:(counter2+n-2),]<-
cbind(unique(Year)[2:n],Jacsim[2:n],Hornsim[2:n],Braysim[2:n],Pearsoncor[2:n])
What happend is that the amount of rows you are trying to assign in the left-hand experession is not equal to the right-hand one. To avoid it you can use intermediate dataframe filled-in with e.g. NAs then reassign temporary dataframe to your target data frame. Please see the simulated code below with explanations how it can be done:
# simulation
df <- data.frame(i = 1:10, l = letters[1:10], stringsAsFactors = FALSE)
head(df)
# i l
# 1 1 a
# 2 1 1
# 3 a a
# 4 4 d
# 5 5 e
# 6 6 f
# with error
df[1:5, ] <- cbind(1:3, c("a", "b", "c"))
# Error in `[<-.data.frame`(`*tmp*`, 1:5, , value = c("1", "2", "3", "a", :
# replacement has 6 items, need 10
# without error
dftemp_in <- cbind(1:3, c("a", "b", "c"))
dftemp_out <- df[1:5, ]
dftemp_out[] <- NA
dftemp_out[seq(nrow(dftemp_in)), ] <- dftemp_in
df[1:5, ] <- dftemp_out
df
# i l
# 1 1 a
# 2 2 b
# 3 3 c
# 4 <NA> <NA>
# 5 <NA> <NA>
# 6 6 f
# 7 7 g
# 8 8 h
# 9 9 i
# 10 10 j

How to make a fuzzy join in R using more than one variable on each side

I would like to join the two data frames :
a <- data.frame(x=c(1,3,5))
b <- data.frame(start=c(0,4),end=c(2,6),y=c("a","b"))
with a condition like (x>start)&(x<end) in order to get such a result:
# x y
#1 1 a
#2 2 <NA>
#3 3 b
I don't want to make a potentially large cartesian product and then select only the few rows matching the condition and I'd like a solution using the tidyverse (I am not interested in a solution using SQL which would be a confession of failure). I thought of the 'fuzzyjoin' package but I cannot find examples fitting my need : the function to apply for the condition has only two arguments. I also tried to put 'start' and 'end' into a single argument with data.frame(z=I(purrr::map2(b$start,b$end,list)),y=b$y)
# z y
#1 0, 2 a
#2 4, 6 b
but although the data looks fine fuzzy_left_join doesn't accept it.
I search for solutions working in more general cases (n variables on the LHS, m on the RHS, not necessarily numeric with arbitrary conditions).
UPDATE
I also want to be able to express conditions like (x=start+1)|(x=end+1) giving here:
# x y
#1 1 a
#2 3 a
#3 5 b
For this case you don't need multi_by or multy_match_fun, this works :
library(fuzzyjoin)
fuzzy_left_join(a, b, by = c(x = "start", x = "end"), match_fun = list(`>`, `<`))
# x start end y
# 1 1 0 2 a
# 2 3 NA NA <NA>
# 3 5 4 6 b
I eventually went to the code of fuzzy_join and found a way to make what I want even without proper documentation. fuzzy_let_join doesn't work but there is the following way (not really pretty and it actually does a cartesian product):
g <- function(x,y) (x>y[,"start"])&(x<y[,"end"])
fuzzy_join(a,b, multi_by = list(x="x",y=c("start","end"))
, multi_match_fun = g, mode = "left") %>% select(x,y)
data.table approach could be
library(data.table)
name1 <- setdiff(names(setDT(b)), names(setDT(a)))
#perform left outer join and then select required columns
a[b, (name1) := mget(name1), on = .(x > start, x < end)][, .(x, y)]
which gives
x y
1: 1 a
2: 3 <NA>
3: 5 b
Sample data:
a <- data.frame(x = c(1, 3, 5))
b <- data.frame(start = c(0, 4), end = c(2, 6), y = c("a", "b"))
Update: In case you want to join both dataframes on (x=start+1)|(x=end+1) condition then you can try
library(data.table)
DT1 <- as.data.table(a)
DT2 <- as.data.table(b)
#Perform 1st join on "x = start+1" and then another on "x = end+1". Finally row-bind both results.
DT <- rbindlist(list(DT1[DT2[, start_temp := start+1], on = c(x = "start_temp"), .(x, y), nomatch = 0],
DT1[DT2[, end_temp := end+1], on = c(x = "end_temp"), .(x, y), nomatch = 0]))
DT
# x y
#1: 1 a
#2: 5 b
#3: 3 a
A possible answer to explain what I am trying to do : extending dplyr in some way. And I will be happy to know if there are ways to improve this solution or some problems I didn't see.
The solution avoids the cartesian product, but duplicates into lists of data frames both one of the input data frame and the result. I didn't include the final column selection of x and y that is easy to code.
my_left_join <- function(.DATA1,.DATA2,.WHERE)
{
call = as.list(match.call())
df1 <- .DATA1
df1$._row_ <- 1:nrow(df1)
dfl1 <- replyr::replyr_split(df1,"._row_")
eval(substitute(
dfl2 <- mapply(function(.x)
{filter(.DATA2,with(.x,WHERE)) %>%
mutate(._row_=.x$._row_)}
, dfl1, SIMPLIFY=FALSE)
,list(WHERE=call$.WHERE)))
df2 <- replyr::replyr_bind_rows(dfl2)
left_join(df1,df2,by="._row_") %>% select(-._row_)
}
my_left_join(a,b,(x>start)&(x<end))
# x start end y
#1 1 0 2 a
#2 3 NA NA <NA>
#3 5 4 6 b
my_left_join(a,b,(x==(start+1))|(x==(end+1)))
# x start end y
#1 1 0 2 a
#2 3 0 2 a
#3 5 4 6 b
You can try a GenomicRanges solution
library(GenomicRanges)
# setup GRanges objects
a_gr <- GRanges(1, IRanges(a$x,a$x))
b_gr <- GRanges(1, IRanges(b$start, b$end))
# find overlaps between the two data sets
res <- as.data.frame(findOverlaps(a_gr,b_gr))
# create the expected output
a$y <- NA
a$y[res$queryHits] <- as.character(b$y)[res$subjectHits]
a
x y
1 1 a
2 3 <NA>
3 5 b

Run a separate function for each item depending on the value of another variable with dplyr

I have a dataset which contains a categorical variable. Depending on the value of this variable, I want to run a different function for each such value. All the possible functions have the same return type. I might wish to run say, sin() if category is 'A', cos() if category is 'B', and tan() if category is 'C'.
The real application for this is in simulating populations, where outcomes depend on the values of categories, but sometimes in very different ways.
Toy example
library(dplyr)
category=c('A','B','C')
N <- 100
pop <- as.data.frame(ID <- seq(1:N))
pop <- as.tbl(pop)
pop$Category <- sample(category,N,replace=TRUE)
pop$score <- runif(N)
pop
tf <- function(x,EXPR) {
switch(EXPR,
A = cos(x),
B = sin(x),
C = tan(x)
)}
pop$results <- tf(pop$Score,pop$Category)
This code fails,reasonably enough, with the error message
Error in switch(EXPR, A = cos(x), B = sin(x), C = tan(x)) : EXPR must be a length 1 vector
I have looked, carefully, at dplyr and do, and I can easily see how to run the same function for each category separately. However, I need a function which depends on the category value.
Suggestions greatly appreciated.
The rowwise function is what you need to force it evaluate row by row...
pop<-data.frame(ID=1:100,
category = sample(c("A", "B", "C"),100,replace=TRUE),
score = runif(100))
exprs<-function(category, score){
if(category=="A")
ret <- sin(score)
if(category=="B")
ret <- cos(score)
if(category=="C")
ret <- tan(score)
ret }
pop %>%
rowwise %>%
mutate(answer = exprs(category, score))
Source: local data frame [100 x 4]
Groups:
# A tibble: 100 × 4
ID category score answer
<int> <fctr> <dbl> <dbl>
1 1 C 0.5219332 0.5751317
2 2 C 0.9266336 1.3314972
3 3 B 0.2729260 0.9629863
4 4 B 0.6575110 0.7915158
5 5 B 0.0910481 0.9958580
6 6 C 0.9968752 1.5467554
7 7 A 0.3429183 0.3362369
8 8 A 0.9101669 0.7896062
9 9 B 0.9291849 0.5984872
10 10 C 0.8913347 1.2379742
# ... with 90 more rows
You can use Vectorize():
set.seed(42)
category=c('A','B','C')
N <- 10
pop <- data.frame(ID=seq(1:N), Category=sample(category,N,replace=TRUE), score=runif(N), stringsAsFactors = FALSE)
tf <- function(x, EXPR) switch(EXPR,
'A' = cos(x),
'B' = sin(x),
'C' = tan(x))
TF <- Vectorize(tf)
pop$result <- TF(pop$score, pop$Category)
or (thx to #42 for the comment)
pop$result <- mapply(tf, pop$score, pop$Category)
The error appears because you are sending the complete vector , instead of record wise. I used lapply to call your function for each row and it works
library(dplyr)
category=c('A','B','C')
N <- 100
pop <- data.frame(ID = seq(1:N))
pop$Category <- sample(category,N,replace=TRUE)
pop$Category <- as.factor(pop$Category)
pop$score <- runif(N)
tf <- function(x,EXPR) {
switch(EXPR,
A = cos(x),
B = sin(x),
C = tan(x)
)}
## call tf for every row in the dataframe
pop$results <-lapply( seq_len(nrow(pop)) , function (i) {
tf(pop$score[i],pop$Category[i])
}) %>% unlist
Thanks

asssign values to dataframe subset in R

I'm having trouble assigning a dataframe to a subset of another. In the example below, the line
ds[cavities,] <- join(ds[cavities,1:4], fillings, by="ZipCode", "left")
only modifies one column instead of two. I would expect it either to modify no columns or both, not only one. I wrote the function to fill in the PrefName and CountyID columns in dataframe ds where they are NA by joining ds to another dataframe cs.
As you can see if you run it, the test is failing because PrefName is not getting filled in. After doing a bit of debugging, I realized that join() is doing exactly what it is expected to do, but the actual assignment of the result of that join somehow drops the PrefName back to a NA.
# fully copy-paste-run-able (but broken) code
suppressMessages({
library("plyr")
library("methods")
library("testthat")
})
# Fill in the missing PrefName/CountyIDs in delstat
# - Find the missing values in Delstat
# - Grab the CityState Primary Record values
# - Match on zipcode to fill in the holes in the delstat data
# - Remove any codes that could not be fixed
# - #param ds: delstat dataframe with 6 columns (see test case)
# - #param cs: citystate dataframe with 6 columns (see test case)
getMissingCounties <- function(ds, cs) {
if (length(is.na(ds$CountyID))) {
cavities <- which(is.na(ds$CountyID))
fillings <- cs[cs$PrimRec==TRUE, c(1,3,4)]
ds[cavities,] <- join(ds[cavities,1:4], fillings, by="ZipCode", "left")
ds <- ds[!is.na(ds$CountyID),]
}
return(ds)
}
test_getMissingCounties <- function() {
ds <- data.frame(
CityStateKey = c(1, 2, 3, 4 ),
ZipCode = c(11, 22, 33, 44 ),
Business = c(1, 1, 1, 1 ),
Residential = c(1, 1, 1, 1 ),
PrefName = c("One", NA , NA, NA),
CountyID = c(111, NA, NA, NA))
cs <- data.frame(
ZipCode = c(11, 22, 22, 33, 55 ),
Name = c("eh", "eh?", "eh?", "eh!?", "ah." ),
PrefName = c("One", "To", "Two", "Three", "Five"),
CountyID = c(111, 222, 222, 333, 555 ),
PrimRec = c(TRUE, FALSE, TRUE, TRUE, TRUE ),
CityStateKey = c(1, 2, 2, 3, 5 ))
expected <- data.frame(
CityStateKey = c(1, 2, 3 ),
ZipCode = c(11, 22, 33 ),
Business = c(1, 1, 1 ),
Residential = c(1, 1, 1 ),
PrefName = c("One", "Two", "Three"),
CountyID = c(111, 222, 333 ))
expect_equal(getMissingCounties(ds, cs), expected)
}
# run the test
test_getMissingCounties()
The results are:
CityStateKey ZipCode Business Residential PrefName CountyID
1 11 1 1 One 111
2 22 1 1 <NA> 222
3 33 1 1 <NA> 333
Any ideas why PrefName is getting set to NA by the assignment or how to do the assignment so I don't lose data?
The short answer is that you can avoid this problem by making sure that there are no factors in your data frames. You do this by using stringsAsFactors=FALSE in the call(s) to data.frame(...). Note that many of the data import functions, including read.table(...) and read.csv(...) also convert character to factor by default. You can defeat this behavior the same way.
This problem is actually quite subtle, and is also a good example of how R's "silent coercion" between data types creates all sorts of problems.
The data.frame(...) function converts any character vectors to factors by default. So in your code ds$PerfName is a factor with one level, and cs$PerfName is a factor with 5 levels. So in your assignment statement:
ds[cavities,] <- join(ds[cavities,1:4], fillings, by="ZipCode", "left")
the 5th column on the LHS is a factor with 1 level, and the 5th column on the RHS is a factor with 5 levels.
Under some circumstances, when you assign a factor with more levels to a factor with fewer levels, the missing levels are set to NA. Consider this:
x <- c("A","B",NA,NA,NA) # character vector
y <- LETTERS[1:5] # character vector
class(x); class(y)
# [1] "character"
# [1] "character"
df <- data.frame(x,y) # x and y coerced to factor
sapply(df,class) # df$x and df$y are factors
# x y
# "factor" "factor"
# assign rows 3:5 of col 2 to col 1
df[3:5,1] <- df[3:5,2] # fails with a warning
# Warning message:
# In `[<-.factor`(`*tmp*`, iseq, value = 3:5) :
# invalid factor level, NA generated
df # missing levels set to NA
# x y
# 1 A A
# 2 B B
# 3 <NA> C
# 4 <NA> D
# 5 <NA> E
The example above is equivalent to your assignment statement. However, notice what happens if you assign all of column 2 to column 1.
# assign all of col 2 to col 1
df <- data.frame(x,y)
df[,1] <- df[,2] # succeeds!!
df
# x y
# 1 A A
# 2 B B
# 3 C C
# 4 D D
# 5 E E
This works.
Finally, a note on debugging: if you are debugging a function, sometimes it is useful to run through the statements line by line at the command line (e.g., in the global environment). If you did that, you would have gotten the warning above, whereas inside a function call the warnings are suppressed.
The constraints of the test can be satisfied by reimplementing getMissingCountries with:
merge(ds[1:4], subset(subset(cs, PrimRec)[c(1, 3, 4)]), by="ZipCode")
Caveat: the ZipCode column is always emitted first, which differs from your expected result.
But to answer the subassignment question: it breaks, because the level sets of PrefName are incompatible between ds and cs. Either avoid using a factor or relevel them. You might have missed R's warning about this, because testthat was somehow suppressing warnings.

selection of observations by combining criteria in R

This topic has probably been brought up and it is a quite simpe solution , i guess. However i couldnt make it up to now.
Lets say i have a data.frame (called "data") which contains 10 individuals (id) on which i collected observations at 3 time points (T)
> data <- data.frame(id = rep(c(1:10), 3),
T = gl(3, 10),
X = sample(1:30),
Y = sample(c("yes", "no"), 30, replace = TRUE),
Z = sample(1:40, 30),
Z2 = rnorm(30, mean = 5, sd = 0.5))
> head(data)
id T X Y Z Z2
1 1 1 10 yes 15 5.993605
2 2 1 18 no 22 6.096566
3 3 1 5 no 24 5.101393
4 4 1 15 yes 18 4.944108
5 5 1 23 no 34 4.634176
6 6 1 13 no 27 5.576015
I would like to create a subset of this data.frame (an new data.frame called data2) by selecting only individuals that have "yes" (variable Y) for each of the three time points (variable T), that means Y="yes" for T=1 and T=2 and T=3.
I know that combining conditions can be achieved by using the "&" sign, and this can be used to relate conditions for the 3 time points. However, my problem is to write each condition for each time point : how to tell R that i want subjects for which Y="yes" at T="1" for example ?
Thank you very much in advance to all.
Have a great day,
Denis
You can do:
keep.ids <- tapply(data$Y, data$id, FUN = function(x)all(x == "yes"))
subset(data, keep.ids[factor(id)])
Or use the plyr package:
library(plyr)
ddply(data, "id", function(x) if(all(x$Y == "yes")) x else NULL)

Resources