selection of observations by combining criteria in R - r

This topic has probably been brought up and it is a quite simpe solution , i guess. However i couldnt make it up to now.
Lets say i have a data.frame (called "data") which contains 10 individuals (id) on which i collected observations at 3 time points (T)
> data <- data.frame(id = rep(c(1:10), 3),
T = gl(3, 10),
X = sample(1:30),
Y = sample(c("yes", "no"), 30, replace = TRUE),
Z = sample(1:40, 30),
Z2 = rnorm(30, mean = 5, sd = 0.5))
> head(data)
id T X Y Z Z2
1 1 1 10 yes 15 5.993605
2 2 1 18 no 22 6.096566
3 3 1 5 no 24 5.101393
4 4 1 15 yes 18 4.944108
5 5 1 23 no 34 4.634176
6 6 1 13 no 27 5.576015
I would like to create a subset of this data.frame (an new data.frame called data2) by selecting only individuals that have "yes" (variable Y) for each of the three time points (variable T), that means Y="yes" for T=1 and T=2 and T=3.
I know that combining conditions can be achieved by using the "&" sign, and this can be used to relate conditions for the 3 time points. However, my problem is to write each condition for each time point : how to tell R that i want subjects for which Y="yes" at T="1" for example ?
Thank you very much in advance to all.
Have a great day,
Denis

You can do:
keep.ids <- tapply(data$Y, data$id, FUN = function(x)all(x == "yes"))
subset(data, keep.ids[factor(id)])
Or use the plyr package:
library(plyr)
ddply(data, "id", function(x) if(all(x$Y == "yes")) x else NULL)

Related

Index and assign multiple sets of rows at once

I have an imported dataframe Measurements that contains many observations from an experiment.
Measurements <- data.frame(X = 1:4,
Data = c(90, 85, 100, 105))
X Data
1 90
2 85
3 100
4 105
I want to add another column Condition that specifies the treatment group for each datapoint. I know which obervation ranges are from which condition (e.g. observations 1:2 are from the control and observations 3:4 are from the experimental group).
I have devised two solutions already that give the desired output but neither are ideal. First:
Measurements["Condition"] <- c(rep("Cont", 2), rep("Exp", 2))
X Data Condition
1 90 Cont
2 85 Cont
3 100 Exp
4 105 Exp
The benefit of this is it is one line of code/one command. But this is not ideal since I need to do math outside separately (e.g. 3:4 = 2 obs, etc) which can be tricky/unclear/indirect with larger datasets and more conditions (e.g. 47:83 = ? obs, etc) and would be liable to perpetuating errors since a small error in length for an early assignment would also shift the assignment of later groups (e.g. if rep of Cont is mistakenly 1, then Exp gets mistakenly assigned to 2:3 too).
I also thought of assigning like this, which gives the desired output too:
Measurements[1:2, "Condition"] <- "Cont"
Measurements[3:4, "Condition"] <- "Exp"
X Data Condition
1 90 Cont
2 85 Cont
3 100 Exp
4 105 Exp
This makes it more clear/simple/direct which rows will receive which assignment, but this requires separate assignments and repetition. I feel like there should be a way to "vectorize" this assignment, which is the solution I'm looking for.
I'm having trouble finding complex indexing rules from online. Here is my first intuitive guess of how to achieve this:
Measurements[c(1:2, 3:4), "Condition"] <- list("Cont", "Exp")
X Data Condition
1 90 Cont
2 85 Cont
3 100 Cont
4 105 Cont
But this doesn't work. It seems to combine 1:2 and 3:4 into a single equivalent range (1:4) and assigns only the first condition to this range, which suggests I also need to specify the column again. When I try to specify the column again:
Measurements[c(1:2, 3:4), c("Condition", "Condition")] <- list("Cont", "Exp")
X Data Condition Condition.1
1 90 Cont Exp
2 85 Cont Exp
3 100 Cont Exp
4 105 Cont Exp
For some reason this creates a second new column (??), and it again seems to combine 1:2 and 3:4 into essentially 1:4. So I think I need to index the two row ranges in a way that keeps them separate and only specify the column once, but I'm stuck on how to do this. I assume the solution is simple but I can't seem to find an example of what I'm trying to do. Maybe to keep them separate I do have to assign them separately, but I'm hoping there is a way.
Can anyone help? Thank you a ton in advance from an R noobie!
If you already have a list of observations which belong to each condition you could use dplyr::case_when to do a conditional mutate. Depending on how you have this information stored you could use something like the following:
library(dplyr)
Measurements <- data.frame(X = 1:4,
Data = c(90, 85, 100, 105))
# set which observations belong to each condition
Cont <- 1:2
Exp <- 3:4
Measurements %>%
mutate(Condition = case_when(
X %in% Cont ~ "Cont",
X %in% Exp ~ "Exp"
))
# X Data Condition
# 1 90 Cont
# 2 85 Cont
# 3 100 Exp
# 4 105 Exp
Note that this does not require the observations to be in consecutive rows.
I normally see this done with a merge operation. The trick is getting your conditions data into a nice shape.
composeConditions <- function(...) {
conditions <- list(...)
data.frame(
X = unname(unlist(conditions)),
condition = unlist(unname(lapply(
names(conditions),
function(x) rep(x, times = length(conditions[x][[1]]))
)))
)
}
conditions <- composeConditions(Cont = 1:2, Exp = 3:4)
> conditions
X condition
1 1 Cont
2 2 Cont
3 3 Exp
4 4 Exp
merge(Measurements, conditions, by = "X")
X Data condition
1 1 90 Cont
2 2 85 Cont
3 3 100 Exp
4 4 105 Exp
Efficient for larger datasets is to know the data pattern and the data id.
Measurements <- data.frame(X = 1:4, Data = c(90, 85, 100, 105))
dat <- c("Cont","Exp")
pattern <- c(1,1,2,2)
Or draw pattern from data, e.g. conditional from Measurements$Data
pattern <- sapply( Measurements$Data >=100, function(x){ if(x){2}else{1} } )
# [1] 1 1 2 2
Then you can add the data simply by doing:
Measurements$Condition <- dat[pattern]
# X Data Condition
#1 1 90 Cont
#2 2 85 Cont
#3 3 100 Exp
#4 4 105 Exp

Using a custom summary function for factors within multiple columns

I conducted a survey with a large number of items, each of which has distinct categorical response options stored as factors. I need to summarize these columns in an efficient manner, preferably with functionality like that provided by forcats::fct_count(). I also need to know how many non-NA responses were provided for each variable, since different items were shown to different respondents. I wrote a function to make a tidy little summary data frame, but am struggling to efficiently run this function along each column and then combine the results into a single object (ala ddply).
I've tried sapply(), gather()-ing the data to long format and then running ddply(), but the problem of the distinct levels for each variable seems to keep getting in the way. See below for a reproducible example of the data set and my summarizing function. I could run the function for each variable (as shown below), but I know there's gotta be a more efficient way to do this that doesn't involve creating a ton of individual summary data-frame objects. Thanks for any help you can provide.
data <- data.frame(
ID = c(1:50),
X = as.factor(sample(c("yes", "no", NA), 50, replace = TRUE)),
Y = as.factor(sample(c("a", "b", "c", NA), 50, replace = TRUE)),
Z = as.factor(sample(c("d", "e", "f", "g", "h", NA), 50, replace = TRUE))
)
library(tidyverse)
library(forcats)
factorsummaries.f <- function(x) {
x <- na.omit(x)
counts <- fct_count(fct_drop(x), sort = T)
counts$f <- as.character(counts$f)
total <- data.frame(f = "sum", n = as.numeric(sum(counts$n)))
return(bind_rows(counts, total))
}
factorsummaries.f(data$X)
factorsummaries.f(data$Y)
Perhaps you are looking for purrr::map_dfr
map_dfr(data[,2:ncol(data)], factorsummaries.f, .id = "colname")
#output
colname f n
<chr> <chr> <dbl>
1 X no 18
2 X yes 17
3 X sum 35
4 Y a 14
5 Y c 13
6 Y b 12
7 Y sum 39
8 Z g 10
9 Z d 9
10 Z h 8
11 Z f 6
12 Z e 5
13 Z sum 38

Faster Way to Create a Subset within a Loop or Apply Function in R

I'm new to R, so apologies in advance for bad form in my code.
I'm trying to figure out the best way to go through a dataframe, row by row, and modify a value based on logic that references other columns within that row or an entirely different dataframe. The issue is that the logic I'm using necessitates creating and subsetting a dataframe for each row to retrieve a minimum value. My real data set is 47000 rows and 15 columns, so creating 47,000 subsets is taking a long time.
Here are sample datasets to help describe what I'm talking about.
df1 <- data.frame('A' = c(rep("Beer", 2), rep("Chip", 2)), 'B' = c(NA, 3,
NA,9), 'C' = 5:8, 'D' = NA)
df2 <- data.frame('Q' = c(rep("Beer", 2), rep("Chip", 2)), 'R' = 6:9, 'S' =
c(12, 15, 4, 18), 'T' = c(23, 45, 75, 34))
df1:
A B C D
Beer NA 5 NA
Beer 3 6 NA
Chip NA 7 NA
Chip 9 8 NA
df2:
Q R S T
Beer 6 12 23
Beer 7 15 45
Chip 8 4 75
Chip 9 18 34
This loop does what I want, namely checking whether a value is NA in column B or not, if it isn't then use that value in for column D, if it is NA then retrieve the minimum value from a filtered subset of df2. In the real use case I have other filtering conditions.
require(dplyr)
for (i in 1:nrow(df1)) {
if (!(is.na(df1$B[i]))) {
df1$D[i] <- df1$B[i]}
else {x <- filter(df2, df1$A[i] == df2$Q)
x <- min(x$S)
df1$D[i] <- x
}
}
Everyone says to avoid loops in R, so I created this function using apply which also works (although is a little more difficult to follow):
FUNC <- function(x) {
apply(x, 1, function(y) {
if (!(is.na(y[2]))) {
y[4] <- y[2]}
else {z <- filter(df2, y[1] == df2$Q)
z <- min(z$S)
y[4] <- z}
}
)
}
df1$D <- as.numeric(FUNC(df1))
Output:
A B C D
Beer NA 5 12
Beer 3 6 3
Chip NA 7 4
Chip 9 8 9
Aside question: is there a way to reference items in vector y by name instead of by index position?
So is there a better way to do this? Right now both methods take about 5-8 minutes to run through 47,000+ rows which seems long to me.
df1$D <- df2 %>%
rename(A=Q) %>%
group_by(A) %>%
summarise(D=min(S)) %>%
right_join(df1, by="A") %>%
mutate(D=ifelse(is.na(B), D.x, B)) %>%
`[[`("D")

Generate random numbers by group with replacement

** edited because I'm a doofus - with replacement, not without **
I have a large-ish (>500k rows) dataset with 421 groups, defined by two grouping variables. Sample data as follows:
df<-data.frame(group_one=rep((0:9),26), group_two=rep((letters),10))
head(df)
group_one group_two
1 0 a
2 1 b
3 2 c
4 3 d
5 4 e
6 5 f
...and so on.
What I want is some number (k = 12 at the moment, but that number may vary) of stratified samples, by membership in (group_one x group_two). Membership in each group should be indicated by a new column, sample_membership, which has a value of 1 through k (again, 12 at the moment). I should be able to subset by sample_membership and get up to 12 distinct samples, each of which is representative when considering group_one and group_two.
Final data set would thus look something like this:
group_one group_two sample_membership
1 0 a 1
2 0 a 12
3 0 a 5
4 1 a 5
5 1 a 7
6 1 a 9
Thoughts? Thanks very much in advance!
Maybe something like this?:
library(dplyr)
df %>%
group_by(group_one, group_two) %>%
mutate(sample_membership = sample(1:12, n(), replace = FALSE))
Here's a one-line data.table approach, which you should definitely consider if you have a long data.frame.
library(data.table)
setDT(df)
df[, sample_membership := sample.int(12, .N, replace=TRUE), keyby = .(group_one, group_two)]
df
# group_one group_two sample_membership
# 1: 0 a 9
# 2: 0 a 8
# 3: 0 c 10
# 4: 0 c 4
# 5: 0 e 9
# ---
# 256: 9 v 4
# 257: 9 x 7
# 258: 9 x 11
# 259: 9 z 3
# 260: 9 z 8
For sampling without replacement, use replace=FALSE, but as noted elsewhere, make sure you have fewer than k members per group. OR:
If you want to use "sampling without unnecessary replacement" (making this up -- not sure what the right terminology is here) because you have more than k members per group but still want to keep the groups as evenly sized as possible, you could do something like:
# example with bigger groups
k <- 12L
big_df <- data.frame(group_one=rep((0:9),260), group_two=rep((letters),100))
setDT(big_df)
big_df[, sample_round := rep(1:.N, each=k, length.out=.N), keyby = .(group_one, group_two)]
big_df[, sample_membership := sample.int(k, .N, replace=FALSE), keyby = .(group_one, group_two, sample_round)]
head(big_df, 15) # you can see first repeat does not occur until row k+1
Within each "sampling round" (first k observations in the group, second k observations in the group, etc.) there is sampling without replacement. Then, if necessary, the next sampling round makes all k assignments available again.
This approach would really evenly stratify the sample (but perfectly even is only possible if you have a multiple of k members in each group).
Here is a base R method, that assumes that your data.frame is sorted by groups:
# get number of observations for each group
groupCnt <- with(df, aggregate(group_one, list(group_one, group_two), FUN=length))$x
# for reproducibility, set the seed
set.seed(1234)
# get sample by group
df$sample <- c(sapply(groupCnt, function(i) sample(12, i, replace=TRUE)))
Untested example using dplyr, if it doesn't work it might point you in the right direction.
library( dplyr )
set.seed(123)
df <- data.frame(
group_one = as.integer( runif( 1000, 1, 6) ),
group_two = sample( LETTERS[1:6], 1000, TRUE)
) %>%
group_by( group_one, group_two ) %>%
mutate(
sample_membership = sample( seq(1, length(group_one) ), length(group_one), FALSE)
)
Good luck!

R: get corresponding value from another data frame

I'm new to R and here and I need some help to structure my data.
I have two data sets:
One of them is a long format within subjects data set which is large and looks a little bit like this:
long.format <- data.frame(subject.no = c(1, 1, 1, 1, 2, 2, 2, 2), condition = c("prime", "prime", "prime", "prime", "control", "control","control","control"), response = c(1,1,1,0,1,1,1,0))
subject.no condition response
>1 1 prime 1
>2 1 prime 1
>3 1 prime 1
>4 1 prime 0
>5 2 control 1
>6 2 control 1
>7 2 control 1
>8 2 control 0
The other one is already in wide format and looks like this
wide.format <- data.frame(subject = c(1, 2), age = c(26,27), gender = c("m","f"))
subject age gender
>1 1 26 m
>2 2 27 f
The only thing I want to do now is to get the value in "condition" (and only this!) from the long format data frame to the corresponding subject in the wide data frame by adding a new column in the wide data frame (by using the columns subject.no and subject, respectively).
So the final data frame should look like this:
wide.format.aim <- data.frame(subject = c(1, 2), age = c(26,27), gender = c("m","f"), condition = c("prime","control"))
subject age gender condition
>1 1 26 m prime
>2 2 27 f control
I've tried merging but this ended up with a long format data frame added with the information from the wide format data frame... but I want it the other way around...
This is what I've tried:
test.it <- merge(x=wide.format, y=long.format[,c("subject.no", "condition")], all.x=T, by.x="subject", by.y="subject.no")
Any suggestions?
Thanks in advance!
You are interested merging the unique values from long.format[,c("subject.no", "condition")]:
unique(long.format[,c("subject.no", "condition")])
# subject.no condition
#1 1 prime
#5 2 control
You can merge using those values
merge(x = wide.format,
y = unique(long.format[,c("subject.no", "condition")]),
by.x = "subject",
by.y = "subject.no")
# subject age gender condition
#1 1 26 m prime
#2 2 27 f control

Resources