How to create a subset by using another subset as condition? - r

I want to create a subset using another subset as a condition. I can't show my actual data, but I can show an example that deals with the core of my problem.
For example, I have 10 subjects with 10 observations each. So an example of my data would be to create a simple data frame using this:
ID <- rep(1:10, each = 10)
x <- rnorm(100)
y <- rnorm(100)
df <- data.frame(ID,x,y)
Which creates:
ID x y
1 1 0.08146318 0.26682668
2 1 -0.18236757 -1.01868755
3 1 -0.96322876 0.09565239
4 1 -0.64841436 0.09202456
5 1 -1.15244873 -0.38668929
6 1 0.28748521 -0.80816416
7 1 -0.64243912 0.69403155
8 1 0.84882350 -1.48618271
9 1 -1.56619331 -1.30379070
10 1 -0.29069417 1.47436411
11 2 -0.77974847 1.25704185
12 2 -1.54139896 1.25146126
13 2 -0.76082748 0.22607239
14 2 -0.07839719 1.94448322
15 2 -1.53020374 -2.08779769
etc.
Some of these subjects were positive for an event (for example subject 3, 5 and 7), so I have created a subset for that using:
event_pos <- subset(df, ID %in% c("3","5","7"))
Now, I also want to create a subset for the subjects who were negative for an event. I could use something like this:
event_neg <- subset(df, ID %in% c("1","2","4","6","8","9","10"))
The problem is, my data set is too large to specify all the individuals of the negative group. Is there a way to use my subset event_pos to get all the subjects with negative events in one subset?
TL;DR
Can I get a subset_2 by removing the subset_1 from the data frame?

You can use :
ind_list <- c("3","5","7")
event_neg <- subset(df, (ID %in% ind_list) == FALSE)
or
event_neg <- subset(df, !(ID %in% ind_list))
Hope that will helps
Gottaviannoni

Related

Is there any way to make a data frame using the last values of a repeated loop?

I'd like to make a data frame using only the last computed values from a Repeat loop.
For the repeat and sample functions, I'm using this data. The numbers in Prob column are the probabilities of each number to occur.
enter image description here
b <- 1
repeat {
c <- sample(a$Plus, size=1, prob=(a$Prob))
cat(b, '\t', c, '\n')
b <- b + 1
if (c >= 10) {
{
break
}
}
}
#I'm interested in the result greater than 10 only
If I run the code above, then it will compute something like
1 4
2 8
3 13
If I run this again, it will compute different results like..
1 9
2 3
3 7
4 3
5 11
What I'd like to do is to make a data frame using only the last outputs of each loop.
For example, using the computed data above, I'd like to make a frame that looks like
Trial Result
3 13
5 11
Is there any way to repeat this loop the number of times I want to and make a data frame using only the last outputs of each repeated function?
You can use a user defined function to do this. Since you haven't given your dataframe a, I've defined it as follows:
library(tidyverse)
a <- tibble(
Plus = 1:15,
Prob = seq(from = 15, to = 1, by = -1)
)
The following function does the same thing as your repeat loop, but stores the relevant results in a tibble. I've left your variable b out of this because as far as I can see, it doesn't contribute to your desired output.
samplefun <- function(a) {
c <- sample(a$Plus, size=length(a$Plus), prob=a$Prob)
res <- tibble(
Trial = which(c >= 10)[1],
Result = c[which(c >= 10)[1]]
)
return(res)
}
Then use map_dfr to return as many samples as you like:
nsamples <- 5
map_dfr(1:nsamples, ~ samplefun(a))
Output:
# A tibble: 5 x 2
Trial Result
<int> <int>
1 4 11
2 6 14
3 5 11
4 2 10
5 4 15

purrr::Compose or any alternative to reduce the run time for a long nested function in R?

Problem: I have several (10+) custom functions, each defining a step in the workflow. I want to run a nested function of these steps over a large data frame for n (50+) periods iteratively. My current function achieves the result but it is too slow and not very elegant.
Example Input
id x_1975 z_1975
1 1 1 NA
2 2 2 NA
3 3 3 NA
4 4 4 NA
5 5 5 NA
Step 1:
Compare initial x values (x_1975) against a cutoff=3. If x is greater than 3, then the z value should be "Y".
Step 2:
If z value is "Y", then x value in next year should be x times 2. Otherwise, it should be x times 5. Although the z values can be skipped altogether, I need the categorical column to create summary stats.
Note:
The data set I am working with has 20 variables that need to be calculated based on some similar logics.
Desired Output
id x_1975 z_1975 x_1976 z_1976 x_1977 z_1977 x_1978
1 1 1 <NA> 5 Y 10 Y 20
2 2 2 <NA> 10 Y 20 Y 40
3 3 3 <NA> 15 Y 30 Y 60
4 4 4 Y 8 Y 16 Y 32
5 5 5 Y 10 Y 20 Y 40
6 6 6 Y 12 Y 24 Y 48
What I have tried:
Tried setting the data in long format. But found it complicated to iterate over rows.
Pre-allocated all columns with appropriate class. That reduced run time a little although not enough.
Have been trying to use purrr::compose to nest all the functions. But I am not being able to make it work.
Reproducible Example
library(dplyr)
library(purrr)
# Create Data Frame
n <- 6
dat <- data.frame(id=1:n,
x_1975=seq(1,6,1),
z_1975=NA)
cut_off <- 3
# Functions
# Set a value for "z_" variables in period t by comparing "x_" value in period t against the the cut_off value.
func_1 <- function(dat,yr){
# pre-define variables
z <- paste0("z_",yr)
x <- paste0("x_",yr)
# Caclulate values for "z_" in period t
dat <- dat %>% mutate(!!sym(z):=
case_when(!!sym(x)>cut_off ~ "Y",
TRUE~as.character(NA)
))
}
# Calculate the value for "x_" variables in period t+1 based on "z_" variables in period t.
func_2 <- function(dat,yr){
# pre-define variables
x <- paste0("x_",yr+1)
x_lag <- paste0("x_",yr)
z <- paste0("z_",yr)
# Calculate "x_" value for t+1
dat <- dat %>% mutate(!!sym(x):=case_when(
!!sym(z)=="Y"~!!sym(x_lag)*2,
TRUE~!!sym(x_lag)*5
))
}
# Join function 1 and function 2 together. The joined function needs to iterate over the `dat` from beginning year to ending year
joined_func <- function(dat,beginning,ending){
for (year in seq(beginning,ending,1)){
dat <- func_1(dat,year)
# Output of step 1 is used as input for step 2
dat <- func_2(dat,year)
}
return(dat)
}
# Run the code from 1975 to 2025.The data_output has the desired output, but need to reduce runtime.
data_output <- joined_func(dat,1975,1977)
# Tried to use the compose function from purrr. but getting error.
my_funs <- c(func_1, func_2)
f1 <- invoke(compose, my_funs)
joined_func_2 <- function(dat,beginning,ending){
for (year in seq(beginning,ending,1)){
dat <- f1(dat,year=year)
}
}
data_output_2 <- joined_func_2(dat,1975,1977)
# Error message:
# Error in f1(dat, year = year) : unused argument (year = year).
Questions
a) how do I make purrr::compose work? b) any other way to achieve efficiency?
Would really appreciate if someone could help me on this!

Using sum(x:y) to create a new variable/vector from existing values in R

I am working in R with a data frame d:
ID <- c("A","A","A","B","B")
eventcounter <- c(1,2,3,1,2)
numberofevents <- c(3,3,3,2,2)
d <- data.frame(ID, eventcounter, numberofevents)
> d
ID eventcounter numberofevents
1 A 1 3
2 A 2 3
3 A 3 3
4 B 1 2
5 B 2 2
where numberofevents is the highest value in the eventcounter for each ID.
Currently, I am trying to create an additional vector z <- c(6,6,6,3,3).
If the numberofevents == 3, it is supposed to calculate sum(1:3), equally to 3 + 2 + 1 = 6.
If the numberofevents == 2, it is supposed to calculate sum(1:2) equally to 2 + 1 = 3.
Working with a large set of data, I thought it might be convenient to create this additional vector
by using the sum function in R d$z<-sum(1:d$numberofevents), i.e.
sum(1:3) # for the rows 1-3
and
sum(1:2) # for the rows 4-5.
However, I always get this warning:
Numerical expression has x elements: only the first is used.
You can try ave
d$z <- with(d, ave(eventcounter, ID, FUN=sum))
Or using data.table
library(data.table)
setDT(d)[,z:=sum(eventcounter), ID][]
Try using apply sapply or lapply functions in R.
sapply(numberofevents, function(x) sum(1:x))
It works for me.

Replicate variable based off match of two other variables in R

I've got a seemingly simple question that I can't answer: I've got three vectors:
x <- c(1,2,3,4)
weight <- c(5,6,7,8)
y <- c(1,1,1,2,2,2)
I want to create a new vector that replicates the values of weight for each time an element in x matches y such that it produces the following new weight vector associated with y:
y_weight <- c(5,5,5,6,6,6)
Any thoughts on how to do this (either loop or vectorized)? Thanks
You want the match function.
match(y, x)
to return the indicies of the matches, the use that to build your new weight vector
weight[match(y, x)]
#Using plyr
library(plyr)
df<-as.data.frame(cbind(x,weight)) # converting to dataframe
df<-rename(df,c(x="y")) # rename x as y for joining dataframes
y<-as.data.frame(y) # converting to dataframe
mydata <- join(df, y, by = "y",type="right")
> mydata
y weight
1 1 5
2 1 5
3 1 5
4 2 6
5 2 6
6 2 6

Comparing two columns: logical- is value from column 1 also in column 2?

I'm pretty confused on how to go about this. Say I have two columns in a dataframe. One column a numerical series in order (x), the other specifying some value from the first, or -1 (y). These are results from a matching experiment, where the goal is to see if multiple photos are taken of the same individual. In the example below, there 10 photos, but 6 are unique individuals. In the y column, the corresponding x is reported if there is a match. y is -1 for no match (might as well be NAs). If there is more than 2 photos per individual, the match # will be the most recent record (photo 1, 5 and 7 are the same individual below). The group is the time period the photo was take (no matches within a group!). Hopefully I've got this example right:
x <- c(1,2,3,4,5,6,7,8,9,10)
y <- c(-1,-1,-1,-1,1,-1,1,-1,2,4)
group <- c(1,1,1,2,2,2,3,3,3,3)
DF <- data.frame(x,y,group)
I would like to create a new variable to name the unique individuals, and have a final dataset with a single row per individual (i.e. only have 6 rows instead of 10), that also includes the group information. I.e. if an individual is in all three groups, there could be a value of "111" or if just in the first and last group it would be "101". Any tips?
Thanks for asking about the resulting dataset. I realized my group explanation was bad based on the actual numbers I gave, so I changed the results slightly. Bonus would also be nice to have, but not critical.
name <- c(1,2,3,4,6,8)
group_history <- as.character(c('111','101','100','011','010','001'))
bonus <- as.character(c('1,5,7','2,9','3','4,10','6','8'))
results_I_want <- data.frame(name,group_history,bonus)
My word, more mistakes fixed above...
Using the (updated) example you gave
x <- c(1,2,3,4,5,6,7,8,9,10)
y <- c(-1,-1,-1,-1,1,-1,1,-1,3,4)
group <- c(1,1,1,2,2,2,3,3,3,3)
DF <- data.frame(x,y,group)
Use the x and y to create a mapping from higher numbers to lower numbers that are the same person. Note that names is a string, despite it be a string of digits.
bottom.df <- DF[DF$y==-1,]
mapdown.df <- DF[DF$y!=-1,]
mapdown <- c(mapdown.df$y, bottom.df$x)
names(mapdown) <- c(mapdown.df$x, bottom.df$x)
We don't know how many times it might take to get everything down to the lowest number, so have to use a while loop.
oldx <- DF$x
newx <- mapdown[as.character(oldx)]
while(any(oldx != newx)) {
oldx = newx
newx = mapdown[as.character(oldx)]
}
The result is the group it belongs to, names by the lowest number of that set.
DF$id <- unname(newx)
Getting the group membership is harder. Using reshape2 to convert this into wide format (one column per group) where the column is "1" if there was something in that one and "0" if not.
library("reshape2")
wide <- dcast(DF, id~group, value.var="id",
fun.aggregate=function(x){if(length(x)>0){"1"}else{"0"}})
Finally, paste these "0"/"1" memberships together to get the grouping variable you described.
wide$grouping = apply(wide[,-1], 1, paste, collapse="")
The result:
> wide
id 1 2 3 grouping
1 1 1 1 1 111
2 2 1 0 0 100
3 3 1 0 1 101
4 4 0 1 1 011
5 6 0 1 0 010
6 8 0 0 1 001
No "bonus" yet.
EDIT:
To get the bonus information, it helps to redo the mapping to keep everything. If you have a lot of cases, this could be slow.
Replace the oldx/newx part with:
iterx <- matrix(DF$x, ncol=1)
iterx <- cbind(iterx, mapdown[as.character(iterx[,1])])
while(any(iterx[,ncol(iterx)]!=iterx[,ncol(iterx)-1])) {
iterx <- cbind(iterx, mapdown[as.character(iterx[,ncol(iterx)])])
}
DF$id <- iterx[,ncol(iterx)]
To generate the bonus data, then you can use
bonus <- tapply(iterx[,1], iterx[,ncol(iterx)], paste, collapse=",")
wide$bonus <- bonus[as.character(wide$id)]
Which gives:
> wide
id 1 2 3 grouping bonus
1 1 1 1 1 111 1,5,7
2 2 1 0 0 100 2
3 3 1 0 1 101 3,9
4 4 0 1 1 011 4,10
5 6 0 1 0 010 6
6 8 0 0 1 001 8
Note this isn't same as your example output, but I don't think your example output is right (how can you have a grouping_history of "000"?)
EDIT:
Now it agrees.
Another solution for bonus variable
f_bonus <- function(data=df){
data_a <- subset(data,y== -1,select=x)
data_a$pos <- seq(nrow(data_a))
data_b <- subset(df,y!= -1,select=c(x,y))
data_b$pos <- match(data_b$y, data_a$x)
data_t <- rbind(data_a,data_b[-2])
data_t <- with(data_t,tapply(x,pos,paste,sep="",collapse=","))
return(data_t)
}

Resources