Function to generate subset of data in R - r

I have created a data frame from which I'm trying to obtain a subset using some criteria as outlined below. In simple terms, I'd like to write a function to select half of all male ids below the age of 4. I need this function subsequently to do other things for me hence the need for writing a function.
I've tried to write a simple code as given below but it seems an error is coming back-I would really appreciate some help with this. I think this error is coming from the filter() task but I have no idea why.
library(dplyr)
set.seed(2016)
longData <- data.frame(patient=rep(paste(letters[1:20], sep = "_", 1:20), each=5),
age=rep(runif(20, 1, 10), each = 5),var=runif(100, 50, 1000),
time=rep(1:5, 20), gender = factor(rep(c("male", "female"), c(60, 40))))
ptNo <- function(data, id = "patient")
{
ptNumber <- data %>% filter(!duplicated(data[id])) %>% nrow()
return(ptNumber)
}
randSubsetFun <- function(df, gender, seedNumb){
data <- filter(df, gender, age < 4)
set.seed(seedNumb)
n <- round(ptNo(data)/2)
fd <- sample_n(data, n)
return(fd)
}
randSubsetFun(df = longData, gender = "male", seedNumb = 1)
The error reads the following:
Warning message:
In Ops.factor(c(2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, :
‘&’ not meaningful for factors
The annoying thing is when I tried running these codes individually outside of the function, it worked fine. It just doesn't work within the function.
Any help would be warmly appreciated. Any suggestions as to how I can improve this code would be welcome too. I guess the problem really is how do I call a factor from the filter function via this function (If that makes sense)?

Related

cannot reference grouped data in summarize(across(...))

When I try to create several columns within summarize(), I can reference a newly created column name in the same summarize statement.
Example:
Goal: Try to calculate the standard error ("se") based on the standard deviation ("sd").
Step 1 (start to assign sd for se):
data %>%
group_by(style) %>%
summarise(across(score,list(mean = mean, sd = sd, se = sd)))
returns
style score_mean score_sd score_se
* <fct> <dbl> <dbl> <dbl>
1 S1 3.5 0.707 0.707
Step 2: calculate se based on sd
data %>%
group_by(style) %>%
summarise(across(score,list(mean = mean, sd = sd, se = sd/sqrt(nrow(score)))))
returns
Error: Problem with `summarise()` input `..1`.
x non-numeric argument to binary operator
ℹ Input `..1` is `across(score, list(mean = mean, sd = sd, se = sd/sqrt(nrow(data))))`.
ℹ The error occured in group 1: style = "S1".
Step 3 debugging assignment term
3a) grouped data reference
I replaced the grouped data in nrow(score)) by the other column names or even nrow(data), but they all led to the same error message.
3b) assignment operation
I replaced the assignement for se sd/sqrt(nrow(score))) with different variations leading all to the same error. The simplest was sd/2, so even dividing by a constant doesn't work.
3c) assignment reference
I replaced sd by score_sd to reference the new column created, as seen in the output (Step 1). Still the same error message.
Question: Why does Step 1 work but not Step 2?
The error message just refers to the whole across() statement, so doesn't help to narrow down the root cause.
My hunch is that I have to reference the grouped data somehow, but I tried
se = sd(.)/sqrt(nrow(data) with no success.
Would be grateful for any hints...
Minimal reproducible example:
data <- structure(list(style = structure(c(1L, 2L, 3L, 4L, 5L, 1L, 2L,
3L, 4L, 5L), .Label = c("S1", "S2", "S3", "S4", "S5"), class = "factor"),
param = structure(c(1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L
), .Label = c("A", "B", "C"), class = "factor"), score = c(4,
1, 1, 3, 3, 3, 5, 1, 1, 1)), row.names = c(NA, -10L), class = c("tbl_df",
"tbl", "data.frame"))
After many trial & error attempts, I found the solution myself. This is for everyone who is not yet familiar with the across function, as dplyr 1.0.0 is not yet released.
So the answer to my question is:
You must reference the grouped data by the . operator - BUT ONLY IF you use the purrr formula operator ~!
However, you must NOT reference the grouped data in the n() function, as the n() does NOT accept the . operator.
The second point took endless trials to find out, and is the reason why I wanted to share this solution.
You might not find it intuitive to understand either that, even though n() is defined with brackets, it is never allowed to use the . operator as it always refers to the grouped data.
This is how this double trick looks like:
data %>%
group_by(style) %>%
summarise(across(
score,
list(mean = mean, sd = sd, se = ~sd(.)/sqrt(n()))
))
If you know it, it's easy :-)

Producing two new variables using recode in R?

I'm gradually getting used to recoding variables in R but I'm having a bit of trouble creating two new variables. For example, I have tried the following:
income2018$income2 <- dplyr::recode(income2018$income, '51' = 1L, '52' = 1L, '53' = 2L)
income2018$income3 <- dplyr::recode(income2018$income, '57' = 1L, '58' = 1L, '50' = 2L)
It doesn't look like the values are being correctly applied to the new variables.
Here is the SPSS syntax that I am attempting to recreate:
RECODE income (51,52=1)(53=2) into income2
RECODE income (57,58=1)(50=2) into income3
I'd be very grateful for any assistance.
Many thanks.
It looks like you might need to rearrange your code a little bit, but it's hard to tell without a reprex
You might want to try:
income2018 <- income2018 %>%
dplyr::mutate(income2 = income) %>%
dplyr::recode(income2, '51' = 1L, '52' = 1L, '53' = 2L)

Merging three factors so their dependent variable sums in R

Not sure if someone has answered this - I have searched, but so far nothing has worked for me. I have a very large dataset that I am trying to narrow. I need to combine three factors in my "PROG" variable ("Grad.2","Grad.3","Grad.H") so that they become a single variable ("Grad") where the dependent variable ("NUMBER") of each comparable set of values is summed.
ie.
YEAR = "92/93" AGE = "20-24" PROG = "Grad.2" NUMBER = "50"
YEAR = "92/93" AGE = "20-24" PROG = "Grad.3" NUMBER = "25"
YEAR = "92/93" AGE = "20-24" PROG = "Grad.H" NUMBER = "2"
turns into
YEAR = "92/93" AGE = "20-24" PROG = "Grad" NUMBER = "77"
I want to then drop all other factors for PROG so that I can compare the enrollment rates for Grad without worrying about the other factors (which I deal with separately). So my active independent variables are YEAR and AGE, while the dependent variable is NUMBER.
I hope this shows my data adequately:
structure(list
(YEAR = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L), .Label = c("92/93", "93/94", "94/95", "95/96", "96/97",
"97/98", "98/99", "99/00", "00/01", "01/02", "02/03", "03/04",
"04/05", "05/06", "06/07", "07/08", "08/09", "09/10", "10/11",
"11/12", "12/13", "13/14", "14/15", "15/16"), class = "factor"),
AGE = structure(c(1L, 2L, 3L, 4L, 5L, 6L, 7L, 1L, 2L, 3L), .Label = c("1-19",
"20-24", "25-30", "31-34", "35-39", "40+", "NR", "T.Age"), class = c("ordered",
"factor")),
PROG = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L,
19L, 19L, 19L), .Label = c("T.Prog", "Basic", "Career", "Grad.H",
"Grad2", "Grad3", "Grad2.Qual", "Grad3.Qual", "Health.Res",
"NoProg.Grad", "NoProg.Other", "NoProg.Und.Grad", "NoProg.NoCred",
"Other", "Post.Und.Grad", "Post.Career", "Pre-U", "Career.Qual",
"Und.Grad", "Und.Grad.Qual"), class = "factor"),
NUMBER = c(104997L,
347235L, 112644L, 38838L, 35949L, 50598L, 5484L, 104991L,
333807L, 76692L)), row.names = c(7936L, 7948L, 7960L, 7972L,
7984L, 7996L, 8008L, 10459L, 10471L, 10483L), class = "data.frame")
In terms of why I am using factors, I don't know how else I should enter the data. Factors made sense, and they were how R interpreted the raw data when I uploaded it.
I am working on the suggestions below. Not had success yet, but I am still learning how to get R to do what I want, and frequently mess up. Will respond to each of you as soon as I have a reasonable answer to give. (And once I stop banging my poor head on my desk... sigh)
If I understand your question correctly, this should do it.
I am assuming your data frame is named df:
library(tidyverse)
df %>%
mutate(PROG = ifelse(PROG %in% c("Grad2", "Grad3","Grad.H"),
"Grad",
NA)) %>% ##combines the 3 Grad variables into one
filter(!is.na(PROG)) %>% ##drops the other variables
group_by(YEAR, AGE) %>%
summarise(NUMBER = sum(NUMBER))
Slightly different approach: only take factors you want, drop the factor variable (because you want to treat them as a group) and sum up all NUMBER values while grouping by all other variables. df is your data.
aggregate(formula = NUMBER ~ .,
data = subset(df, PROG %in% c("Grad2", "Grad3", "Grad.H"), select = -PROG),
FUN = sum)
There are multiple ways to do this, but I agree with FScott that you are likely looking for the levels() function to rename the factor levels. Here is how I would do the second step of summing.
library(magrittr)
library(dplyr)
#do the renaming of the PROG variables here
#sum by PROG
df <- df %>%
group_by(PROG) %>% # you could add more variable names here to group by i.e. group_by(PROG, AGE, YEAR)
mutate(group.sum= sum(NUMBER))
This chunk will make a new column in df named group.sum with the sum between subsetted groups defined by the group_by() function
if you wanted to condense the data.frame further as where the individual values in NUMBER are replaced with group.sum, again there are many ways to do this but here is a simple way.
#condense df down
df$number <- df$group.sum
df <- df[,-ncol(df)]
df <- unique(df)
A side note: I wouldn't recommend doing the above chunk because you loose information in your data, and your data is more tidy just having the extra column group.sum
I think the levels() function is what you are looking for. From the manual:
## combine some levels
z <- gl(3, 2, 12, labels = c("apple", "salad", "orange"))
z
levels(z) <- c("fruit", "veg", "fruit")
z
I named your data temp and ran this code. It works for me.
z<-gl(n=length(temp$PROG),k=2,labels=c("T.Prog", "Basic", "Career", "Grad.H",
"Grad2", "Grad3", "Grad2.Qual", "Grad3.Qual", "Health.Res",
"NoProg.Grad", "NoProg.Other", "NoProg.Und.Grad", "NoProg.NoCred",
"Other", "Post.Und.Grad", "Post.Career", "Pre-U", "Career.Qual",
"Und.Grad", "Und.Grad.Qual"))
z
levels(z)<-c(rep("Other",3),rep("Grad",5),rep("Other",12))
z
temp$PROG2<-factor(x=temp$PROG,levels=levels(temp$PROG),labels=z)
temp

Counting an event only every X days per subject (in an irregular time series)

I've got data where I'm counting episodes of care (like ER visits). The trick is, I can't count every single visit, because sometimes a 2nd or 3rd visit is actually a follow-up for a previous problem. So I've been given direction to count visits by using a 30 day "clean period" or "black out period", such that, I look for the first event (VISIT 1) by patient (min date), I count that event, then apply rules so as NOT to count any visits that occur in the 30 days following the first event. After that 30 day window has elapsed, I can begin looking for the 2nd visit (VISIT 2), count that one, then apply the 30 day black out again (NOT counting any visits that occur in the 30 days after visit #2)... wash, rinse, repeat...
I have rigged together a very sloppy solution that requires a lot of babysitting and manual checking of steps(see below). I have to believe that there is a better way. HELP!
data1 <- structure(list(ID = structure(c(2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L,
3L, 4L, 4L, 4L, 4L, 4L), .Label = c("", "patient1", "patient2",
"patient3"), class = "factor"), Date = structure(c(14610, 14610,
14627, 14680, 14652, 14660, 14725, 15085, 15086, 14642, 14669,
14732, 14747, 14749), class = "Date"), test = c(1L, 1L, 1L, 2L,
1L, 1L, 2L, 3L, 3L, 1L, 1L, 2L, 2L, 2L)), .Names = c("ID", "Date",
"test"), class = "data.frame", row.names = c(NA, 14L))
library(doBy)
## create a table of first events
step1 <- summaryBy(Date~ID, data = data1, FUN=min)
step1$Date30 <- step1$Date.min+30
step2 <- merge(data1, step1, by.x="ID", by.y="ID")
## use an ifelse to essentially remove any events that shouldn't be counted
step2$event <- ifelse(as.numeric(step2$Date) >= step2$Date.min & as.numeric(step2$Date) <= step2$Date30, 0, 1)
## basically repeat steps above until I dont capture any more events
## there just has to be a better way
data3 <- step2[step2$event==1,]
data3<- data3[,1:3]
step3 <- summaryBy(Date~ID, data = data3, FUN=min)
step3$Date30 <- step3$Date.min+30
step4 <- merge(data3, step3, by.x="ID", by.y="ID")
step4$event <- ifelse(as.numeric(step4$Date) >= step4$Date.min & as.numeric(step4$Date) <= step4$Date30, 0, 1)
data4 <- step4[step4$event==1,]
data4<- data4[,1:3]
step5 <- summaryBy(Date~ID, data = data4, FUN=min)
step5$Date30 <- step5$Date.min+30
## then I rbind the "keepers"
## in this case steps 1 and 3 above
final <- rbind(step1,step3, step5)
## then reformat
final <- final[,1:2]
final$Date.min <- as.Date(final$Date.min,origin="1970-01-01")
## again, extremely clumsy, but it works... HELP! :)
This solution is loop-free and uses only base R. It produces a logical vector ok which selects the acceptable rows of data1.
ave runs the indicated anonymous function over each patient separately.
We define a state vector consisting of the current date and the start of the period for which no other dates are considered. Each date is represented by as.numeric(x) where x is the date. step takes the state vector and the current date and updates the state vector. Reduce runs it over the data and then we take only results for which the minimum and current date are the same and for which the current date is not a duplicate.
step <- function(init, curdate) {
c(curdate, if (curdate > init[2] + 30) curdate else init[2])
}
ok <- !!ave(as.numeric(data1$Date), paste(data1$ID), FUN = function(d) {
x <- do.call("rbind", Reduce(step, d, c(-Inf, 0), acc = TRUE))
x[-1,1] == x[-1,2] & !duplicated(x[-1,1])
})
data1[ok, ]
Since that kind of manipulation is not straightforward and error-prone,
I would write a separate function to discard events in the blackout period.
The function contains a loop,
which basically does what you were doing by hand,
until there is nothing left to do.
blackout <- function(dates, period=30) {
dates <- sort(dates)
while( TRUE ) {
spell <- as.numeric(diff(dates)) <= period
if(!any(spell)) { return(dates) }
i <- which(spell)[1] + 1
dates <- dates[-i]
}
}
# Tests
stopifnot(
length(
blackout( seq.Date(Sys.Date(), Sys.Date()+50, by=1) )
) == 2
)
stopifnot(
length(
blackout( seq.Date(Sys.Date(), by=31, length=5) )
) == 5
)
It can be used as follows.
library(plyr)
ddply(data1, "ID", summarize, Date=blackout(Date))
How about
do.call('rbind', lapply(split(data1, factor(data1$ID)), function(x) (x <- x[order(x$Date),])[c(T, diff(x$Date) > 30),]))

Use of x`apply` to speed up loops

This is a more focussed question based on another question I have open at Vectorize/Speed up Code with Nested For Loops
Basically, I want to speed up the execution of this code. I was thinking of using one of the apply family of functions. The apply function would have to use/perform the following:
Input: loop over regions 1 to 10; vectors sed and borewidth with preallocated dimensions filled with NAs
Process: fill data in each of sed and borewidth in the manner implemented in the inner for loop
Output: sed and borewidth vectors
Assumptions (h/t Simon Urbanek): the begin, finish points of each row are contiguous, sequential and for each region, begin at 0.
Code is as below:
for (region in 1:10) {
# subset standRef and sample by region code
standRef.region <- standRef[which(standRef$region == region),]
sample.region <- sample[which(sample$region == region),]
for (i in 1:nrow(sample.region))
{
# create a dataframe - locations - that includes:
# 1) those indices of standRef.region in which the value of the location column is greater than the value of the ith row of the begin column of sample.region
# 2) those indices of standRef.region in which the value of the location column is less than the value of the ith row of the finish column of sample.region
locations <- standRef.region[which((standRef.region$location > sample.region$begin[i]) & (standRef.region$location < sample.region$finish[i])),]
sed[end_tracker:(end_tracker + nrow(locations))] <- sample.region$sed[i]
borewidth[end_tracker:(end_tracker + nrow(locations))] <- sample.region$borewidth[i]
# update end_tracker to the number of locations rows for this iteration
end_tracker <- end_tracker + nrow(locations)
}
cat("Finished region", region,"\n")
}
Sample Data for borewidth andsed. Edit: corrected formatting error in dput
structure(list(region = c(1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L),
begin = c(0L, 2253252L, 7091077L, 9120205L, 0L, 135094L,
941813L, 5901391L, 6061324L), finish = c(2253252L, 7091077L,
9120205L, 17463033L, 135094L, 941813L, 5901391L, 6061324L,
7092402L), sed = c(3.31830840984048, 1.38014704208403, 6.13049140975458,
2.10349875097134, 0.48170587509345, 0.13058713509175, 9.13509713513509,
6.13047153058701, 3.81734081501503), borewidth = c(3L, 5L,
2L, 1L, 1L, 1L, 2L, 4L, 4L)), .Names = c("region", "begin",
"finish", "sed", "borewidth"), class = "data.frame", row.names = c(NA,
-9L))
TIA.
With some extra assumptions based on the data you posted (incl. the other question), this is one way you could do it:
index <- unlist(lapply (unique(standRef$region), function(reg) {
reg.filter <- which(standRef$region == reg)
samp.filter <- which(sample$region == reg)
samp.filter[cut(standRef$location[reg.filter],c(0L,sample$finish[samp.filter]),labels=F)]
}))
sed <- sample$sed[index]
borewidth <- sample$borewidth[index]
The extra assumption is that your samples are contiguous, sequential (all your examples were) and start at 0. This allows us to use cut() on the $finish instead of treating each interval separately. One difference is that you code left gaps at the breaks, but I'm assuming that was not intentional.

Resources