Create group id column using dplyr if numbers between certain values [duplicate] - r

I am trying to categorize a numeric variable (age) into groups defined by intervals so it will not be continuous. I have this code:
data$agegrp(data$age >= 40 & data$age <= 49) <- 3
data$agegrp(data$age >= 30 & data$age <= 39) <- 2
data$agegrp(data$age >= 20 & data$age <= 29) <- 1
the above code is not working under survival package. It's giving me:
invalid function in complex assignment
Can you point me where the error is? data is the dataframe I am using.

I would use findInterval() here:
First, make up some sample data
set.seed(1)
ages <- floor(runif(20, min = 20, max = 50))
ages
# [1] 27 31 37 47 26 46 48 39 38 21 26 25 40 31 43 34 41 49 31 43
Use findInterval() to categorize your "ages" vector.
findInterval(ages, c(20, 30, 40))
# [1] 1 2 2 3 1 3 3 2 2 1 1 1 3 2 3 2 3 3 2 3
Alternatively, as recommended in the comments, cut() is also useful here:
cut(ages, breaks=c(20, 30, 40, 50), right = FALSE)
cut(ages, breaks=c(20, 30, 40, 50), right = FALSE, labels = FALSE)

We can use dplyr:
library(dplyr)
data <- data %>% mutate(agegroup = case_when(age >= 40 & age <= 49 ~ '3',
age >= 30 & age <= 39 ~ '2',
age >= 20 & age <= 29 ~ '1')) # end function
Compared to other approaches, dplyr is easier to write and interpret.

This answer provides two ways to solve the problem using the data.table package, which would greatly improve the speed of the process. This is crucial if one is working with large data sets.
1s Approach: an adaptation of the previous answer but now using data.table + including labels:
library(data.table)
agebreaks <- c(0,1,5,10,15,20,25,30,35,40,45,50,55,60,65,70,75,80,85,500)
agelabels <- c("0-1","1-4","5-9","10-14","15-19","20-24","25-29","30-34",
"35-39","40-44","45-49","50-54","55-59","60-64","65-69",
"70-74","75-79","80-84","85+")
setDT(data)[ , agegroups := cut(age,
breaks = agebreaks,
right = FALSE,
labels = agelabels)]
2nd Approach: This is a more wordy method, but it also makes it more clear what exactly falls within each age group:
setDT(data)[age <1, agegroup := "0-1"]
data[age >0 & age <5, agegroup := "1-4"]
data[age >4 & age <10, agegroup := "5-9"]
data[age >9 & age <15, agegroup := "10-14"]
data[age >14 & age <20, agegroup := "15-19"]
data[age >19 & age <25, agegroup := "20-24"]
data[age >24 & age <30, agegroup := "25-29"]
data[age >29 & age <35, agegroup := "30-34"]
data[age >34 & age <40, agegroup := "35-39"]
data[age >39 & age <45, agegroup := "40-44"]
data[age >44 & age <50, agegroup := "45-49"]
data[age >49 & age <55, agegroup := "50-54"]
data[age >54 & age <60, agegroup := "55-59"]
data[age >59 & age <65, agegroup := "60-64"]
data[age >64 & age <70, agegroup := "65-69"]
data[age >69 & age <75, agegroup := "70-74"]
data[age >74 & age <80, agegroup := "75-79"]
data[age >79 & age <85, agegroup := "80-84"]
data[age >84, agegroup := "85+"]
Although the two approaches should give the same result, I prefer the 1st one for two reasons. (a) It is shorter to write and (2) the age groups are ordered in the correct way, which is crucial when it comes to visualizing the data.

Let's say that your ages were stored in the dataframe column labeled age. Your dataframe is df, and you want a new column age_grouping containing the "bucket" that your ages fall in.
In this example, suppose that your ages ranged from 0 -> 100, and you wanted to group them every 10 years. The following code would accomplish this by storing these intervals in a new age grouping column:
df$age_grouping <- cut(df$age, seq(0, 100, 10))

Related

R - Sum total time of multiple overlapping and/or discontinuous periods

For example, these are the days certain type of roles are present in an office
type
day_in
day_out
A
1
10
A
5
15
A
31
35
B
5
15
C
10
20
C
45
55
D
41
50
I want the number of days the office is occupied. There is a continuous office presence from days 1 to 20, 31 to 35, and 41 to 45, so the answer I want is 40 days.
I have a solution based on pivoting the data and setting flags on day when the state switches between occupied and unoccupied , using a for loop to cycle through each row. But I came to this solution reluctantly after failing to work out a vectorized approach.
Is there a vectorized way to do the operation from my for loop? Or any ideas for different algorithms would also be welcome.
My solution with example data is below:
library(dplyr)
library(tidyr)
df_raw <- read.table(
header = TRUE,
text = "
type day_in day_out
A 1 10
A 5 15
A 31 35
B 5 15
C 10 20
C 45 55
D 41 50
"
)
# occupancy from day 1 to 20, 31 to 35 & 41 to 55 = 40 days
# Unoccupied for 15 days
df <- df_raw %>%
tidyr::pivot_longer(cols = c(day_in, day_out), names_to = "in_out", values_to = "day") %>%
arrange(day)
# Create these columns to prevent warning "Unknown or uninitialised column" later
df$current_types <- NA
df$flag <- NA
# Loop to create flags on day when occupancy switches from occupied to unoccupied or vice-versa
for (rown in 1:nrow(df)) {
df$current_types[rown] <- if (rown == 1) {
df$type[rown]
} else {
if (df$in_out[rown] == "day_in") {
paste(df$current_types[rown - 1], df$type[rown], collapse = " ")
} else {
trimws(gsub(paste0("\\s?", df$type[rown], "\\s?"), " ", df$current_types[rown - 1]))
}
}
# if there are no current type then unoccupied. It may or may not be occupied again afterwards.
df$flag[rown] <- if (rown == 1 | (df$in_out[rown] == "day_out" & nchar(df$current_types[rown]) == 0)) {
1
} else {
if (df$in_out[rown] == "day_in" & nchar(df$current_types[rown - 1]) == 0) 1 else 0
}
}
# Then filter the flags, "pivot" to get each occupancy start and end in one row and sum the total days occupied
df %>%
filter(flag == 1) %>%
mutate(
start = if_else(in_out == "day_out" & lag(in_out) == "day_in", dplyr::lag(day), NULL),
stop = if_else(in_out == "day_out", day, NULL)
) %>%
filter(in_out == "day_out") %>%
summarise(days_occupied = sum(stop - start + 1))
You can generate day sequences for each role and count the number of unique days:
length(unique(unlist(apply(df_raw[, c('day_in', 'day_out')],
1,
function(x) seq(x[1], x[2])))))
Or using pipes:
df_raw[, c('day_in', 'day_out')] %>%
apply(1, function(x) seq(x[1], x[2])) %>%
unlist %>%
unique %>%
length
Another simple solution would be to create a vector with the size of your timespan and flag all occupied days and count them afterwards.
df <- data.frame(
type = c("A","A","A","B","C","C","D"),
day_in = c(1,5,31,5,10,45,41),
day_out = c(10,15,35,15,20,55,50))
occupation <- rep(0, max(df$day_out))
for(i in 1:nrow(df)){
occupation[df[i,'day_in']:df[i,'day_out']] <- 1
}
# 40
sum(occupation)

How to subset in RStudio properly?

I have created the following data frame:
age <- c(21,35,829,2)
sex <- c("m","f","m","c")
height <- c(181,173,171,166)
weight <- c(69,58,75,60)
dat <- as.data.frame(cbind(age,sex,height,weight), stringsAsFactors = FALSE)
dat$age <- as.numeric(age)
dat
I want to choose now only the rows of students which are older than 20 or younger than 80.
Why does this work : dat[dat$age<20| dat$age>80,] ; subset(dat, age < 20 | age > 80)
But this does not: dat[dat$age>20| dat$age<80,] ; subset(dat, age > 20 | age < 80)
I can subset the rows who are NOT younger than 80 or older than 20, but not those who are actually in this interval.
What is the mistake?
Thanks in advance.
Because your condition allows basically every possible age. Think about it, your conditions are independent (because you are using the | operator), so every row, that fits in one of your conditions, are selected by your filter. Every age that is defined in your data.frame now, are higher than 20, OR if not, they certainly are lower than 80.
If you want to select every row, that is in between age 20 and 80, you would change the logic operator. To make these conditions dependent, like this:
dat[dat$age>20 & dat$age<80,]
subset(dat, age > 20 & age < 80)
Resulting this:
age sex height weight
1 21 m 181 69
2 35 f 173 58
Now, if you want to select all the rows, that are outside of this interval, you could negate this logic condition with the ! operator, like was suggested by #r2evans in the comment section. It would be something like this:
dat[!(dat$age > 20 & dat$age < 80),]
subset(dat, !(age > 20 & age < 80))
Resulting this:
age sex height weight
3 829 m 171 75
4 2 c 166 60
Why not use dplyr filter?
library(dplyr)
df_age <- dat %>%
dplyr::filter(age > 20
, age < 80)

Matching intervals with values in another table in R

Currently, I have one df and a price table.
Order Number wgt wgt_intvl price
------------------- --------------- -----
01 22 0-15 50
02 5 15-25 75
03 35 25-50 135
What I'd like is to match the weight from the df into an interval of the price table in R. For example, the first order (Order Number 01) corresponds with a price of 75. Therefore, I want to add a column in the first df, say df$cost that corresponds with the appropriate price according to wgt_intvl in the price table.
The way I see to do it is with an if-else statement, but this is highly inefficient and I was wondering if there is a better way to do it. In reality these tables are much longer - there is no logical "buildup" in price or weight interval. I have 15 weight intervals in this table. My current solution would look like this:
If(wgt < 15){
df$cost <- 50
} else if (wgt > 15 & wgt < 25){
df$cost <- 75
} else if(wgt > 25 & wgt < 50){
df$cost <- 135
}
This times fifteen, using the corresponding prices of the price table. I'd love a more efficient solution. Thanks in advance!
Using the data shown reproducibly in the Note at the end, form the vector of cutpoints (i.e. the first number in each interval) and then use findInterval to find the interval corresponding to the weight.
cutpoints <- as.numeric(sub("-.*", "", dfprice$wgt_intvl))
transform(dfmain, price = dfprice$price[findInterval(wgt, cutpoints)])
giving:
Order wgt price
1 01 22 75
2 02 5 50
3 03 35 135
4 04 25 135
Note
dfmain <- data.frame(Order = c("01", "02", "03", "04"), wgt = c(22, 5, 35, 25),
stringsAsFactors = FALSE)
dfprice <- data.frame(wgt_intvl = c("0-15", "15-25", "25-50"),
price = c(50, 75, 135), stringsAsFactors = FALSE)
Instead of an if-statement you could use a more efficient case_when operation:
library(dplyr)
df %>%
mutate(cost = case_when(
wgt < 15 ~ 50,
wgt > 15 & wgt <25 ~ 75,
TRUE ~ 135))
Alternatively you could use cut() to transform wgt to wgt_intvl and match via left_join().

Creating a contingency table by hypergeometric sampling with the Titanic's database

I created a contingency table with the passengers data from the Titanic by the Hypergeometric sampling -That's mean that both of the marginal totals are preset and equals-. It was created crossing the Sex and Survivor columns of 328 cases -164 men and 164 women-, this is the code:
First, I ungroup the data and deleted the useless columns
titanic = as.data.frame(Titanic)
titanic = titanic[rep(1:nrow(titanic),titanic$Freq),]
titanic = titanic[,c(2,4)]
later, selected a sample of men
men = subset(titanic, titanic$Sex == 'Male')
men = men [sample(nrow(men),164), ]
table(men$Sex, men$Survived)
# No Yes
# Male 133 31
# Female 0 0
now the row of women must be filled in with the appropriate values
n = summary.factor(men$Survived)
womenYes = subset(titanic, (titanic$Sex == 'Female' & titanic$Survived=='Yes'))
womenYes = subset(womenYes[1:n[1], ])
womenNo = subset(titanic, (titanic$Sex == 'Female' & titanic$Survived=='No'))
womenNo = subset(womenNo[1:n[2], ])
women = merge(womenYes, womenNo, all = TRUE)
hyperSample = merge(men, women, all = TRUE)
table(hyperSample$Sex, hyperSample$Survived)
# No Yes
# Male 133 31
# Female 31 133
It works, but it looks like a bit ugly and I honestly think perhaps someone could find a much more elegant or efficient way to do it. Thanks.
You can sample in two stages, both using rhyper: First to determine the number of men and women subject to only sampling 328 and assuming populations were sex-distributed as in the original sample. This is what you might do if you were trying to bootstrap a statistic like a rate ratio. And then secondly, use rhyper twice more to determine the numbers of survivors subject to the same probabilities in the original sample rows.
MFmat <- apply(Titanic, c(2, 4), sum)
nMale <- rhyper(1, rowSums(MFmat)[1], rowSums(MFmat)[2], 328)
#[1] 262
nFemale <- 328 - nMale
DMale <- rhyper(1, MFmat[1,1], MFmat[1,2], nMale)
SurvMale = nMale-DMale
DFemale = rhyper(1, MFmat[2,1], MFmat[2,2], nFemale)
SurvFemale = nFemale - DFemale
matrix( c( DMale, DFemale, SurvMale, SurvFemale), ncol=2,
dimnames=dimnames(MFmat) )
#----
Survived
Sex No Yes
Male 223 42
Female 22 41
I suppose you could sample the two rows separately and you should be able to use the logic above, ... if that what you have decided to do. Which way is more appropriate will depend on the underlying problem.
# Fixed row marginals....
nMale <-164
nFemale <- 164
DMale <- rhyper(1, MFmat[1,1], MFmat[1,2], nMale)
SurvMale = nMale-DMale
DFemale = rhyper(1, MFmat[2,1], MFmat[2,2], nFemale)
SurvFemale = nFemale - DFemale
matrix( c( DMale, DFemale, SurvMale, SurvFemale), ncol=2,
dimnames=dimnames(MFmat) )
#----------------
Survived
Sex No Yes
Male 127 37
Female 39 125

Identifying groups of individuals if conditional occurence in one of them (next)

I come back to a previous question/post for which i got nice suggestions, but need an additional push : the idea is to create a binary variable which takes a value conditionnally to the individual status of any of related family members. This value is shared by all members of the same family. i put again a reproducive example:
family <- factor(rep(c("001","002","003"), c(10,8,15)),levels=c("001","002","003"), labels=c("001","002","003"), ordered=TRUE)
sx <- c(1,2,2,2,1,2,2,2,1,1,2,1,2,1,2,1,2,2,2,2,1,2,1,2,1,2,1,2,1,2,1,2,2)
ag <- c(22,8,4,2,55,9,44,65,1,7,32,2,2,1,6,9,18,99,73,1,2,3,4,5,6,7,8,9,10,18,11,22,33)
st <- factor(rep(c("a","b","c"),11))
DF <- data.frame(family, ag,sx,st) ; DF
One nice trick proposed by #Psidom allowed me to create this new variable NoMan, taking value 1for all individuals from a family which does not include any man older than 16:
DF <- ddply(DF, .(family), transform, NoMan = +!any(sx == 1 & ag > 16)) ; DF ## works well !!
I am now trying to add another condition related to the age : NoManalso would equal 1whenever any of the Male family members older than 16 has "a" or "b" as the attribute for the factor st. i tried the following but this did not work:
DF <- ddply(DF, .(family), transform, NoMan = !any(sx == 1 & ag > 16) |
all(sx == 1 & ag > 16 & st=="a") |
all(sx == 1 & ag > 16 & st=="b")) ; DF
Any clue about the reason family 001 does not take the value 1as NoMan? Thank you...
Compare with the following:
DF <- ddply(DF, .(family), transform, NoMan = +!any((sx == 1 & ag > 16) &
((sx == 1 & ag > 16 & st != "a") & (sx == 1 & ag > 16 & st != "b"))))
Or use a simplified one
DF <- ddply(DF, .(family), transform, NoMan = +!any(sx == 1 & ag > 16
& (st != "a" & st != "b")))

Resources