Identifying groups of individuals if conditional occurence in one of them (next) - r

I come back to a previous question/post for which i got nice suggestions, but need an additional push : the idea is to create a binary variable which takes a value conditionnally to the individual status of any of related family members. This value is shared by all members of the same family. i put again a reproducive example:
family <- factor(rep(c("001","002","003"), c(10,8,15)),levels=c("001","002","003"), labels=c("001","002","003"), ordered=TRUE)
sx <- c(1,2,2,2,1,2,2,2,1,1,2,1,2,1,2,1,2,2,2,2,1,2,1,2,1,2,1,2,1,2,1,2,2)
ag <- c(22,8,4,2,55,9,44,65,1,7,32,2,2,1,6,9,18,99,73,1,2,3,4,5,6,7,8,9,10,18,11,22,33)
st <- factor(rep(c("a","b","c"),11))
DF <- data.frame(family, ag,sx,st) ; DF
One nice trick proposed by #Psidom allowed me to create this new variable NoMan, taking value 1for all individuals from a family which does not include any man older than 16:
DF <- ddply(DF, .(family), transform, NoMan = +!any(sx == 1 & ag > 16)) ; DF ## works well !!
I am now trying to add another condition related to the age : NoManalso would equal 1whenever any of the Male family members older than 16 has "a" or "b" as the attribute for the factor st. i tried the following but this did not work:
DF <- ddply(DF, .(family), transform, NoMan = !any(sx == 1 & ag > 16) |
all(sx == 1 & ag > 16 & st=="a") |
all(sx == 1 & ag > 16 & st=="b")) ; DF
Any clue about the reason family 001 does not take the value 1as NoMan? Thank you...

Compare with the following:
DF <- ddply(DF, .(family), transform, NoMan = +!any((sx == 1 & ag > 16) &
((sx == 1 & ag > 16 & st != "a") & (sx == 1 & ag > 16 & st != "b"))))
Or use a simplified one
DF <- ddply(DF, .(family), transform, NoMan = +!any(sx == 1 & ag > 16
& (st != "a" & st != "b")))

Related

Filter multiple values in 2 different columns for specified conditions

Sample data:
df <- data.frame(
dr=c('john', 'jill', 'jill', 'john'),
service=c('PT','SN','SN','PT'),
Hours=c(6,5,4,8)
)
I tried filtering it to get the output as below with values to be displayed greater than 4 for SN and greater than 6 for PT.
Output should be
Dr. Service. Hours
Jill. SN. 5
John. PT. 8
I have 2000 rows of such filtering and unsure how to go about.
You can specify the conditions as follows:
library(dplyr)
df <- data.frame(
dr=c('john', 'jill', 'jill', 'john'),
service=c('PT','SN','SN','PT'),
Hours=c(6,5,4,8)
)
df %>%
filter(
(service == 'SN' & Hours > 4) |
(service == 'PT' & Hours > 6)
)
In base R you can try using subset:
subset(
df,
service == "SN" & Hours > 4 | service == "PT" & Hours > 6
)
or place the evaluation in brackets to extract rows that meet the conditions:
df[(df$service == "SN" & df$Hours > 4) | (df$service == "PT" & df$Hours > 6),]
Output
dr service Hours
2 jill SN 5
4 john PT 8

How to subset in RStudio properly?

I have created the following data frame:
age <- c(21,35,829,2)
sex <- c("m","f","m","c")
height <- c(181,173,171,166)
weight <- c(69,58,75,60)
dat <- as.data.frame(cbind(age,sex,height,weight), stringsAsFactors = FALSE)
dat$age <- as.numeric(age)
dat
I want to choose now only the rows of students which are older than 20 or younger than 80.
Why does this work : dat[dat$age<20| dat$age>80,] ; subset(dat, age < 20 | age > 80)
But this does not: dat[dat$age>20| dat$age<80,] ; subset(dat, age > 20 | age < 80)
I can subset the rows who are NOT younger than 80 or older than 20, but not those who are actually in this interval.
What is the mistake?
Thanks in advance.
Because your condition allows basically every possible age. Think about it, your conditions are independent (because you are using the | operator), so every row, that fits in one of your conditions, are selected by your filter. Every age that is defined in your data.frame now, are higher than 20, OR if not, they certainly are lower than 80.
If you want to select every row, that is in between age 20 and 80, you would change the logic operator. To make these conditions dependent, like this:
dat[dat$age>20 & dat$age<80,]
subset(dat, age > 20 & age < 80)
Resulting this:
age sex height weight
1 21 m 181 69
2 35 f 173 58
Now, if you want to select all the rows, that are outside of this interval, you could negate this logic condition with the ! operator, like was suggested by #r2evans in the comment section. It would be something like this:
dat[!(dat$age > 20 & dat$age < 80),]
subset(dat, !(age > 20 & age < 80))
Resulting this:
age sex height weight
3 829 m 171 75
4 2 c 166 60
Why not use dplyr filter?
library(dplyr)
df_age <- dat %>%
dplyr::filter(age > 20
, age < 80)

R: sum of number of rows of one data based on row-specific dynamic conditions from another data

Consider the following data:
Country1 = c("Brazil", "India", "China","China","Brazil")
Date1<-as.Date(c("2001-01-21", "2002-04-13","2003-06-19","2006-06-19","2007-06-19"))
Name1<-c("B","C","A","A","A")
Data1<-data.frame(Country1,Date1,Name1)
Name2<-c("B","B","C","C","C","A","A","A")
Quality2<-c("good","good","medium","good","good","bad","good","good")
Country2<-c("China","Brazil","Taiwan","India","India","United States","China","Brazil")
Date2<-as.Date(c("2002-02-21", "1999-03-13","1998-08-19", "1996-09-13","2000-12-12","1998-07-21","2005-03-22","2003-06-19"))
Data2<-data.frame(Name2,Quality2,Country2,Date2)
In Data1, I want to add a column by the name of "Result". The "Result" (for each row of Data1) should be the sum of number of rows of Data2 which meet four conditions (1) Data2$Name2 should match row’s entry of Data1$Name1, (2) Data2$Country2 should match row’s entry of Data1$Country1, (3) Data2$Quality2 should be “good”, (4) Data2$Date2 should be less than row’s entry of Data1$Date1. So, Data1$Result should be 1, 2, 0, 1, and 1.
For example, for the first row, Data1$Result should be 1 because Data2 has only 1 row that meets these conditions:
sum(Data2$Name2==as.character(Data1$Name1)[1] & Data2$Country2==as.character(Data1$Country1)[1] & Data2$Quality2=="good" & Data2$Date2 < Data1$Date1[1])
Or, in other words
sum(Data2$Name2=="B" & Data2$Country2=="Brazil" & Data2$Quality2=="good" & Data2$Date2 < "2001-01-21")
In the same way, for the second row, Data1$Result should be 2 because Data2 has 2 row that meet these conditions: sum(Data2$Name2==as.character(Data1$Name1)[2] & Data2$Country2==as.character(Data1$Country1)[2] & Data2$Quality2=="good" & Data2$Date2 < Data1$Date1[2])
Or,
sum(Data2$Name2=="C" & Data2$Country2=="India" & Data2$Quality2=="good" & Data2$Date2 < "2002-04-13").
For the third row, Data1$Result should be 0 because Data2 does not have any row that meets these conditions:
sum(Data2$Name2==as.character(Data1$Name1)[3] & Data2$Country2==as.character(Data1$Country1)[3] & Data2$Quality2=="good" & Data2$Date2 < Data1$Date1[3])
Alternatively,
sum(Data2$Name2=="A" & Data2$Country2=="China" & Data2$Quality2=="good" & Data2$Date2 < "2003-06-19").
Same goes for 4th and 5th rows:
sum(Data2$Name2==as.character(Data1$Name1)[4] & Data2$Country2==as.character(Data1$Country1)[4] & Data2$Quality2=="good" & Data2$Date2 < Data1$Date1[4])
sum(Data2$Name2==as.character(Data1$Name1)[5] & Data2$Country2==as.character(Data1$Country1)[5] & Data2$Quality2=="good" & Data2$Date2 < Data1$Date1[5])
As a beginner in R, I wrote the following code:
sum(Data2$Name2==as.character(Data1$Name1)[1:nrow(Data1)] & Data2$Country2==as.character(Data1$Country1)[1:nrow(Data1)] & Data2$Quality2=="good" & Data2$Date2 < Data1$Date1[1:nrow(Data1)])
However, it does not return the desired outcome. I want to write a dynamic code based on row number of Data1. In my actual data, I have around 100,000 observations in each data.
Ideally, I am looking for some code that R reads depending on row number of Data1 “n”.
For example, for 1st row, R should execute
sum(Data2$Name2==as.character(Data1$Name1)[1] & Data2$Country2==as.character(Data1$Country1)[1] & ata2$Quality2=="good" & Data2$Date2 < Data1$Date1[1])
For second row,
sum(Data2$Name2==as.character(Data1$Name1)[2] & Data2$Country2==as.character(Data1$Country1)[2] & ata2$Quality2=="good" & Data2$Date2 < Data1$Date1[2])
For (lets say) 54,342th row
sum(Data2$Name2==as.character(Data1$Name1)[54342] & Data2$Country2==as.character(Data1$Country1)[54342] & ata2$Quality2=="good" & Data2$Date2 < Data1$Date1[54342])
For nth row
sum(Data2$Name2==as.character(Data1$Name1)[n] & Data2$Country2==as.character(Data1$Country1)[n] & Data2$Quality2=="good" & Data2$Date2 < Data1$Date1[n])
Also, I want to add another column in Data1 by the name of “Min.Date.Result” which gives me the smallest (oldest) value of Data2$Date2 which meets the same four conditions. So Data1$Min.Date.Result should be “1999-03-13”, “1996-09-13”,“NA”, "2005-03-22", "2003-06-19".
We can filter the Quality2 to keep "Good" rows, join it with Data1, group_by Country2 and count number of rows where Date2 < Date1 and the minimum value.
library(dplyr)
Data2 %>%
filter(Quality2 == 'good') %>%
right_join(Data1, by = c('Name2' = 'Name1', 'Country2' = 'Country1')) %>%
group_by(Country2) %>%
summarise(Result = sum(Date2 < Date1),
Date1 = min(Date2[Date2 < Date1]))
# A tibble: 3 x 3
# Country2 Result Date1
# <chr> <int> <date>
#1 Brazil 1 1999-03-13
#2 China 0 NA
#3 India 2 1996-09-13
For the updated data, we can change the approach and do :
Data1 %>%
left_join(Data2, by = c('Name1' = 'Name2', 'Country1' = 'Country2')) %>%
group_by(Country1, Date1) %>%
summarise(Result = sum(Date2 < Date1 & Quality2 == "good"),
Date = min(Date2[Date2 < Date1 & Quality2 == "good"]))
# Country1 Date1 Result Date
# <chr> <date> <int> <date>
#1 Brazil 2001-01-21 1 1999-03-13
#2 China 2003-06-19 0 NA
#3 China 2006-06-19 1 2005-03-22
#4 India 2002-04-13 2 1996-09-13

Create group id column using dplyr if numbers between certain values [duplicate]

I am trying to categorize a numeric variable (age) into groups defined by intervals so it will not be continuous. I have this code:
data$agegrp(data$age >= 40 & data$age <= 49) <- 3
data$agegrp(data$age >= 30 & data$age <= 39) <- 2
data$agegrp(data$age >= 20 & data$age <= 29) <- 1
the above code is not working under survival package. It's giving me:
invalid function in complex assignment
Can you point me where the error is? data is the dataframe I am using.
I would use findInterval() here:
First, make up some sample data
set.seed(1)
ages <- floor(runif(20, min = 20, max = 50))
ages
# [1] 27 31 37 47 26 46 48 39 38 21 26 25 40 31 43 34 41 49 31 43
Use findInterval() to categorize your "ages" vector.
findInterval(ages, c(20, 30, 40))
# [1] 1 2 2 3 1 3 3 2 2 1 1 1 3 2 3 2 3 3 2 3
Alternatively, as recommended in the comments, cut() is also useful here:
cut(ages, breaks=c(20, 30, 40, 50), right = FALSE)
cut(ages, breaks=c(20, 30, 40, 50), right = FALSE, labels = FALSE)
We can use dplyr:
library(dplyr)
data <- data %>% mutate(agegroup = case_when(age >= 40 & age <= 49 ~ '3',
age >= 30 & age <= 39 ~ '2',
age >= 20 & age <= 29 ~ '1')) # end function
Compared to other approaches, dplyr is easier to write and interpret.
This answer provides two ways to solve the problem using the data.table package, which would greatly improve the speed of the process. This is crucial if one is working with large data sets.
1s Approach: an adaptation of the previous answer but now using data.table + including labels:
library(data.table)
agebreaks <- c(0,1,5,10,15,20,25,30,35,40,45,50,55,60,65,70,75,80,85,500)
agelabels <- c("0-1","1-4","5-9","10-14","15-19","20-24","25-29","30-34",
"35-39","40-44","45-49","50-54","55-59","60-64","65-69",
"70-74","75-79","80-84","85+")
setDT(data)[ , agegroups := cut(age,
breaks = agebreaks,
right = FALSE,
labels = agelabels)]
2nd Approach: This is a more wordy method, but it also makes it more clear what exactly falls within each age group:
setDT(data)[age <1, agegroup := "0-1"]
data[age >0 & age <5, agegroup := "1-4"]
data[age >4 & age <10, agegroup := "5-9"]
data[age >9 & age <15, agegroup := "10-14"]
data[age >14 & age <20, agegroup := "15-19"]
data[age >19 & age <25, agegroup := "20-24"]
data[age >24 & age <30, agegroup := "25-29"]
data[age >29 & age <35, agegroup := "30-34"]
data[age >34 & age <40, agegroup := "35-39"]
data[age >39 & age <45, agegroup := "40-44"]
data[age >44 & age <50, agegroup := "45-49"]
data[age >49 & age <55, agegroup := "50-54"]
data[age >54 & age <60, agegroup := "55-59"]
data[age >59 & age <65, agegroup := "60-64"]
data[age >64 & age <70, agegroup := "65-69"]
data[age >69 & age <75, agegroup := "70-74"]
data[age >74 & age <80, agegroup := "75-79"]
data[age >79 & age <85, agegroup := "80-84"]
data[age >84, agegroup := "85+"]
Although the two approaches should give the same result, I prefer the 1st one for two reasons. (a) It is shorter to write and (2) the age groups are ordered in the correct way, which is crucial when it comes to visualizing the data.
Let's say that your ages were stored in the dataframe column labeled age. Your dataframe is df, and you want a new column age_grouping containing the "bucket" that your ages fall in.
In this example, suppose that your ages ranged from 0 -> 100, and you wanted to group them every 10 years. The following code would accomplish this by storing these intervals in a new age grouping column:
df$age_grouping <- cut(df$age, seq(0, 100, 10))

How to efficiently match offerlines based on a rolling window in R

I currently have 600 000 + offerlines, where I want to efficiently match them based on the product bought & the timeframe.
With timeframe I mean that from the base line, I look at all offerlines that are maximally either 10 days before the base line or 10 days after. Everything in between with the same product should be matched.
However, it is very time expensive & after running it for a complete night, I only got to line 45000.
I know parallelism is one option, but I want to know if there are better ways (packages, functions, logic).
Input data
Offerline n°,Customer n°,Offerdate,Product
(we clean to 1 offerline n° per day per custno, for a certain product)
Logic => match lines with same product, different Customer n°
Desired output
Base customer, Related Customer, Offerline n°, Matched Offerline n°, Product, Offerdate base, Offerdate matched line.
Current code
for(i in 1:nrow(X)){
sku <- X[i,]$product
date <- X[i,]$order.offer_date
cust <- X[i,]$customer_code
oon <- X[i,]$order.offer_number
F <- data.frame()
F <- X %>%
filter(product == (X[i,]$product) & (order.offer_date <= date + 10 & order.offer_date >= date - 10)& customer_code != cust)
if(nrow(F)== 0){next}
else{
for(j in 1:nrow(F)){
skuc <- F[j,]$product
datec <- F[j,]$order.offer_date
custc <- F[j,]$customer_code
oonc <- F[j,]$order.offer_number
if(custc == cust | oon == oonc){next}
else if(skuc != sku){next}
else if(skuc == sku){
if(datec <= date + 10 & datec >= date - 10){
z <- z + 1
Y[z,]$count <- j
Y[z,]$base <- oon
Y[z,]$related <- oonc
Y[z,]$baseSku <- sku
Y[z,]$relSku <- skuc
Y[z,]$basedate <- as.Date(date)
Y[z,]$reldate <- as.Date(datec)
Y[z,]$basecust <- cust
Y[z,]$relcust <- custc
}
else{next}
}
}
next
}
}

Resources