subset data based on condition in r [duplicate] - r

This question already has answers here:
Remove group from data.frame if at least one group member meets condition
(4 answers)
Closed 1 year ago.
I want to select those household where all the member's age is greater than 20 in r.
household Members_age
100 75
100 74
100 30
101 20
101 50
101 60
102 35
102 40
102 5
Here two household satisfy the condition. Household 100 and 101.
How to do it in r?
what I did is following but it's not working.
sqldf("select household,Members_age from data group by household having Members_age > 20")
household Members_age
100 75
102 35
Please suggest. Here is the sample dataset
library(dplyr)
library(sqldf)
data <- data.frame(household = c(100,100,100,101,101,101,102,102,102),
Members_age = c(75,74,30,20,50,60,35,40,5))

You can use ave.
data[ave(data$Members_age, data$household, FUN=min) > 20,]
# household Members_age
#1 100 75
#2 100 74
#3 100 30
or only the households.
unique(data$household[ave(data$Members_age, data$household, FUN=min) > 20])
#[1] 100

I understand SQL's HAVING clause, but your request "all member's age is greater than 20" does not match your sqldf output. This is because HAVING is really only looking at the first row for each household, which is why we see 102 (and shouldn't) and we don't see 101 (shouldn't as well).
I suggest to implement your logic, you would change your sqldf code to the following:
sqldf("select household,Members_age from data group by household having min(Members_age) > 20")
# household Members_age
# 1 100 30
which is effectively the SQL analog of GKi's ave answer.
An alternative:
library(dplyr)
data %>%
group_by(household) %>%
filter(all(Members_age > 20)) %>%
ungroup()
# # A tibble: 3 x 2
# household Members_age
# <dbl> <dbl>
# 1 100 75
# 2 100 74
# 3 100 30
and if you just need one row per household, then add %>% distinct(household) or perhaps %>% distinct(household, .keep_all = TRUE).
But for base R, I think nothing is likely to be better than GKi's use of ave.

Related

How to merge rows in a data frame when the values in a column are the same in R [duplicate]

This question already has answers here:
How to sum a variable by group
(18 answers)
Closed 3 years ago.
I have a number of observations from the same unit, and I need to merge the rows. So a data frame like
data.frame(
fir =c("001","006","001", "006", "062"),
sec = c(10,5,6,7,8),
thd = c(45,67,84,54,23))
fir sec thd
001 10 45
006 5 67
001 6 84
006 7 54
062 8 23
The first column has a 3 digit number representing a unit. I need to add the rows together to get a total for each unit. The other columns are numeric values that need adding together. So the dataframe would look like,
fir sec thd
001 16 129
006 12 121
062 8 23
I need it to work for any unique number in the first column.
Any ideas? Thank you for any help!
welcome this is a classic case of a group by operation, we can group your logic by group in this case we want the sum of the sec and thd columns.
library(tidyverse)
df <- data.frame(
fir =c("001","006","001", "006", "062"),
sec = c(10,5,6,7,8),
thd = c(45,67,84,54,23))
df %>%
group_by(fir) %>%
summarise(sec_sum = sum(sec),
thd_sum = sum(thd))
We can do a group by 'sum'
library(dplyr)
df1 %>%
group_by(fir) %>%
summarise_all(sum)
# A tibble: 3 x 3
# fir sec thd
# <fct> <dbl> <dbl>
#1 001 16 129
#2 006 12 121
#3 062 8 23
Or with aggregate from base R
aggregate(. ~ fir, df1, sum)
data
df1 <- data.frame(
fir =c("001","006","001", "006", "062"),
sec = c(10,5,6,7,8),
thd = c(45,67,84,54,23))

Rank most recent scores of students within a given date - 30 days window

Following is what my dataframe/data.table looks like. The rank column is my desired calculated field.
library(data.table)
df <- fread('
Name Score Date Rank
John 42 1/1/2018 3
Rob 85 12/31/2017 2
Rob 89 12/26/2017 1
Rob 57 12/24/2017 1
Rob 53 08/31/2017 1
Rob 72 05/31/2017 2
Kate 87 12/25/2017 1
Kate 73 05/15/2017 1
')
df[,Date:= as.Date(Date, format="%m/%d/%Y")]
I am trying to calculate the rank of each student at every given point in time in the data within a 30 day windows. For that, I need to fetch the most recent scores of all students at a given point in time and then pass the rank function.
In the 1st row, as of 1/1/2018, John has two more competitors in a past 30 day window: Rob with the most recent score of 85 in 12/31/2017 AND Kate with the most recent score of 87 in 12/25/2017 and both of these dates fall within the 1/1/2018 - 30 Day Window. John gets a rank of 3 with the lowest score of 42. If only one students falls within date(at a given row) - 30 day window, then the rank is 1.
In the 3rd row the date is 12/26/2017. So Rob's score as of 12/26/2017 is 89. There is only one case of another student that falls in the time window of 12/26/2017 - 30 days and that is the most recent score(87) of kate on 12/25/2017. So within the time window of (12/26/2017) - 30 , Rob's score of 89 is higher than Kate's score of 87 and therefore Rob gets rank 1.
I was thinking about using the framework from here Efficient way to perform running total in the last 365 day window but struggling to think of a way to fetch all recent score of all students at a given point in time before using rank.
This seems to work:
ranks = df[.(d_dn = Date - 30L, d_up = Date), on=.(Date >= d_dn, Date <= d_up), allow.cart=TRUE][,
.(LatestScore = last(Score)), by=.(Date = Date.1, Name)]
setorder(ranks, Date, -LatestScore)
ranks[, r := rowid(Date)]
df[ranks, on=.(Name, Date), r := i.r]
Name Score Date Rank r
1: John 42 2018-01-01 3 3
2: Rob 85 2017-12-31 2 2
3: Rob 89 2017-12-26 1 1
4: Rob 57 2017-12-24 1 1
5: Rob 53 2017-08-31 1 1
6: Rob 72 2017-05-31 2 2
7: Kate 87 2017-12-25 1 1
8: Kate 73 2017-05-15 1 1
... using last since the Cartesian join seems to sort and we want the latest measurement.
How the update join works
The i. prefix means it's a column from i in the x[i, ...] join, and the assignment := is always in x. So it's looking up each row of i in x and where matches are found, copying values from i to x.
Another way that is sometimes useful is to look up x rows in i, something like df[, r := ranks[df, on=.(Name,Date), x.r]] in which case x.r is still from the ranks table (now in the x position relative to the join).
There's also...
ranks = df[CJ(Name = Name, Date = Date, unique=TRUE), on=.(Name, Date), roll=30, nomatch=0]
setnames(ranks, "Score", "LatestScore")
# and then use the same last three lines above
I'm not sure about efficiency of one vs another, but I guess it depends on number of Names, frequency of measurement and how often measurement days coincide.
A solution that uses data.table though not sure if it is the most efficient usage:
df[.(iName=Name, iScore=Score, iDate=Date, StartDate=Date-30, EndDate=Date),
.(Rank=frank(-c(iScore[1L], .SD[Name != iName, max(Score), by=.(Name)]$V1),
ties.method="first")[1L]),
by=.EACHI,
on=.(Date >= StartDate, Date <= EndDate)]
Explanation:
1) The outer square brackets do a non-equi join within a date range (i.e. 30days ago and latest date for each row). Try studying the below output against the input data:
df[.(iName=Name, iScore=Score, iDate=Date, StartDate=Date-30, EndDate=Date),
c(.(RowGroup=.GRP),
.SD[, .(Name, Score, Date, OrigDate, iName, iScore, iDate, StartDate, EndDate)]),
by=.EACHI,
on=.(Date >= StartDate, Date <= EndDate)]
2) .EACHI is to perform j calculations for each row of i.
3) Inside j, iScore[1L] is the score for the current row, .SD[Name != iName] means taking scores not corresponding to the student in the current row. Then, we use the max(Score) for each student of those students within the 30days window.
4) Concatenate all these scores and calculate the rank for the score of the current row while taking care of ties by taking the first tie.
Note:
see ?data.table to understand what i, j, by, on and .EACHI refers to.
EDIT after comments by OP:
I would add a OrigDate column and find those that matches the latest date.
df[, OrigDate := Date]
df[.(iName=Name, iScore=Score, iDate=Date, StartDate=Date-30, EndDate=Date),
.(Name=iName, Score=iScore, Date=iDate,
Rank=frank(-c(iScore[1L],
.SD[Name != iName, Score[OrigDate==max(OrigDate)], by=.(Name)]$V1),
ties.method="first")[1L]),
by=.EACHI,
on=.(Date >= StartDate, Date <= EndDate)]
I came up with following partial solution, encountered however problem - is it possible that there will be two people occuring with the same date?
if not, have a look at following piece of code:
library(tidyverse) # easy manipulation
library(lubridate) # time handling
# This function can be added to
get_top <- function(df, date_sel) {
temp <- df %>%
filter(Date > date_sel - months(1)) %>% # look one month in the past from given date
group_by(Name) %>% # and for each occuring name
summarise(max_score = max(Score)) %>% # find the maximal score
arrange(desc(max_score)) %>% # sort them
mutate(Rank = 1:n()) # and rank them
temp
}
Now, you have to find the name in the table, for given date and return its rank.
library(data.table)
library(magrittr)
setorder(df, -Date)
fun <- function(i){
df[i:nrow(df), head(.SD, 1), by = Name] %$%
rank(-Score[Date > df$Date[i] - 30])[1]
}
df[, rank := sapply(1:.N, fun)]
This can be done by joining to df those rows of df that are within 30 days behind it or the same date and have higher or equal scores. Then for each original row and joined row Name get the joined row Name that is the most recent. The count of the remaining joined rows for each of the original df rows is the rank.
library(sqldf)
sqldf("with X as
(select a.rowid r, a.*, max(b.Date) Date
from df a join df b
on b.Date between a.Date - 30 and a.Date and b.Score >= a.Score
group by a.rowid, b.Name)
select Name, Date, Score, count(*) Rank
from X
group by r
order by r")
giving:
Name Date Score Rank
1 John 2018-01-01 42 3
2 Rob 2017-12-31 85 2
3 Rob 2017-12-26 89 1
4 Rob 2017-12-24 57 1
5 Rob 2017-08-31 53 1
6 Rob 2017-05-31 72 2
7 Kate 2017-12-25 87 1
8 Kate 2017-05-15 73 1
A tidyverse solution (dplyr + tidyr):
df %>%
complete(Name,Date) %>%
group_by(Name) %>%
mutate(last_score_date = `is.na<-`(Date,is.na(Score))) %>%
fill(Score,last_score_date) %>%
filter(!is.na(Score) & Date-last_score_date <30) %>%
group_by(Date) %>%
mutate(Rank = rank(-Score)) %>%
right_join(df)
# # A tibble: 8 x 5
# # Groups: Date [?]
# Name Date Score last_score_date Rank
# <chr> <date> <int> <date> <dbl>
# 1 John 2018-01-01 42 2018-01-01 3
# 2 Rob 2017-12-31 85 2017-12-31 2
# 3 Rob 2017-12-26 89 2017-12-26 1
# 4 Rob 2017-12-24 57 2017-12-24 1
# 5 Rob 2017-08-31 53 2017-08-31 1
# 6 Rob 2017-05-31 72 2017-05-31 2
# 7 Kate 2017-12-25 87 2017-12-25 1
# 8 Kate 2017-05-15 73 2017-05-15 1
We add all missing combinations of Date and Name
then we create a column for the last_score_date, equal to Date when score isn't NA.
by filling NAs down Score has become the latest score
we filter out NAs and keep only scores that have < 30 days of age
That's our table of valid scores by dates
From there it's easy to add ranks
and a final right_join on the original table gives us the expected output
data
library(data.table)
df <- fread('
Name Score Date
John 42 01/01/2018
Rob 85 12/31/2017
Rob 89 12/26/2017
Rob 57 12/24/2017
Rob 53 08/31/2017
Rob 72 05/31/2017
Kate 87 12/25/2017
Kate 73 05/15/2017
')
df[,Date:= as.Date(Date, format="%m/%d/%Y")]

Grouping within group in R, plyr/dplyr

I'm working on the baseball data set:
data(baseball, package="plyr")
library(dplyr)
baseball[,1:4] %>% head
id year stint team
4 ansonca01 1871 1 RC1
44 forceda01 1871 1 WS3
68 mathebo01 1871 1 FW1
99 startjo01 1871 1 NY2
102 suttoez01 1871 1 CL1
106 whitede01 1871 1 CL1
First I want to group the data set by team in order to find the first year each team appears, and the number of distinct players that has ever played for each team:
baseball[,1:4] %>% group_by(team) %>%
summarise("first_year"=min(year), "num_distinct_players"=n_distinct(id))
# A tibble: 132 × 3
team first_year num_distinct_players
<chr> <int> <int>
1 ALT 1884 1
2 ANA 1997 29
3 ARI 1998 43
4 ATL 1966 133
5 BAL 1954 158
Now I want to add a column showing the maximum number of years any player (id) has played for the team in question. To do this, I need to somehow group by player within the existing group (team), and select the maximum number of rows. How do I do this?
Perhaps this helps
baseball %>%
select(1:4) %>%
group_by(id, team) %>%
dplyr::mutate(nyear = n_distinct(year)) %>%
group_by(team) %>%
dplyr::summarise(first_year = min(year),
num_distinct_players = n_distinct(id),
maxYear = max(nyear))
I tried doing this with base R and came up with this. It's fairly slow.
df = data.frame(t(sapply(split(baseball, baseball$team), function(x)
cbind( min(x$year),
length(unique(x$id)),
max(sapply(split(x,x$id), function(y)
nrow(y))),
names(which.max(sapply(split(x,x$id), function(y)
nrow(y)))) ))))
colnames(df) = c("Year", "Unique Players", "Longest played duration",
"Longest Playing Player")
First, split by team into different groups
For each group, obtain the minimum year as first year when the team appears
Get length of unique ids which is the number of players in that team
Split each group into subgroup by id and obtain the maximum number of rows that will give the maximum duration played by a player in that team
For each subgroup, get names of the id with maximum rows which gives the name of the player that played for the longest time in that team

How to assign a value depending on two conditions including column names. (add environmental variable to tracking data)

I have a data frame (track) with the position (longitude - Latitude) and date (number of the day in the year) of tracking point for different animals and an other data frame (var) which gives a the mean temperature for every day of the year in different locations.
I would like to add a new column TEMP to my data frame (Track) where the value would be from (var) and correspond to the date and GPS location of each tracking points in (track).
Here are a really simple subset of my data and what I would like to obtain.
track = data.frame(
animals=c(1,1,1,2,2),
Longitude=c(117,116,117,117,116),
Latitude=c(18,20,20,18,20),
Day=c(1,3,4,1,5))
Var = data.frame(
Longitude=c(117,117,116,116),
Latitude=c(18,20,18,20),
Day1=c(22,23,24,21),
Day2=c(21,28,27,29),
Day3=c(12,13,14,11),
Day4=c(17,19,20,23),
Day5=c(32,33,34,31)
)
TrackPlusVar = data.frame(
animals=c(1,1,1,2,2),
Longitude=c(117,116,117,117,116),
Latitude=c(18,20,20,18,20),
Day=c(1,3,4,1,5),
Temp= c(22,11,19,22,31)
)
I've no idea how to assign the value from the same date and GPS location as it is a column name. Any idea would be very useful !
This is a dplyr and tidyr approach.
library(dplyr)
library(tidyr)
# reshape table Var
Var %>%
gather(Day,Temp,-Longitude, -Latitude) %>%
mutate(Day = as.numeric(gsub("Day","",Day))) -> Var2
# join tables
track %>% left_join(Var2, by=c("Longitude", "Latitude", "Day"))
# animals Longitude Latitude Day Temp
# 1 1 117 18 1 22
# 2 1 116 20 3 11
# 3 1 117 20 4 19
# 4 2 117 18 1 22
# 5 2 116 20 5 31
If the process that creates your tables makes sure that all your cases belong to both tables, then you can use inner_join instead of left_join to make the process faster.
If you're still not happy with the speed you can use a data.table join process to check if it is faster, like:
library(data.table)
Var2 = setDT(Var2, key = c("Longitude", "Latitude", "Day"))
track = setDT(track, key = c("Longitude", "Latitude", "Day"))
Var2[track][order(animals,Day)]
# Longitude Latitude Day Temp animals
# 1: 117 18 1 22 1
# 2: 116 20 3 11 1
# 3: 117 20 4 19 1
# 4: 117 18 1 22 2
# 5: 116 20 5 31 2

How to create dataframe subset of the one patient observation with the lowest score on a variable

Hello I have a dataset with multiple patients, each with multiple observations.
I want to select the earliest observation for each patient.
Example:
Patient ID Tender Swollen pt_visit
101 1 10 6
101 6 12 12
101 4 3 18
102 9 5 18
102 3 6 24
103 5 2 12
103 2 1 18
103 8 0 24
The pt_visit variable is the number of months the patient was in the study at the time of the observation. What I need is the first observation from each patient based on the lowest number of months in the pt_visit column. However I need the earliest observation for each patient ID.
My desired results:
Patient ID Tender Swollen pt_visit
101 1 10 6
102 9 5 18
103 5 2 12
Assuming your data frame is called df, use the ddply function in the plyr package:
require(plyr)
firstObs <- ddply(df, "PatientID", function(x) x[x$pt_visit == min(x$pt_visit), ])
I would use the data.table package:
Data <- data.table(Data)
setkey(Data, Patient_ID, pt_visit)
Data[,.SD[1], by=Patient_ID]
Assuming that the Patient ID column is actually named Patient_ID, here are a few approaches. DF is assumed to be the name of the input data frame:
sqldf
library(sqldf)
sqldf("select Patient_ID, Tender, Swollen, min(pt_visit) pt_visit
from DF
group by Patient_ID")
or
sqldf("select *, min(pt_visit) pt_visit from DF group by Patient_ID")[-ncol(DF)]
Note: The above two alternatives use an extension to SQL only found in SQLite so be sure you are using the SQLite backend. (SQLite is the default backend for sqldf unless RH2, RProgreSQL or RMYSQL is loaded.)
subset/ave
subset(DF, ave(pt_visit, Patient_ID, FUN = rank) == 1)
Note: This makes use of the fact that there are no duplicate pt_visit values within the same Patient_ID. If there were we would need to specify the ties= argument to rank.
I almost think they should be a subset parameter named "by" that would do the same as it does in data.table. This is a base-solution:
do.call(rbind, lapply( split(dfr, dfr$PatientID),
function(x) x[which.min(x$pt_visit),] ) )
PatientID Tender Swollen pt_visit
101 101 1 10 6
102 102 9 5 18
103 103 5 2 12
I guess you can see why #hadley built 'plyr'.

Resources