Summing Counts of a wide variable once per subject - r

I have a subject dataset subjects that contains a set of variables corresponding to medications subjects have taken. From that wide variable, I've created a unique set of all the meds in the dataset (Regardless of which specific variable it came from).
I want to find the subject count of each med, such that if a subject lists a med once or more that count is increased by one.
Below is a slow way of doing it, but I have 9 med variables and over 50,000 subjects. Can someone help me figure out a more efficient way? Thanks.
subjects <- data.table(
med1= as.factor(c("NASONEX","ALBUTEROL","","BENADRYL","THEODUR")),
med2= as.factor(c("","ALBUTEROL","ASBRON","NASONEX","MONTEKULAST")),
medrecent= as.factor(c("MONTEKLUAST","","","THEODUR",""))
)
medvarnames <- c("med1","med2","medrecent")
allmeds <- data.table(
med=subjects[,unique(unlist(sapply(.SD,levels))), .SDcols=medvarnames],
count=0
)
for(i in 1: dim(subjects)[1]){
allmeds[, count := count +
sapply(allmeds$med,function(.m){
as.integer(
any(.m == subjects[i,.SD,.SDcols=medvarnames],na.rm=TRUE)
)
})
]
}
allmeds
med count
1: 4
2: ALBUTEROL 1
3: BENADRYL 1
4: NASONEX 2
5: THEODUR 2
6: ASBRON 1
7: MONTEKULAST 1
8: MONTEKLUAST 1

What about
as.data.frame(table(unlist(apply(subjects, 1, unique))))
There might be faster methods but it does a decent job (~1.5 sec) on a 50,000-by-9 table like you mentioned.

Related

How to measure the average gap size in a time series panel per id?

In order to deal with product time series where lots of them showing intermittent demand, I want to measure how large the gaps consisting of zero values in between the series are.
In the next step I want to measure the average gap length per id. In my example this would be 4.33 for ID 1.
I found an older solution for measurement of gap sizes in time series, that does not give me the result in way, that I am able to process it further and derive measures like average gap size and min and max gap size:
Gap size calculation in time series with R
library(tidyverse)
library(lubridate)
library(data.table)
data <- tibble(id = as.factor(c(rep("1",24),rep("2",24),rep("3",24))),
date = rep(c(ymd("2013-01-01")+ months(0:23)),3),
value = c(c(rep(4,5),0,0,0,0,0,0,0,0,7,0,0,0,0,11,23,54,33,45,0),
c(4,6,1,2,3,4,4,6,8,11,18,6,6,1,7,7,13,9,4,33,3,6,81,45),
c(rep(4,5),0,0,0,5,2,0,0,0,7,0,0,8,0,11,23,54,33,0,0))
)
# this gives me the repeated gap size per observation
setDT(data)
data[, gap := rep(rle(value)$lengths, rle(value)$lengths) * (value == 0)]
# I want the distinct gap size per id
1: c(8,4,1)
2: c(0)
3: c(3,3,2,1,2)
If I would be able to determine the number of gaps per id, I could also calculate the mean gap size, by retrieving the total number zeros per id like this (13/3 = 4.33):
# total number of zeros per id
data <- as_tibble(data)
data %>% group_by(id) %>% summarise(zero_sum = length(which(value == 0)))
You could use rle:
library(data.table)
setDT(data)
data[,.(n=with(rle(value==0),lengths*values)),by=id][n>0]
id n
<fctr> <int>
1: 1 8
2: 1 4
3: 1 1
4: 3 3
5: 3 3
6: 3 2
7: 3 1
8: 3 2
or in the expected format:
data[,.(n=list(with(rle(value==0),{r = lengths*values;
r <- r[r!=0];
if (length(r)==0) {r <- 0L};
r }))),by=id]
id n
<fctr> <list>
1: 1 8,4,1
2: 2 0
3: 3 3,3,2,1,2

How to check for skipped values in a series in a R dataframe column?

I have a dataframe price1 in R that has four columns:
Name Week Price Rebate
Car 1 1 20000 500
Car 1 2 20000 400
Car 1 5 20000 400
---- -- ---- ---
Car 1 54 20400 450
There are ten Car names in all in price1, so the above is just to give an idea about the structure. Each car name should have 54 observations corresponding to 54 weeks. But, there are some weeks for which no observation exists (for e.g., Week 3 and 4 in the above case). For these missing weeks, I need to plug in information from another dataframe price2:
Name AveragePrice AverageRebate
Car 1 20000 500
Car 2 20000 400
Car 3 20000 400
---- ---- ---
Car 10 20400 450
So, I need to identify the missing week for each Car name in price1, capture the row corresponding to that Car name in price2, and insert the row in price1. I just can't wrap my head around a possible approach, so unfortunately I do not have a code snippet to share. Most of my search in SO is leading me to answers regarding handling missing values, which is not what I am looking for. Can someone help me out?
I am also indicating the desired output below:
Name Week Price Rebate
Car 1 1 20000 500
Car 1 2 20000 400
Car 1 3 20200 410
Car 1 4 20300 420
Car 1 5 20000 400
---- -- ---- ---
Car 1 54 20400 450
---- -- ---- ---
Car 10 54 21400 600
Note that the output now has Car 1 info for Week 4 and 5 which I should fetch from price2. Final output should contain 54 observations for each of the 10 car names, so total of 540 rows.
try this, good luck
library(data.table)
carNames <- paste('Car', 1:10)
df <- data.table(Name = rep(carNames, each = 54), Week = rep(1:54, times = 10))
df <- merge(df, price1, by = c('Name', 'Week'), all.x = TRUE)
df <- merge(df, price2, by = 'Name', all.x = TRUE); df[, `:=`(Price = ifelse(is.na(Price), AveragePrice, Price), Rebate = ifelse(is.na(Rebate), AverageRebate, Rebate))]
df[, 1:4]
So if I understand your problem correctly you basically have 2 dataframes and you want to make sure the dataframe - "price1" has the correct rownames(names of the cars) in the 'names' column?
Here's what I would do, but it probably isn't the optimal way:
#create a loop with length = number of rows in your frame
for(i in 1:nrow(price1)){
#check if the value is = NA,
if (is.na(price1[1,i] == TRUE){
#if it is NA, replace it with the corresponding value in price2
price1[1,i] <- price2[1,i]
}
}
Hope this helps (:
If I understand your question correctly, you only want to see what is in the 2nd table and not in the first. You will just want to use an anti_join. Note that the order you feed the tables into the anti_join matters.
library(tidyverse)
complete_table ->
price2 %>%
anti_join(price1)
To expand your first table to cover all 54 weeks use complete() or you can even fudge it and right_join a table that you will purposely build with all 54 weeks in it. Then anything that doesn't join to this second table gets an NA in that column.

R count and substract events from a data frame

I am trying to calculate the families sizes from a data frame, which also contains two types of events : family members who died, and those who left the family. I would like to take into account these two parameters in order to compute the actual family size.
Here is a reproductive example of my problem, with 3 families only :
family <- factor(rep(c("001","002","003"), c(10,8,15)), levels=c("001","002","003"), labels=c("001","002","003"), ordered=TRUE)
dead <- c(0,0,0,0,1,0,1,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,1,0,1,0,0,0,0,0,0,0)
left <- c(0,0,0,0,0,1,0,0,0,1,1,0,0,0,1,1,0,0,0,1,1,1,0,0,0,0,0,0,1,1,1,0,0)
DF <- data.frame(family, dead, left) ; DF
I could count N = total family members (in each family) in a second dataframe DF2, by simply using table()
DF2 <- with(DF, data.frame(table(family)))
colnames(DF2)[2] <- "N" ; DF2
family N
1 001 10
2 002 8
3 003 15
But i can not find a proper way to get the actual number of people (for example, creating a new variable N2 into DF2) , calculated by substracting to N the number of members who died or left the family. I suppose i have to relate the two dataframes DF and DF2 in a way. i have looked for other related questions in this site but could not find the right answer...
If anyone has a good idea, it would be great !
Thank you in advance..
Deni
Logic : First we want to group_by(family) and then calculate 2 numbers : i) total #obs in each group ii) subtract the sum(dead) + sum(left) from this total .
In dplyr package : n() helps us get the total #observations in each group
In data.table : .N does the same above job
library(dplyr)
DF %>% group_by(family) %>% summarise( total = n(), current = n()-sum(dead,left, na.rm = TRUE))
# family total current
# (fctr) (int) (dbl)
#1 001 10 6
#2 002 8 4
#3 003 15 7
library(data.table)
# setDT() is preferred if incase your data was a data.frame. else just DF.
setDT(DF)[, .(total = .N, current = .N - sum(dead, left, na.rm = TRUE)), by = family]
# family total current
#1: 001 10 6
#2: 002 8 4
#3: 003 15 7
Here is a base R option
do.call(data.frame, aggregate(dl~family, transform(DF, dl = dead + left),
FUN = function(x) c(total=length(x), current=length(x) - sum(x))))
Or a modified version is
transform(aggregate(. ~ family, transform(DF, total = 1,
current = dead + left)[c(1,4:5)], FUN = sum), current = total - current)
# family total current
#1 001 10 6
#2 002 8 4
#3 003 15 7
I finally found another which works fine (from another post), allowing to compute everything from the original DF table. This uses the ddply function :
DF <- ddply(DF,.(family),transform,total=length(family))
DF <- ddply(DF,.(family),transform,actual=length(family)-sum(dead=="1")-sum(left=="1"))
DF
Thanks a lot to everyone who helped ! Deni

Combining all immediately previous rows that have the same value as last row in R

I have a dataframe that looks similar to this:
ID Description
1 "Low Blood Sugar, High Temperature"
1 "No Appetite"
2 "Blood Test Taken"
2 "D4556 Applied, No Obvious Reaction"
3 "At Rest"
1 "Lower Temperature, Improving"
1 "Walked 50m"
1 "Sedated"
What I want to do is return as an array, the last patient ID and all comments that occurred to the same patient immediately previously. That is, not anything that happened earlier. For example, I want to return:
ID Description
1 "Lower Temperature, Improving, Walked 50m, Sedated"
I can work this out with for loops, but I have a big dataframe and want something more efficient. I'm already subsetting over different variables using ddply, this is why I only need the last observations for the last ID.
Another data.table approach (using rleid):
library(data.table) #1.9.6
res <- setDT(df)[, list(ID[1L], toString(Description)), by = rleid(ID)]
You can remove the rleid column later on using
res[, rleid := NULL]
If you only want to return the last ID in aggregated form, you could do the following which should be pretty quick:
idx <- df[,rleid(ID)]
df[idx == max(idx), list(ID[1L], toString(Description))]
# V1 V2
#1: 1 Lower Temperature, Improving, Walked 50m, Sedated
I think this might be a solution:
#use rle to get the number of rows for the last id
idrows <- rle(df$ID)$lengths[length(rle(df$ID)$lengths)]
#use rle in the same way to get the actual id value
id <- rle(df$ID)$values[length(rle(df$ID)$values)]
#combine the above two variables to calculate the needed values
#i.e. the id and the pasted-together descriptions
> data.frame(id = id, Description = paste(df[(nrow(df)-idrows+1):nrow(df), 'Description'],
collapse=','))
id Description
1 1 Lower Temperature, Improving,Walked 50m,Sedated
Just to make it a bit more clear, df[(nrow(df)-idrows+1):nrow(df), 'Description'] selects the last rows for the needed id and paste pastes those descriptions together.
First we create a unique id by group in column ID:
df$id <- cumsum(c(1, diff(df$ID)!=0))
ID Description id
1 1 Low Blood Sugar, High Temperature 1
2 1 No Appetite 1
3 2 Blood Test Taken 2
4 2 D4556 Applied, No Obvious Reaction 2
5 3 At Rest 3
6 1 Lower Temperature, Improving 4
7 1 Walked 50m 4
8 1 Sedated 4
Then we select the last rows meeting the condition and create the vector:
c(t(df[df$id==max(df$id),][2]))
Output:
[1] "Lower Temperature, Improving" "Walked 50m" "Sedated"
With data.table you can try:
require(data.table)
setDT(df)
dt[,list(ID=ID[1],Description=paste(Description,collapse=", ")),
by=cumsum(c(TRUE,df$ID[2:nrow(df)]!=df$ID[1:(nrow(df)-1)]))][,
list(ID,Description)]
#ID Description
#1: 1 Low Blood Sugar, High Temperature, No Appetite
#2: 2 Blood Test Taken, D4556 Applied, No Obvious Reaction
#3: 3 At Rest
#4: 1 Lower Temperature, Improving, Walked 50m, Sedated
Although this old question already has an accepted answer, I felt challenged to add another data.table solution which uses rleid() but is different to this answer:
library(data.table) # CRAN version 1.10.4 used
setDT(DF)[, .(ID, Description = toString(Description)), by = rleid(ID)][, .SD[.N]]
# rleid ID Description
#1: 4 1 Lower Temperature, Improving, Walked 50m, Sedated
This returns the aggregated last entries concerning the very last patient ID as requested by the OP.
As already mentioned in the other answer, rleid can be removed by chaining [, rleid := NULL].
Just to mention, we can retrieve the aggregated last entries for each patient ID with a slight modification:
setDT(DF)[, .(Description = toString(Description)), .(ID, rleid(ID))][, .SD[.N], ID]
# ID rleid Description
#1: 1 4 Lower Temperature, Improving, Walked 50m, Sedated
#2: 2 2 Blood Test Taken, D4556 Applied, No Obvious Reaction
#3: 3 3 At Rest
Note that it is not required to name the by parameter as long as it is the third unnamed parameter.
Data
library(data.table)
DF <- fread(
'ID Description
1 "Low Blood Sugar, High Temperature"
1 "No Appetite"
2 "Blood Test Taken"
2 "D4556 Applied, No Obvious Reaction"
3 "At Rest"
1 "Lower Temperature, Improving"
1 "Walked 50m"
1 "Sedated"'
, data.table = FALSE)
Note that the parameter data.table = FALSE instructs fread() to return a data.frame to be in line with OP's specifications.

How to find sum and average for some columns based on the numbers from another column in R

GIVEN DATA
I have 6 columns of data of vehicle trajectory (observation of vehicles' change in position, velocity, etc over time) a part of which is shown below:
Vehicle ID Frame ID Global X Vehicle class Vehicle velocity Lane
1 177 6451181 2 24.99 5
1 178 6451182 2 24.95 5
1 179 6451184 2 24.91 5
1 180 6451186 2 24.90 5
1 181 6451187 2 24.96 5
1 182 6451189 2 25.08 5
Vehicle ID is the identification of individual vehicles e.g. vehicle 1, vehicle 2, etc. It is repeated in the column for each frame in which it was observed. Please note that each frame is 0.1 seconds long so 10 frames make 1 second. The IDs of frames is in Frame ID column. Vehicle class is the type of vehicle (1=motorcycle, 2=car, 3=truck). Vehicle velocity column represents instantaneous speed of vehicle in that instant of time i.e. in a frame. Lane represents the number or ID of the lane in which vehicle is present in a particular frame.
WHAT I NEED TO FIND
The data I have is for 15 minutes period. The minimum frame ID is 5 and maximum frame ID is 9952. I need to find the total number of vehicles in every 30 seconds time period. This means that starting from the first 30 seconds (frame ID 5 to frame ID 305), I need to know the unique vehicle IDs observed. Also, for these 30 seconds period, I need to find the average velocity of each vehicle class. This means that e.g. for cars I need to find the average of all velocities of those vehicles whose vehicle class is 2.
I need to find this for all 30 seconds time period i.e. 5-305, 305-605, 605-905,..., 9605-9905. The ouput should tables for cars, trucks and motorcycles like this:
Time Slots Total Cars Average Velocity
5-305 xx xx
305-605 xx xx
. . .
. . .
9605-9905 xx xx
WHAT I HAVE TRIED SO FAR
# Finding the minimum and maximum Frame ID for creating 30-seconds time slots
minfid <- min(data$'Frame ID') # this was 5
maxfid <- max(data$'Frame ID') # this was 9952
for (i in 'Frame ID'==5:Frame ID'==305) {
table ('Vehicle ID')
mean('Vehicle Velocity', 'Vehicle class'==2)
} #For cars in first 30 seconds
I can't generate the required output and I don't know how can I do this for all 30 second periods. Please help.
It's a bit tough to make sure code is completely correct with your data since there is only one vehicle in the sample you show. That said, this is a typical split-apply-combine type analysis you can execute easily with the data.table package:
library(data.table)
dt <- data.table(df) # I just did a `read.table` on the text you posted
dt[, frame.group:=cut(Frame_ID, seq(5, 9905, by=300), include.lowest=T)]
Here, I just converted your data into a data.table (df was a direct import of your data posted above), and then created 300 frame buckets using cut. Then, you just let data.table do the work. In the first expression we calculate total unique vehicles per frame.group
dt[, list(tot.vehic=length(unique(Vehicle_ID))), by=frame.group]
# frame.group tot.vehic
# 1: [5,305] 1
Now we group by frame.group and Vehicle_class to get average speed and count for those combinations:
dt[, list(tot.vehic=length(unique(Vehicle_ID)), mean.speed=mean(Vehicle_velocity)), by=list(frame.group, Vehicle_class)]
# frame.group Vehicle_class tot.vehic mean.speed
# 1: [5,305] 2 1 24.965
Again, a bit silly when we only have one vehicle, but this should work for your data set.
EDIT: to show that it works:
library(data.table)
set.seed(101)
dt <- data.table(
Frame_ID=sample(5:9905, 50000, rep=T),
Vehicle_ID=sample(1:400, 50000, rep=T),
Vehicle_velocity=runif(50000, 25, 100)
)
dt[, frame.group:=cut(Frame_ID, seq(5, 9905, by=300), include.lowest=T)]
dt[, Vehicle_class:=Vehicle_ID %% 3]
head(
dt[order(frame.group, Vehicle_class), list(tot.vehic=length(unique(Vehicle_ID)), mean.speed=mean(Vehicle_velocity)), by=list(frame.group, Vehicle_class)]
)
# frame.group Vehicle_class tot.vehic mean.speed
# 1: [5,305] 0 130 63.34589
# 2: [5,305] 1 131 61.84366
# 3: [5,305] 2 129 64.13968
# 4: (305,605] 0 132 61.85548
# 5: (305,605] 1 132 64.76820
# 6: (305,605] 2 133 61.57129
Maybe it's your data?
Here is a plyr version:
data$timeSlot <- cut(data$FrameID,
breaks = seq(5, 9905, by=300),
dig.lab=5,
include.lowest=TRUE)
# split & combine
library(plyr)
data.sum1 <- ddply(.data = data,
.variables = c("timeSlot"),
.fun = summarise,
totalCars = length(unique(VehicleID)),
AverageVelocity = mean(velocity)
)
# include VehicleClass
data.sum2 <- ddply(.data = data,
.variables = c("timeSlot", "VehicleClass"),
.fun = summarise,
totalCars = length(unique(VehicleID)),
AverageVelocity = mean(velocity)
)
The column names like FrameID would have to be edited to match the ones you use:
data <- read.table(sep = "", header = TRUE, text = "
VehicleID FrameID GlobalX VehicleClass velocity Lane
1 177 6451181 2 24.99 5
1 178 6451182 2 24.95 5
1 179 6451184 2 24.91 5
1 180 6451186 2 24.90 5
1 181 6451187 2 24.96 5
1 182 6451189 2 25.08 5")
data.sum1
# timeSlot totalCars AverageVelocity
# 1 [5,305] 1 24.965
data.sum2
# timeSlot VehicleClass totalCars AverageVelocity
# 1 [5,305] 2 1 24.965

Resources