Related
HAVE = data.frame(INSTRUCTOR = c(1, 1, 1, 1, 1, 1, 2, 2, 2, 3, 3, 3),
STUDENT = c(1, 2, 2, 2, 1, 3, 1, 1, 1, 1, 2, 1),
SCORE = c(10, 1, 0, 0, 7, 3, 5, 2, 2, 4, 10, 2),
TIME = c(1,1,2,3,2,1,1,2,3,1,1,2))
WANT = data.frame(INSTRUCTOR = c(1, 2, 3),
SCORE.DIF = c(-9, NA, 6))
For each INSTRUCTOR, I wish to find the SCORE of the first and second STUDENT, and subtract their scores. The STUDENT code varies so I wish not to use '==1' vs '==2'
I try:
HAVE[, .SD[1:2], by = 'INSTRUCTOR']
but do not know how to subtract vertically and obtain 'WANT' data frame from 'HAVE'
library(data.table)
setDT(HAVE)
unique(HAVE, by = c("INSTRUCTOR", "STUDENT")
)[, .(SCORE.DIF = diff(SCORE[1:2])), by = INSTRUCTOR]
# INSTRUCTOR SCORE.DIF
# <num> <num>
# 1: 1 -9
# 2: 2 NA
# 3: 3 6
To use your new TIME variable, we can do
HAVE[, .SD[which.min(TIME),], by = .(INSTRUCTOR, STUDENT)
][, .(SCORE.DIF = diff(SCORE[1:2])), by = INSTRUCTOR]
# INSTRUCTOR SCORE.DIF
# <num> <num>
# 1: 1 -9
# 2: 2 NA
# 3: 3 6
One might be tempted to replace SCORE[1:2] with head(SCORE,2), but that won't work: head(SCORE,2) will return length-1 if the input is length-2, as it is with instructor 2 (who only has one student albeit multiple times). When you run diff on length-1 (e.g., diff(1)), it returns a 0-length vector, which in the above data.table code reduces to zero rows for instructor 2. However, when there is only one student, SCORE[1:2] resolves to c(SCORE[1], NA), for which the diff is length-1 (as needed) and NA (as needed).
I have a two dataframes A and B that both have multiple columns. They share the common columns "week" and "store". I would like to join these two dataframes on the matching values of the common columns.
For example this is a small subset of the data that I have:
A = data.frame(retailer = c(2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2),
store = c(5, 5, 5, 5, 5, 5, 5, 5, 6, 6, 6, 6, 6, 6, 6, 6),
week = c(2021100301, 2021092601, 2021091901, 2021091201, 2021082901, 2021082201, 2021081501, 2021080801,
2021080101, 2021072501, 2021071801, 2021071101, 2021070401, 2021062701, 2021062001, 2021061301),
dollars = c(121817.9, 367566.7, 507674.5, 421257.8, 453330.3, 607551.4, 462674.8,
464329.1, 339342.3, 549271.5, 496720.1, 554858.7, 382675.5,
373210.9, 422534.2, 381668.6))
and
B = data.frame(
week = c("2020080901", "2017111101", "2017061801", "2020090701", "2020090701", "2020090701",
"2020091201","2020082301", "2019122201", "2017102901"),
store = c(14071, 11468, 2428, 17777, 14821, 10935, 5127, 14772, 14772, 14772),
fill = c(1, 1, 1, 1, 1, 1, 1, 1, 1, 1)
)
I would like to join these two tables on the matching week AND store values in order to incorporate the "fill" column from B into A. Where the values don't match, I would like to have a label "0" in the fill column, instead of a 1. Is there a way I can do this? I am not sure which join to use as well, or if "merge" would be better for this? Essentially I am NOT trying to get rid of any rows that do not have the matching values for the two common columns. Thanks for any help!
We may do a left_join
library(dplyr)
library(tidyr)
A %>%
mutate(week = as.character(week)) %>%
left_join(B) %>%
mutate(fill = replace_na(fill, 0))
I am working with a long-format longitudinal dataset where each person has 1, 2 or 3 time points. In order to perform certain analyses I need to make sure that each person has the same number of rows even if it consists of NAs because they did not complete the certain time point.
Here is a sample of the data before adding the rows:
structure(list(Values = c(23, 24, 45, 12, 34, 23), P_ID = c(1,
1, 2, 2, 2, 3), Event_code = c(1, 2, 1, 2, 3, 1), Site_code = c(1,
1, 3, 3, 3, 1)), class = "data.frame", row.names = c(NA, -6L))
This is the data I aim to get after adding the relevant rows:
structure(list(Values = c(23, 24, NA, 45, 12, 34, 23, NA, NA),
P_ID = c(1, 1, 1, 2, 2, 2, 3, 3, 3), Event_code = c(1, 2,
3, 1, 2, 3, 1, 2, 3), Site_code = c(1, 1, 1, 3, 3, 3, 1,
1, 1)), class = "data.frame", row.names = c(NA, -9L))
I want to come up with code that would automatically add rows to the dataset conditionally on whether the participant has had 1, 2 or 3 visits. Ideally it would make rest of data all NAs while copying Participant_ID and site_code but if not possible I would be satisfied just with creating the right number of rows.
We could use fill after doing a complete
library(dplyr)
library(tidyr)
ExpandedDataset %>%
complete(P_ID, Event_code) %>%
fill(Site_code)
I came with quite a long code, but you could group it in a function and make it easier:
Here's your dataframe:
df <- data.frame(ID = c(rep("P1", 2), rep("P2", 3), "P3"),
Event = c("baseline", "visit 2", "baseline", "visit 2", "visit 3", "baseline"),
Event_code = c(1, 2, 1, 2, 3, 1),
Site_code = c(1, 1, 2, 2, 2, 1))
How many records you have per ID?
values <- summary(df$ID)
What is the maximum number of records for a single patient?
target <- max(values)
Which specific patients have less records than the maximum?
uncompliant <- names(which(values<target))
And how many records do you have for those patients who have missing information?
rowcount <- values[which(values<target)]
So now, let's create the vectors of the data frame we will add to your original one. First, IDs:
IDs <- vector()
for(i in 1:length(rowcount)){
y <- rep(uncompliant[i], target - rowcount[i])
IDs <- c(IDs, y)
}
And now, the sitecodes:
SC <- vector()
for(i in 1:length(rowcount)){
y <- rep(unique(df$Site_code[which(df$ID == uncompliant[i])]), target - rowcount[i])
SC <- c(SC, y)
}
Finally, a data frame with the values we will introduce:
introduce <- data.frame(ID = IDs, Event = rep(NA, length(IDs)),
Event_code = rep(NA, length(IDs)),
Site_code = SC)
Combine the original dataframe with the new values to be added and sort it so it looks nice:
final <- as.data.frame(rbind(df, introduce))
final <- final[order(v$ID), ]
This question is very similar to a question asked in another thread which can be found here. I'm trying to achieve something similar: within groups (events) subtract the first date from the last date. I'm using the dplyr package and code provided in the answers of this thread. Subtracting the first date from the last date works, however it does not provide satisfactory results; the resulting time difference is displayed in numbers, and there seems to be no distinction between different time units (e.g., minutes and hours) --> subtractions in first 2 events are correct, however in the 3rd one it is not i.e. should be minutes. How can I manipulate the output by dplyr so that the resulting subtractions are actually a correct reflection of the time difference? Below you will find a sample of my data (1 group only) and the code that I used:
df<- structure(list(time = structure(c(1428082860, 1428083340, 1428084840,
1428086820, 1428086940, 1428087120, 1428087240, 1428087360, 1428087480,
1428087720, 1428088800, 1428089160, 1428089580, 1428089700, 1428090120,
1428090240, 1428090480, 1428090660, 1428090780, 1428090960, 1428091080,
1428091200, 1428091500, 1428091620, 1428096060, 1428096420, 1428096540,
1428096600, 1428097560, 1428097860, 1428100440, 1428100560, 1428100680,
1428100740, 1428100860, 1428101040, 1428101160, 1428101400, 1428101520,
1428101760, 1428101940, 1428102240, 1428102840, 1428103080, 1428103620,
1428103980, 1428104100, 1428104160, 1428104340, 1428104520, 1428104700,
1428108540, 1428108840, 1428108960, 1428110340, 1428110460, 1428110640
), class = c("POSIXct", "POSIXt"), tzone = ""), event = c(1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
2, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3)), .Names = c("time",
"event"), class = "data.frame", row.names = c(NA, 57L))
df1 <- df %>%
group_by(event) %>%
summarize(first(time),last(time),difference = last(time)-first(time))
We can use difftime and specify the unit to get all the difference in the same unit.
df %>%
group_by(event) %>%
summarise(First = first(time),
Last = last(time) ,
difference= difftime(last(time), first(time), unit='hour'))
This question is very similar to a question asked in another thread which can be found here. I'm trying to achieve something similar: within groups (events) subtract the first date from the last date. I'm using the dplyr package and code provided in the answers of this thread. Subtracting the first date from the last date works, however it does not provide satisfactory results; the resulting time difference is displayed in numbers, and there seems to be no distinction between different time units (e.g., minutes and hours) --> subtractions in first 2 events are correct, however in the 3rd one it is not i.e. should be minutes. How can I manipulate the output by dplyr so that the resulting subtractions are actually a correct reflection of the time difference? Below you will find a sample of my data (1 group only) and the code that I used:
df<- structure(list(time = structure(c(1428082860, 1428083340, 1428084840,
1428086820, 1428086940, 1428087120, 1428087240, 1428087360, 1428087480,
1428087720, 1428088800, 1428089160, 1428089580, 1428089700, 1428090120,
1428090240, 1428090480, 1428090660, 1428090780, 1428090960, 1428091080,
1428091200, 1428091500, 1428091620, 1428096060, 1428096420, 1428096540,
1428096600, 1428097560, 1428097860, 1428100440, 1428100560, 1428100680,
1428100740, 1428100860, 1428101040, 1428101160, 1428101400, 1428101520,
1428101760, 1428101940, 1428102240, 1428102840, 1428103080, 1428103620,
1428103980, 1428104100, 1428104160, 1428104340, 1428104520, 1428104700,
1428108540, 1428108840, 1428108960, 1428110340, 1428110460, 1428110640
), class = c("POSIXct", "POSIXt"), tzone = ""), event = c(1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
2, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3)), .Names = c("time",
"event"), class = "data.frame", row.names = c(NA, 57L))
df1 <- df %>%
group_by(event) %>%
summarize(first(time),last(time),difference = last(time)-first(time))
We can use difftime and specify the unit to get all the difference in the same unit.
df %>%
group_by(event) %>%
summarise(First = first(time),
Last = last(time) ,
difference= difftime(last(time), first(time), unit='hour'))