I've tried creating a variable that represents the lagged version of another variable relative to the whole change of the variable within the group.
Let's use this example dataframe:
game_data <- data.frame(player = c(1,1,1,2,2,2,3,3,3), level = c(1,2,3,1,2,3,1,2,3), score=as.numeric(c(0,150,170,80,100,110,75,100,0)))
game_data
player level score
1 1 1 0
2 1 2 150
3 1 3 170
4 2 1 80
5 2 2 100
6 2 3 110
7 3 1 75
8 3 2 100
9 3 3 0
I've tried the following, but while lagging the variable works, I am not able to create a new variable that shows the lag of the variable relative to the whole change for the player:
result <-
+ game_data %>%
+ group_by(player) %>%
+ mutate(
+ lag_score = score - dplyr::lag(score, n=1, default = NA),
+ lag_score_relative = lag_score/sum(lag_score))
result
# A tibble: 9 x 5
# Groups: player [3]
player level score lag_score lag_score_relative
<dbl> <dbl> <dbl> <dbl> <dbl>
1 1 1 0 NA NA
2 1 2 150 150 NA
3 1 3 170 20 NA
4 2 1 80 NA NA
5 2 2 100 20 NA
6 2 3 110 10 NA
7 3 1 75 NA NA
8 3 2 100 25 NA
9 3 3 0 -100 NA
For example, for player 1 it should be in
Level 1: NA/170 = NA
Level 2: 150/170
Level 3: 20/170
Thanks in advance, I hope anyone can help.
If you sum the lagged scores you include an NA. The sum then returns NA. You divide by NA which in the end returns NA for every value. To avoid this just set the na.rm argument to TRUE in your call of sum and NAs do not get included in the sum:
game_data <- data.frame(player = c(1,1,1,2,2,2,3,3,3), level = c(1,2,3,1,2,3,1,2,3),
score=as.numeric(c(0,150,170,80,100,110,75,100,0)))
game_data %>%
group_by(player) %>%
mutate(
lag_score = score - dplyr::lag(score, n=1, default = NA),
lag_score_relative = lag_score/sum(lag_score, na.rm = TRUE))
Related
I have a dataset containing 4 organisation units (org_unit) with different number of participants and 2 Questions (Q1,Q2) on a 2-degree scale (1:2). I want to know how many people per unit answered the respective question with [1] and divide them by the total number of participants / unit.
Org_unit <- c(1,1,1,1,2,2,2,3,3,4)
Q1 <- c(1,2,1,2,1,2,1,2,1,2)
Q2 <- c(-9,-9,-9,-9,-9,-9,-9,-9,-9,-9)
The problem is, my Q2 only consists of [-9] which stands for non-response. I therefore assigned NA to [-9].
DF <- data.frame(Org_unit, Q1, Q2)
DF[DF == -9] <- NA
DF
Org_unit Q1 Q2
1 1 1 NA
2 1 2 NA
3 1 1 NA
4 1 2 NA
5 2 1 NA
6 2 2 NA
7 2 1 NA
8 3 2 NA
9 3 1 NA
10 4 2 NA
Next I calculated the proportion of people who answered Q1 with [1], which works fine.
prop_q1 <- DF %>%
group_by(Org_unit) %>%
summarise(count = n(),
prop = mean(Q1 == 1))
prop_q1
# A tibble: 4 x 3
Org_unit count prop
<dbl> <int> <dbl>
1 1 4 0.5
2 2 3 0.667
3 3 2 0.5
4 4 1 0
when i run the same code for Q2 however, I get the same amount of members per unit (count = c(1,2,3,4), although nobody answered the question and I don't want them to be registered as participants, since they technically didn't participate in the study.
prop_q2 <- DF %>%
group_by(Org_unit) %>%
summarise(count = n(),
prop = mean(Q2 == 1))
prop_q2
# A tibble: 4 x 3
Org_unit count prop
<dbl> <int> <dbl>
1 1 4 NA
2 2 3 NA
3 3 2 NA
4 4 1 NA
Is there a way to calculate the right amount of members per unit when facing NA's? [-9]
Thanks!
Would
prop_q2 <- DF %>%
filter(!is.na(Q2)) %>%
group_by(Org_unit) %>%
summarise(count = n(),
prop = mean(Q2 == 1))
do the job?
Given that you want to do this across multiple columns, I think that using across() within the dplyr verbs will be better for you. I explain the solution below.
Org_unit <- c(1,1,1,1,2,2,2,3,3,4)
Q1 <- c(1,2,1,2,1,2,1,2,1,2)
Q2 <- c(1,-9,-9,-9,-9,-9,-9,-9,-9,-9) #Note one response
df <- tibble(Org_unit, Q1, Q2)
df %>%
mutate(across(starts_with("Q"), ~na_if(., -9))) %>%
group_by(Org_unit) %>%
summarize(across(starts_with("Q"),
list(
N = ~sum(!is.na(.)),
prop = ~sum(. == 1, na.rm = TRUE)/sum(!is.na(.)))
))
# A tibble: 4 x 5
Org_unit Q1_N Q1_prop Q2_N Q2_prop
* <dbl> <int> <dbl> <int> <dbl>
1 1 4 0.5 1 1
2 2 3 0.667 0 NaN
3 3 2 0.5 0 NaN
4 4 1 0 0 NaN
First, we take the data frame (which I created as a tibble) and substitute NA for all values that equal -9 for all columns that start with a capital "Q". This converts all question columns to have NAs in place of -9s.
Second, we group by the organizational unit and then summarize using two functions. The first sums all values where the response to the question is not NA. The string _N will be appended to columns with these values. The second calculates the proportion and will have _prop appended to the values.
I have a long form of clinical data that looks something like this:
patientid <- c(100,100,100,101,101,101,102,102,102,104,104,104)
outcome <- c(1,1,1,1,1,NA,1,NA,NA,NA,NA,NA)
time <- c(1,2,3,1,2,3,1,2,3,1,2,3)
Data <- data.frame(patientid=patientid, outcome=outcome, time=time)
A patient should be kept in the database only if they 2 or 3 observations (so patients that have complete data for 0 or only 1 time points should be thrown out. So for this example my desired result is this:
patientid <- c(100,100,100,101,101,101)
outcome <- c(1,1,1,1,1,NA)
time <- c(1,2,3,1,2,3)
Data <- data.frame(patientid=patientid, outcome=outcome, time=time)
Hence patients 102 and 104 are thrown out of the database because of they were missing the outcome variable in 2 or 3 of the time points.
We can create a logical expression on the sum of non-NA elements as a logical vector, grouped by 'patientid' to filter patientid's having more than one non-NA 'outcome'
library(dplyr)
Data %>%
group_by(patientid) %>%
filter(sum(!is.na(outcome)) > 1) %>%
ungroup
-output
# A tibble: 6 x 3
# patientid outcome time
# <dbl> <dbl> <dbl>
#1 100 1 1
#2 100 1 2
#3 100 1 3
#4 101 1 1
#5 101 1 2
#6 101 NA 3
A base R option using subset + ave
subset(
Data,
ave(!is.na(outcome), patientid, FUN = sum) > 1
)
giving
patientid outcome time
1 100 1 1
2 100 1 2
3 100 1 3
4 101 1 1
5 101 1 2
6 101 NA 3
A data.table option
setDT(Data)[, Y := sum(!is.na(outcome)), patientid][Y > 1, ][, Y := NULL][]
or a simpler one (thank #akrun)
setDT(Data)[Data[, .I[sum(!is.na(outcome)) > 1], .(patientid)]$V1]
which gives
patientid outcome time
1: 100 1 1
2: 100 1 2
3: 100 1 3
4: 101 1 1
5: 101 1 2
6: 101 NA 3
library(dplyr)
Data %>%
group_by(patientid) %>%
mutate(observation = sum(outcome, na.rm = TRUE)) %>% # create new variable (observation) and count the observation per patient
filter(observation >=2) %>%
ungroup
output:
# A tibble: 6 x 4
patientid outcome time observation
<dbl> <dbl> <dbl> <dbl>
1 100 1 1 3
2 100 1 2 3
3 100 1 3 3
4 101 1 1 2
5 101 1 2 2
6 101 NA 3 2
I have a data frame with three variables and some missing values in one of the variables that looks like this:
subject <- c(1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2)
part <- c(0,0,0,0,1,1,1,1,2,2,2,2,3,3,3,3,0,0,0,0,1,1,1,1,2,2,2,2,3,3,3,3)
sad <- c(1,7,7,4,NA,NA,2,2,NA,2,3,NA,NA,2,2,1,NA,5,NA,6,6,NA,NA,3,3,NA,NA,5,3,NA,7,2)
df1 <- data.frame(subject,part,sad)
I have created a new data frame with the mean values of 'sad' per subject and part using a loop, like this:
columns<-c("sad.m",
"part",
"subject")
df2<-matrix(data=NA,nrow=1,ncol=length(columns))
df2<-data.frame(df2)
names(df2)<-columns
tn<-unique(df1$subject)
row=1
for (s in tn){
for (i in 0:3){
TN<-df1[df1$subject==s&df1$part==i,]
df2[row,"sad.m"]<-mean(as.numeric(TN$sad), na.rm = TRUE)
df2[row,"part"]<-i
df2[row,"subject"]<-s
row=row+1
}
}
Now I want to include an additional variable 'missing' in that indicates the percentage of rows per subject and part with missing values, so that I get df3:
subject <- c(1,1,1,1,2,2,2,2)
part<-c(0,1,2,3,0,1,2,3)
sad.m<-df2$sad.m
missing <- c(0,50,50,25,50,50,50,25)
df3 <- data.frame(subject,part,sad.m,missing)
I'd really appreciate any help on how to go about this!
It's best to try and avoid loops in R where possible, since they can get messy and tend to be quite slow. For this sort of thing, dplyr library is perfect and well worth learning. It can save you a lot of time.
You can create a data frame with both variables by first grouping by subject and part, and then performing a summary of the grouped data frame:
df2 = df1 %>%
dplyr::group_by(subject, part) %>%
dplyr::summarise(
sad_mean = mean(na.omit(sad)),
na_count = (sum(is.na(sad) / n()) * 100)
)
df2
# A tibble: 8 x 4
# Groups: subject [2]
subject part sad_mean na_count
<dbl> <dbl> <dbl> <dbl>
1 1 0 4.75 0
2 1 1 2 50
3 1 2 2.5 50
4 1 3 1.67 25
5 2 0 5.5 50
6 2 1 4.5 50
7 2 2 4 50
8 2 3 4 25
For each subject and part you can calculate mean of sad and calculate ratio of NA value using is.na and mean.
library(dplyr)
df1 %>%
group_by(subject, part) %>%
summarise(sad.m = mean(sad, na.rm = TRUE),
perc_missing = mean(is.na(sad)) * 100)
# subject part sad.m perc_missing
# <dbl> <dbl> <dbl> <dbl>
#1 1 0 4.75 0
#2 1 1 2 50
#3 1 2 2.5 50
#4 1 3 1.67 25
#5 2 0 5.5 50
#6 2 1 4.5 50
#7 2 2 4 50
#8 2 3 4 25
Same logic with data.table :
library(data.table)
setDT(df1)[, .(sad.m = mean(sad, na.rm = TRUE),
perc_missing = mean(is.na(sad)) * 100), .(subject, part)]
Try this dplyr approach to compute df3:
library(dplyr)
#Code
df3 <- df1 %>% group_by(subject,part) %>% summarise(N=100*length(which(is.na(sad)))/length(sad))
Output:
# A tibble: 8 x 3
# Groups: subject [2]
subject part N
<dbl> <dbl> <dbl>
1 1 0 0
2 1 1 50
3 1 2 50
4 1 3 25
5 2 0 50
6 2 1 50
7 2 2 50
8 2 3 25
And for full interaction with df2 you can use left_join():
#Left join
df3 <- df1 %>% group_by(subject,part) %>%
summarise(N=100*length(which(is.na(sad)))/length(sad)) %>%
left_join(df2)
Output:
# A tibble: 8 x 4
# Groups: subject [2]
subject part N sad.m
<dbl> <dbl> <dbl> <dbl>
1 1 0 0 4.75
2 1 1 50 2
3 1 2 50 2.5
4 1 3 25 1.67
5 2 0 50 5.5
6 2 1 50 4.5
7 2 2 50 4
8 2 3 25 4
Given the following data:
test = data.frame(x = c(NA,1,1,2,3,4),
y = c(NA,1,2,3,4,4))
I want to perform some calculations and store these as new columns. The calculations, however, might result in a variable amount of columns. E.g. suppose I want store for each row the column index of the column(s) that contain the minimum per row. E.g. in row 1, both columns contain the minimum, hence I need to create two columns.
Using the tidyverse approach, I know I can use the set_names argument when passing my function as a list. But this doesn't work when I don't know the number of columns my calculation will create. See also here: https://community.rstudio.com/t/how-to-handle-lack-of-names-with-unnest-wider/40496
My approach for the calculations:
library(tidyverse)
test %>%
rowwise() %>%
mutate(dist = min(c_across(everything())),
code = list(which(c_across(cols = c(everything(), -dist)) == dist))) %>%
ungroup() %>%
unnest_wider(code)
which automatically names the unnested columns with "...1" and "...2":
# A tibble: 6 x 5
x y dist ...1 ...2
<dbl> <dbl> <dbl> <int> <int>
1 NA NA NA NA NA
2 1 1 1 1 2
3 1 2 1 1 NA
4 2 3 2 1 NA
5 3 4 3 1 NA
6 4 4 4 1 2
But that's not what I want. I also tried to use the named_repair argument within the unnest_wider, i.e. unnest_wider(code, names_repair = ~paste0("code", .x)) but this renames all columns.
Any ideas (preferably in the tidyverse approach)? Expected outcome:
# A tibble: 6 x 5
x y dist code_1 code_2
<dbl> <dbl> <dbl> <int> <int>
1 NA NA NA NA NA
2 1 1 1 1 2
3 1 2 1 1 NA
4 2 3 2 1 NA
5 3 4 3 1 NA
6 4 4 4 1 2
EDITED to add an example where one row contains only missings.
Edit 2: this is my current solution. But it is really ugly and requires to stop half way through. Problem here is that the rename_with function doesn't recognize the on-the-fly generated "length_code" column when I put everything into one pipe.
test2 <- test %>%
rowwise() %>%
mutate(dist = min(c_across(everything())),
code = list(which(c_across(cols = c(everything(), -dist)) == dist)),
length_code = length(code)) %>%
ungroup() %>%
unnest_wider(code) %>%
test3 <- test2 %>%
rename_with(.cols = starts_with("..."), .fn = ~paste0("code_", 1:max(test2$length_code)))
which gives:
# A tibble: 6 x 6
x y dist code_1 code_2 length_code
<dbl> <dbl> <dbl> <int> <int> <int>
1 NA NA NA NA NA 0
2 1 1 1 1 2 2
3 1 2 1 1 NA 1
4 2 3 2 1 NA 1
5 3 4 3 1 NA 1
6 4 4 4 1 2 2
I have an incomplete time series dataframe and I need to insert rows of NAs for missing time stamps. There should always be 6 time stamps per day, which is indicated by the variable "Signal" (1-6) in the dataframe. I am trying to merge the incomplete dataframe A with a vector Bcontaining all Signals. Simplified example data below:
B <- rep(1:6,2)
A <- data.frame(Signal = c(1,2,3,5,1,2,4,5,6), var1 = c(1,1,1,1,1,1,1,1,1))
Expected <- data.frame(Signal = c(1,2,3,NA, 5, NA, 1,2,NA,4,5,6), var1 = c(1,1,1,NA,1,NA,1,1,NA,1,1,1)
Note that Brepresents a dataframe with multiple variables and the NAs in Expected are rows of NAs in the dataframe. Also the actual dataframe has more observations (84 in total).
Would be awesome if you guys could help me out!
If you already know there are 6 timestamps in a day you can do this without B. We can create groups for each day and use complete to add the missing observations with NA.
library(dplyr)
library(tidyr)
A %>%
group_by(gr = cumsum(c(TRUE, diff(Signal) < 0))) %>%
complete(Signal = 1:6) %>%
ungroup() %>%
select(-gr)
# Signal var1
# <dbl> <dbl>
# 1 1 1
# 2 2 1
# 3 3 1
# 4 4 NA
# 5 5 1
# 6 6 NA
# 7 1 1
# 8 2 1
# 9 3 NA
#10 4 1
#11 5 1
#12 6 1
If in the output you need Signal as NA for missing combination you can use
A %>%
group_by(gr = cumsum(c(TRUE, diff(Signal) < 0))) %>%
complete(Signal = 1:6) %>%
mutate(Signal = replace(Signal, is.na(var1), NA)) %>%
ungroup %>%
select(-gr)
# Signal var1
# <dbl> <dbl>
# 1 1 1
# 2 2 1
# 3 3 1
# 4 NA NA
# 5 5 1
# 6 NA NA
# 7 1 1
# 8 2 1
# 9 NA NA
#10 4 1
#11 5 1
#12 6 1