Imputing missing data conditional on other values in R - r

I working on a longitudinal patient dataset with some missing data. I'm trying to replicate a missing data imputation approach used by a published study. A snapshot of the first 18 rows of this dataset is below. Briefly, here there are 6 patients belonging to 3 different groups. Each person has been assessed over 3 years across a variety of tests. There is also information on Age, disease severity and a functional capacity score:
ID Group Time Age Severity Func.score Test1 Test2 Test3 Test4
1 A 1 60 5 50 -4 888 5 4
1 A 2 61 6 45 3 3 4 4
1 A 3 62 7 40 2 2 888 4
2 A 1 59 5 50 5 3 6 3
2 A 2 60 6 40 4 2 5 3
2 A 3 61 7 35 3 1 888 2
3 B 1 59 6 40 -4 -4 7 5
3 B 2 59 7 40 3 3 7 5
3 B 3 60 8 30 1 888 888 2
4 B 1 55 7 50 5 888 7 4
4 B 2 56 8 NA 3 1 6 3
4 B 3 57 9 NA 1 -4 6 888
5 C 1 54 7 40 6 6 5 5
5 C 2 55 8 40 4 5 5 4
5 C 3 56 8 35 2 888 5 3
6 C 1 60 6 50 6 7 4 4
6 C 2 61 6 40 5 6 4 888
6 C 3 62 7 30 3 5 4 888
Missing data in this dataset is coded in 3 possible ways. If NA, then the measure was not administered. If -4, the person could not complete the test due to a cognitive problem (i.e., they have poor memory etc.). If 888, then the person couldn't complete the test because of a physical problem (i.e., they have difficulty writing, drawing etc.).
My aim is to impute this missing data using two strategies.
If the missing data are because of a cognitive problem (i.e., where -4), then I want to impute the lowest possible score, given their specific time point and group membership. For example, for Test1 for ID1, I want the -4 substituted with 5 (as that is the only score that belongs to Time 1 and Group A).
If the missing data are because of a physical problem (i.e., where 888), I want to impute this using a regression equation using Age, Severity, and Functional score (Func.score) and all other available Test scores to predict that missing data point.
How can I build this conditional imputing into a dplyr::mutate or an ifelse or case_when function?

In tidymodels, you would have to set them to NA and not use coded values (I really do wish that we had different types of missing values).
No guarantees on this since we don't have a reproducible example but this might work for you:
some_recipe %>%
# for case 1 in your post
step_mutate(Test1 = ifelse(Test1 == -4, 5, Test1)) %>%
# for case 2
step_mutate(
# probably better to do this with across()
Test1 = ifelse(Test1 == 888, NA_integer_, Test1),
Test2 = ifelse(Test2 == 888, NA_integer_, Test2),
Test3 = ifelse(Test3 == 888, NA_integer_, Test3),
Test4 = ifelse(Test4 == 888, NA_integer_, Test4)
) %>%
step_impute_linear(starts_with("Test"),
impute_with = vars(Age, Severity, Func.score,
starts_with("Test")))

Related

Creating an summary dataset with multiple objects and multiple observations per object

I have a dataset with the reports from a local shop, where each line has a client's ID, date of purchase and total value per purchase.
I want to create a new plot where for each client ID I have all the purchases in the last month or even just sample purchases in a range of dates I choose.
The main problem is that certain customers might buy once a month, while others can come daily - so the number of observations per period of time can vary.
I have tried subsetting my dataset to a specific range of time, but either I choose a specific date - and then I only get a small % of all customers, or I choose a range and get multiple observations for certain customers.
(In this case - I wouldn't mind getting the earliest observation)
An important note: I know how to create a for loop to solve this problem, but since the dataset is over 4 million observations it isn't practical since it would take an extremely long time to run.
A basic example of what the dataset looks like:
ID Date Sum
1 1 1 234
2 1 2 45
3 1 3 1
4 2 4 223
5 3 5 546
6 4 6 12
7 2 1 20
8 4 3 30
9 6 2 3
10 3 5 45
11 7 6 456
12 3 7 65
13 8 8 234
14 1 9 45
15 3 2 1
16 4 3 223
17 6 6 546
18 3 4 12
19 8 7 20
20 9 5 30
21 11 6 3
22 12 6 45
23 14 9 456
24 15 10 65
....
And the new data set would look something like this:
ID 1Date 1Sum 2Date 2Sum 3Date 3Sum
1 1 234 2 45 3 1
2 1 20 4 223 NA NA
3 2 1 5 546 5 45
...
Thanks for your help!
I think you can do this with a bit if help from dplyr and tidyr
library(dplyr)
library(tidyr)
dd %>% group_by(ID) %>% mutate(seq=1:n()) %>%
pivot_wider("ID", names_from="seq", values_from = c("Date","Sum"))
Where dd is your sample data frame above.

R: how to use expand.grid to generate combinations based on group

I am trying to get all combinations of values per group. I want to prevent combination of values between different groups.
To create all combinations of values (no matter which group the value belongs) vaI can use:
expand.grid(value, value)
Awaited result should be the subset of result of previous command.
Example:
#base data
value = c(1,3,5, 1,5,7,9, 2)
group = c("a", "a", "a","b","b","b","b", "c")
base <- data.frame(value, group)
#creating ALL combinations of value
allComb <- expand.grid(base$value, base$value)
#awaited result is subset of allComb.
#Note: first colums shows the number of row from allComb.
#Empty rows are separating combinations per group and are shown only for clarification.
Var1 Var2
1 1 1
2 3 1
3 5 1
11 1 3
12 3 3
13 5 3
21 1 5
22 3 5
23 5 5
34 1 1
35 5 1
36 7 1
37 9 1
44 1 5
45 5 5
46 7 5
47 9 5
54 1 7
55 5 7
56 7 7
57 9 7
64 1 9
65 5 9
66 7 9
67 9 9
78 2 2

R: Sum column from table 2 based on value in table 1, and store result in table 1

I am a R noob, and hope some of you can help me.
I have two data sets:
- store (containing store data, including location coordinates (x,y). The location are integer values, corresponding to GridIds)
- grid (containing all gridIDs (x,y) as well as a population variable TOT_P for each grid point)
What I want to achieve is this:
For each store I want loop over the grid date, and sum the population of the grid ids close to the store grid id.
I.e basically SUMIF the grid population variable, with the condition that
grid(x) < store(x) + 1 &
grid(x) > store(x) - 1 &
grid(y) < store(y) + 1 &
grid(y) > store(y) - 1
How can I accomplish that? My own take has been trying to use different things like merge, sapply, etc, but my R inexperience stops me from getting it right.
Thanks in advance!
Edit:
Sample data:
StoreName StoreX StoreY
Store1 3 6
Store2 5 2
TOT_P GridX GridY
8 1 1
7 2 1
3 3 1
3 4 1
22 5 1
20 6 1
9 7 1
28 1 2
8 2 2
3 3 2
12 4 2
12 5 2
15 6 2
7 7 2
3 1 3
3 2 3
3 3 3
4 4 3
13 5 3
18 6 3
3 7 3
61 1 4
25 2 4
5 3 4
20 4 4
23 5 4
72 6 4
14 7 4
178 1 5
407 2 5
26 3 5
167 4 5
58 5 5
113 6 5
73 7 5
76 1 6
3 2 6
3 3 6
3 4 6
4 5 6
13 6 6
18 7 6
3 1 7
61 2 7
25 3 7
26 4 7
167 5 7
58 6 7
113 7 7
The output I am looking for is
StoreName StoreX StoreY SUM_P
Store1 3 6 479
Store2 5 2 119
I.e for store1 it is the sum of TOT_P for Grid fields X=[2-4] and Y=[5-7]
One approach would be to use dplyr to calculate the difference between each store and all grid points and then group and sum based on these new columns.
#import library
library(dplyr)
#create example store table
StoreName<-paste0("Store",1:2)
StoreX<-c(3,5)
StoreY<-c(6,2)
df.store<-data.frame(StoreName,StoreX,StoreY)
#create example population data (copied example table from OP)
df.pop
#add dummy column to each table to enable cross join
df.store$k=1
df.pop$k=1
#dplyr to join, calculate absolute distance, filter and sum
df.store %>%
inner_join(df.pop, by='k') %>%
mutate(x.diff = abs(StoreX-GridX), y.diff=abs(StoreY-GridY)) %>%
filter(x.diff<=1, y.diff<=1) %>%
group_by(StoreName) %>%
summarise(StoreX=max(StoreX), StoreY=max(StoreY), tot.pop = sum(TOT_P) )
#output:
StoreName StoreX StoreY tot.pop
<fctr> <dbl> <dbl> <int>
1 Store1 3 6 721
2 Store2 5 2 119

Using tidyverse gather() to output multiple value vectors with a single key in a data frame

Despite the conventions of R, data collection and entry is for me most easily done in vertical columns. Therefore, I have a question about efficiently converting to horizontal rows with the gather() function in the tidyverse library. I find myself using gather() over and over which seems inefficient. Is there a more efficient way? And can an existing vector serve as the key? Here is an example:
Let's say we have the following health metrics on baby birds.
bird day_1_mass day_2_mass day_1_heart_rate day_3_heart_rate
1 1 5 6 60 55
2 2 6 8 62 57
3 3 3 3 45 45
Using the gather function I can reorganize the mass data into rows.
horizontal.data <- gather(vertical.data,
key = age,
value = mass,
day_1_mass:day_2_mass,
factor_key=TRUE)
Giving us
bird day_1_heart_rate day_3_heart_rate age mass
1 1 60 55 day_1_mass 5
2 2 62 57 day_1_mass 6
3 3 45 45 day_1_mass 3
4 1 60 55 day_2_mass 6
5 2 62 57 day_2_mass 8
6 3 45 45 day_2_mass 3
And use the same function again to similarly reorganize heart rate data.
horizontal.data.2 <- gather(horizontal.data,
key = age2,
value = heart_rate,
day_1_heart_rate:day_3_heart_rate,
factor_key=TRUE)
Producing a new dataframe
bird age mass age2 heart_rate
1 1 day_1_mass 5 day_1_heart_rate 60
2 2 day_1_mass 6 day_1_heart_rate 62
3 3 day_1_mass 3 day_1_heart_rate 45
4 1 day_2_mass 6 day_1_heart_rate 60
5 2 day_2_mass 8 day_1_heart_rate 62
6 3 day_2_mass 3 day_1_heart_rate 45
7 1 day_1_mass 5 day_3_heart_rate 55
8 2 day_1_mass 6 day_3_heart_rate 57
9 3 day_1_mass 3 day_3_heart_rate 45
10 1 day_2_mass 6 day_3_heart_rate 55
11 2 day_2_mass 8 day_3_heart_rate 57
12 3 day_2_mass 3 day_3_heart_rate 45
So it took two steps, but it worked. The questions are 1) Is there a way to do this in one step? and 2) Can it alternatively be done with one key (the "age" vector) that I can then simply replace as numeric data?
if I get the question right, you could do that by first gathering everything together, and then "spreading" on mass and heart rate:
library(forcats)
library(dplyr)
mass_levs <- names(vertical.data)[grep("mass", names(vertical.data))]
hearth_levs <- names(vertical.data)[grep("heart", names(vertical.data))]
horizontal.data <- vertical.data %>%
gather(variable, value, -bird, factor_key = TRUE) %>%
mutate(day = stringr::str_sub(variable, 5,5)) %>%
mutate(variable = fct_collapse(variable,
"mass" = mass_levs,
"hearth_rate" = hearth_levs)) %>%
spread(variable, value)
, giving:
bird day mass hearth_rate
1 1 1 5 60
2 1 2 6 NA
3 1 3 NA 55
4 2 1 6 62
5 2 2 8 NA
6 2 3 NA 57
7 3 1 3 45
8 3 2 3 NA
9 3 3 NA 45
we can see how it works by going through the pipe one pass at a time.
First, we gather everyting on a long format:
horizontal.data <- vertical.data %>%
gather(variable, value, -bird, factor_key = TRUE)
bird variable value
1 1 day_1_mass 5
2 2 day_1_mass 6
3 3 day_1_mass 3
4 1 day_2_mass 6
5 2 day_2_mass 8
6 3 day_2_mass 3
7 1 day_1_heart_rate 60
8 2 day_1_heart_rate 62
9 3 day_1_heart_rate 45
10 1 day_3_heart_rate 55
11 2 day_3_heart_rate 57
12 3 day_3_heart_rate 45
then, if we want to keep a "proper" long table, as the OP suggested we have to create a single key variable. In this case, it makes sense to use the day (= age). To create the day variable, we can extract it from the character strings now in variable:
%>% mutate(day = stringr::str_sub(variable, 5,5))
here, str_sub gets the substring in position 5, which is the day (note that if in the full dataset you have multiple-digits days, you'll have to tweak this a bit, probably by splitting on _):
bird variable value day
1 1 day_1_mass 5 1
2 2 day_1_mass 6 1
3 3 day_1_mass 3 1
4 1 day_2_mass 6 2
5 2 day_2_mass 8 2
6 3 day_2_mass 3 2
7 1 day_1_heart_rate 60 1
8 2 day_1_heart_rate 62 1
9 3 day_1_heart_rate 45 1
10 1 day_3_heart_rate 55 3
11 2 day_3_heart_rate 57 3
12 3 day_3_heart_rate 45 3
now, to finish we have to "spread " the table to have a mass and a heart rate column.
Here we have a problem, because currently there are 2 levels each corresponding to mass and hearth rate in the variable column. Therefore, applying spread on variable would give us again four columns.
To prevent that, we need to aggregate the four levels in variable into two levels. We can do that by using forcats::fc_collapse, by providing the association between the new level names and the "old" ones. Outside of a pipe, that would correspond to:
horizontal.data$variable <- fct_collapse(horizontal.data$variable,
mass = c("day_1_mass", "day_2_mass",
heart = c("day_1_hearth_rate", "day_3_heart_rate")
However, if you have many levels it is cumbersome to write them all. Therefore, I find beforehand the level names corresponding to the two "categories" using
mass_levs <- names(vertical.data)[grep("mass", names(vertical.data))]
hearth_levs <- names(vertical.data)[grep("heart", names(vertical.data))]
mass_levs
[1] "day_1_mass" "day_2_mass"
hearth_levs
[1] "day_1_heart_rate" "day_3_heart_rate"
therefore, the third line of the pipe can be shortened to:
%>% mutate(variable = fct_collapse(variable,
"mass" = mass_levs,
"hearth_rate" = hearth_levs))
, after which we have:
bird variable value day
1 1 mass 5 1
2 2 mass 6 1
3 3 mass 3 1
4 1 mass 6 2
5 2 mass 8 2
6 3 mass 3 2
7 1 hearth_rate 60 1
8 2 hearth_rate 62 1
9 3 hearth_rate 45 1
10 1 hearth_rate 55 3
11 2 hearth_rate 57 3
12 3 hearth_rate 45 3
, so that we are now in the condition to "spread" the table again according to variable using:
%>% spread(variable, value)
bird day mass hearth_rate
1 1 1 5 60
2 1 2 6 NA
3 1 3 NA 55
4 2 1 6 62
5 2 2 8 NA
6 2 3 NA 57
7 3 1 3 45
8 3 2 3 NA
9 3 3 NA 45
HTH
If you insist on a single command , i can give you one
setup the data.frame
c1<-c(1,2,3)
c2<-c(5,6,3)
c3<-c(6,8,3)
c4<-c(60,62,45)
c5<-c(55,57,45)
dt<-as.data.table(cbind(c1,c2,c3,c4,c5))
colnames(dt)<-c("bird","day_1_mass","day_2_mass","day_1_heart_rate","day_3_heart_rate")
Now use this single command to get the final outcome
merge(melt(dt[,c("bird","day_1_mass","day_2_mass")],id.vars = c("bird"),variable.name = "age",value.name="mass"),melt(dt[,c("bird","day_1_heart_rate","day_3_heart_rate")],id.vars = c("bird"),variable.name = "age2",value.name="heart_rate"),by = "bird")
The final outcome is
bird age mass age2 heart_rate
1: 1 day_1_mass 5 day_1_heart_rate 60
2: 1 day_1_mass 5 day_3_heart_rate 55
3: 1 day_2_mass 6 day_1_heart_rate 60
4: 1 day_2_mass 6 day_3_heart_rate 55
5: 2 day_1_mass 6 day_1_heart_rate 62
6: 2 day_1_mass 6 day_3_heart_rate 57
7: 2 day_2_mass 8 day_1_heart_rate 62
8: 2 day_2_mass 8 day_3_heart_rate 57
9: 3 day_1_mass 3 day_1_heart_rate 45
10: 3 day_1_mass 3 day_3_heart_rate 45
11: 3 day_2_mass 3 day_1_heart_rate 45
12: 3 day_2_mass 3 day_3_heart_rate 45
Though already answered, I have a different solution in which you save a list of the gather parameters you would like to run, and then run the gather_() command for each set of parameters in the list.
# Create a list of gather parameters
# Format is key, value, columns_to_gather
gather.list <- list(c("age", "mass", "day_1_mass", "day_2_mass"),
c("age2", "heart_rate", "day_1_heart_rate", "day_3_heart_rate"))
# Run gather command for each list item
for(i in gather.list){
df <- gather_(df, key_col = i[1], value_col = i[2], gather_cols = c(i[3:length(i)]), factor_key = TRUE)
}

Remove duplicate observations based on set of rules

I am trying to remove duplicate observations from a data set based on my variable, id. However, I want the removal of observations to be based on the following rules. The variables below are id, the sex of household head (1-male, 2-female) and the age of the household head. The rules are as follows. If a household has both male and female household heads, remove the female household head observation. If a household as either two male or two female heads, remove the observation with the younger household head. An example data set is below.
id = c(1,2,2,3,4,5,5,6,7,8,8,9,10)
sex = c(1,1,2,1,2,2,2,1,1,1,1,2,1)
age = c(32,34,54,23,32,56,67,45,51,43,35,80,45)
data = data.frame(cbind(id,sex,age))
You can do this by first ordering the data.frame so the desired entry for each id is first, and then remove the rows with duplicate ids.
d <- with(data, data[order(id, sex, -age),])
# id sex age
# 1 1 1 32
# 2 2 1 34
# 3 2 2 54
# 4 3 1 23
# 5 4 2 32
# 7 5 2 67
# 6 5 2 56
# 8 6 1 45
# 9 7 1 51
# 10 8 1 43
# 11 8 1 35
# 12 9 2 80
# 13 10 1 45
d[!duplicated(d$id), ]
# id sex age
# 1 1 1 32
# 2 2 1 34
# 4 3 1 23
# 5 4 2 32
# 7 5 2 67
# 8 6 1 45
# 9 7 1 51
# 10 8 1 43
# 12 9 2 80
# 13 10 1 45
With data.table, this is easy with "compound queries". To order the data when you read it in, set the "key" when you read it in as "id,sex" (required in case any female values would come before male values for a given ID).
> library(data.table)
> DT <- data.table(data, key = "id,sex")
> DT[, max(age), by = key(DT)][!duplicated(id)]
id sex V1
1: 1 1 32
2: 2 1 34
3: 3 1 23
4: 4 2 32
5: 5 2 67
6: 6 1 45
7: 7 1 51
8: 8 1 43
9: 9 2 80
10: 10 1 45

Resources