Replacing NA in longitudinal data with average difference of non-missing values - r

Here is a simplified version of the data I am working with:
data.frame(country = c("country1", "country2", "country3", "country1", "country2"), measurement = c("m1", "m1", "m1", "m2", "m2"),
y2015 = c(NA, 15, 19, 13, 55), y2016 = c(NA, 17, NA, 10, NA), y2017 = c(14, NA, NA, 9, 45), y2018 = c(18, 22, 16, NA, 40))
I am trying to take the difference between the two non-missing variables on either side of the NAs, and replace the missing values with the average of the differences over time.
For row 5, this would be something like c(55, 50, 45, 40).
However, it also needs to work for the rows that have more than one missing value in a sequence, like row 1 and row 3. For row 1, I'd like the difference between 14 and 18 to be interpolated, and so it should look something like c(6, 10, 14, 18). Meanwhile, for row 3, the difference between 19-13 divided between the two missing years, to look something like c(19, 18, 17, 16).
Essentially, I'm looking to create a slope for each country and measurement through the available years, and interpolating missing variables based on that.
I am trying to think of a package for this or perhaps create a loop. I have looked at the package 'spline' but does not seem to work since I want to run separate linear interpolation based on country and measurement.
Any thoughts would be greatly appreciated!

Use zoo::na.spline:
library(zoo)
dat[-c(1:2)] <- t(na.spline(t(dat[-c(1:2)])))
country measurement y2015 y2016 y2017 y2018
1 country1 m1 6 10 14.00000 18
2 country2 m1 15 17 19.33333 22
3 country3 m1 19 18 17.00000 16
4 country1 m2 13 10 9.00000 10
5 country2 m2 55 50 45.00000 40

Related

Calculating rolling mean by ID with variable window width

I have a repeated measures dataset of vital signs. I'm trying to calculate some summary statistics (mean, min, max, slope, etc) of the patient's prior 24 hours of observations, measured by the Admit_to_Perform variable, excluding the current observation. Here is an extract of the first patient's first 15 observations:
df1 <- data.frame(ID = rep(1, 15),
Admit_to_Perform = c(1.07, 1.07, 1.70, 3.73, 3.73, 4.20, 8.87, 11.68, 14.80, 15.67, 19.08, 23.15, 29.68, 36.03, 39.08),
Resp_Rate = c(18, 17, 18, 17, 16, 16, 16, 16, 16, 17, 16, 16, 16, 16, 16))
ID Admit_to_Perform Resp_Rate
1: 1 1.07 18
2: 1 1.07 17
3: 1 1.70 18
4: 1 3.73 17
5: 1 3.73 16
6: 1 4.20 16
7: 1 8.87 16
8: 1 11.68 16
9: 1 14.80 16
10: 1 15.67 17
11: 1 19.08 16
12: 1 23.15 16
13: 1 29.68 16
14: 1 36.03 16
15: 1 39.08 16
What I would like is to add on a column for each summary statistic of Resp_Rate. The first row has no prior observations in the past 24 hours, so it can be blank, but for the second row the mean would be 18, for the third row 17.5, the fourth row 17.667, and so on. However for the 13th row, because Admit_to_Perform is more than 24 hours after the first 6 observations, it would only take the mean of rows 7-12.
I've tried using some of the zoo and data.table functions but don't seem to be getting anywhere.
EDIT: I should probably mention that my dataset exceeds 1.5m rows. So base R and dplyr solutions using any kind of rowwise or filter are effective, but too slow to run (4 days and counting before I killed the command)
Bit of a quick and dirty solution where Resp_Rate is hard-coded into the f(), and it will be slow because it performs the filter on the dataset for every row, but this does what you want.
library(tidyverse)
df1 <- data.frame(ID = rep(1, 15),
Admit_to_Perform = c(1.07, 1.07, 1.70, 3.73, 3.73, 4.20, 8.87, 11.68, 14.80, 15.67, 19.08, 23.15, 29.68, 36.03, 39.08),
Resp_Rate = c(18, 17, 18, 17, 16, 16, 16, 16, 16, 17, 16, 16, 16, 16, 16))
f <- function(data, id, outcome, time, window=24) {
data <- filter(data,
ID==id,
Admit_to_Perform>(time-window),
Admit_to_Perform<time)
if(length(!is.na(data$Resp_Rate))==0) return(NA)
mean(data$Resp_Rate)
}
df1 %>%
rowwise() %>%
mutate(roll=f(data=., id=ID, outcome=Resp_Rate, time=Admit_to_Perform))

Combining componenets of a list in r

I have a list that contains data by year. I want to combine these components into a single dataframe, which is matched by row. Example list:
List [[1]]
State Year X Y
23 1971 etc etc
47 1971 etc etc
List[[2]]
State Year X Y
13 1972 etc etc
23 1973 etc etc
47 1973 etc etc
etc....
List[[45]]
State Year X Y
1 2017 etc etc
2 2017 etc etc
3 2017 etc etc
1 2017 etc etc
23 2017 etc etc
47 2017 etc etc
I want the dataframe to look like (I know I will have to go through and remove some extra columns:
State 1971_X 1971_Y 1972_X 1972_Y....2018_X 2019_Y
1 NA NA NA NA etc etc
2 NA NA etc etc etc etc
3 etc ect etc etc etc etc
...
50 NA NA etc etc etc etc
I have tried the command Outcomewanted=do.call("cbind", examplelist) but get the message
"Error in data.frame(..., check.names = FALSE) :
arguments imply differing number of rows: 36, 40, 20, 42, 38, 26, 17, 31, 35, 23, 33, 13, 29, 28, 32, 34, 41, 37, 43, 39, 30, 14, 10, 4, 7"
It seems that the cbind.fill command could be an option but has been retired? Thanks for any help in advance.
You may use reshape after a do.call(rbind()) manoeuvre.
res <- reshape(do.call(rbind, lst), idvar="state", timevar="year", direction="wide")
res
# state x.1971 y.1971 x.1972 y.1972 x.1973 y.1973
# 1 23 1.3709584 0.3631284 NA NA -0.1061245 2.0184237
# 2 24 -0.5646982 0.6328626 NA NA 1.5115220 -0.0627141
# 3 13 NA NA 0.4042683 -0.09465904 NA NA
Data
lst <- list(structure(list(state = c(23, 24), year = c(1971, 1971),
x = c(1.37095844714667, -0.564698171396089), y = c(0.363128411337339,
0.63286260496104)), class = "data.frame", row.names = c(NA,
-2L)), structure(list(state = c(13, 23, 24), year = c(1972, 1973,
1973), x = c(0.404268323140999, -0.106124516091484, 1.51152199743894
), y = c(-0.0946590384130976, 2.01842371387704, -0.062714099052421
)), class = "data.frame", row.names = c(NA, -3L)))

Categorizing values and grouping by year?

I have a dataset with numbers (between 10 and 30) each day for 5 years. I want to know how many days per year are between 24 and 26, 26 and 28, and 28 and 30.
I think I should categorise the data (so any values between 10 and 24 become 0, 24 and 26 become 1, 26 and 28 become 2, 28 and 30 become 3) and then group by year?
Data:
set.seed(22)
dates <- seq(as.Date("2000/01/01"), by = "day", length.out = 1825)
numbers <- sample(10:30, 1825, replace=TRUE)
df1 <- data.frame(dates, numbers)
I tried this to categorise the data:
Test <- cut(df1$numbers,
breaks = c(10, 24, 26, 28, 30),
labels = c("0", "1", "2", "3"))
It seems to work, I get NAs so I presume I need to work on the values for cut. Then how would I count occurrences per year?
So I would know in 2000 there are x days when between 24 and 26 and so on....

Need to create a variable based on the equality of other variables

I have a dataset called CSES (Comparative Study of Electoral Systems) where each row corresponds to an individual (one interview in a public opinion survey), from many countries, in many different years .
I need to create a variable which identifies the ideology of the party each person voted, as perceived by this same person.
However, the dataset identifies this perceived ideology of each party (as many other variables) by letters A, B, C, etc. Then, when it comes to identify WHICH PARTY each person voted for, it has a UNIQUE CODE NUMBER, that does not correspond to these letters across different years (i.e., the same party can have a different letter in different years – and, of course, it is never the same party across different countries, since each country has its own political parties).
Fictitious data to help clarify, reproduce and create a code:
Let’s say:
country = c(1,1,1,1,2,2,2,2,3,3,3,3)
year = c (2000,2000,2004,2004, 2002,2002,2004,2008,2000,2000,2000,2000)
party_A_number = c(11,11,12,12,21,21,22,23,31,31,31,31)
party_B_number = c(12, 12, 11, 11, 22,22,21,22,32,32,32,32)
party_C_number = c(13,13,13,13,23,23,23,21,33,33,33,33)
party_voted = c(12,13,12,11,21,24,23,22,31,32,33,31)
ideology_party_A <- floor(runif (12, min=1, max=10))
ideology_party_B <- floor(runif (12, min=1, max=10))
ideology_party_C <- floor(runif (12, min=1, max=10))
Let’s call the variable I want to create “ideology_voted”:
I need something like:
IF party_A_number == party_voted THEN ideology_voted = ideology_party_A
IF party_B_number == party_voted, THEN ideology_voted == ideology_party_B
IF party_C_number == party_voted, THEN ideology_voted == ideology_party_C
The real dataset has 9 letters for (up to) 9 main parties in each country , dozens of countries and election-years. Therefore, it would be great to have a code where I could iterate through letters A-I instead of “if voted party A, then …; if voted party B then….”
Nevertheless, I am having trouble even when I try longer, repetitive codes (one transformation for each party letter - which would give me 8 lines of code)
library(tidyverse)
df <- tibble(
country = c(1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3),
year = c(2000, 2000, 2004, 2004, 2002, 2002, 2004, 2008, 2000, 2000, 2000, 2000),
party_A_number = c(11, 11, 12, 12, 21, 21, 22, 23, 31, 31, 31, 31),
party_B_number = c(12, 12, 11, 11, 22, 22, 21, 22, 32, 32, 32, 32),
party_C_number = c(13, 13, 13, 13, 23, 23, 23, 21, 33, 33, 33, 33),
party_voted = c(12, 13, 12, 11, 21, 24, 23, 22, 31, 32, 33, 31),
ideology_party_A = floor(runif (12, min = 1, max = 10)),
ideology_party_B = floor(runif (12, min = 1, max = 10)),
ideology_party_C = floor(runif (12, min = 1, max = 10))
)
> df
# A tibble: 12 x 9
country year party_A_number party_B_number party_C_number party_voted ideology_party_A ideology_party_B
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 2000 11 12 13 12 9 3
2 1 2000 11 12 13 13 2 6
3 1 2004 12 11 13 12 3 8
4 1 2004 12 11 13 11 7 8
5 2 2002 21 22 23 21 2 7
6 2 2002 21 22 23 24 8 2
7 2 2004 22 21 23 23 1 7
8 2 2008 23 22 21 22 7 7
9 3 2000 31 32 33 31 4 3
10 3 2000 31 32 33 32 7 5
11 3 2000 31 32 33 33 1 6
12 3 2000 31 32 33 31 2 1
# ... with 1 more variable: ideology_party_C <dbl>
It seems you're after conditioning using case_when:
ideology_voted <- df %>% transmute(
ideology_voted = case_when(
party_A_number == party_voted ~ ideology_party_A,
party_B_number == party_voted ~ ideology_party_B,
party_C_number == party_voted ~ ideology_party_C,
TRUE ~ party_voted
)
)
> ideology_voted
# A tibble: 12 x 1
ideology_voted
<dbl>
1 3
2 7
3 3
4 8
5 2
6 24
7 8
8 7
9 4
10 5
11 6
12 2
Note that the evaluation of case_when is lazy, so the first true condition is used (if it happens that more than one is actually true, say).

moving values from one dataframe to another, depending on value of a variable

Not being familiar with R, I've got the following problem: I want to add the values probeposition from the dataframe mlpa to the dataframe patients, with the values of probeposition being linked by values being present both in mlpa and patients (i.e. probe and patprobe). As far as I've seen, this problem is not covered by the usual data management tutorials.
#mlpa:
probe <- c(12,15,18,19)
probeposition <- c(100,1200,500,900)
mlpa = data.frame(probe = probe, probeposition = probeposition)
#patients:
patid <- c('AT', 'GA', 'TT', 'AG', 'GG', 'TA')
patprobe <- c(12, 12, NA, NA, 18, 19)
patients = data.frame(patid = patid, patprobe = patprobe)
#And that's what I finally want:
patprobeposition = c(100, 100, NA, NA, 500, 900)
patients$patprobeposition = patprobeposition
Update
Upon the response of Andrie, I got aware that that I have to mention that there are several "probes" in the patients dataset, so actually the data would more look like this (in fact, there would not only be probe1 and probe2, but probe1-probe4):
mlpa <- data.frame(probe = c(12,15,18,19),
probeposition = c(100,1200,500,900) )
patients <- data.frame(patid = c('AT', 'GA', 'TT', 'AG', 'GG', 'TA'),
probe1 = c(12, 12, NA, NA, 18, 19),
probe2 = c(15, 15, NA, NA, 19, 19) )
And what I want is this:
patients <- data.frame(patid = c('AT', 'GA', 'TT', 'AG', 'GG', 'TA'),
probe1 = c(12, 12, NA, NA, 18, 19),
probe2 = c(15, 15, NA, NA, 19, 19),
position1 = c(100, 100, NA, NA, 500, 900),
position2 = c(1200, 1200, NA, NA, 900, 900))
You can do this very easily using merge, which takes two data frames and joins them on common columns or row names.
The easiest way to get merge to work, is to make sure you have matching columns names where those columns refer to the same information. To be specific, I have renamed your column patprobe to probe:
mlpa <- data.frame(
probe = c(12,15,18,19),
probeposition = c(100,1200,500,900)
)
patients <- data.frame(
patid = c('AT', 'GA', 'TT', 'AG', 'GG', 'TA'),
probe = c(12, 12, NA, NA, 18, 19)
)
Now you can call merge. However, note that the default values of merge only returns matching rows (in database terminology this is an inner join). What you want, is to include all of the rows in patients (a left outer join). You do this by specifying all.x=TRUE:
merge(patients, mlpa, all.x=TRUE, sort=FALSE)
probe patid probeposition
1 12 AT 100
2 12 GA 100
3 18 GG 500
4 19 TA 900
5 NA TT NA
6 NA AG NA
Install the reshape2 package and try the following:
require(reshape2)
m.patients = melt(patients)
m.patients = merge(m.patients, mlpa,
by.x = "value",
by.y = "probe",
all = TRUE)
reshape(m.patients, direction="wide",
timevar="variable", idvar="patid")
This should give you output like the following, which can be cleaned up to match your desired output.
patid value.probe1 probeposition.probe1 value.probe2 probeposition.probe2
1 AT 12 100 15 1200
2 GA 12 100 15 1200
5 GG 18 500 19 900
7 TA 19 900 19 900
9 TT NA NA NA NA
10 AG NA NA NA NA
Update
Of course, you can also do it all with the reshape2 package as below:
m.patients = melt(patients, id.vars="patid", variable_name="time")
m.patients = melt(merge(m.patients, mlpa, by.x = "value",
by.y = "probe", all = TRUE))
dcast(m.patients, patid ~ variable + time )
Which results in:
patid value_probe1 value_probe2 probeposition_probe1 probeposition_probe2
1 AG NA NA NA NA
2 AT 12 15 100 1200
3 GA 12 15 100 1200
4 GG 18 19 500 900
5 TA 19 19 900 900
Update 2: Using Base R Reshape
You can also avoid using the reshape2 package entirely.
patients.l = reshape(patients, direction="long", idvar="patid",
varying=c("probe1", "probe2"), sep="")
reshape(merge(patients.l, mlpa, all = TRUE), direction="wide",
idvar="patid", timevar="time")
This gets you closest to your desired output:
patid probe.1 probeposition.1 probe.2 probeposition.2
1 AT 12 100 15 1200
2 GA 12 100 15 1200
5 GG 18 500 19 900
7 TA 19 900 19 900
9 TT NA NA NA NA
10 AG NA NA NA NA

Resources