Related
I'm struggeling with imputation via mice package to solve a NA problem in my data anlysis. I'm using lienar mixed models to calcultate inter class correlation coefficients (ICC's). in my final dataframe there are several control variables (as columns) that I use as fixed effects in the model.
in some columns there are missing values. I have no further Problems to impute the NA by the following commands:
imputation_list <- mice(baseline_df,
method = "pmm",
m=5) # "pmm" == predictive mean matching (numeric data)
df_imputation_final= complete(imputation_list)
But now my problem:
The ID's (persons in rows) are subgrouped in multiple groups (families). So I have to impute the NA's, all persons within one family having the same imputation.
In the following dataframe I have to make imputations.
df_test <- data.frame(ID=c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20),
family=c(Gerrard, Gerrard, Gerrard, Torres, Torres, Torres, Keita, Keita, Keita, Suarez, Suarez, Kuyt, Kuyt, Carragher, Carragher, Carragher, Salah, Salah, Firmono, Firmino )
income_family=c(NA, NA, NA, 100, 100, 100, 90, 90, 90, 150, 150, 40, 40, NA, NA, NA, 200, 200, 99, 99))
So all members/persons ("1", "2", "3" & "14", "15", "16") within families: "Gerrard", and "Carragher" need imputation in the income_family variable and the imputed values must be the same for all the members of the family. Should look like this:
df_final <- data.frame(ID=c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20),
family=c(Gerrard, Gerrard, Gerrard, Torres, Torres, Torres, Keita, Keita, Keita, Suarez, Suarez, Kuyt, Kuyt, Carragher, Carragher, Carragher, Salah, Salah, Firmono, Firmino )
income_family=c(55, 55, 55, 100, 100, 100, 90, 90, 90, 150, 150, 40, 40, 66, 66, 66, 200, 200, 99, 99))
I hope you know what I mean. Thx a lot !!
It's unclear what purpose the long ID variable serves if the values for income_family are the same for every observation of family. I believe the only way to achieve your desired result is to summarize your dataset before imputation.
df <- data.frame(ID=c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20),
family=c("Gerrard", "Gerrard", "Gerrard", "Torres", "Torres", "Torres", "Keita", "Keita", "Keita", "Suarez", "Suarez", "Kuyt", "Kuyt", "Carragher", "Carragher", "Carragher", "Salah", "Salah", "Firmono", "Firmino"),
income_family=c(NA, NA, NA, 100, 100, 100, 90, 90, 90, 150, 150, 40, 40, NA, NA, NA, 200, 200, 99, 99))
df2 <- df %>%
group_by(family) %>%
summarize(income_family = mean(income_family))
# Same for every family
imputation_list <- mice(df2, m = 1, printFlag = FALSE)
df_imputation_final <- complete(imputation_list)
However, if you want to do proper modelling on multiply-imputed data, you will need to conduct your analyses on the mids object imputation_list, not the large dataframe df_imputation_final. If you're using lme4, see this post for details: Using imputed datasets from library mice() to fit a multi-level model in R
# Longitudinal multiple imputation
# https://rmisstastic.netlify.app/tutorials/erler_course_multipleimputation_2018/erler_practical_miadvanced_2018
imp <- mice(df, maxit = 0)
meth <- imp$meth
pred <- imp$pred
meth[c("income_family")] <- "2lonly.pmm"
pred[, "ID"] <- -2
pred[, "family"] <- 2
imputation_list <- mice::mice(df,
m = 5, maxit = 10,
method = meth,
seed = 123,
pred = pred,
printFlag = FALSE)
fit <- with(data = imputation_list,
exp = lme4::lmer(income_family ~ (1|family)))
pool(fit)
I have a dataframe which look like this
y = data.frame(subdel = c(1, 2, 3, 1, 57, 14, 1, 2, 57, 57, 57, 3, 1, 1,
31, 21, 34, 56, 12, 45, 1, 63, 31, 34), muni = c("A01", "A83", "A40", NA, NA, NA, NA, NA, NA, NA, NA, "A45", "B26", "B42","B61", "B70", "B90", "C53", "C89","A45", "B26", "B42","B61", "B70"))
I'm expecting the next result:
z = data.frame(subdel = c(1, 2, 3, 57, 57, 57, 57, 3, 1, 1, 31, 21, 34, 56, 12, 45, 1, 63, 31, 34), muni = c("A01", "A83", "A40", NA, NA, NA, NA, "A45", "B26", "B42","B61", "B70", "B90", "C53", "C89", "A45", "B26", "B42","B61", "B70"))
I want to match subdel == 57 with muni == NA, but, as you can see, conservating all the another observations in the dataframe.
Any help would be appreciated.
We can use subset with a logical condition i.e. check for NA in 'muni' (is.na(muni)) and (&) where the 'subdel' is 57 (subdel == 57) or all other non-NA elements from 'muni' (!is.na(muni))
subset(y, is.na(muni) & subdel == 57 | !is.na(muni))
I want to find the outlier variable in a dataset both in terms of length and value.
In other words, I want to find a variable that is unlike other variables by comparing between variables.
To illustrate, below is a dummy set:
outliers <- data.frame(A = c(32, 31, 32, 38, 23, NA, NA, NA, NA, NA),
B = c(33, 39, 28, 34, 32, NA, NA, NA, NA, NA),
C = c(28, 39, 41, 31, 29, NA, NA, NA, NA, NA),
D = c(8, 9, 19, 28, 31, 23, 18, 13, 93, 2))
Clearly, the variable D is the outlier both in terms of length of observations and its values (i.e. mean).
I want to find a way to locate outlier variables like D in my actual dataset and put them into a list for further inspection.
The difficulty that I have in doing this with my actual dataset is that its very large (there are many lists that contain dataframes with hundreds of columns and thousands of rows).
Thank you!
This is small example of my data set. This set contains weekly data about 52 weeks. You can see data with code below:
# CODE
#Data
library(tidyverse)
library(plotly)
ARTIFICIALDATA<-dput(structure(list(week = c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12,
13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28,
29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44,
45, 46, 47, 48, 49, 50, 51, 52), `2019 Series_1` = c(534.771929824561,
350.385964912281, 644.736842105263, 366.561403508772, 455.649122807018,
533.614035087719, 829.964912280702, 466.035087719298, 304.421052631579,
549.473684210526, 649.719298245614, 537.964912280702, 484.982456140351,
785.929824561404, 576.736842105263, 685.508771929824, 514.842105263158,
464.491228070175, 608.245614035088, 756.701754385965, 431.859649122807,
524.315789473684, 739.40350877193, 604.736842105263, 669.684210526316,
570.491228070175, 641.649122807018, 649.298245614035, 664.210526315789,
530.385964912281, 754.315789473684, 646.80701754386, 764.070175438596,
421.333333333333, 470.842105263158, 774.245614035088, 752.842105263158,
575.368421052632, 538.315789473684, 735.578947368421, 522, 862.561403508772,
496.526315789474, 710.631578947368, 584.456140350877, 843.19298245614,
563.473684210526, 568.456140350877, 625.368421052632, 768.912280701754,
679.824561403509, 642.526315789474), `2020 Series_1` = c(294.350877192983,
239.824561403509, 709.614035087719, 569.824561403509, 489.438596491228,
561.964912280702, 808.456140350877, 545.157894736842, 589.649122807018,
500.877192982456, 584.421052631579, 524.771929824561, 367.438596491228,
275.228070175439, 166.736842105263, 58.2456140350878, NA, NA,
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,
NA, NA)), row.names = c(NA, -52L), class = c("tbl_df", "tbl",
"data.frame")))
colnames(ARTIFICIALDATA) <- c('week', 'series1', 'series2')
So the next step is to plot this data with r-plotly package. I want to have a plot like the example below. Because this is weekly data, first series1 have 52 observations while series2 has 16 observation (series1 is mean data for 2019 and series2 data for 2020). So for that reason, the comparison must be only on 16 observation (all observations which don't have NA) like the example below:
So can anybody help how to plot this graph with plotly?
Try this:
colnames(ARTIFICIALDATA) <- c("week", "series1", "series2")
ARTIFICIALDATA %>%
# Drop rows with NA
drop_na() %>%
# Convert to long format
pivot_longer(-week, names_to = "series") %>%
# Set the labels for the plot. If you want other lables simply adjust
mutate(label = case_when(
series == "series1" ~ "2019 Series_1",
series == "series2" ~ "2020 Series_1")) %>%
# Compute sum by sereis
group_by(label) %>%
summarise(sum = sum(value, na.rm = TRUE)) %>%
ungroup() %>%
# Plot
plot_ly(x = ~label, y = ~sum) %>%
add_bars() %>%
# Remove title for xaxis. But can you can label it as you like
layout(xaxis = list(title = ""))
This question comes from a previous one I posted a while ago:
rollsum with fixed dates
I can not make the given solution to work. I have a large data set, the interesting columns are:
id = c(145658, 145658, 145658, 145658, 145658, 145658, 145658, 145658, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3)
week_number = c(24, 35, 44, 71, 82, 117, 127, 142, 4, 15, 20, 24, 30, 36, 42, 46, 59, 67, 68, 71, 75, 78, 79, 86, 93, 96)
amount = c(51.9, 51.9, 51.9, 51.9, 51.9, 103.8, 51.9, 51.9, 67.9, 67.9, 67.9, 67.9, 67.9, 67.9, 67.9, 67.9, 67.9, 67.9, 101.0, 168.9, 101.0, 101.0, 135.8, 168.9, 168.9, 67.9)
df = data.frame(id = id, week_number = week_number, amount = amount)
In reality, I have thousands of id's, and each has different week number. I want to calculate the rollsum on the "amount" column for n past weeks (including the present week) for each id.
An extreme example would be with the past 100 weeks. The results would look like:
past_100wk = c(NA, NA, NA, NA, NA, 363.3, 363.3, 363.8, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA)
Again, this is an extreme case, but it shows the the results should give NA (or -1) when the row value is not included in the week_number window (100 weeks, in this case).
Thank you!