How to find outliers variables in a very large dataset? - r

I want to find the outlier variable in a dataset both in terms of length and value.
In other words, I want to find a variable that is unlike other variables by comparing between variables.
To illustrate, below is a dummy set:
outliers <- data.frame(A = c(32, 31, 32, 38, 23, NA, NA, NA, NA, NA),
B = c(33, 39, 28, 34, 32, NA, NA, NA, NA, NA),
C = c(28, 39, 41, 31, 29, NA, NA, NA, NA, NA),
D = c(8, 9, 19, 28, 31, 23, 18, 13, 93, 2))
Clearly, the variable D is the outlier both in terms of length of observations and its values (i.e. mean).
I want to find a way to locate outlier variables like D in my actual dataset and put them into a list for further inspection.
The difficulty that I have in doing this with my actual dataset is that its very large (there are many lists that contain dataframes with hundreds of columns and thousands of rows).
Thank you!

Related

How to make scatter plot with connections and colors for paired data in ggplot2

I have a pair data first column shows some values for individuals with disease and column 2 contains values for individuals without disease. I would like to make an scatter plot showing in the x-axis those with disease and in the y-axis those without disease. I want to show disease and non-diseases with two colors and also connect the pairs by line. Each row shows the pairs. For example pair 1 with disease value = 27 and pair 2 without disease value= 29 and so on. I have tried below, but I am not sure how to continue the rest. Any guidance is appreciated.
d <- structure(list(id_case = c(27, 17, 35, 18, 27, 40, 20, 25, 30, 20, 35, 26, 30, 31, 15, 11, 41),
id_control = c(29, 26, 39, 22, 24, 41, 29, 24, 25, 21, 29, 24, 26, 29, 15, 11, 35)),
class = c("tbl_df", "tbl", "data.frame"), row.names = c(NA, -17L))
ggplot(d, aes(id_case, id_control))+geom_point()+
xlab("with disease")+ylab("without disease")

Subset specific rows in a dataframe, but keeping the observations

I have a dataframe which look like this
y = data.frame(subdel = c(1, 2, 3, 1, 57, 14, 1, 2, 57, 57, 57, 3, 1, 1,
31, 21, 34, 56, 12, 45, 1, 63, 31, 34), muni = c("A01", "A83", "A40", NA, NA, NA, NA, NA, NA, NA, NA, "A45", "B26", "B42","B61", "B70", "B90", "C53", "C89","A45", "B26", "B42","B61", "B70"))
I'm expecting the next result:
z = data.frame(subdel = c(1, 2, 3, 57, 57, 57, 57, 3, 1, 1, 31, 21, 34, 56, 12, 45, 1, 63, 31, 34), muni = c("A01", "A83", "A40", NA, NA, NA, NA, "A45", "B26", "B42","B61", "B70", "B90", "C53", "C89", "A45", "B26", "B42","B61", "B70"))
I want to match subdel == 57 with muni == NA, but, as you can see, conservating all the another observations in the dataframe.
Any help would be appreciated.
We can use subset with a logical condition i.e. check for NA in 'muni' (is.na(muni)) and (&) where the 'subdel' is 57 (subdel == 57) or all other non-NA elements from 'muni' (!is.na(muni))
subset(y, is.na(muni) & subdel == 57 | !is.na(muni))

Modify graph (with plotly)

This is small example of my data set. This set contains weekly data about 52 weeks. You can see data with code below:
# CODE
#Data
library(tidyverse)
library(plotly)
ARTIFICIALDATA<-dput(structure(list(week = c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12,
13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28,
29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44,
45, 46, 47, 48, 49, 50, 51, 52), `2019 Series_1` = c(534.771929824561,
350.385964912281, 644.736842105263, 366.561403508772, 455.649122807018,
533.614035087719, 829.964912280702, 466.035087719298, 304.421052631579,
549.473684210526, 649.719298245614, 537.964912280702, 484.982456140351,
785.929824561404, 576.736842105263, 685.508771929824, 514.842105263158,
464.491228070175, 608.245614035088, 756.701754385965, 431.859649122807,
524.315789473684, 739.40350877193, 604.736842105263, 669.684210526316,
570.491228070175, 641.649122807018, 649.298245614035, 664.210526315789,
530.385964912281, 754.315789473684, 646.80701754386, 764.070175438596,
421.333333333333, 470.842105263158, 774.245614035088, 752.842105263158,
575.368421052632, 538.315789473684, 735.578947368421, 522, 862.561403508772,
496.526315789474, 710.631578947368, 584.456140350877, 843.19298245614,
563.473684210526, 568.456140350877, 625.368421052632, 768.912280701754,
679.824561403509, 642.526315789474), `2020 Series_1` = c(294.350877192983,
239.824561403509, 709.614035087719, 569.824561403509, 489.438596491228,
561.964912280702, 808.456140350877, 545.157894736842, 589.649122807018,
500.877192982456, 584.421052631579, 524.771929824561, 367.438596491228,
275.228070175439, 166.736842105263, 58.2456140350878, NA, NA,
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,
NA, NA)), row.names = c(NA, -52L), class = c("tbl_df", "tbl",
"data.frame")))
colnames(ARTIFICIALDATA) <- c('week', 'series1', 'series2')
So the next step is to plot this data with r-plotly package. I want to have a plot like the example below. Because this is weekly data, first series1 have 52 observations while series2 has 16 observation (series1 is mean data for 2019 and series2 data for 2020). So for that reason, the comparison must be only on 16 observation (all observations which don't have NA) like the example below:
So can anybody help how to plot this graph with plotly?
Try this:
colnames(ARTIFICIALDATA) <- c("week", "series1", "series2")
ARTIFICIALDATA %>%
# Drop rows with NA
drop_na() %>%
# Convert to long format
pivot_longer(-week, names_to = "series") %>%
# Set the labels for the plot. If you want other lables simply adjust
mutate(label = case_when(
series == "series1" ~ "2019 Series_1",
series == "series2" ~ "2020 Series_1")) %>%
# Compute sum by sereis
group_by(label) %>%
summarise(sum = sum(value, na.rm = TRUE)) %>%
ungroup() %>%
# Plot
plot_ly(x = ~label, y = ~sum) %>%
add_bars() %>%
# Remove title for xaxis. But can you can label it as you like
layout(xaxis = list(title = ""))

Simple restrictions/constraint for multiple imputation (MICE) in R

I want to perform multiple imputation for a set of variables using the MICE package in R.
# Example data
data <- data.frame(
gcs = c(3, 10, NA, NA, NA, 15, 14, 15, 15, 14, 15, NA, 13, 15, 15),
hf = c(50, 66, 78, 99, NA, NA, 56, 55, NA, 76, 98, 105, NA, NA, 65),
...
)
The minimum for gcs is 3 and the maximum is 15, and it may not be a fractional number, how can I set these constraints in MICE? Same goes for hf, but this one only has a bottom limit of 0.

Rollsum with week numbers

This question comes from a previous one I posted a while ago:
rollsum with fixed dates
I can not make the given solution to work. I have a large data set, the interesting columns are:
id = c(145658, 145658, 145658, 145658, 145658, 145658, 145658, 145658, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3)
week_number = c(24, 35, 44, 71, 82, 117, 127, 142, 4, 15, 20, 24, 30, 36, 42, 46, 59, 67, 68, 71, 75, 78, 79, 86, 93, 96)
amount = c(51.9, 51.9, 51.9, 51.9, 51.9, 103.8, 51.9, 51.9, 67.9, 67.9, 67.9, 67.9, 67.9, 67.9, 67.9, 67.9, 67.9, 67.9, 101.0, 168.9, 101.0, 101.0, 135.8, 168.9, 168.9, 67.9)
df = data.frame(id = id, week_number = week_number, amount = amount)
In reality, I have thousands of id's, and each has different week number. I want to calculate the rollsum on the "amount" column for n past weeks (including the present week) for each id.
An extreme example would be with the past 100 weeks. The results would look like:
past_100wk = c(NA, NA, NA, NA, NA, 363.3, 363.3, 363.8, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA)
Again, this is an extreme case, but it shows the the results should give NA (or -1) when the row value is not included in the week_number window (100 weeks, in this case).
Thank you!

Resources