R Viz for comparing multiple paired values - r

I have and excel with data for 2 DBs and multiple Measures (Msr) for each. There is classic ratio data Num/Denom=Ratio for each. Can anybody suggest what visualization I can use in R to graphically find big differences (let say 10%+) for each of measure between Test and X1 databases and then for each Measure.
So we compare Denom, Num, Rate between line 1 and 2.
..and then 3,4
..and then 5,6 etc
Tried to do in Excel but read that R could be much better for this purposes. But for now I can see most paired viz works for scattered display. I need something more traditional e.g. in my sample we can mark X1.SRB.Rare as low
In my example I have 3 measures, in reality could be 30. Thanks much for info.
M
db <- c('test','x1','test','x1','test','x1')
msr <- c('BCS','BCS','CCS','CCS','SRB','SRB')
denom <- c(11848,11049,35836,38458,54160,56387)
num <- c(5255,6376,16908,18124,26253,15000)
rate <- c(44.35,57.71,47.18,47.13,48.47,26.6)
df <- data.frame(db,msr,denom,num,rate)
df
db msr denom num rate
1 test BCS 11848 5255 44.35
2 x1 BCS 11049 6376 57.71
3 test CCS 35836 16908 47.18
4 x1 CCS 38458 18124 47.13
5 test SRB 54160 26253 48.47
6 x1 SRB 56387 15000 26.60

If I understood correctly, this should do what you want. I reshaped the data so you have one row per msr with separate columns for each db. I used data.table for it's performance.
library(data.table)
db <- c('test','x1','test','x1','test','x1')
msr <- c('BCS','BCS','CCS','CCS','SRB','SRB')
denom <- c(11848,11049,35836,38458,54160,56387)
num <- c(5255,6376,16908,18124,26253,15000)
rate <- c(44.35,57.71,47.18,47.13,48.47,26.6)
df <- data.frame(db,msr,denom,num,rate)
#set as a data.table
setDT(df)
#cast into one row per MSR - fill in with the "rate" variable
out <- dcast(msr ~ db, data = df, value.var = "rate")
#Compute difference
out[, test_x1_diff := test - x1]
#filter out diff >= 10
out[abs(test_x1_diff) >= 10]
#> msr test x1 test_x1_diff
#> 1: BCS 44.35 57.71 -13.36
#> 2: SRB 48.47 26.60 21.87
Created on 2019-01-11 by the reprex package (v0.2.1)

Related

How to use mice for multiple imputation of missing values in longitudinal data?

I have a dataset with a repeatedly measured continuous outcome and some covariates of different classes, like in the example below.
Id y Date Soda Team
1 -0.4521 1999-02-07 Coke Eagles
1 0.2863 1999-04-15 Pepsi Raiders
2 0.7956 1999-07-07 Coke Raiders
2 -0.8248 1999-07-26 NA Raiders
3 0.8830 1999-05-29 Pepsi Eagles
4 0.1303 2005-03-04 NA Cowboys
5 0.1375 2013-11-02 Coke Cowboys
5 0.2851 2015-06-23 Coke Eagles
5 -0.3538 2015-07-29 Pepsi NA
6 0.3349 2002-10-11 NA NA
7 -0.1756 2005-01-11 Pepsi Eagles
7 0.5507 2007-10-16 Pepsi Cowboys
7 0.5132 2012-07-13 NA Cowboys
7 -0.5776 2017-11-25 Coke Cowboys
8 0.5486 2009-02-08 Coke Cowboys
I am trying to multiply impute missing values in Soda and Team using the mice package. As I understand it, because MI is not a causal model, there is no concept of dependent and independent variable. I am not sure how to setup this MI process using mice. I like some suggestions or advise from others who have encountered missing data in a repeated measure setting like this and how they used mice to tackle this problem. Thanks in advance.
Edit
This is what I have tried so far, but this does not capture the repeated measure part of the dataset.
library(mice)
init = mice(dat, maxit=0)
methd = init$method
predM = init$predictorMatrix
methd [c("Soda")]="logreg";
methd [c("Team")]="logreg";
imputed = mice(data, method=methd , predictorMatrix=predM, m=5)
There are several options to accomplish what you are asking for. I have decided to impute missing values in covariates in the so-called 'wide' format. I will illustrate this with the following worked example, which you can easily apply to your own data.
Let's first make a reprex. Here, I use the longitudinal Mayo Clinic Primary Biliary Cirrhosis Data (pbc2), which comes with the JM package. This data is organized in the so-called 'long' format, meaning that each patient i has multiple rows and each row contains a measurement of variable x measured on time j. Your dataset is also in the long format. In this example, I assume that pbc2$serBilir is our outcome variable.
# install.packages('JM')
library(JM)
# note: use function(x) instead of \(x) if you use a version of R <4.1.0
# missing values per column
miss_abs <- \(x) sum(is.na(x))
miss_perc <- \(x) round(sum(is.na(x)) / length(x) * 100, 1L)
miss <- cbind('Number' = apply(pbc2, 2, miss_abs), '%' = apply(pbc2, 2, miss_perc))
# --------------------------------
> miss[which(miss[, 'Number'] > 0),]
Number %
ascites 60 3.1
hepatomegaly 61 3.1
spiders 58 3.0
serChol 821 42.2
alkaline 60 3.1
platelets 73 3.8
According to this output, 6 variables in pbc2 contain at least one missing value. Let's pick alkaline from these. We also need patient id and the time variable years.
# subset
pbc_long <- subset(pbc2, select = c('id', 'years', 'alkaline', 'serBilir'))
# sort ascending based on id and, within each id, years
pbc_long <- with(pbc_long, pbc_long[order(id, years), ])
# ------------------------------------------------------
> head(pbc_long, 5)
id years alkaline serBilir
1 1 1.09517 1718 14.5
2 1 1.09517 1612 21.3
3 2 14.15234 7395 1.1
4 2 14.15234 2107 0.8
5 2 14.15234 1711 1.0
Just by quickly eyeballing, we observe that years do not seem to differ within subjects, even though variables were repeatedly measured. For the sake of this example, let's add a little bit of time to all rows of years but the first measurement.
set.seed(1)
# add little bit of time to each row of 'years' but the first row
new_years <- lapply(split(pbc_long, pbc_long$id), \(x) {
add_time <- 1:(length(x$years) - 1L) + rnorm(length(x$years) - 1L, sd = 0.25)
c(x$years[1L], x$years[-1L] + add_time)
})
# replace the original 'years' variable
pbc_long$years <- unlist(new_years)
# integer time variable needed to store repeated measurements as separate columns
pbc_long$measurement_number <- unlist(sapply(split(pbc_long, pbc_long$id), \(x) 1:nrow(x)))
# only keep the first 4 repeated measurements per patient
pbc_long <- subset(pbc_long, measurement_number %in% 1:4)
Since we will perform our multiple imputation in wide format (meaning that each participant i has one row and repeated measurements on x are stored in j different columns, so xj columns in total), we have to convert the data from long to wide. Now that we have prepared our data, we can use reshape to do this for us.
# convert long format into wide format
v_names <- c('years', 'alkaline', 'serBilir')
pbc_wide <- reshape(pbc_long,
idvar = 'id',
timevar = "measurement_number",
v.names = v_names, direction = "wide")
# -----------------------------------------------------------------
> head(pbc_wide, 4)[, 1:9]
id years.1 alkaline.1 serBilir.1 years.2 alkaline.2 serBilir.2 years.3 alkaline.3
1 1 1.095170 1718 14.5 1.938557 1612 21.3 NA NA
3 2 14.152338 7395 1.1 15.198249 2107 0.8 15.943431 1711
12 3 2.770781 516 1.4 3.694434 353 1.1 5.148726 218
16 4 5.270507 6122 1.8 6.115197 1175 1.6 6.716832 1157
Now let's multiply the missing values in our covariates.
library(mice)
# Setup-run
ini <- mice(pbc_wide, maxit = 0)
meth <- ini$method
pred <- ini$predictorMatrix
visSeq <- ini$visitSequence
# avoid collinearity issues by letting only variables measured
# at the same point in time predict each other
pred[grep("1", rownames(pred), value = TRUE),
grep("2|3|4", colnames(pred), value = TRUE)] <- 0
pred[grep("2", rownames(pred), value = TRUE),
grep("1|3|4", colnames(pred), value = TRUE)] <- 0
pred[grep("3", rownames(pred), value = TRUE),
grep("1|2|4", colnames(pred), value = TRUE)] <- 0
pred[grep("4", rownames(pred), value = TRUE),
grep("1|2|3", colnames(pred), value = TRUE)] <- 0
# variables that should not be imputed
pred[c("id", grep('^year', names(pbc_wide), value = TRUE)), ] <- 0
# variables should not serve as predictors
pred[, c("id", grep('^year', names(pbc_wide), value = TRUE))] <- 0
# multiply imputed missing values ------------------------------
imp <- mice(pbc_wide, pred = pred, m = 10, maxit = 20, seed = 1)
# Time difference of 2.899244 secs
As can be seen in the below three example traceplots (which can be obtained with plot(imp), the algorithm has converged nicely. Refer to this section of Stef van Buuren's book for more info on convergence.
Now we need to convert back the multiply imputed data (which is in wide format) to long format, so that we can use it for analyses. We also need to make sure that we exclude all rows that had missing values for our outcome variable serBilir, because we do not want to use imputed values of the outcome.
# need unlisted data
implong <- complete(imp, 'long', include = FALSE)
# 'smart' way of getting all the names of the repeated variables in a usable format
v_names <- as.data.frame(matrix(apply(
expand.grid(grep('ye|alk|ser', names(implong), value = TRUE)),
1, paste0, collapse = ''), nrow = 4, byrow = TRUE), stringsAsFactors = FALSE)
names(v_names) <- names(pbc_long)[2:4]
# convert back to long format
longlist <- lapply(split(implong, implong$.imp),
reshape, direction = 'long',
varying = as.list(v_names),
v.names = names(v_names),
idvar = 'id', times = 1:4)
# logical that is TRUE if our outcome was not observed
# which should be based on the original, unimputed data
orig_data <- reshape(imp$data, direction = 'long',
varying = as.list(v_names),
v.names = names(v_names),
idvar = 'id', times = 1:4)
orig_data$logical <- is.na(orig_data$serBilir)
# merge into the list of imputed long-format datasets:
longlist <- lapply(longlist, merge, y = subset(orig_data, select = c(id, time, logical)))
# exclude rows for which logical == TRUE
longlist <- lapply(longlist, \(x) subset(x, !logical))
Finally, convert longlist back into a mids using datalist2mids from the miceadds package.
imp <- miceadds::datalist2mids(longlist)
# ----------------
> imp$loggedEvents
NULL

Obtaining Percentage for Date Observations

I am very new to R and am struggling with this concept. I have a data frame that looks like this:
enter image description here
I have used summary(FoodFacilityInspections$DateRecent) to get the observations for each "date" listed. I have 3932 observations, though, and wanted to get a summary of:
Dates with the most observations and the percentage for that
Percentage of observations for the Date Recent category
I have tried:
*
> count(FoodFacilityInspections$DateRecent) Error in UseMethod("count")
> : no applicable method for 'count' applied to an object of class
> "factor"
Using built in data as you did not provide example data
library(data.table)
dtcars <- data.table(mtcars, keep.rownames = TRUE)
Solution
dtcars[, .("count"=.N, "percent"=.N/dtcars[, .N]*100),
by=cyl]
You can use the table function to find out which date occurs the most. Then you can loop through each item in the table (date in your case) and divide it by the total number of rows like this (also using the mtcars dataset):
table(mtcars$cyl)
percent <- c()
for (i in 1:length(table(mtcars$cyl))){
percent[i] <- table(mtcars$cyl)[i]/nrow(mtcars) * 100
}
output <- cbind(table(mtcars$cyl), percent)
output
percent
4 11 34.375
6 7 21.875
8 14 43.750
A one-liner using table and proportions in within.
within(as.data.frame.table(with(mtcars, table(cyl))), Pc <- proportions(Freq)*100)
# cyl Freq Pc
# 1 4 11 34.375
# 2 6 7 21.875
# 3 8 14 43.750
An updated solution with total, percent and cumulative percent table based on your data.
library(data.table)
data<-data.frame("ScoreRecent"=c(100,100,100,100,100,100,100,100,100),
"DateRecent"=c("7/23/2021", "7/8/2021","5/25/2021","5/19/2021","5/20/2021","5/13/2021","5/17/2021","5/18/2021","5/18/2021"),
"Facility_Type_Description"=c("Retail Food Stores", "Retail Food Stores","Food Service Establishment","Food Service Establishment","Food Service Establishment","Food Service Establishment","Food Service Establishment","Food Service Establishment","Food Service Establishment"),
"Premise_zip"=c(40207,40207,40207,40206,40207,40206,40207,40206,40206),
"Opening_Date"=c("6/27/1988","6/29/1988","10/20/2009","2/28/1989","10/20/2009","10/20/2009","10/20/2009","10/20/2009", "10/20/2009"))
tab <- function(dataset, var){
dataset %>%
group_by({{var}}) %>%
summarise(n=n()) %>%
mutate(total = cumsum(n),
percent = n / sum(n) * 100,
cumulativepercent = cumsum(n / sum(n) * 100))
}
tab(data, Facility_Type_Description)
Facility_Type_Description n total percent cumulativepercent
<chr> <int> <int> <dbl> <dbl>
1 Food Service Establishment 7 7 77.8 77.8
2 Retail Food Stores 2 9 22.2 100

Correct variable values in a dataframe applying a function using variable-specific values in another dataframe in R

I have a df called 'covs' with sites on rows and in columns, 9 different environmental variables for each of these sites. I need to recalculate the value of each cell using the function x - center_values(x)) / scale_values(x). However, 'center_values' and 'scale_values' are different for each environmental covariate, and they are located in another df called 'correction'.
I have found many solutions for applying a function for a whole df, but not for applying specific values according to the id of the value to transform.
covs <- read.table(text = "X elev builtup river grip pa npp treecov
384879-2009 1 24.379101 25188.572 1241.8348 1431.1082 5.705152e+03 16536.664 60.23175
385822-2009 2 29.533478 32821.770 2748.9053 1361.7772 2.358533e+03 15773.115 62.38455
385823-2009 3 30.097059 28358.244 2525.7627 1073.8772 4.340906e+03 14899.451 46.03269
386765-2009 4 33.877861 40557.891 927.4295 1049.4838 4.580944e+03 15362.518 53.08151
386766-2009 5 38.605156 36182.801 1479.6178 1056.2130 2.517869e+03 13389.958 35.71379",
header= TRUE)
correction <- read.table(text = "var_name center_values scale_values
1 X 196.5 113.304898393671
2 elev 200.217889868483 307.718211316278
3 builtup 31624.4888660664 23553.2438790344
4 river 1390.41023742909 1549.88661649406
5 grip 5972.67361738244 6996.57793554527
6 pa 2731.33431010861 4504.71055521749
7 npp 10205.2997576655 2913.19658598938
8 treecov 47.9080656134352 17.7101565911347
9 nonveg 7.96755640452006 4.56625351682905", header= TRUE)
Could someone help me write a code to recalculate the environmental covariate values in 'covs' using the specific covariate values reported in 'correction'? E.g. For each value in the column 'elev' of the df 'covs', I need to substract the 'center_value' reported for 'elev' in the 'corrected' df, and then divided by the 'scale_value' of 'elev' reported in 'corrected' df. Thank you for your kind help.
You may assign var_name to row names, then loop over the names of covs to do the calculations in an sapply.
rownames(correction) <- correction$var_name
res <- as.data.frame(sapply(names(covs), function(x, y)
(covs[, x] - correction[x, "center_values"])/correction[x, "scale_values"]))
res
# X elev builtup river grip pa npp treecov
# 1 -1.725433 -0.5714280 -0.27324970 -0.09586213 -0.6491124 0.66015733 2.173339 0.6958541
# 2 -1.716607 -0.5546776 0.05083296 0.87651254 -0.6590217 -0.08275811 1.911239 0.8174114
# 3 -1.707781 -0.5528462 -0.13867495 0.73253905 -0.7001703 0.35730857 1.611340 -0.1058927
# 4 -1.698956 -0.5405596 0.37928543 -0.29871910 -0.7036568 0.41059457 1.770295 0.2921174
# 5 -1.690130 -0.5251972 0.19353224 0.05755748 -0.7026950 -0.04738713 1.093183 -0.6885470
Check e.g. "elev":
(covs[,"elev"] - correction["elev", "center_values"]) / correction["elev", "scale_values"]
# [1] -0.5714280 -0.5546776 -0.5528462 -0.5405596 -0.5251972

R dplyr - running total with row-wise calculations

I have a dataframe that keeps track of the activities associated with a bank account (example below).
The initial balance is $5,000 (type "initial). If type is "in", that means a cash deposit. In this example each deposit is $1,000. If type is "out", that means a withdrawal from the account. In this example each withdrawal is 10% of the account balance.
data <- tibble(
activity=1:6,
type=c("initial","in","out","out","in","in"),
input=c(5000,1000,10,10,1000,1000))
Is there a dplyr solution to keep track of the balance after each activity?? I have tried several ways but I can't seem to find a way to efficiently calculate running totals and the withdrawal amount (which depends on the running total).
For this example the output should be:
result <- tibble(
activity=1:6,
type=c("initial","in","out","out","in","in"),
input=c(5000,1000,10,10,1000,1000),
balance=c(5000,6000,5400,4860,5860,6860))
Thanks in advance for any suggestions or recommendations!
You can use purrr::accumulate2() to condition the calculation on the value of type:
library(dplyr)
library(purrr)
library(tidyr)
data %>%
mutate(balance = accumulate2(input, type[-1], .f = function(x, y, type) if(type == "out") x - x * y/100 else x + y)) %>%
unnest(balance)
# A tibble: 6 x 4
activity type input balance
<int> <chr> <dbl> <dbl>
1 1 initial 5000 5000
2 2 in 1000 6000
3 3 out 10 5400
4 4 out 10 4860
5 5 in 1000 5860
6 6 in 1000 6860

how to make t-student test in loop in r?

I'm new in R and what I want to do is something very simple but I need help.
I have a database that looks like the one above; where spot number = "name" of a protein, grupo = group I and II and APF = fluorescent reading.
I want to do a tstudent test to each protein, by comparing groups I and II, but in a loop.
In the database above there only 1 protein (147) but im my real database i have 444 proteins.
Starting with some fake data:
set.seed(0)
Spot.number <- rep(147:149, each=10)
grupo <- rep(rep(1:2, each=5), 3)
APF <- rnorm(30)
gel <- data.frame(Spot.number, grupo, APF)
> head(gel)
Spot.number grupo APF
1 147 1 2.1780699
2 147 1 -0.2609347
3 147 1 -1.6125236
4 147 1 1.7863384
5 147 1 2.0325473
6 147 2 0.6261739
You can use lapply to loop through the subsets of gel, split by the Spot.number:
tests <- lapply(split(gel, gel$Spot.number), function(spot) t.test(APF ~ grupo, spot))
or just
tests <- by(gel, gel$Spot.number, function(spot) t.test(APF ~ grupo, spot))
You can then move on to e.g. taking only the p values:
sapply(tests, "[[", "p.value")
# 147 148 149
#0.2941609 0.9723856 0.5726007
or confidence interval
sapply(tests, "[[", "conf.int")
# 147 148 149
# [1,] -0.985218 -1.033815 -0.8748502
# [2,] 2.712395 1.066340 1.4240488
And the resulting vector or matrix will already have the Spot.number as names which can be very helpful.
You can perform a t.test within each group using dplyr and my broom package. If your data is stored in a data frame called dat, you would do:
library(dplyr)
library(broom)
results <- dat %>%
group_by(Spot.number) %>%
do(tidy(t.test(APF ~ grupo, .)))
This works by performing t.test(APF ~ grupo, .) on each group defined by Spot.number. The tidy function from broom then turns it into a one-row data frame so that it can be recombined. The results data frame will then contain one row per protein (Spot.number) with columns including estimate, statistic, and p.value.
See this vignette for more on the combination of dplyr and broom.

Resources