I am working with a dataset that comes with lme4, and am trying to learn how to apply reshape2 to convert it from long to wide [full code at the end of the post].
library(lme4)
data("VerbAgg") # load the dataset
The dataset has 9 variables; 'Anger', 'Gender', and 'id' don't vary with 'item', while 'resp',
'btype', 'situ', 'mode', and 'r2' do.
I have successfully been able to convert the dataset from long to wide format using reshape():
wide <- reshape(VerbAgg, timevar=c("item"),
idvar=c("id", 'Gender', 'Anger'), dir="wide")
Which yields 316 observations on 123 variables, and appears to be correctly transformed. However, I have had no success using reshape/reshape2 to reproduce the wide dataframe.
wide2 <- recast(VerbAgg, id + Gender + Anger ~ item + variable)
Using Gender, item, resp, id, btype, situ, mode, r2 as id variables
Error: Casting formula contains variables not found in molten data: Anger
I may not be 100% clear on how recast defines id variables, but I am very confused why it does not see "Anger". Similarly,
wide3 <- recast(VerbAgg, id + Gender + Anger ~ item + variable,
id.var = c("id", "Gender", "Anger"))
Error: Casting formula contains variables not found in molten data: item
Can anyone see what I am doing wrong? I would love to obtain a better understanding of melt/cast!
Full code:
## load the lme4 package
library(lme4)
data("VerbAgg")
head(VerbAgg)
names(VerbAgg)
# Using base reshape()
wide <- reshape(VerbAgg, timevar=c("item"),
idvar=c("id", 'Gender', 'Anger'), dir="wide")
# Using recast
library(reshape2)
wide2 <- recast(VerbAgg, id + Gender + Anger ~ item + variable)
wide3 <- recast(VerbAgg, id + Gender + Anger ~ item + variable,
id.var = c("id", "Gender", "Anger"))
# Using melt/cast
m <- melt(VerbAgg, id=c("id", "Gender", "Anger"))
wide <- o cast(m,id+Gender+Anger~...)
Aggregation requires fun.aggregate: length used as default
# Yields a list object with a length of 8?
m <- melt(VerbAgg, id=c("id", "Gender", "Anger"), measure.vars = c(4,6,7,8,9))
wide <- dcast(m, id ~ variable)
# Yields a data frame object with 6 variables.
I think the following code does what you want.
library(lme4)
data("VerbAgg")
# Using base reshape()
wide <- reshape(VerbAgg, timevar=c("item"),
idvar=c("id", 'Gender', 'Anger'), dir="wide")
dim(wide) # 316 123
# Using melt/cast
require(reshape2)
m1 <- melt(VerbAgg, id=c("id", "Gender", "Anger","item"), measure=c('resp','btype','situ','mode','r2'))
wide4 <- dcast(m1,id+Gender+Anger~item+variable)
dim(wide4) # 316 123
R> wide[1:5,1:6]
Anger Gender id resp.S1WantCurse btype.S1WantCurse situ.S1WantCurse
1 20 M 1 no curse other
2 11 M 2 no curse other
3 17 F 3 perhaps curse other
4 21 F 4 perhaps curse other
5 17 F 5 perhaps curse other
R> wide4[1:5,1:6]
id Gender Anger S1WantCurse_resp S1WantCurse_btype S1WantCurse_situ
1 1 M 20 no curse other
2 2 M 11 no curse other
3 3 F 17 perhaps curse other
4 4 F 21 perhaps curse other
5 5 F 17 perhaps curse other
Related
I have a dataset with a repeatedly measured continuous outcome and some covariates of different classes, like in the example below.
Id y Date Soda Team
1 -0.4521 1999-02-07 Coke Eagles
1 0.2863 1999-04-15 Pepsi Raiders
2 0.7956 1999-07-07 Coke Raiders
2 -0.8248 1999-07-26 NA Raiders
3 0.8830 1999-05-29 Pepsi Eagles
4 0.1303 2005-03-04 NA Cowboys
5 0.1375 2013-11-02 Coke Cowboys
5 0.2851 2015-06-23 Coke Eagles
5 -0.3538 2015-07-29 Pepsi NA
6 0.3349 2002-10-11 NA NA
7 -0.1756 2005-01-11 Pepsi Eagles
7 0.5507 2007-10-16 Pepsi Cowboys
7 0.5132 2012-07-13 NA Cowboys
7 -0.5776 2017-11-25 Coke Cowboys
8 0.5486 2009-02-08 Coke Cowboys
I am trying to multiply impute missing values in Soda and Team using the mice package. As I understand it, because MI is not a causal model, there is no concept of dependent and independent variable. I am not sure how to setup this MI process using mice. I like some suggestions or advise from others who have encountered missing data in a repeated measure setting like this and how they used mice to tackle this problem. Thanks in advance.
Edit
This is what I have tried so far, but this does not capture the repeated measure part of the dataset.
library(mice)
init = mice(dat, maxit=0)
methd = init$method
predM = init$predictorMatrix
methd [c("Soda")]="logreg";
methd [c("Team")]="logreg";
imputed = mice(data, method=methd , predictorMatrix=predM, m=5)
There are several options to accomplish what you are asking for. I have decided to impute missing values in covariates in the so-called 'wide' format. I will illustrate this with the following worked example, which you can easily apply to your own data.
Let's first make a reprex. Here, I use the longitudinal Mayo Clinic Primary Biliary Cirrhosis Data (pbc2), which comes with the JM package. This data is organized in the so-called 'long' format, meaning that each patient i has multiple rows and each row contains a measurement of variable x measured on time j. Your dataset is also in the long format. In this example, I assume that pbc2$serBilir is our outcome variable.
# install.packages('JM')
library(JM)
# note: use function(x) instead of \(x) if you use a version of R <4.1.0
# missing values per column
miss_abs <- \(x) sum(is.na(x))
miss_perc <- \(x) round(sum(is.na(x)) / length(x) * 100, 1L)
miss <- cbind('Number' = apply(pbc2, 2, miss_abs), '%' = apply(pbc2, 2, miss_perc))
# --------------------------------
> miss[which(miss[, 'Number'] > 0),]
Number %
ascites 60 3.1
hepatomegaly 61 3.1
spiders 58 3.0
serChol 821 42.2
alkaline 60 3.1
platelets 73 3.8
According to this output, 6 variables in pbc2 contain at least one missing value. Let's pick alkaline from these. We also need patient id and the time variable years.
# subset
pbc_long <- subset(pbc2, select = c('id', 'years', 'alkaline', 'serBilir'))
# sort ascending based on id and, within each id, years
pbc_long <- with(pbc_long, pbc_long[order(id, years), ])
# ------------------------------------------------------
> head(pbc_long, 5)
id years alkaline serBilir
1 1 1.09517 1718 14.5
2 1 1.09517 1612 21.3
3 2 14.15234 7395 1.1
4 2 14.15234 2107 0.8
5 2 14.15234 1711 1.0
Just by quickly eyeballing, we observe that years do not seem to differ within subjects, even though variables were repeatedly measured. For the sake of this example, let's add a little bit of time to all rows of years but the first measurement.
set.seed(1)
# add little bit of time to each row of 'years' but the first row
new_years <- lapply(split(pbc_long, pbc_long$id), \(x) {
add_time <- 1:(length(x$years) - 1L) + rnorm(length(x$years) - 1L, sd = 0.25)
c(x$years[1L], x$years[-1L] + add_time)
})
# replace the original 'years' variable
pbc_long$years <- unlist(new_years)
# integer time variable needed to store repeated measurements as separate columns
pbc_long$measurement_number <- unlist(sapply(split(pbc_long, pbc_long$id), \(x) 1:nrow(x)))
# only keep the first 4 repeated measurements per patient
pbc_long <- subset(pbc_long, measurement_number %in% 1:4)
Since we will perform our multiple imputation in wide format (meaning that each participant i has one row and repeated measurements on x are stored in j different columns, so xj columns in total), we have to convert the data from long to wide. Now that we have prepared our data, we can use reshape to do this for us.
# convert long format into wide format
v_names <- c('years', 'alkaline', 'serBilir')
pbc_wide <- reshape(pbc_long,
idvar = 'id',
timevar = "measurement_number",
v.names = v_names, direction = "wide")
# -----------------------------------------------------------------
> head(pbc_wide, 4)[, 1:9]
id years.1 alkaline.1 serBilir.1 years.2 alkaline.2 serBilir.2 years.3 alkaline.3
1 1 1.095170 1718 14.5 1.938557 1612 21.3 NA NA
3 2 14.152338 7395 1.1 15.198249 2107 0.8 15.943431 1711
12 3 2.770781 516 1.4 3.694434 353 1.1 5.148726 218
16 4 5.270507 6122 1.8 6.115197 1175 1.6 6.716832 1157
Now let's multiply the missing values in our covariates.
library(mice)
# Setup-run
ini <- mice(pbc_wide, maxit = 0)
meth <- ini$method
pred <- ini$predictorMatrix
visSeq <- ini$visitSequence
# avoid collinearity issues by letting only variables measured
# at the same point in time predict each other
pred[grep("1", rownames(pred), value = TRUE),
grep("2|3|4", colnames(pred), value = TRUE)] <- 0
pred[grep("2", rownames(pred), value = TRUE),
grep("1|3|4", colnames(pred), value = TRUE)] <- 0
pred[grep("3", rownames(pred), value = TRUE),
grep("1|2|4", colnames(pred), value = TRUE)] <- 0
pred[grep("4", rownames(pred), value = TRUE),
grep("1|2|3", colnames(pred), value = TRUE)] <- 0
# variables that should not be imputed
pred[c("id", grep('^year', names(pbc_wide), value = TRUE)), ] <- 0
# variables should not serve as predictors
pred[, c("id", grep('^year', names(pbc_wide), value = TRUE))] <- 0
# multiply imputed missing values ------------------------------
imp <- mice(pbc_wide, pred = pred, m = 10, maxit = 20, seed = 1)
# Time difference of 2.899244 secs
As can be seen in the below three example traceplots (which can be obtained with plot(imp), the algorithm has converged nicely. Refer to this section of Stef van Buuren's book for more info on convergence.
Now we need to convert back the multiply imputed data (which is in wide format) to long format, so that we can use it for analyses. We also need to make sure that we exclude all rows that had missing values for our outcome variable serBilir, because we do not want to use imputed values of the outcome.
# need unlisted data
implong <- complete(imp, 'long', include = FALSE)
# 'smart' way of getting all the names of the repeated variables in a usable format
v_names <- as.data.frame(matrix(apply(
expand.grid(grep('ye|alk|ser', names(implong), value = TRUE)),
1, paste0, collapse = ''), nrow = 4, byrow = TRUE), stringsAsFactors = FALSE)
names(v_names) <- names(pbc_long)[2:4]
# convert back to long format
longlist <- lapply(split(implong, implong$.imp),
reshape, direction = 'long',
varying = as.list(v_names),
v.names = names(v_names),
idvar = 'id', times = 1:4)
# logical that is TRUE if our outcome was not observed
# which should be based on the original, unimputed data
orig_data <- reshape(imp$data, direction = 'long',
varying = as.list(v_names),
v.names = names(v_names),
idvar = 'id', times = 1:4)
orig_data$logical <- is.na(orig_data$serBilir)
# merge into the list of imputed long-format datasets:
longlist <- lapply(longlist, merge, y = subset(orig_data, select = c(id, time, logical)))
# exclude rows for which logical == TRUE
longlist <- lapply(longlist, \(x) subset(x, !logical))
Finally, convert longlist back into a mids using datalist2mids from the miceadds package.
imp <- miceadds::datalist2mids(longlist)
# ----------------
> imp$loggedEvents
NULL
This question already has answers here:
Reshaping data.frame from wide to long format
(8 answers)
Closed 3 years ago.
I'm having more problems with the tidyr package in R. I am doing an experiment involving splitting up the data frame into plot, plant, and leaf variables, and since I have a large data frame, I need to do this with a code. I'm using RStudio and using the tidyr package.
I need to organize a data frame from this:
library(readr)
library(tidyr)
library(dplyr)
plot <- c("101","101","101","101","101","102","102","102","102","102")
plant <- c("1","2","3","4","5","1","2","3","4","5")
leaf_1 <- c("100","100","100","100","100","100","100","100","100","100")
leaf_2 <- c("90","90","90","90","90","90","90","90","90","90")
leaf_3 <- c("80","80","80","80","80","80","80","80","80","80")
plot <- as.data.frame(plot)
plant <- as.data.frame(plant)
leaf_1 <- as.data.frame(leaf_1)
leaf_2 <- as.data.frame(leaf_2)
leaf_3 <- as.data.frame(leaf_3)
data <- cbind(plot, plant, leaf_1, leaf_2, leaf_3)
View(data)
Into this:
plot <- c("101","101","101", "101","101","101","101","101","101","101","101","101","101","101","101")
plant <- c("1","1","1","2","2","2","3","3","3","4","4","4","5","5","5")
leaf_number <- c("1","2","3","1","2","3","1","2","3","1","2","3","1","2","3")
score <- c("100","90","80","100","90","80","100","90","80","100","90","80","100","90","80")
plot <- as.data.frame(plot)
plant <- as.data.frame(plant)
leaf_number <- as.data.frame(leaf_number)
score <- as.data.frame(score)
example <- cbind(plot, plant, leaf_number, score)
View(example)
Here is what I have already tried:
data1 <- gather(data, leaf_number, score, -plot)
But it just doesn't gather the data frame into what I need. Any help is greatly appreciated, thanks so much everybody!
data <- data.frame(
plot = c(101,101,101,101,101,102,102,102,102,102),
plant = c(1,2,3,4,5,1,2,3,4,5),
leaf_1 = c(100,100,100,100,100,100,100,100,100,100),
leaf_2 = c(90,90,90,90,90,90,90,90,90,90),
leaf_3 = c(80,80,80,80,80,80,80,80,80,80)
)
gather(data, leaf_number, score, -c(plot, plant))
# plot plant leaf_number score
#1 101 1 leaf_1 100
#2 101 2 leaf_1 100
#3 101 3 leaf_1 100
#4 101 4 leaf_1 100
#5 101 5 leaf_1 100
#6 102 1 leaf_1 100
#7 102 2 leaf_1 100
#etc.
My problem is really simple: I have a dataframe with 3 columns
> head(subset_only_aster)
compound contrast sign_level
2 10 + 11 + 12 + 13 + 14-MeC30 Precocene.undeveloped - Acetone.undeveloped *
7 10 + 11 + 12 + 13 + 14-MeC30 Precocene.developed - Acetone.undeveloped **
Of which I want to make a data frame where 'compound' should be the row names (there are 65 compounds all together), the 'contrasts' (which is a variable with 6 levels) should be the columns (6 columns) and the variable 'sign_level' should be the data in the data frame.
Don't know where to begin, can't find the answer on the web neither. Can anybody help?
Here is a base R solution:
dat <- expand.grid(compounds=letters[1:3], contrast=LETTERS[5:10])
dat[, "sgn"] <- sample(c("*", "**", "***"), nrow(dat), replace=TRUE)
reshape(dat, direction="wide", idvar="compounds", timevar="contrast")
You can use the spread-function in tidyr:
DF<-data.frame(compound=rep(LETTERS[1:2],2),
contrast=c(rep(letters[1],2),rep(letters[2],2)),
signlevel=1:4)
library(tidyr)
DF2<-tidyr::spread(DF,contrast,signlevel)
I'm new in R and what I want to do is something very simple but I need help.
I have a database that looks like the one above; where spot number = "name" of a protein, grupo = group I and II and APF = fluorescent reading.
I want to do a tstudent test to each protein, by comparing groups I and II, but in a loop.
In the database above there only 1 protein (147) but im my real database i have 444 proteins.
Starting with some fake data:
set.seed(0)
Spot.number <- rep(147:149, each=10)
grupo <- rep(rep(1:2, each=5), 3)
APF <- rnorm(30)
gel <- data.frame(Spot.number, grupo, APF)
> head(gel)
Spot.number grupo APF
1 147 1 2.1780699
2 147 1 -0.2609347
3 147 1 -1.6125236
4 147 1 1.7863384
5 147 1 2.0325473
6 147 2 0.6261739
You can use lapply to loop through the subsets of gel, split by the Spot.number:
tests <- lapply(split(gel, gel$Spot.number), function(spot) t.test(APF ~ grupo, spot))
or just
tests <- by(gel, gel$Spot.number, function(spot) t.test(APF ~ grupo, spot))
You can then move on to e.g. taking only the p values:
sapply(tests, "[[", "p.value")
# 147 148 149
#0.2941609 0.9723856 0.5726007
or confidence interval
sapply(tests, "[[", "conf.int")
# 147 148 149
# [1,] -0.985218 -1.033815 -0.8748502
# [2,] 2.712395 1.066340 1.4240488
And the resulting vector or matrix will already have the Spot.number as names which can be very helpful.
You can perform a t.test within each group using dplyr and my broom package. If your data is stored in a data frame called dat, you would do:
library(dplyr)
library(broom)
results <- dat %>%
group_by(Spot.number) %>%
do(tidy(t.test(APF ~ grupo, .)))
This works by performing t.test(APF ~ grupo, .) on each group defined by Spot.number. The tidy function from broom then turns it into a one-row data frame so that it can be recombined. The results data frame will then contain one row per protein (Spot.number) with columns including estimate, statistic, and p.value.
See this vignette for more on the combination of dplyr and broom.
I have data that looks like this:
sample start end gene coverage
X 1 10 A 5
X 11 20 A 10
Y 1 10 A 5
Y 11 20 A 10
X 1 10 B 5
X 11 20 B 10
Y 1 10 B 5
Y 11 20 B 10
I added additional columns:
data$length <- (data$end - data$start + 1)
data$ct_lt <- (data$length * data$coverage)
I reformated my data using dcast:
casted <- dcast(data, gene ~ sample, value.var = "coverage", fun.aggregate = mean)
So my new data looks like this:
gene X Y
A 10.00000 10.00000
B 38.33333 38.33333
This is the correct data format I desire, but I would like to fun.aggregate differently. Instead, I would like to take a weighted average, with coverage weighted by length:
( sum (ct_lt) ) / ( sum ( length ) )
How do I go about doing this?
Disclosure: no R in front of me, but I think your friend here may be the dplyr and tidyr packages.
Certainly lots of ways to accomplish this, but I think the following might get you started
library(dplyr)
library(tidyr)
data %>%
select(gene, sample, ct_lt, length) %>%
group_by(gene, sample) %>%
summarise(weight_avg = sum(ct_lt) / sum(length)) %>%
spread(sample, weight_avg)
Hope this helps...