How to go from summarized data to raw data - r

I first want to generate multinomially distributed data using r, and then I want the data in its "raw" form. So for an example, say that I have generated data by
set.seed(1)
df <- as.data.frame(cbind(rmultinom(1, 13, c(0.1, 0.3, 0.4, 0.2)), seq(from = 0, to = 3, by = 1)))
I get
V1 V2
3 0
2 1
4 2
4 3
I then want the data in a vector at the individual level, so that it looks like
0, 0, 0, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3
Is there an easy way to do this? I´m new to this and it wasn´t as easy as I thought it would be. I tried to create a function that looked something like
xcv <- vector(length = m)
asdf <- function(x, n){
for(i in 1:n){
xcv[j] <- seq(from = x[i,2], to = x[i,2], length.out = x[i,1])
}
return(xcv)
}
This did not work at all, so I hope to get some help.

Related

Different return values of the sum of a row with imputed values using 'complete' (mice) and 'update' (survey)

I need to calculate the sum of some variables with imputed values. I did this with complete --> as.mids --> with --> do.call
I needed to do the same thing but in a survey context. Therefore, I did: update --> with --> MIcombine
The means of the variables calculated both ways do not match. Which one is correct?
You may check this different behavior in this toy database:
library(tidyverse)
library(mice)
library(mitools)
library(survey)
mydata <- structure(list(dis1 = c(NA, NA, 1, 0, 0, 1, 1, 1, 1, 0),
dis2 = c(0, 1, 0, 1, NA, 1, 1, 1, 1, 0),
dis3 = c(1, 1, 0, 0, NA, 1, 1, 1, 1, 0),
sex = c(0,0,0,1,0,1,1,1,1,0),
clus = c(1,1,1,1,1,2,2,2,2,2)),
row.names = c(NA, 10L),
class = c("tbl_df", "tbl", "data.frame") )
imp <- mice::mice(mydata, m = 5, seed = 237856)
# calculating numenf with mice::complete
long <- mice::complete(imp, action = "long", include = TRUE)
long$numenf <- long$dis1 + long$dis2 + long$dis3
imp2 <- mice::as.mids(long)
res <- with(imp2, mean(numenf))
do.call(mean, res$analyses) # mean = 2.1
#calculating numenf with update (from survey)
imp1 <- mice::complete(imp)
imp2 <- mice::complete(imp, 2)
imp3 <- mice::complete(imp, 3)
imp4 <- mice::complete(imp, 4)
imp5 <- mice::complete(imp, 5)
listimp <- mitools::imputationList(list(imp1, imp2, imp3, imp4, imp5))
clus <- survey::svydesign(id = ~clus, data = listimp)
clus <- stats::update(clus, numenf = dis1 + dis2 + dis3)
res <- with(clus, survey::svymean(~numenf))
summary(mitools::MIcombine(res)) # mean = 1.98
Answer
Replace do.call(mean, res$analyses) with mean(unlist(res$analyses)).
Rationale
In the first code snippet, res$analyses is a list. When entering it into do.call, you are essentially calling:
mean(res$analyses[1], res$analyses[2], res$analyses[3], res$analyses[4], res$analyses[5])
mean takes the average of a vector in its first argument. The other arguments are not used properly (see ?mean). Hence, you're just getting 2.1 back, since that is the (mean of the) value of first analysis.
We can make a vector out of the list by using unlist(res$analyses). Then, we can just feed it to mean as an argument:
mean(unlist(res$analyses))

Correlation between variables under the for loop

I have an issue that is shown below. I tried to solve it but was not successful. I have a dataframe df1. I need to make a table of correlation between the variables within a for loop. Reason being I do not want to make the code look long and complicated.
df1 <- structure(list(a = c(1, 2, 3, 4, 5), b = c(3, 5, 7, 4, 3), c = c(3,
6, 8, 1, 2), d = c(5, 3, 1, 3, 5)), class = "data.frame", row.names =
c(NA, -5L))
I tried with the below code using 2 for loops
fv <- as.data.frame(combn(names(df1),2,paste, collapse="&"))
colnames(fv) <- "ColA"
fv$ColB <- sapply(strsplit(fv$ColA,"\\&"),'[',1)
fv$ColC <- sapply(strsplit(fv$ColA,"\\&"),'[',2)
asd <- list()
for (i in fv$ColB) {
for (j in fv$ColC) {
asd[i,j] <- as.data.frame(cor(df1[,i],df1[,j]))}}
May I know what wrong I am doing
We can apply cor directly on the data.frame and convert to 'long' format with melt. As the values in the lower triangular part is the mirror values of those in the upper triangular part, either one of these can be assigned to NA and then do the melt
library(reshape2)
out[lower.tri(out, diag = TRUE)] <- NA
melt(out, na.rm = TRUE)

Using foreach to create new observations and deleting erroneous observations in parallel

I am currently trying to clean a very large data set. I have working code to clean it, but it takes about three days to run without any parallelization, so I want to parallelize it. The original code works fine, but I can't figure out how to parallelize it in R using the doParallel and foreach packages or any other pre-built ones.
In particular, if I observe two data points that have the same time stamp, they should really be one data point. The non-parallelized code can accurately identify the points, flag them to be deleted later and create a new data point that is correct.
I've tried adapting existing code to convert the for loops into foreach loops using the %do% option provided by the doParallel package. Doing this works fine. Changing the %do% to %dopar% causes the code to stop working. I understand that this is the incorrect way to use %dopar%, but I don't know how to correctly accomplish my goal.
library(doParallel)
library(foreach)
df1 <- data.frame(ID = c(1, 2, 3, 4, 5),
date = c(10, 1, 9, 4, 11),
var2 = c(2, 4, 6, 8, 10),
var3 = c(2, 4, 6, 8, 10),
ind = c(0, 0, 0, 0, 0)) #Indicator for problem observations
df2 <- data.frame(ID = c(1, 2, 3, 4, 5),
date = c(12, 10, 7, 5, 6),
var2 = c(2, 4, 6, 8, 10),
var3 = c(2, 4, 6, 8, 10),
ind = c(0, 0, 0, 0, 0))
foreach (row1 = 1:nrow(df1)) %dopar% {
for (row2 in 1:nrow(df2)) {
if(df1[row1, "date"] == df2[row2, "date"]) { #Observations that occur on the same date should be combined
df1[row1, "ind"] <- 1 #Tag problem observations to delete them later
df2[row2, "ind"] <- 1
temp_obs <- data.frame(ID = df2[row2, "ID"],
date = df1[row1, "date"],
var2 = df1[row1, "var2"],
var3 = df1[row1, "var3"] + df2[row2, "var3"],
ind = 0)
df1 <- rbind(df1, temp_obs)
rm(temp_obs)
}
}
}
The sample code demonstrates my problem in a simpler context. It loops through all observations in df1 and df2, and identifies observations with the same date. It should add a 6th observation to df1, and change the indicators from 0 to 1 in the 1st entry of df1 and the second entry of df2 to indicate that they have been matched. As is, this code does not change df1 or df2 at all. It works when %dopar% is replaced with %do%.

Looping for ggplot how to use the for loop variable i inside the loop

I have a dataframe (what is the dataframe? i,e is not important).
I am using that and plotting some point curves. like below
#EXP <- 3 (example)
#EXP_VEC <- c(1:EXP)
for (i in 1:EXP)
{
gg2_plot[i] <- ggplot(subset(gg2,Ei == EXP_VEC[i] ),aes(x=hours, y=variable, fill = Mi)) + geom_point(aes(fill = Mi,color = Mi),size = 3)
}
As you can see EXP_VEC = c(1,2,3.......) (Depends on user input Ex: if user inputs 2 then EXP_VEC = c(1,2))
Dataframe has Ei = 1,2,3,4,........
Now I have to do the plotting for all these Ei values depending on the user input.
Consider, EXP_VEC=3
now the for loop should produce three plots for Ei = 1 , Ei = 2 and Ei = 3
for this if the for loop I have written works then it would have been done and finished.
But obviously for loop is not working. I cant use aes_string because variable "i" is outside the aes().
Ex: consider the following dataset
dd<-data.frame(
Ei = c(1L, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2),
Mi = c(1, 1, 1, 2, 2, 2, 1, 1, 1, 2, 2, 2),
hours = c(1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3),
variable = c(0.1023488, 0.1254325, 0.1523245, 0.1225425, 0.1452354,
0.1853324, 0.1452369, 0.1241241, 0.0542232, 0.8542154, 0.021542,
0.2541254))
As you can see I have two sets of Ei, I want to plot 1st plot for Ei = 1 and then beside this plot I want to again plot for Ei = 2.
So I thought of saving the plots for Ei=1 and Ei=2 in two separate variables and then using then in some kind of cascade function which I am yet to find out.
How do I do it?
Is there a easy way to do this by just using ggplot without any loop?
If not then how can I call "i" value inside my for loop?
I would do something like this:
plot_exp <-
function(i){
dat <- subset(gg2,Ei == i )
if (nrow(dat) > 0)
ggplot(dat,aes(x=hours, y=variable, fill = Mi)) +
geom_point(aes(color = Mi),size = 3)
}
ll <- lapply(seq_len(EXP), plot_exp)
ll is a list of plot of ggplot objects.

Rearrange data for ANOVA

I haven't quite got my head around R and how to rearrange data. I have an old SPSS data file that needs rearranging so I can conduct an ANOVA in R.
My current data file has this format:
ONE <- matrix(c(1, 2, 777.75, 609.30, 700.50, 623.45, 701.50, 629.95, 820.06, 651.95,"nofear","nofear"), nr=2,dimnames=list(c("1", "2"), c("SUBJECT","AAYY", "BBYY", "AAZZ", "BBZZ", "XX")))
And I need to rearrange it to this:
TWO <- matrix(c(1, 1, 1, 1, 2, 2, 2, 2, 1, 1, 0, 0, 1, 1, 0, 0, 0, 1, 0, 1, 0, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 777.75, 701.5, 700.5, 820.06, 609.3, 629.95, 623.95, 651.95), nr=8, dimnames=list(c("1", "1", "1", "1", "2", "2", "2", "2"), c("SUBJECT","AA", "ZZ", "XX", "RT")))
I am sure that there is an easy way of doing it, rather than hand coding. Thanks for the consideration.
This should do it. You can tweak it a bit, but this is the idea:
library(reshape)
THREE <- melt(as.data.frame(ONE),id=c("SUBJECT","XX"))
THREE$AA <- grepl("AA",THREE$variable)
THREE$ZZ <- grepl("ZZ",THREE$variable)
THREE$variable <- NULL
# cleanup
THREE$XX <- as.factor(THREE$XX)
THREE$AA <- as.numeric(THREE$AA)
THREE$ZZ <- as.numeric(THREE$ZZ)
Reshape and reshape() both help with this kind of stuff but in this simple case where you have to generate the variables hand coding is pretty easy, just take advantage of automatic replication in R.
TWO <- data.frame(SUBJECT = rep(1:2,each = 4),
AA = rep(1:0, each = 2),
ZZ = 0:1,
XX = 1,
RT = as.numeric(t(ONE[,2:5])))
That gives the TWO you asked for but it doesn't generalize to a larger ONE easily. I think this makes more sense
n <- nrow(ONE)
TWO <- data.frame(SUBJECT = rep(ONE$SUBJECT, 4),
AB = rep(1:0, each = n),
YZ = rep(0:1, each = 2*n),
fear = ONE$XX,
RT = unlist(ONE[,2:5]))
This latter one gives more representative variable names, and handles the likely case that your data is actually much bigger with XX (fear) varying and more subjects. Also, given that you're reading it in from an SPSS data file then ONE is actually a data frame with numeric numbers and factored character columns. The reshaping was only this part of the code...
TWO <- data.frame(SUBJECT = rep(ONE$SUBJECT, 4),
fear = ONE$XX,
RT = unlist(ONE[,2:5]))
You could add in other variables afterward.

Resources