Rearrange data for ANOVA - r

I haven't quite got my head around R and how to rearrange data. I have an old SPSS data file that needs rearranging so I can conduct an ANOVA in R.
My current data file has this format:
ONE <- matrix(c(1, 2, 777.75, 609.30, 700.50, 623.45, 701.50, 629.95, 820.06, 651.95,"nofear","nofear"), nr=2,dimnames=list(c("1", "2"), c("SUBJECT","AAYY", "BBYY", "AAZZ", "BBZZ", "XX")))
And I need to rearrange it to this:
TWO <- matrix(c(1, 1, 1, 1, 2, 2, 2, 2, 1, 1, 0, 0, 1, 1, 0, 0, 0, 1, 0, 1, 0, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 777.75, 701.5, 700.5, 820.06, 609.3, 629.95, 623.95, 651.95), nr=8, dimnames=list(c("1", "1", "1", "1", "2", "2", "2", "2"), c("SUBJECT","AA", "ZZ", "XX", "RT")))
I am sure that there is an easy way of doing it, rather than hand coding. Thanks for the consideration.

This should do it. You can tweak it a bit, but this is the idea:
library(reshape)
THREE <- melt(as.data.frame(ONE),id=c("SUBJECT","XX"))
THREE$AA <- grepl("AA",THREE$variable)
THREE$ZZ <- grepl("ZZ",THREE$variable)
THREE$variable <- NULL
# cleanup
THREE$XX <- as.factor(THREE$XX)
THREE$AA <- as.numeric(THREE$AA)
THREE$ZZ <- as.numeric(THREE$ZZ)

Reshape and reshape() both help with this kind of stuff but in this simple case where you have to generate the variables hand coding is pretty easy, just take advantage of automatic replication in R.
TWO <- data.frame(SUBJECT = rep(1:2,each = 4),
AA = rep(1:0, each = 2),
ZZ = 0:1,
XX = 1,
RT = as.numeric(t(ONE[,2:5])))
That gives the TWO you asked for but it doesn't generalize to a larger ONE easily. I think this makes more sense
n <- nrow(ONE)
TWO <- data.frame(SUBJECT = rep(ONE$SUBJECT, 4),
AB = rep(1:0, each = n),
YZ = rep(0:1, each = 2*n),
fear = ONE$XX,
RT = unlist(ONE[,2:5]))
This latter one gives more representative variable names, and handles the likely case that your data is actually much bigger with XX (fear) varying and more subjects. Also, given that you're reading it in from an SPSS data file then ONE is actually a data frame with numeric numbers and factored character columns. The reshaping was only this part of the code...
TWO <- data.frame(SUBJECT = rep(ONE$SUBJECT, 4),
fear = ONE$XX,
RT = unlist(ONE[,2:5]))
You could add in other variables afterward.

Related

Different return values of the sum of a row with imputed values using 'complete' (mice) and 'update' (survey)

I need to calculate the sum of some variables with imputed values. I did this with complete --> as.mids --> with --> do.call
I needed to do the same thing but in a survey context. Therefore, I did: update --> with --> MIcombine
The means of the variables calculated both ways do not match. Which one is correct?
You may check this different behavior in this toy database:
library(tidyverse)
library(mice)
library(mitools)
library(survey)
mydata <- structure(list(dis1 = c(NA, NA, 1, 0, 0, 1, 1, 1, 1, 0),
dis2 = c(0, 1, 0, 1, NA, 1, 1, 1, 1, 0),
dis3 = c(1, 1, 0, 0, NA, 1, 1, 1, 1, 0),
sex = c(0,0,0,1,0,1,1,1,1,0),
clus = c(1,1,1,1,1,2,2,2,2,2)),
row.names = c(NA, 10L),
class = c("tbl_df", "tbl", "data.frame") )
imp <- mice::mice(mydata, m = 5, seed = 237856)
# calculating numenf with mice::complete
long <- mice::complete(imp, action = "long", include = TRUE)
long$numenf <- long$dis1 + long$dis2 + long$dis3
imp2 <- mice::as.mids(long)
res <- with(imp2, mean(numenf))
do.call(mean, res$analyses) # mean = 2.1
#calculating numenf with update (from survey)
imp1 <- mice::complete(imp)
imp2 <- mice::complete(imp, 2)
imp3 <- mice::complete(imp, 3)
imp4 <- mice::complete(imp, 4)
imp5 <- mice::complete(imp, 5)
listimp <- mitools::imputationList(list(imp1, imp2, imp3, imp4, imp5))
clus <- survey::svydesign(id = ~clus, data = listimp)
clus <- stats::update(clus, numenf = dis1 + dis2 + dis3)
res <- with(clus, survey::svymean(~numenf))
summary(mitools::MIcombine(res)) # mean = 1.98
Answer
Replace do.call(mean, res$analyses) with mean(unlist(res$analyses)).
Rationale
In the first code snippet, res$analyses is a list. When entering it into do.call, you are essentially calling:
mean(res$analyses[1], res$analyses[2], res$analyses[3], res$analyses[4], res$analyses[5])
mean takes the average of a vector in its first argument. The other arguments are not used properly (see ?mean). Hence, you're just getting 2.1 back, since that is the (mean of the) value of first analysis.
We can make a vector out of the list by using unlist(res$analyses). Then, we can just feed it to mean as an argument:
mean(unlist(res$analyses))

Correlation between variables under the for loop

I have an issue that is shown below. I tried to solve it but was not successful. I have a dataframe df1. I need to make a table of correlation between the variables within a for loop. Reason being I do not want to make the code look long and complicated.
df1 <- structure(list(a = c(1, 2, 3, 4, 5), b = c(3, 5, 7, 4, 3), c = c(3,
6, 8, 1, 2), d = c(5, 3, 1, 3, 5)), class = "data.frame", row.names =
c(NA, -5L))
I tried with the below code using 2 for loops
fv <- as.data.frame(combn(names(df1),2,paste, collapse="&"))
colnames(fv) <- "ColA"
fv$ColB <- sapply(strsplit(fv$ColA,"\\&"),'[',1)
fv$ColC <- sapply(strsplit(fv$ColA,"\\&"),'[',2)
asd <- list()
for (i in fv$ColB) {
for (j in fv$ColC) {
asd[i,j] <- as.data.frame(cor(df1[,i],df1[,j]))}}
May I know what wrong I am doing
We can apply cor directly on the data.frame and convert to 'long' format with melt. As the values in the lower triangular part is the mirror values of those in the upper triangular part, either one of these can be assigned to NA and then do the melt
library(reshape2)
out[lower.tri(out, diag = TRUE)] <- NA
melt(out, na.rm = TRUE)

3D trajectory visualization with path in R

I'm looking for an efficient way to plot time, x, y, z with different colors for different objects - to view proximity of the objects over time.
plot3D::line3D works with add = TRUE, but it is not very elegant. Here's a sample code that works:
data$object_id <- factor(data$object_id)
library(plot3D)
for(tr in unique(data$object_id)) {
lines3D(data$x[data$object_id == tr], data$y[data$object_id == tr], data$z[data$ba object_id ll == tr], add = T, col = data$object_id[data$object_id == tr])
}
Example data:
data <- data.frame(object_id = c(1, 1, 2, 2), t = c(0, 1, 0, 1), x = c(0, 1, 1, 0), y = c(0, 1, 1, 0), altitude = c(0, 1, 1, 0))
Desired result: path traced by different objects at a given time along with an arrow that indicates the current direction of heading (determined by joining the last 2 known positions).
At time t = 0, this should yield nothing or should yield points. At t = 1, this should yield 2 lines (one over the other) of different colors: one color for each object.
2D equivalent is ggplot2::geom_path, which does all the heavy-lifting using group parameter which joins all the paths by the grouping variable.

Addressing list object in for loop

I'm having a dataframe that consists a column that contains lm formulas. When I run this column for a specific row [[2]], I get my summary output of that LM. That works perfectly, but since I have 959 rows in that column, I want to write a for loop in order to do an anova on these regressions. How do I specify that I want to address all the objects in that list in a for loop?
In order for you to have a good understanding, here a MWE:
Dataframe:
structure(list(Week = 7:17, Category = c("2", "2", "2", "2",
"2", "2", "2", "2", "2", "2", "2"), Brand = c("3", "3", "3",
"3", "3", "3", "3", "3", "3", "3", "3"), Display = c(0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0), Sales = c(0, 0, 0, 0, 13.440948, 40.097397,
32.01384, 382.169189, 2830.748779, 4524.460938, 1053.590576),
Price = c(0, 0, 0, 0, 5.949999, 5.95, 5.950003, 4.87759,
3.787015, 3.205987, 4.898724), Distribution = c(0, 0, 0,
0, 1.394019, 1.386989, 1.621416, 8.209759, 8.552915, 9.692097,
9.445554), Advertising = c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0), lnSales = c(11.4945151554497, 11.633214247508, 11.5862944141137,
11.5412559646132, 11.4811122484454, 11.4775106999991, 11.6333660772506,
11.4859819773102, 11.5232680456161, 11.5572670584292, 11.5303686934256
), IntrayearCycles = c(4.15446534315765, 3.62757053512638,
2.92387946552647, 2.14946414386239, 1.40455011205262, 0.768856938870769,
0.291497141953598, -0.0131078404184544, -0.162984144025091,
-0.200882782749248, -0.182877633924882), `Competitor Advertising` = c(10584.87063,
224846.3243, 90657.72553, 0, 0, 0, 2396.54212, 0, 0, 0, 40343.49444
), `Competitor Display` = c(0.385629, 2.108133, 2.515806,
4.918288, 3.81749, 3.035847, 2.463194, 3.242594, 1.850399,
1.751096, 1.337943), `Competitor Prices` = c(5.30989, 5.372752,
5.3717245, 5.3295525, 5.298393, 5.319466, 5.1958415, 5.2941095,
5.296757, 5.294059, 5.273578), ZeroSales = c(1, 1, 1, 1,
0, 0, 0, 0, 0, 0, 0)), .Names = c("Week", "Category", "Brand",
"Display", "Sales", "Price", "Distribution", "Advertising", "lnSales",
"IntrayearCycles", "Competitor Advertising", "Competitor Display",
"Competitor Prices", "ZeroSales"), row.names = 1255:1265, class = "data.frame")
Then I apply a for loop to estimate an Error Correction Model (with ECM package) - this produces a Linear Model ouptut -. This for loop is applied to estimate 959 separate regressions.
f <- function(.) {
xeq <- as.data.frame(select(., lnPrice, lnAdvertising, lnDisplay, IntrayearCycles, lnCompetitorPrices, lnCompADV, lnCompDISP, ADVxDISP, ADVxCYC, DISPxCYC, ADVxDISPxCYC))
xtr <- as.data.frame(select(., lnPrice, lnAdvertising, lnDisplay, IntrayearCycles, lnCompetitorPrices, lnCompADV, lnCompDISP, ADVxDISP, ADVxCYC, DISPxCYC, ADVxDISPxCYC))
print(xeq)
print(xtr)
summary(ecm(.$lnSales, xeq, xtr, includeIntercept = TRUE))
}
Models <- DatasetThesisSynergyClean %>%
group_by(Category, Brand) %>%
do(Model = f(.))
To see the summary of a specific model (here model 2), you can address:
Models$model[[2]]
Consequently, I want to extract specific values from this summary output. But first I want to extract the Residuals Sum of Squares (RSS) to do an anova. I do this for one list object as follows:
anova_output_Unitmodels <- anova(Models$Model[[2]])
RSS_Unit <- anova_output_Unitmodels$`Sum Sq`[nrow(anova_output_Unitmodels)] #saving the RSS
Now, I want to for loop this accross all the list objects, from object [[1]] until [[959]]. This RSS output has to be saved end eventually I need to sum all these RSS values.
Furthermore, if this works, I need to extract all coefficients, t-values, and p-values of all variables, from all models. Then I also need to address the specific objects in the list and put $coefficients behind it, but I was not able to manage this too.
Here is how I implemented #Roman Lustrik's answer.
extractRSS <- function(x) {
an <- anova(x)
RSS_Unit <- an$`Sum Sq`[nrow(an)]
return(RSS_Unit)
}
sapply(Model, FUN = extractRSS)
I also tried to do it for one specific model, but this gives me an error:
SapplyRSS <- sapply(Models$Model, FUN = extractRSS)
I've had another idea and thought to for loop it differently, did not work out well but it's a start:
If you do
RSS2<- sum(Models$Model[[2]]$residuals^2)
So I thought replicate this in a for loop:
for(i in residuals.lm){
AllRSS<- as.matrix(c(1:949))
AllRSS <- as.data.frame(AllRSS)
SumRSS <- sum(Models$Model[[i]]$residuals^2)
SumRSS <- as.data.frame(SumRSS)
TotalRSS <- cbind(SumRSS, AllRSS)}
TotalRSS <- SumRSS[NULL,]
It starts with specifying the i in the for function, I do not know if this is right. Eventually it leaves me with an empty dataframe, or a dataframe with the value of the same brand.
#MichaelChirico probably had something like this in mind.
extractRSS <- function(x) {
an <- anova(x)
RSS_Unit <- an$`Sum Sq`[nrow(an)]
return(RSS_Unit)
}
sapply(Model, FUN = extractRSS)
sapply will traverse every Models$Model[[i]] object and extract RSS. You can modify this function to perhaps include other pieces of information. The result will probably be coerced to some simpler object. You can prevent this by sapply(..., simplify = FALSE).
A different way of doing this is by exporting all the list objects as objects in the dataframe. You do this through:
names(Models$Model) <- paste0("C", Models$Category, "B", Models$Brand)
list2env(Models$Model, .GlobalEnv)
Then I wrote a for loop to address these objects, and to fill an empty dataframe over and over with the values from this for loop. This goes as follows:
for(X in c("0","1","3")){
EmptyRSS <- data.frame(RSS = 0)
ModelX <- get(paste0("C", X, "B2"))
RSS <- sum(ModelX$residuals^2)
RSS <- as.data.frame(RSS)
DF <- ModelX$df[2]
DF <- as.data.frame(DF)
RSSDF <- cbind(RSS, DF)
TotalRSS2 <- rbind(TotalRSS2, RSSDF)
}
TotalRSS2 <- RSSDF[NULL,]
You should run the command outside the loop twice.

How to go from summarized data to raw data

I first want to generate multinomially distributed data using r, and then I want the data in its "raw" form. So for an example, say that I have generated data by
set.seed(1)
df <- as.data.frame(cbind(rmultinom(1, 13, c(0.1, 0.3, 0.4, 0.2)), seq(from = 0, to = 3, by = 1)))
I get
V1 V2
3 0
2 1
4 2
4 3
I then want the data in a vector at the individual level, so that it looks like
0, 0, 0, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3
Is there an easy way to do this? I´m new to this and it wasn´t as easy as I thought it would be. I tried to create a function that looked something like
xcv <- vector(length = m)
asdf <- function(x, n){
for(i in 1:n){
xcv[j] <- seq(from = x[i,2], to = x[i,2], length.out = x[i,1])
}
return(xcv)
}
This did not work at all, so I hope to get some help.

Resources