How to do the Whitney U test (wilcox.test) across several columns? - r

So I have data looking a little something like this:
Data:
Area
Al
Cd
Cu
A
10000
0.2
30
A
15000
0.5
25
A
NA
Na
NA
B
8000
1.1
55
B
11000
0.2
40
B
13000
0.1
40
etc.
And I want to do a Mann Whitney U test between group A and B separately for each element/column.
I have managed to do this manually for each element individually according to this:
#Data is the above dataframe
Area_A <- subset(Data, Group %in% c("A"))
Area_B <- subset(Data, Group %in% c("B"))
WhitneyU_Al <- wilcox.test(Area_A$Al, Area_B$Al, na.rm = TRUE, paired = FALSE, exact = FALSE)
(I couldn't figure out how to do it based on the rows in the column "Areas" in one data frame, which is why I divided it into two subsets).
Now, I have a lot more columns than just these three (43 to be exact), and I was wondering if there was some way to do this across all columns without changing it manually each time?
I tried a few variations of this:
WhitneyU <- wilcox.test(Area_A, Area_B, na.rm = TRUE, paired = FALSE, exact = FALSE)
#OR
WhitneyU <- wilcox.test(Area_A[2:43], Area_B[2:43], na.rm = TRUE, paired = FALSE, exact = FALSE)
But they both return the error that "'x' must be numeric".
I suspect the answer isn't this easy and that I am barking up the wrong tree? Either that, or the question/answer is too obvious and I am just not seeing it.
When I tried looking up multiple tests most answers deal with how to do multiple tests if you have multiple "groups" (as in, they have area A, B, C and D). Sorry if this has been asked before and I didn't find it (or I didn't understand it). I did look.
Any help is appreciated.
Edit:
Upon request, using dput() on part of my data it looks a bit like this:
structure(list(Group = c("A", "A", "A", "A",
"A", "B", "B", "B", "B", "B", "B"
), Al = c(NA, NA, NA, 18100, 18400, 32500, 33200, 31200,
17400, 13900, 14400), As = c(NA, NA, NA, 16.9, 14.6, 8.83, 8.59,
8.42, 13.4, 13.5, 13.7), B = c(NA, NA, NA, 18, 16, 14, 14, 11,
53, 87, 58), Bi = c(NA, NA, NA, 0.13, 0.12, 0.57, 0.55, 0.52, 0.22,
0.18, 0.21), Ca = c(NA, NA, NA, 5950, 5480, 6220, 6230, 5950,
6850, 8170, 7000), Cd = c(NA, NA, NA, 0.2, 0.2, 0.2, 0.2, 0.18,
0.31, 0.36, 0.46)), row.names = c(1L, 2L, 3L, 4L, 5L, 40L, 41L,
42L, 43L, 44L, 45L), class = c("tbl_df", "tbl", "data.frame"))

wilcox.test requires the first input (x) to be numeric. In R, factors have an integer value assigned to them “under the hood” (ie, A = 1, B = 2,…). So you can convert the group variable in your data frame df. This should work to perform the test across all other columns:
df$Group <- as.factor(df$Group)
lapply(df[-1], function(x){
wilcox.test(x ~ df$Group)
})

Related

LMest: problem introducing covariates to the measurement model when fitting a Latent Markov Model to continuous data

I am working with longitudinal continuous data that reflect the linguistic abilities of children. In that regard I seek to make a Latent Transition Model, more exact a Latent Markov Model using the LMest package in R. As far as I have understood this means creating both a measurement model and subsequently a latent model, both in which covariates (X) can be reduced, however I fail when I try to add them to the measurement model. Can anyone tell me why?
##### SIMULATED DATA OF THE SAME NATURE
ID <- c(1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2,3,3,3,3,3,3,3,3,3)
time <- c(0,1,2,3,4,5,6,7,8,0,1,2,3,4,5,6,7,8,0,1,2,3,4,5,6,7,8)
gender <- c(1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2,1,1,1,1,1,1,1,1,1)
response_y <- c(NA, 0.15, 0.2, 0.4, 0.64, NA, 0.85, 0.89, NA, 0.02, NA, 0.01, 0.11, 0.35, 0.63, NA, NA, NA, NA, 0.3, NA, 0.56, 0.84, 0.81, 0.9, NA, NA)
response_y1 <- c(NA, 0.1, 0.3, 0.5, NA, NA, 0.7, 0.89, NA, NA, NA, 0.01, 0.11, 0.35, NA, NA, NA, NA, NA, 0.3, NA, 0.56, 0.84, NA, 0.9, 0.91, NA)
d = data.frame(ID, time, gender, response_y)
I have so far tried to model it like this:
library(LMest)
## COVARIATES INTRODUCED TO THE MEASUREMENT MODEL (gives error)
lmestCont(responsesFormula = response_y + response_y1 ~ gender, latentFormula = NULL, index = c("ID", "time"), k = 1:5, data = dt$data, modBasic = 1, output = TRUE, tol = 10^-5, out_se = TRUE)
But keep getting this error:
Warning: multivariate data are not allowed; only the first response variable is considered
Steps of EM:
1...2...3...4...5...6...7...8...9...10...11...12...13...14...15...16...17...18...19...20...21...22...23...24...25...26...27...28...29...30...31...32...33...34...35...36...37...38...39...40...41...42...43...44...
Missing data in the dataset. imp.mix function (mix package) used for imputation.
Error in aicv[kv] <- out[[kv]]$aic : replacement has length zero
When introducing the covariates to the latent model it works, and looks like this:
## COVARIATES INTRODUCED TO THE LATENT MODEL (RUNS)
mod_con <- lmestCont(responsesFormula = response_y+ response_y1 ~ NULL, latentFormula = ~ gender | gender, index = c("ID", "time"), k = 1:5, data = dt$data, modBasic = 1, output = TRUE, tol = 10^-5, out_se = TRUE)
All kinds of advise are happily received - also on the LMest in general, maybe I have misunderstood something!!! thanks

Why isn't my mixed model loop working? (RStudio, Crossover design)

I can't figure out why my loop isn't working.
I have a database (36rows x 51columns, its name is "Seleccio") consisting of 3 factors (first 3 columns: Animal (12 animals), Diet (3 diets) and Period (3 periods)) and 48 variables (many clinical parameters) with 36 observations per column. It is a 3x3 crossover design so I want to implement a mixed model to include the Animal random effect and also Period and Diet fixed effects and the interaction between them.
A sample of the data (but with less rows and columns):
Animal Diet Period Var1 Var2 Var3 Var4 Var5 Var6
<chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 A A A 11 55 1.2 0.023 22 3
2 B A A 13 34 1.6 0.04 23 4
3 C B A 15 13 1.1 0.052 22 2
4 A B B 10 22 1.5 0.067 27 4
5 B C B 9 45 1.4 0.012 24 2
6 C C B 13 32 1.5 0.014 23 3
> dput(sample[1:9,])
structure(list(Animal = c("A", "B", "C", "A", "B", "C", NA, NA,
NA), Diet = c("A", "A", "B", "B", "C", "C", NA, NA, NA), Period = c("A",
"A", "A", "B", "B", "B", NA, NA, NA), Var1 = c(11, 13, 15, 10,
9, 13, NA, NA, NA), Var2 = c(55, 34, 13, 22, 45, 32, NA, NA,
NA), Var3 = c(1.2, 1.6, 1.1, 1.5, 1.4, 1.5, NA, NA, NA), Var4 = c(0.023,
0.04, 0.052, 0.067, 0.012, 0.014, NA, NA, NA), Var5 = c(22, 23,
22, 27, 24, 23, NA, NA, NA), Var6 = c(3, 4, 2, 4, 2, 3, NA, NA,
NA)), row.names = c(NA, -9L), class = c("tbl_df", "tbl", "data.frame"
))
I want to make descriptive analysis (normality testing and checking for outliers) of each variable sorted by Diet (which is the treatment) and also run a mixed model and make an ANOVA and a Tukey test for the fixed effects.
I can do the analysis one by one, but it takes a lot of time, I have tried several times to create a for loop to automate the analysis for all the variables but it didn't work (I'm pretty new to R).
What I got so far:
sink("output.txt") # to store the output into a file, as it would be to large to be shown in the console
vars <-as.data.frame(Seleccio[,c(4:51)])
fact <-Seleccio$Diet
dim(vars)
for (i in 1:length(vars)) {
variable <- vars[,i]
lme_cer <- lme(variable ~ Period*Diet, random = ~1 | Animal, data = Seleccio) # the model
cat("\n---------\n\n")
cat(colnames(Seleccio)[i]) # the name of each variable, so I don't get lost in the text file
cat("\n")
print(boxplot(vars[,i]~fact)$out) #checking for outliers
print(summary(lme_cer))
print(anova(lme_cer))
print(lsmeans(lme_cer, pairwise~Diet, adjust="tukey"))
}
sink()
This code runs but doesn't do the job, as it gives me wrong results for each variable because they are different from the results that I get when I analyse each variable one by one.
I would also like to add to the loop this normality testing sorted by Diet (Treatment) code. I wonder if it would be possible.
aggregate(formula = VARIABLENAME ~ Diet,
data = Seleccio,
FUN = function(x) {y <- shapiro.test(x); c(y$statistic, y$p.value)})
Thank you very much in advance to all of those who are willing to lend me a hand, any help will be greatly appreciated
I don't think i can run the model with only 6 observations, so i couldn't find why would your loop doesn't return the same as doing it one by one. Maybe the problem is with cat(colnames(Seleccio)[i]): you only want the Var names, and for i=1, 2 and 3, that code will return "Animal", "Diet" and "Period", thus messing up how you're comparing the results. Using cat(colnames(vars)[i]) might correct that. If you find a way to include more observations of Seleccio i might be able to help more.
I would suggest you to create a list to store the output:
vars <- as.data.frame(Seleccio[,c(4:51)])
fact <- Seleccio$Diet
dim(vars)
output = list() #Create empty list
for (i in 1:length(vars)) {
var = colnames(vars)[i]
output[[var]] = list() #Create one entry for each variable
variable <- vars[,i]
lme_cer <- lme(variable ~ Period*Diet, random = ~1 | Animal, data = Seleccio) # the model
#Fill that entry with each statistics:
output[[var]]$boxplot = boxplot(vars[,i]~fact)$out #checking for outliers
output[[var]]$summary = summary(lme_cer)
output[[var]]$anova = anova(lme_cer)
output[[var]]$lsmeans = lsmeans(lme_cer, pairwise~Diet, adjust="tukey")
output[[var]]$shapiro = aggregate(formula = variable ~ Diet, data = Seleccio,
FUN = function(x) {y <- shapiro.test(x); c(y$statistic, y$p.value)})
}
This way you have the results in you R enviroment, and have better visualisation options: do output$Var1 and get all the results for Var1, which should fit in the console; do for(i in output){print(i$summary)} to get all of the summaries; etc.

Skip iteration and return NA in nested for loop in R

Given the data frame:
test <- structure(list(IDcount = c(1, 1, 1, 1, 1, 2, 2, 2, 2, 2), year = c(1,
2, 3, 4, 5, 1, 2, 3, 4, 5), Otminus1 = c(-0.28, -0.28, -0.44,
-0.27, 0.23, -0.03, -0.06, -0.04, 0, 0.02), N.1 = c(NA, -0.1,
0.01, 0.1, -0.04, -0.04, -0.04, -0.04, -0.05, -0.05), N.2 = c(NA,
NA, -0.09, 0.11, 0.06, NA, -0.08, -0.08, -0.09, -0.09), N.3 = c(NA,
NA, NA, 0.01, 0.07, NA, NA, -0.12, -0.13, -0.13), N.4 = c(NA,
NA, NA, NA, -0.04, NA, NA, NA, -0.05, -0.05), N.5 = c(NA, NA,
NA, NA, NA, NA, NA, NA, NA, -0.13)), row.names = c(NA, -10L), groups = structure(list(
IDcount = c(1, 2), .rows = structure(list(1:5, 6:10), ptype = integer(0), class = c("vctrs_list_of",
"vctrs_vctr", "list"))), row.names = 1:2, class = c("tbl_df",
"tbl", "data.frame"), .drop = TRUE), class = c("grouped_df",
"tbl_df", "tbl", "data.frame"))
and a results data frame:
results <- structure(list(IDcount = c(1, 2), N.1 = c(NA, NA), N.2 = c(NA,
NA), N.3 = c(NA, NA), N.4 = c(NA, NA), N.5 = c(NA, NA)), row.names = c(NA,
-2L), class = "data.frame")
I would like to perform a nested for loop like the following:
index <- colnames(test) %>% str_which("N.")
betas <- matrix(nrow=length(unique(test$IDcount)), ncol=2)
colnames(betas) <- c("Intercept", "beta")
for (j in colnames(test)[index]) {
for (i in 1:2) {
betas[i,] <- coef(lm(Otminus1~., test[test$IDcount==i, c("Otminus1", j)]))
}
betas <- data.frame(betas)
results[[j]] <- betas$beta
}
The for loop is supposed to run the regression across each column and each ID and write the coefficients into the data frame "results".
This works, as long as each ID has one value in each column. Unfortunately, my data frame "test" is missing values in the column "N.5". The regression and loop can therefore not be performed since all values for this ID are NA.
I now would like to adapt my loop so that iterations are only performed if there is at least one non-NA value for a certain ID in the specific column.
Following this explanation R for loop skip to next iteration ifelse, I tried to implement the following:
for (j in colnames(test)[index]) {
for (i in 1:2) {
if(sum(is.na(test[which(test[,1]==i),.]))==length(unique(test$year))) next
betas[i,] <- coef(lm(Otminus1~., test[test$IDcount==i, c("Otminus1", j)]))
}
betas <- data.frame(betas)
results[[j]] <- betas$beta
}
But this doesn't work.
I would like to receive a data frame "results" looking something like this:
IDcount N.1 N.2 N.3 N.4 N.5
1 0.1 0.2 0.5 0.3 NA
2 -5,3 -0.8 -0.4 -0.1 -0.1
Any help would be greatly appreciated!!
You can use colSums to perform a check :
index <- colnames(test) %>% str_which("N.")
betas <- matrix(nrow=length(unique(test$IDcount)), ncol=2)
colnames(betas) <- c("Intercept", "beta")
for (j in colnames(test)[index]) {
for (i in 1:2) {
tmp <- test[test$IDcount==i, c("Otminus1", j)]
if(any(colSums(!is.na(tmp)) == 0)) next
betas[i,] <- coef(lm(Otminus1~., tmp))
}
betas <- data.frame(betas)
results[[j]] <- betas$beta
}

linear regression model with dplyr on sepcified columns by name

I have the following data frame, each row containing four dates ("y") and four measurements ("x"):
df = structure(list(x1 = c(69.772808673525, NA, 53.13125414839,
17.3033274666411,
NA, 38.6120670385487, 57.7229000792707, 40.7654208618078, 38.9010405201831,
65.7108936694177), y1 = c(0.765671296296296, NA, 1.37539351851852,
0.550277777777778, NA, 0.83037037037037, 0.0254398148148148,
0.380671296296296, 1.368125, 2.5250462962963), x2 = c(81.3285388496182,
NA, NA, 44.369872853302, NA, 61.0746827226573, 66.3965114460601,
41.4256874481852, 49.5461413070349, 47.0936997726146), y2 =
c(6.58287037037037,
NA, NA, 9.09377314814815, NA, 7.00127314814815, 6.46597222222222,
6.2462962962963, 6.76976851851852, 8.12449074074074), x3 = c(NA,
60.4976916064608, NA, 45.3575294731303, 45.159758146854, 71.8459173097114,
NA, 37.9485456227131, 44.6307631013742, 52.4523342186143), y3 = c(NA,
12.0026157407407, NA, 13.5601157407407, 16.1213657407407, 15.6431018518519,
NA, 15.8986805555556, 13.1395138888889, 17.9432638888889), x4 = c(NA,
NA, NA, 57.3383407228293, NA, 59.3921356160536, 67.4231673171527,
31.853845252547, NA, NA), y4 = c(NA, NA, NA, 18.258125, NA,
19.6074768518519,
20.9696527777778, 23.7176851851852, NA, NA)), class = c("tbl_df",
"tbl", "data.frame"), row.names = c(NA, -10L))
I would like to create an additional column containing the slope of all the y's versus all the x's, for each row (each row is a patient with these 4 measurements).
Here is what I have so far:
df <- df %>% mutate(Slope = lm(vars(starts_with("y") ~
vars(starts_with("x"), data = .)
I am getting an error:
invalid type (list) for variable 'vars(starts_with("y"))'...
What am I doing wrong, and how can I calculate the rowwise slope?
You are using a tidyverse syntax but your data is not tidy...
Maybe you should rearrange your data.frame and rethink the way you store your data.
Here is how to do it in a quick and dirty way (at least if I understood your explanations correctly):
df <- merge(reshape(df[,(1:4)*2-1], dir="long", varying = list(1:4), v.names = "x", idvar = "patient"),
reshape(df[,(1:4)*2], dir="long", varying = list(1:4), v.names = "y", idvar = "patient"))
df$patient <- factor(df$patient)
Then you could loop over the patients, perform a linear regression and get the slopes as a vector:
sapply(levels(df$patient), function(pat) {
coef(lm(y~x,df[df$patient==pat,],na.action = "na.omit"))[2]
})

find strings in data.frame to fill in new column

I used dplyr on my data to create a subset of data like this:
dd <- data.frame(ID = c(700689L, 712607L, 712946L, 735907L, 735908L, 735910L, 735911L, 735912L, 735913L, 746929L, 747540L),
`1` = c("eg", NA, NA, "eg", "eg", NA, NA, NA, NA, "eg", NA),
`2` = c(NA, NA, NA, "sk", "lk", NA, NA, NA, NA, "eg", NA),
`3` = c(NA, NA, NA, "sk", "lk", NA, NA, NA, NA, NA, NA),
`4` = c(NA, NA, NA, "lk", "lk", NA, NA, NA, NA, NA, NA),
`5` = c(NA, NA, NA, "lk", "lk", NA, NA, NA, NA, NA, NA),
`6` = c(NA, NA, NA, "lk", "lk", NA, NA, NA, NA, NA, NA))
I now want to check every column except ID if it contains certain strings. In this example I want to create 1 column with "1" for every ID that contains a column with "eg" and "0" for the rest. Likewise one more column which tells me if there is either a "sk" or "lk" in the other columns. After that the old columns except ID can be removed from the data.frame
The difficult part for me is doing this with a dynamic number of columns, as my dplyr-subset will return different amounts of columns based on the specific case, but I need to check every one that is created in every case. I wanted to use unite first to put all strings together but I will have the same problem then: How can I unite all columns except the first ID one.
If this can be solved within dplyr it would be perfect but any working solution is appreciated.
The result should look like this:
result <- data.frame(ID = c(700689L, 712607L, 712946L, 735907L, 735908L, 735910L, 735911L, 735912L, 735913L, 746929L, 747540L),
with_eg = c(1, 0, 0, 1, 1, 0, 0, 0, 0, 1, 0),
with_sk_or_lk = c(0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0))
From your description, you want one column to check for "eg" and another column to check for both "lk" and "sk". If this is the case, then the following base R method will work.
dfNew <- cbind(id=dd[1],
eg=pmin(rowSums(dd[-1] == "eg", na.rm=TRUE), 1),
other=pmin(rowSums(dd[-1] == "sk" | dd[-1] == "lk", na.rm=TRUE), 1))
Here, the presence of "eg" is checked across the entire data.frame (except the id column) and a logical matrix is returned, rowSums adds the TRUE values across the rows, with na.rm removing the NAs, then pmin takes the minimum of the output of rowSums and 1, so that any elements with 2 are replaced by 1 and any values with 0 are preserved.
This same logic is applied to the construction of the "other" variable, except the presence of either "lk" or "sk" are checked in the initial logical matrix. Finally, data.frame returns a 3 column data.frame with the desired values.
This returns
dfNew
ID eg other
1 700689 1 0
2 712607 0 0
3 712946 0 0
4 735907 1 1
5 735908 1 1
6 735910 0 0
7 735911 0 0
8 735912 0 0
9 735913 0 0
10 746929 1 0
11 747540 0 0
Here is an admittedly hacky dplyr/purrr solution. Given that your IDs don't seem like they'll ever equal 'eg', 'sk', or 'lk', I haven't included anything to not search the ID column.
library(dplyr)
library(purrr)
dd %>%
split(.$ID) %>%
map_df(~ data_frame(
ID = .x$ID,
eg = ifelse(any(.x == 'eg', na.rm = TRUE), 1, 0),
other = ifelse(any(.x == 'lk' | .x == 'sk', na.rm = TRUE), 1, 0)
))

Resources