How to devide my dataset for using permanova - r

Hello everyone :) I have a data set with individuals that correspond in 5 different species in one column, and their presence/absence in different landscapes (7 other columns).
data.frame': 1212 obs. of 10 variables:
$ latitude : num 34.5 34.7 34.7 34.8 34.8 ...
$ longitude : num 127 127 127 127 127 ...
$ species : Factor w/ 5 levels "Bufo gargarizans",..: 1 1 1 1 1 1 1 1 1 1 ...
$ Built : int 0 0 0 0 0 0 0 0 0 0 ...
$ Agriculture: int 1 1 0 1 0 0 1 0 0 0 ...
$ Forested : int 0 0 1 0 0 0 0 1 1 1 ...
$ Grassland : int 0 0 0 0 0 0 0 0 0 0 ...
$ Wetland : int 0 0 0 0 0 0 0 0 0 0 ...
$ Bare : int 0 0 0 0 1 0 0 0 0 0 ...
$ Water : int 1 0 0 0 0 1 0 0 0 0 ...
I try to use permanova and then Tukey test to see if the species use the landscape differently or not. My supervisor did it on SPSS and it worked very well, so I have to do it on R.
I saw I need 2 csv files for running permanova on R but I have only one. I will give you the script that I found on internet and I want to use for my analysis.
library(vegan)
data(dune)
data(dune.env)
# default test by terms
adonis2(dune ~ Management*A1, data = dune.env)
In my case, I should have 1 dataframe with species and 1 dataframe with environmental variables, if I understand well.
However my presence/absence are inside the environmental categories (see the str of my table above). So if I create 1 dataframe with species only, I will not have numerical values in the dataframe with species.
So I am totally lost. I don't know how to process. Can someone help me please ? Thank you !

I will split my answer into two parts. The one where I know what I am talking about and one where I am brainstorming ;)
Here is the first part on how to split your data into two data.frames
# Set seed
seed(1312)
# Some sample data
sample_data <- dplyr::tibble(species=sample(LETTERS[1:5],size=500,replace=T),
Built=sample(c(0,1),size=500,replace=T),
Agriculture=sample(c(0,1),size=500,replace=T),
Forested=sample(c(0,1),size=500,replace=T),
Grassland=sample(c(0,1),size=500,replace=T),
Wetland=sample(c(0,1),size=500,replace=T),
Bare=sample(c(0,1),size=500,replace=T),
Water=sample(c(0,1),size=500,replace=T))
# Split data
df1 <- sample_data %>%
dplyr::select(species)
df2 <- sample_data %>%
dplyr::select(Built:Water)
Now part two: as it says in ?adonis2 the first part of adnonis2 is a formula where the left part of the formula must be a community data matrix or a dissimilarity matrix
Eventhough I am not sure if it does make sense, I went wild and followed the instructions :D
df2_dist <- dist(df2)
vegan::adonis2(df2_dist~species, data=df1)
Permutation test for adonis under reduced model
Terms added sequentially (first to last)
Permutation: free
Number of permutations: 999
vegan::adonis2(formula = d2 ~ species, data = d1)
Df SumOfSqs R2 F Pr(>F)
species 4 5.333 0.15802 0.7038 0.88
Residual 15 28.417 0.84198
Total 19 33.750 1.00000
Of course this might be nonsense in terms of content as I took a purely technical approach here, but maybe it helps you to shape your data as required

So I made this code :
setwd("C:/Users/Johan/Downloads/memoire Johanna (1)/memoire Johanna")
xdata=read.csv(file="all10_reduce_focal species1.csv", header=T, sep=",")
str(xdata)
xdata$species <- as.factor(xdata$species)
sample_data <- dplyr::tibble(species=sample(LETTERS[1:5],size=1212,replace=T),
Built=sample(c(0,1),size=1212,replace=T),
Agriculture=sample(c(0,1),size=1212,replace=T),
Forested=sample(c(0,1),size=1212,replace=T),
Grassland=sample(c(0,1),size=1212,replace=T),
Wetland=sample(c(0,1),size=1212,replace=T),
Bare=sample(c(0,1),size=1212,replace=T),
Water=sample(c(0,1),size=1212,replace=T))
library(dplyr)
df1 <- sample_data %>%
dplyr::select(species)
df2 <- sample_data %>%
dplyr::select(Built:Water)
df1_dist <- dist(df1)
vegan::adonis2(df1_dist~Built+Agriculture+Grassland+Forested+Wetland+Bare+Water, data=df2)
Species should be the response as I try to see the landscape on the species. When I do this I have :
Error in vegdist(as.matrix(lhs), method = method, ...) : input data must be numeric
It's because the "species" variable has only characters. So I changed to make it numeric :
sample_data <- dplyr::tibble(species=sample(c(1:5),size=1212,replace=T),
Built=sample(c(0,1),size=1212,replace=T),......
But the result I got is different as the result from SPSS, as I don't have any significant variable (in SPSS Built, Agriculture, Forested and Water are significant).
I think my code is wrong

Related

Test to compare proportions / paired (small) samples / 7-levels categorical variables

I'm working on data from a pre-post survey: the same participants have been asked the same questions at 2 different times (so the sample are not independant). I have 19 categorical variables (Likert scale with 7 levels).
For each question, I want to know if there is a significant difference between the "pre" and "post" answer. To do this, I want to compare proportions in each of the 7 categories between pre and post results.
I have two data bases (one 'pre' and one 'post') which I have merged as in the following example (I've made sure that the categorical variables have the same levels for PRE and POST):
prepost <- data.frame(ID = c(1:7),
Quest1_PRE = c('5_SomeA','1_StronglyD','3_SomeD','4_Neither','6_Agree','2_Disagree','7_StronglyA'),
Quest1_POST = c('1_StronglyD','7_StronglyA','6_Agree','7_StronglyA','3_SomeD','5_SomeA','7_StronglyA'))
I tried to perform a McNemar test:
temp <- table(prepost_S1$Quest1_PRE,prepost_S1$Quest1_POST)
mcnemar.test(temp)
> McNemar's Chi-squared test
data: temp
McNemar's chi-squared = NaN, df = 21, p-value = NA
But whatever the question, the test always return NA values. I think it is because the pivot table (temp) has very low frequencies (I only have 24 participants).
One exemple of a pivot table (I have 22 participants):
> temp
1_StronglyD 2_Disagree 3_SomeD 4_Neither 5_SomeA 6_Agree 7_StronglyA
1_StronglyD 0 0 0 0 0 1 0
2_Disagree 0 0 0 0 1 0 0
3_SomeD 0 0 0 0 0 1 1
4_Neither 0 0 1 1 2 2 2
5_SomeA 0 0 0 0 1 1 2
6_Agree 0 0 0 0 0 3 2
7_StronglyA 0 0 0 0 0 1 2
I've tried aggregating the variables' levels into 5 instead of 7 ("1_Disagree", "2_SomeD", "3_Neither", "4_SomeA", "5_Agree") but it still doesn't work.
Is there an equivalent of Fisher's exact test for paired sample? I've done research but I couldn't find anything helpful.
If not, could you think of any other test that could answer my question (= Do the answers differ significantly between the pre and post survey)?
Thanks!

How to organize my data to create a heatmap using correlation or cluster analysis (x must be numeric problem)

I need some help with generating heatmaps with cluster analysis and correlation (I am new to R). My data looks like this in Excel:
Gene1 Gene2 Gene3 Gene4 Gene5 ... Gene296
Bacteria1 0 0 0 0.7 0.2 ... 0
Bacteria2 0.44 0 0 0 0 ... 0.9
Bacteria2 0 0.32 0 0.4 0 ... 0
... ... ... ... ... ... ... ...
Bacteria117 0 0.2 0.3 0 0.7 ... 0
A value of 0.32 represents a score of 32 from 0 to 100. There are higher scores (0.9 for example) or lower scores (0 or 0.2 for example). I checked for NAs and there are none. I want to do cluster analysis to find out what bacteria form clusters according to my experimental data (scores). The file is CSV. I used this code:
> aa <- read.csv(file.choose())
> str(aa)
#I obtain this structure
'data.frame': 117 obs. of 296 variables:
$ X : Factor w/ 117 levels "Ac_neuii_BVI",..: 45 64 67 104 1 2 3 4 5 6 ...
$ AAC6_Iad : num 0 0 0 0 0 0 0 0 0 0 ...
$ aad6 : num 0 0 0 0 0 0 0 0 0 0 ...
$ abeS : num 0 0 0 0 0 0 0 0 0 0 ...
> is.numeric(aa)
[1] FALSE
When I try to use the correlation or the clustering I get this error:
> az <- cor(aa)
Error in cor(aa) : 'x' must be numeric
I tried as.matrix but the error continues in the matrix of course. I tried as.numeric but it didn't work. I erased X > aa$X <- NULL and the problem disappeared (I don't know if this is the correct way to solve the problem), but the name of the bacteria disapeared and then I get a correlation between my genes, not between my genes AND the bacteria. The same thing happens with the clustering using hclust or dist. Is there a way I should organize my csv file? I haven't found a clear article on the internet on how to solve the "x must be numeric problem" and on how to do the correlation or measuring the distances between the genes and the bacteria.
Thank you. Sorry for the ignorance on certain things that might appear obvious to you.
You can import the bacteria names as row.names:
aa <- read.csv(file.choose(), row.names = 1)
aa$X is not numeric (it contains factors). You can transform it with:
aa$X = as.numeric(aa$X)
Then az <- cor(aa) will run... but (as noted by #Cole) it does not make sense since X refers to the names of the bacteria.
You can set the first column to be the names of the rows with the row.names parameter of read.csv:
aa <- read.csv(file.choose(), row.names = 1)

R: filling matrix with values does not work

I have a data frame vec that I need to prepare for an image.plot() plot. The structure of vec is as follows:
> str(vec)
'data.frame': 31212 obs. of 5 variables:
$ x : int 8 24 40 56 72 88 104 120 136 152 ...
$ y : int 8 8 8 8 8 8 8 8 8 8 ...
$ dx: num 0 0 0 0 0 0 0 0 0 0 ...
$ dy: num 0 0 0 0 0 0 0 0 0 0 ...
$ d : num 0 0 0 0 0 0 0 0 0 0 ...
Note: the values in $dx, $dy and $d are not zero but only too small to be shown in this overview.
Background: the data is the output of a pixel tracking software. $x and $y are pixel coordinates while in $d are the displacement vector lengths (in pixels) of that pixel.
image.plot() expects as first and second argument the dimension of the matrix as ordered vectors, so I think sort(unique(vec$x)) and sort(unique(vec$y)) respectively should be good. So, I would like to end up with image.plot(sort(unique(vec$x)),sort(unique(vec$y)), data)
The third argument is the actual data. To build this I tried:
# spanning an empty matrix
data = matrix(NA,length(unique(vec$x)),length(unique(vec$y)))
# filling the matrix
data[match(vec$x, sort(unique(vec$x))), match(vec$y, sort(unique(vec$y)))] = vec$d
But, unfortunately, this isn't working. It reports no errors but data contains no values! This works:
for(i in c(1:length(vec$x))) data[match(vec$x[i], sort(unique(vec$x))), match(vec$y[i], sort(unique(vec$y)))] = vec$d[i]
But is very slow.
a) is there a better way to build data?
b) is there a better way to deal with my problem, anyways?
R allows indexing of a matrix by a two-column matrix, where the first column of the index is interpreted as the row index, and the second column as the column index. So create the indexes into data as a two-column matrix
idx = cbind(match(vec$x, sort(unique(vec$x))),
match(vec$y, sort(unique(vec$y))))
and use that
data[idx] = vec$d

How to add dummy variables in R

I know there are several questions about this topic, but none of them seem to answer my specific question.
I have a dataset with five independent variables and I want to add two dummy variables to my regression in R. I have my data in Excel and importing the dataset is not a problem (I use read.csv2). Now, when I want to see my dummy variables, D1 and D2, I can't. I can see all the other variables. The two dummy variables both vary from 0 and 1 through the dataset.
I can easily see a summary of all my data, including D1 and D2 (with median, mean, etc.), and I can call each of the 5 variables separately without any problems at all, but I can't do that with D1 and D2.
> str(tilskuere) 'data.frame': 180 obs. of 7 variables:
$ ATT : int 3166 4315 7123 6575 7895 7323 3579 9571 5345 6595 ...
$ PRICE : int 80 95 120 100 105 115 80 130 105 100 ...
$ viewers: int 41000 43000 56000 66000 157000 91000 51000 30000 36000 72000 ...
$ CB1 : int 10 10 5 2 7 2 3 1 10 1 ...
$ CB2 : num 1 1 1 0 0.33 ...
$ D1 : int 0 0 0 1 0 0 0 0 0 0 ...
$ D2 : int 1 0 0 0 0 1 1 0 0 0 ...
> summary(tilskuere)
> mean(ATT) [1] 6856.372
> mean(D1) Fejl i
mean(D1) : object 'D1' not found
To sum up: I can run regressions in R without D1 and D2, but I can't include these as dummy variables as R can't find these variables, when I run them. R simply says "object D1 not found."
I hope someone can help. Thank you in advance.
Kind regards
Mikkel
I added the material in your comment to the text , added some linefeeds, and it is now clear that you don't understand that columns are not first class objects in R. Try:
mean(tilskuere$D1)
You can see what objects are in your workspace with:
ls()
You appear to have an object named ATT in your workspace as well as a length-180 column by the same name in the object named tilskuere.

How do I automatically specify the correct regression model when the number of variables differs between input data sets?

I have a working R program that will be used by my internal client for analysing their nutrient intake data. For each dataset that they have, they will re-run the R program.
A key part of the dataset is an nonlinear mixed method analysis, using nlmer from the lme4 package, that incorporates dummy variables for age. Depending on whether they will be analysing children or adults, the number of age band dummies in the formula will differ, although the reference age band dummy will always be the youngest. I think that the number of possible age bands ranges from 4 to about 6, so it's not a large range. It is a trivial matter to count the number of age band dummies, if I need to condition based on that.
What is the most efficient way for me to wrap the model-based code (the lmer that provides the starting parameter values, the function for the nlmer model, and the model specification in nlmer itself) so that the correct function and models are applied based on the number of age band dummies in the model? The other variables in the model are constant across datasets.
I've already got the program set up for automatically generating the relevant dummies and dropping those that aren't used in the current analysis. The program after the model is pretty well set up as automated as well. I'm just stuck on what to do with automating the two lme4-based analyses and function. These will only be run once for each dataset.
I've been wondering whether I need to write a function to contain all the lme4 related code, or whether there was an easier way. I would appreciate some pointers on how to do this. It took me one day to work out how to get the function working that I needed for the nlmer model, so I am still at a beginner level with functions.
I've searched for other R related automation questions on the site and I didn't find anything similar to what I would like to do.
Thanks in advance.
Update in response to suggestion in the comments about using a string. That sounds like an easy way forward for me, except that I don't then know how to apply the string content in a function as each dummy variable level (excluding the reference category) is used in the function for nlmer. How can I pull apart the string and use only the dummy variables that I have in a function? For example, one analysis might have AgeBand2, AgeBand3, AgeBand4, and another analysis might have AgeBand5 as well as those 3? If this was VBA, I would just create subfunctions based on the number of age dummy variables. I have no idea how to do this efficiently in R.
Can I just wrap a while loop around the lmer, function, and nlmer parts, so I have a series of while loops?
This is the section of code I wish to automate, the number of AgeBand dummy variables differs depending on the dataset that will be analysed (children vs. adults). This is using the dataset that I have been testing a SAS to R translation on, but the real datasets will be very similar. It is necessary to have a nonlinear model as this is the basis of the peer-reviewed published method that I am working off.
library(lme4)
Male.lmer <- lmer(BoxCoxXY ~ AgeBand4 + AgeBand5 + AgeBand6 + AgeBand7 +
AgeBand8 + Race1 + Race3 + Weekend + IntakeDay + (1|RespondentID),
data=Male.AddSugar,
weights=Replicates)
Male.lmer.fixef <- fixef(Male.lmer)
Male.lmer.fixef <- as.data.frame(Male.lmer.fixef)
bA <- Male.lmer.fixef[1,1]
bB <- Male.lmer.fixef[2,1]
bC <- Male.lmer.fixef[3,1]
bD <- Male.lmer.fixef[4,1]
bE <- Male.lmer.fixef[5,1]
bF <- Male.lmer.fixef[6,1]
bG <- Male.lmer.fixef[7,1]
bH <- Male.lmer.fixef[8,1]
bI <- Male.lmer.fixef[9,1]
bJ <- Male.lmer.fixef[10,1]
MD <- deriv(expression(b0 + b1*AgeBand4 + b2*AgeBand5 + b3*AgeBand6 +
b4*AgeBand7 + b5*AgeBand8 + b6*Race1 + b7*Race3 + b8*Weekend + b9*IntakeDay),
namevec=c("b0","b1","b2","b3", "b4", "b5", "b6", "b7", "b8", "b9"),
function.arg=c("b0","b1","b2","b3", "b4", "b5", "b6", "b7", "b8", "b9",
"AgeBand4","AgeBand5","AgeBand6","AgeBand7","AgeBand8",
"Race1","Race3","Weekend","IntakeDay"))
Male.nlmer <- nlmer(BoxCoxXY ~ MD(b0,b1,b2,b3,b4,b5,b6,b7,b8,b9,AgeBand4,AgeBand5,AgeBand6,AgeBand7,AgeBand8,
Race1,Race3,Weekend,IntakeDay)
~ b0|RespondentID,
data=Male.AddSugar,
start=c(b0=bA, b1=bB, b2=bC, b3=bD, b4=bE, b5=bF, b6=bG, b7=bH, b8=bI, b9=bJ),
weights=Replicates
)
These will be the required changes between the datasets:
the number of fixed effect coefficients that I need to assign out of the lmer will change.
in the function, the expression, name.vec, and function.arg parts will change
the nlmer, the model statement and start parameter list will change.
I can change the lmer model statement so it takes AgeBand as a factor with levels, but I still need to pull out the values of the coefficients afterwards.
str(Male.AddSugar) gives:
'data.frame': 10287 obs. of 23 variables:
$ RespondentID: int 9966 9967 9970 9972 9974 9976 9978 9979 9982 9993 ...
$ RACE : int 2 3 2 2 3 2 2 2 2 1 ...
$ RNDW : int 26290 7237 10067 75391 1133 31298 20718 23908 7905 1091 ...
$ Replicates : num 41067 2322 17434 21723 375 ...
$ DRXTNUMF : int 27 11 13 18 17 13 13 19 11 11 ...
$ DRDDAYCD : int 1 1 1 1 1 1 1 1 1 1 ...
$ IntakeAmt : num 33.45 2.53 9.58 43.34 55.66 ...
$ RIAGENDR : int 1 1 1 1 1 1 1 1 1 1 ...
$ RIDAGEYR : int 39 23 16 44 13 36 16 60 13 16 ...
$ Subgroup : Ord.factor w/ 6 levels "3"<"4"<"5"<"6"<..: 4 3 2 4 1 4 2 5 1 2 ...
$ WKEND : int 1 1 1 0 1 0 0 1 1 1 ...
$ AmtInd : num 1 1 1 1 1 1 1 1 1 1 ...
$ IntakeDay : num 0 0 0 0 0 0 0 0 0 0 ...
$ Weekend : int 1 1 1 0 1 0 0 1 1 1 ...
$ Race1 : num 0 0 0 0 0 0 0 0 0 1 ...
$ Race3 : num 0 1 0 0 1 0 0 0 0 0 ...
$ AgeBand4 : num 0 0 1 0 0 0 1 0 0 1 ...
$ AgeBand5 : num 0 1 0 0 0 0 0 0 0 0 ...
$ AgeBand6 : num 1 0 0 1 0 1 0 0 0 0 ...
$ AgeBand7 : num 0 0 0 0 0 0 0 1 0 0 ...
$ AgeBand8 : num 0 0 0 0 0 0 0 0 0 0 ...
$ YN : num 1 1 1 1 1 1 1 1 1 1 ...
$ BoxCoxXY : num 7.68 1.13 3.67 8.79 9.98 ...
The AgeBand data is incorrectly shown as the ordered factor Subgroup. Because I haven't used it, I haven't gone back and correct this to a plain factor.
This assumes that you have one variable, "ageband", which is a factor with levels: AgeBand2, AgeBand3, AgeBand4, and perhaps others that you want to be ignored. Since factors are generally treated by R regression functions using the lowest lexigraphic values as the reference levels, you would get your correct level chosen automagically. You pick your desired levels by creating a dataset hat has only the desired levels.
agelevs <- c("AgeBand2", "AgeBand3", "AgeBand4")
dsub <- subset(inpdat, ageband %in agelevs)
res <- your_fun(dsub) nlmer(y ~ ageband + <other-parameters>, data=dsub, ...)
If you have gone to the trouble of creating separate variables, then you need to learn to use factors correctly rather than holding to inefficent habits enforced by training in SPSS or other clunky macro processors.

Resources