R: Seperating several observations of a variable and building a matrix - r

I have a multiple-response-variable with seven possible observations: "Inhalt", "Arbeit", "Verhindern Koalition", "Ermöglichen Koalition", "Verhindern Kanzlerschaft", "Ermöglichen Kanzlerschaft", "Spitzenpolitiker".
If one chose more than one observation, the answers however are not separated in the data (Data)
My goal is to create a matrix with all possible observations as variables and marked with 1 (yes) and 0 (No). Currently I am using this command:
einzeln_strategisch_2021 <- data.frame(strategisch_2021[, ! colnames (strategisch_2021) %in% "Q12"], model.matrix(~ Q12 - 1, strategisch_2021)) %>%
This gives me the matrix I want but it does not separate the observations, so now I have a matrix with 20 variables instead of the seven (variables).
I also tried seperate() like this:
separate(Q12, into = c("Inhalt", "Arbeit", "Verhindern Koalition", "Ermöglichen Koalition", "Verhindern Kanzlerschaft", "Ermöglichen Kanzlerschaft", "Spitzenpolitiker"), ";") %>%
This does separate the observations, but not in the right order and without the matrix.
How do I separate my observations and create a matrix with the possible observations as variables akin to the third picture (Matrix)?
Thank you very much in advance ;)

Related

Omitting NAs from Data

First time posting. Apologies if I'm not as clear as I intend.
I have an excel (xlxs) spreadsheet of data; it's sequencing data if that helps. Generally indexed as follows:
column 1 = organism families (hundreds of organisms down this column)
columns 2-x = specific samples
Many of the boxes scattered throughout the data are zero values, or too low, which I want to omit. I set my data such that anything under 5 is set to an NA. Since different samples will have many more, less, or different species omitted by that threshold, I want to separate by samples. Code so far is:
#Files work, I just omitted my directories to place online
`my_counts <- read_excel("...Family_120821.xlsx" , sheet = "family_Counts")
my_perc <- read_excel("...Family_120821.xlsx" , sheet = "family_Percentages")
my_counts[my_counts < 5] <- NA
my_counts
my_perc[my_perc < 0.05] <- NA
my_perc
S13 <- my_counts$family , my_counts$Sample.13
S13A <- na.omit(S13)
S13A
S14 <- my_counts$Sample.14
S14A <- na.omit(S14)
S14A
S15 <- my_counts$Sample.15
S15A <- na.omit(S15)
S15A
...
First question, there a better way I can go about this such that I can replicate it in different data without typing out each individual sample?
Most important question: When I do this, I get what I want, which is the values I want, no NAs. But they are values, when I want another dataframe so I can write it back to an xlxs. As I have it, I lose the association to the organism.
Ex: Before
All samples by associated organisms
Ex: After
Single sample, no NAs, but also no association to organism index
Essentially the following image, but broken into individual samples. With only the organisms that met my threshold of 5 for counts, 0.05 for percents.
enter image description here

How to label CCA-Plot with row.names in R

I've been trying to solve the following problem which I am sure is an easy one (I am just not able to find a solution). I am using the package vegan and want to perform a cca that shows the actual row names as labels (instead of the default "sit1", "sit2", ...).
I created a dataframe (ls_Treat1) with cast(), showing plot treatments (AB, DB, DL etc.) as row names and species occurences. The dataframe looks as follows:
species 1
species 2
species 3
AB
0
3
1
DB
1
6
0
DL
3
4
2
I created the data frame with the following code to set the treatments (AB, DB, DL, ...) as row names:
ls_Treat1 <- cast(fungi_ls, Treatment ~ species)
row.names(ls_Treat1)<- ls_Treat1$Treatment
ls_Treat1 <- ls_Treat1[,-1]
When I perform a cca with the following code:
ca <- cca(ls_Treat1)
plot(ca,display="sites")
R puts the default labels "sit1", "sit2", ... into the plot, instead of the actual row names, even though I have performed it this way before and the plots normally showed the right labels. Does this have anything to do with my creating the data frame? I tried to change the treatments (characters) into numbers (integers or factors) but still, the plot won't be labelled with my row names.
Can anyone help me with this?
Thank you very very much!!
The problem is that reshape::cast() does not produce data.frame but something else. It claims to be a data.frame but it is not. We do matrix algebra in cca and therefore we cast input to a matrix which works for standard data.frame, but it does not work with the object you supplied as input. In particular, after you remove the first column in ls_Treat1 <- ls_Treat1[,-1], you also remove the attributes that allow preserving names – it would have worked without removing this column (if reshape package was still loaded). It seems that upgrading to reshape2 package and using reshape2::acast() can be a solution.

Resampling in R

Consider the following data:
library(Benchmarking)
d <- data.frame(x1=c(200,200,3000), x2=c(200,200,1000), y=c(100,100,3))
So I have 3 observations.
Now I want to select 2 observations randomly out of d three times (without repetition - there is three combinations in total). For each of these three times I want to calculate the following:
e <- dea(d[c('x1', 'x2')], d$y)
weighted.mean(eff(e), d$y)
That is, I will get three numbers, which I want to calculate an average of. Can someone show how to do this with a loop function in R?
Example:
There is three combinations in total, so I can only get the same result in this case. If I do the calculation manually, I will get the three following result:
0.977 0.977 1
(The result could of course be in a another order).
And the mean of these two numbers is:
0.984
This is a simple example. In my case I have a lot of combinations, where I don't select all of the combinations (e.g. there could be say 1,000,000 combinations, where I only select 1,000 of them).
I think it's better if you use sample.int and replicate instead of doing all the combinations, see my example:
nsample <- 2 # Number of selected observations
nboot <- 10 # Number of times you repeat the process
replicate(nboot, with(d[sample.int(nrow(d), nsample), ],
weighted.mean(eff(dea(data.frame(x1, x2), y)), y)))
I have check also the link you bring regarding this issue, so if I got it right, I mean, you want to extract two rows (observations) each time without replacement, you can use sample:
SelObs <- sample(1:nrow(d),2)
# for getting the selected observations just
dSel <- d[SelObs,]
And then do your calculations
If you want those already selected observation to not be selected in a nex random selection, it is similar, but you need an index
Obs <- 1:nrow(d)
SelObs <- sample(Obs, 2)
dSel <- d[SelObs, ]
# and now, for removing those already selected
Obs <- Obs[-SelObs]
# and keep going with next random selections and the above code

Chi-squared test of independence on all combinations of columns in a dataframe in R

this is my first time posting here and I hope this is all in the right place. I have been using R for basic statistical analysis for some time, but haven't really used it for anything computationally challenging and I'm very much a beginner in the programming/ data manipulation side of R.
I have presence/absence (binary) data on 72 plant species in 323 plots in a single catchment. The dataframe is 323 rows, each representing a plot, with 72 columns, each representing a species. This is a sample of the first 4 columns (some row numbers are missing because the 323 plots are a subset of a larger number of preassigned plots, not all of which were surveyed):
> head(plots[,1:4])
Agrostis.canina Agrostis.capillaris Alchemilla.alpina Anthoxanthum.odoratum
1 1 0 0 0
3 0 0 0 0
4 0 0 0 0
5 0 0 0 0
6 0 0 0 0
8 0 0 0 0
I want to to determine whether any of the plant species in this catchment are associated with any others, and if so, whether that is a positive or negative association. To do this I want to perform a chi-squared test of independence on each combination of species. I need to create a 2x2 contingency table for each speciesxspecies comparison, run a chi-squared test on each of those contingency tables, and save the output. Ultimately I would like to end up with a list or matrix of all species by species tests that shows whether that combination of species has a positive, negative, or no significant association. I'd also like to incorporate some code that only shows an association as positive if all expected values were greater than 5.
I have made a start by writing the following function:
CHI <- function(sppx, sppy)
{test <- chisq.test(table(sppx, sppy))
result <- c(test$statistic, test$p.value,
sign((table(sppx, sppy) - test$expected)[2,2]))
return(result)
}
This returns the following:
> CHI(plots$Agrostis.canina, plots$Agrostis.capillaris)
X-squared
1.095869e-27 1.000000e+00 -1.000000e+00
Warning message:
In chisq.test(chitbl) : Chi-squared approximation may be incorrect
Now I'm trying to figure out a way to apply this function to each speciesxspecies combination in the data frame. I essentially want R to take each column, apply the CHI function to that column and each other column in sequence, and so on through all the columns, subtracting each column from the dataframe as it is done so the same species pair is not tested twice. I have tried various methods trying to use "for" loops or "apply" functions, but have not been able to figure this out.
I hope that is clear enough. Any help here would be much appreciated. I have tried looking for existing solutions to this specific problem online, but haven't been able to find any that really helped. If anyone could link me to an existing answer to this that would also be great.
You need the combn function to find all the combinations of the columns and then apply them to your function, something like this:
apply(combn(1:ncol(plots), 2), 2, function(ind) CHI(plots[, ind[1]], plots[, ind[2]]))
I think you are looking for something like this. I used the iris dataset.
require(datasets)
ind<-combn(NCOL(iris),2)
lapply(1:NCOL(ind), function (i) CHI(iris[,ind[1,i]],iris[,ind[2,i]]))
The below R code run chisquare test for every categorical variable / every factor of a r dataframe, against a variable given (x or y chisquare parameter is kept stable, is explicitly defined):
Define your variable
Please - change df$variable1 to your desired factor variable and df to your desirable dataframe that contain all the factor variables tested against the given df$variable1
Define your Dataframe
A new dataframe is created (df2) that will contain all the chi square values / dfs, p value of the given variable vs dataframe comparisons
Code created / completed/ altered from similar posts in stackoverflow, neither that produced my desired outcome.
Chi-Square Tables statistic / df / p value for variable vs dataframe
"2" parameter define column wide comparisons - check apply (MARGIN) option.
df2 <- t(round(cbind(apply(df, 2, function(x) {
ch <- chisq.test(df$variable1, x)
c(unname(ch$statistic), ch$parameter, ch$p.value )})), 3))

Set values less than threshold to zero, with column-specific thresholds

I have two data frames. One of them contains 165 columns (species names) and almost 193.000 rows which in each cell is a number from 0 to 1 which is the percent possibility of the species to be present in that cell.
POINTID Abie_Xbor Acer_Camp Acer_Hyrc Acer_Obtu Acer_Pseu Achi_Gran
2 0.0279037 0.604687 0.0388309 0.0161980 0.0143966 0.240152
3 0.0294101 0.674846 0.0673055 0.0481405 0.0397423 0.231308
4 0.0292839 0.603869 0.0597947 0.0526606 0.0463431 0.188875
6 0.0331264 0.541165 0.0470451 0.0270871 0.0373348 0.256662
8 0.0393825 0.672371 0.0715808 0.0559353 0.0565391 0.230833
9 0.0376557 0.663732 0.0747417 0.0445794 0.0602539 0.229265
The second data frame contains 164 columns (species names, as the first data frame) and one row which is the threshold that above this we assume that the species is present and under of this the species is absent
Abie_Xbor Acer_Camp Acer_Hyrc Acer_Obtu Acer_Pseu Achi_Gran Acta_Spic
0.3155 0.2816 0.2579 0.2074 0.3007 0.3513 0.3514
What i want to do is to make a new data frame that will contain for every species in the presence possibility (my.data) the number of possibility if it is above the threshold (thres) and if it is under the threshold the zero number.
I know that it would be a for loop and if statement but i am new in R and i don't know for to do this.
Please help me.
I think you want something like this:
(Make up small reproducible example)
set.seed(101)
speciesdat <- data.frame(pointID=1:10,matrix(runif(100),ncol=10,
dimnames=list(NULL,LETTERS[1:10])))
threshdat <- rbind(seq(0.1,1,by=0.1))
Now process:
thresh <- unlist(threshdat) ## make data frame into a vector
## 'sweep' runs the function column-by-column if MARGIN=2
ss2 <- sweep(as.matrix(speciesdat[,-1]),MARGIN=2,STATS=thresh,
FUN=function(x,y) ifelse(x<y,0,x))
## recombine results with the first column
speciesdat2 <- data.frame(pointID=speciesdat$pointID,ss2)
It's simpler to have the same number of columns (with the same meanings of course).
frame2 = data.frame(POINTID=0, frame2)
R works with vectors so a row of frame1 can be directly compared to frame2
frame1[,1] < frame2
Could use an explicit loop for every row of frame1 but it's common to use the implicit loop of "apply"
answer = apply(frame1, 1, function(x) x < frame2)
This was all rather sloppy solution (especially changing frame2) but it hopefully demonstrates some basic R. Also, I'd generally prefer arrays and matrices when possible (they can still use labels but are generally faster).
This produces a logical matrix which can be used to generate assignments with "[<-"; (Assuming name of multi-row dataframe is "cols" and named vector is "vec":
sweep(cols[-1], 2, vec, ">") # identifies the items to keep
cols[-1][ sweep(cols[-1], 2, vec, "<") ] <- 0
Your example produced a warning about the mismatch of the number of columns with the length of the vector, but presumably you can adjust the length of the vector to be the correct number of entries.

Resources