Sorry I couldn't think of a more informative title, but here's my challenge. I have a matrix and I need to add columns in specific places based on parameters described by a vector. For example, if I have the following matrix:
1, 0, 1, 2, 0
0, 0, 1, 1, 1
1, 1, 0, 0, 0
2, 0, 1, 0, 2
but for a particular R package (unmarked), I need to add columns of NA in specific place. I have a vector relating the columns in the matrix:
1, 1, 1, 2, 3
Which indicates that columns 1-3 were from the same sampling period and columns 4 and 5 were from different sampling periods. I need to make the number of columns in the matrix equal the max number from the same sampling period times the number of sampling periods. In this case there are three 1s (max number of any unique value in the vector) and a total of three sampling periods (max number in the vector). So I need a matrix with 9 columns (3 x 3). Specifically, I need to add the new columns of NAs after the 4th and 5th columns. Basically, I just need columns of NAs to be placeholders to have a matrix where the number of observations (each column) is the same (=3) for each of the sample periods (indicated by the number in the vector). This is difficult to describe but in this imaginary example I would want to end up with:
1, 0, 1, 2, NA, NA, 0, NA, NA
0, 0, 1, 1, NA, NA, 1, NA, NA
1, 1, 0, 0, NA, NA, 0, NA, NA
2, 0, 1, 0, NA, NA, 2, NA, NA
this would be described by a vector that looked like:
1, 1, 1, 2, 2, 2, 3, 3, 3
although I don't actually need to produce that vector, just the matrix. Obviously, it was easy to add those columns in this case, but for my data I have a much bigger matrix that will end up with ~200 columns. Plus I will likely have to do this for numerous data sets.
Can anyone help me with a way to code this in R so that I can automate the process of expanding the matrix?
Thanks for any advice or suggestions!
EDIT:
to make things a bit more similar to my actual data here is a reproducible matrix and vector similar to my current ones:
m <- matrix(rpois(120*26, 1), nrow = 120, ncol = 26)
v <- c(1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 4, 5, 5, 6, 6, 6, 6, 7)
Assuming m is the matrix and v is the vector, you can use something like
t = table(v)
size = dim(m)[1] * max(t) # size of each block based on the longest
matrix(unlist(lapply(names(t), function(i) {
x = m[, v == i] # get the short block
c(x, rep(NA, size - length(x))) # extend it to size
})), dim(m)[1])
To modify the matrix just as you asked assuming the matrix is mat:
nr <- nrow(mat)
nas <- rep(NA, nr)
l <- lapply( 4:ncol(mat), function(x) matrix(c(mat[,x],nas,nas), nrow = nr) )
cbind(mat[,1:3], do.call(cbind,l))
Related
everyone!
I am trying to run a GWAS analysis in R on some very simple genetic data. It only contains the SNPs and one outcome variable (as well as an ID variable for each observation).
Everything I have found online includes chromosome and position data. I have that for the SNPs, but in a separate file. (My plan is to map the SNPs after the relevant ones have been selected).
How can I go about running a GWAS analysis on this data? Would I need to, or could I use another method to filter to only the most significant SNPs?
I tried this, but it didn't work, because my data is not a gData object.
# SNPs are in A/B notation, with 0 = AA, 1 = AB, and 2 = BB
library(statgenGWAS)
id <- c("person1", "person2", "person3", "person4", "person5", "person6", "person7", "person8", "person9", "person10")
snp1 <- c(0, 1, 2, 2, 1, 0, 0, 0, 1, 1)
snp2 <- c(2, 2, 2, 1, 1, 1, 0, 0, 0, 1)
snp3 <- c(0, 0, 2, 2, 0, 2, 1, 0, 2, 2)
diagnosis <- c(0, 1, 1, 0, 0, 1, 1, 0, 1, 1)
data <- as.data.frame(cbind(id, snp1, snp2, snp3, diagnosis))
gwas1a <- runSingleTraitGwas(gData = data,
traits = "diagnosis")
Any help here is appreciated.
Thank you!
I would like to calculate sums of rows, including adjustment for missing data.
The row sums are "MERSQI" scores in real (scoring the quality of studies, 1study per row). Each col is a question about quality with a specific maximum of points achievable.
However, in some cases, questions were not applicable for some studies leading to "missing values". The row sum should be adjusted to standard denominator of 18 as maximal score/row sum, i.e.: (max achievable points= sum of maximal achievable points of applicable questions/cols)
total MERSQI score = row sum / max achievable points * 18
For example:
questions <- c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10) #number of question or col number
max_quest <- c(3, 1.5, 1.5, 3, 1, 1, 1, 1, 3) #maximum of every single question
study1 <- c(1.5, 0.5, 1.5, 3, 0, 0, 0, 1, 3) #points for every single questions for study1
study2 <- c(1, 0.5, 0.5, 3, NA, NA, NA, 1, 1, 3) # for study2
study3 <- c(2, 1.5, NA, 3, NA, 1, NA, 1, 1, 3) #for study3
df <- rbind (questions, max_quest, study1, study2, study3)
For study1 we would have a row sum and resulting score of 10.5 and as there are no missing values.
For study2 we have a row sum of 10. We have three NA, maximal achievable points for study2 were 15 (=18 maximal points - 3*1 point of the NA questions), and adjusted MERSQI score of 12.85 (=10 *18/15).
For study3: row sum= 12.5, maximal achievable points=15.5 (=18 -(1.5+1+1)), adjusted MERSQI score= 15.53
Do you have any idea how to calculate the row sums with adjusting for missing values? Maybe with going through every row, using forloop and ifwith is.na?
Thank you!
PS: Link / explanation to the MERSQI score: https://www.aliem.com/article-review-how-do-you-assess/ and https://pubmed.ncbi.nlm.nih.gov/26107881/
There is an issue with the lengths of the vectors. I edited the dataset so that they are all length 9, but this should work:
apply(mat[, 3:5],
2,
FUN = function (x) {
tot = sum(x, na.rm = TRUE)
nas = which(is.na(x))
total_max = sum(max_quest)
if (!length(nas))
return(tot)
else
return(tot * total_max / (total_max - sum(max_quest[nas])))
})
Data:
questions <- c(1, 2, 3, 4, 5, 6, 7, 8, 9) #number of question or col number
max_quest <- c(3, 1.5, 1.5, 3, 1, 1, 1, 1, 3) #maximum of every single question
study1 <- c(1.5, 0.5, 1.5, 3, 0, 0, 0, 1, 3) #points for every single questions for study1
study2 <- c(1, 0.5, 0.5, 3, NA, NA, NA, 1, 1) # for study2
study3 <- c(2, 1.5, NA, 3, NA, 1, NA, 1, 1) #for study3
## rename mat because cbind(...) of vectors returns matrix.
mat <- cbind (questions, max_quest, study1, study2, study3)
For each study column calculate it's sum multiply by sum of max_quest and divide by max_quest - NA value.
library(dplyr)
val <- sum(df$max_quest)
df %>%
summarise(across(starts_with('study'),
~sum(., na.rm = TRUE)* val/(val - sum(max_quest[is.na(.)]))))
data
The data shared is not complete due to incompatible lengths. Also it would make sense if these values are in column-wise fashion than row-wise.
questions <- c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10)
max_quest <- c(3, 1.5, 1.5, 3, 1, 1, 1, 1, 3, 3)
study1 <- c(1.5, 0.5, 1.5, 3, 0, 0, 0, 1, 3, 0)
study2 <- c(1, 0.5, 0.5, 3, NA, NA, NA, 1, 1, 3)
study3 <- c(2, 1.5, NA, 3, NA, 1, NA, 1, 1, 3)
df <- data.frame(questions, max_quest, study1, study2, study3)
This can be done with vectorization.
First apply row sums and find number of NAs:
row_sums <- apply(df, 1, function(x) sum(x, na.rm=T))
row_NAs <- apply(df,1, function(x) sum(is.na(x)))
Then pull out studies and max points:
studies <- row_sums[3:length(row_sums)]
max <- row_sums[2]
Compute the MERSQI from the adjusted max, based on NAs:
adjusted_max <- rep(max, length(studies)) - row_NAs[3:length(row_NAs)]
MERSQI <- studies * max / adjusted_max
I want to visualize mean comparison with a boxplot in ggplot2, but instead of having a vector of categorical variables, I have a couple of vectors with 1 or 0 to indicate whether they belong in that category. There's some overlap - i.e., some data points will belong to multiple groups simultaneously.
I'm able to get a boxplot of values for all the values in one group, but not able to add another group's values to the same plot. With as.factor() applied to a dummy variable I'm able to get a boxplot of the means of scores for those in that group vs. not in that group. I've seen posts about faceting that seem like that might be helpful, but none of the examples I've found (Multiple boxplots placed side by side for different column values in ggplot, How do I make a boxplot with two categorical variables in R?) are quite like what I'm trying to do.
score <- c(1, 8, 3, 5, 10, 7, 4, 3, 8, 1)
group1 <- c(0, 0, 1, 0, 1, 1, 0, 1, 0, 1)
group2 <- c(1, 1, 0, 1, 0, 1, 1, 1, 0, 0)
group3 <- c(0, 1, 0, 0, 0, 0, 0, 0, 1, 1)
df <- data.frame(score, group1, group2, group3)
library(ggplot2)
ggplot(aes(y=score, x=as.factor(group1), fill=group1), data=df) +
geom_boxplot() #mean for both values inside and outside group plotted
ggplot(aes(y=score, x=as.numeric(group1), fill=group1), data=df) +
geom_boxplot() #mean for just those values where group1 == 1
I want to end up with either a) multiple plots like what I get from that first line of code, OR b) multiple plots like what I get from the second. The former includes a boxplot for all those values outside the group, the latter does not. Would also be cool to have a boxplot for the overall mean but I really am not sure what's feasible.
I'm not quite sure if you just want box plots for those with dummy = 1. Anyway, data.table::melt can be useful to you, which gives you an easy plottable long format.
library(data.table)
dat.m <- melt(dat, measure.vars=2:4)
boxplot(score ~ value + variable, dat.m[dat.m$value == 1, ])
Yields
Data
dat <- structure(list(score = c(1, 8, 3, 5, 10, 7, 4, 3, 8, 1), group1 = c(0,
0, 1, 0, 1, 1, 0, 1, 0, 1), group2 = c(1, 1, 0, 1, 0, 1, 1, 1,
0, 0), group3 = c(0, 1, 0, 0, 0, 0, 0, 0, 1, 1)), class = "data.frame", row.names = c(NA,
-10L))
I have a data frame
testdf <- data.frame(predicted1 = c(1, 0, 1, 3, 2, 1, 1, 0, 1, 0), predicted2 = c(1, 0, 2, 2, 2, 1, 1, 0, 0, 0), predicted3 = c(1, 1, 1, 1, 1, 1, 1, 1, 1, 1), actual = c(1, 0, 2, 3, 2, 1, 1, 1, 0, 0))
I want to add another column to this data frame which tells me the total percentage accuracy when looking at all predicted values. So for example, row 1 of this would have an accuracy of 100%, because all prediction columns predicted the correct value (1).
How can this be done?
Thanks!
We can compare with the 'actual' get the rowMeans, multiply by 100 and round if needed
round(100*rowMeans(testdf[1:3] == testdf$actual), 2)
I am trying to write a function which will have a numeric vector "x" on input and will create on output a numeric vector of xi indexes, such that x(i) == x(i+1)
By far I wrote such function neighbor:
neighbor <- function(l) {
stopifnot(is.numeric(l))
w <- sapply(l, function(x) which(l[x]==l[x+1]))
w
}
So executing this instruction:
neighbor(c(1, 1, 0, 1, 1, 1, 0, 0, 1, 1, 0))
Should produce a numeric vector:
1, 4, 5, 7, 9
But I cannot get it working. Any ideas?
I am searching for an elegant solution without control-flow and if-else instructions.
diff will help with this:
x <- c(1, 1, 0, 1, 1, 1, 0, 0, 1, 1, 0)
which(diff(x) == 0)