How to calculate Zscore in R - r

I have log2ratio values of each chromosome position (137221 coordinates) for different samples (15 samples). I want to calculate the Zscore of log2ratio for each chromosome position (row). Also i want to exclude first three columns because it contains ID. There are also some NAs in between the variables..
Thanking you in advance

It's not completely clear what you want. If you want a Z-score for the entire row (i.e., its mean divided by standard error) for all but the first three rows then
f <- function(x) {
mean(x,na.rm=TRUE)/(sd(x,na.rm=TRUE)*sqrt(length(na.omit(x))))
}
apply(as.matrix(df[-(1:3),]),1,f)
will do it. That gives you a vector equal to the number of columns (minus 3).
If you want entire columns of normalized data (Z-scores) then I think
t(scale(t(as.matrix(df[-(1:3),]))))
should work. If neither of those work, you need to post a reproducible example -- or at least tell us precisely what the error messages are.

Related

Writing a function to give an interval where the values are greater than a certain number

I'm trying to write a function where I find the number of times the value in a data frame is above a certain number x (in this case, 3). Basically, the data start from 1.0, increase, then go below 1.0 (in a span of about 150 data points). I want the function to return to me the number of times the values are above this threshold. I'm fairly new to R and am just confused on how to go about this. Any help is appreciated. Thank you!
If your data frame is called df then sum(df$x>3) will return the number of rows of df where x is greater than 3.
If there are missing values in x and you want to ignore them then use sum(df$x>3, na.rm=TRUE).

Calculating the offset between two columns in a dataframe but ignoring some of the outliers in one of those

I have a dataeframe with two columns, one of which is the baseline (baseline_CO2) I have calculated using a previous set of data and the other is a set of data I believe to be offset with respect to this baseline value.
I want to quantify this offset and calculate it's value in order to correct my original data (CO2_LICOR). In order to do this accurately I need to be able to remove some of the outlier peak values in this offset calculation for the LICOR_CO2 data, say all values over 350.
Can anyone help?
The dataframe looks like the following:
If you want to compare the two rows then you can use the approach Jon Spring suggested.
df$offset <- df$baseline_CO2 - df$CO2_LICOR
If you want to filter these values then something like
df_filtered <- df[df$CO2_LICOR < 350]

In R: ordering values from 2 DF columns for use in ratio for each row

I want to calculate ratios for each row in a data frame using values from two columns for each row. The data are anatomical measurements from paired muscles, and I need to calculate a ratio of the measurement of one muscle to the measurement of the other. Each row is an individual specimen, and each of the 2 columns in question has measurements for one of the 2 muscles. Which of the two muscles is largest varies among individuals (rows), so I need to figure out how to write a script that always picks the smaller value, which may be in either column, for the numerator, and that always picks the larger values, which also can be in either column, for the denominator, rather than simply dividing all values of one column by values of the other. This might be simple, but I'm not so good with coding yet.
This doesn't work:
ratio <- DF$1/DF$2
I assume that what I need would loop through each row doing something like this:
ratio <- which.min(c(DF$1, DF$2))/which.max(c(DF$1, DF$2))
Any help would be greatly appreciated!
Assuming that you are only dealing with positive values, you could consider something like this:
# example data:
df <- data.frame(x = abs(rnorm(100)), y = abs(rnorm(100)))
# sorting the two columns so that the smaller always appears in the first
# column:
df_sorted <- t(apply(df,1, sort))
# dividing the first col. by the second col.
ratio <- df_sorted[,1]/df_sorted[,2]
Or, alternatively:
ifelse(df[,1] > df[,2], df[,2]/df[,1], df[,1]/df[,2])

Setting a maximum limit for values in a data frame in R

In a data frame (in R), I have two columns - the first is a list of species names (species), the second is the number of occurrence records I have for that species (number). There is a large variation in the number column with most values being <100 but a few being very high values (>100,000), and there are many rows (~4000). Here is a simplified example:
x<-data.frame(species=c("a","b","c","d","e","f","g","h","i","j"),number=c(53,17,67,989,135,67,13,786,100400,28))
Basically what I want to do is reduce the maximum number of records (the value in the number column) until the mean of all the values in this column stabilises.
To do this, I need to set a maximum limit for values in the number column so that any value > this limit is reduced to this maximum limit, and record the mean. I want to repeat this multiple times, each time reducing the maximum limit by 100.
I've not been able to find any similar questions online and am not really sure where to start with this! Any help, even just a point in the right direction, would be much appreciated! Cheers
you should use the pmin value :
pmin(x$number, 1e3)
# to test multiple limits :
mns <- sapply(c(1e6, 1e4, 1e2), function(u) mean(pmin(x$number, u)))

Excel overlaping intervals with a type of vlookup

This is a snipet of the data, of which there is a ton, with explanation of what I want to do:
File
Basically I have a number of subsets (marked out by 1, 2 ... in a seperate column) of data which have intervals. I need to know if the intervals in the same two subsets overlap and if yes then I need the value (column C) which is associated with the set in columns E-G to be pasted next to the interval in column J-K that overlaps with the interval in F-G. The problem is that the interval in column F-G overlaps with multiple intervals in columns J-K.
I've been trying to solve this with
=if(or(and(x>=a,x<=b),and(a>=x,a<=y)),"Overlap","Do not overlap")
But the problem is I can't find a way to do this for multiple overlaps. If you think this can't be done in excel and know how else to do it (e.g. R) please let me know.
Thank you
In Excel try this formula in L4 copied down
=IFERROR(INDEX(C$4:C$100,MATCH(1,INDEX((J4<=G$4:G$100)*(K4>=F$4:F$100)*(I4=E$4:E$100),0),0)),"No overlap")
This will find the first row within each subset (if any) where the F/G interval overlaps with the current row J/K interval, if no such row exists you get "no overlap"

Resources