If statement for weighted averaged in R - r

I have a data file that is several million lines long, and contains information from many groups. Below is an abbreviated section:
MARKER GROUP1_A1 GROUP1_A2 GROUP1_FREQ GROUP1_N GROUP2_A1 GROUP2_A2 GROUP2_FREQ GROUP2_N
rs10 A C 0.055 1232 A C 0.055 3221
rs1000 A G 0.208 1232 A G 0.208 3221
rs10000 G C 0.134 1232 C G 0.8624 3221
rs10001 C A 0.229 1232 A C 0.775 3221
I would like to created a weighted average of the frequency (FREQ) variable (which in itself is straightforward), however in this case some of the rows are mismatched (rows 3 & 4). If the letters do not line up, then the frequency of the second group needs to be subtracted by 1 before the weighted mean of that marker is calculated.
I would like to set up a simple IF statement, but I am unsure of the syntax of such a task.
Any insight or direction is appreciated!

Say you've read your data in a data frame called mydata. Then do the following:
mydata$GROUP2_FREQ <- mydata$GROUP2_FREQ - (mydata$GROUP1_A1 != mydata$GROUP2_A1)
It works because R treats TRUE values as 1 and FALSE values as 0.
EDIT: Try the following instead:
mydata$GROUP2_FREQ <- abs( (as.character(mydata$GROUP1_A1) !=
as.character(mydata$GROUP2_A1)) -
as.numeric(mydata$GROUP2_FREQ) )

Related

iterative wilcox on data.frame in R

I am trying, or rather I wish I could try, to write a loop in R that executes the Wilcoxon test (wilcox.test) in an iterative way, comparing 2 groups of values in each row of a data.frame, and returning for each row the p-value that is then put in a dataframe with its associated row label.
The data.frame is as follows:
> tab[1:5,]
mol E12 E15 E22 E25 E26 E27 E38 E44 E47
1 A 7362.40 2475.93 3886.06 5825.59 6882.00 3250.05 3406.65 6416.29 7786.73
2 B 5391.42 2037.88 3330.05 4043.83 5766.20 2591.69 3603.95 14431.89 8320.70
3 C 1195.89 241.24 252.46 865.97 1970.28 899.22 346.36 1135.86 1179.31
4 D 502.64 171.41 434.29 508.22 419.34 260.13 298.14 326.70 167.07
5 E 181.63 171.41 165.30 150.47 164.09 109.19 122.76 212.74 155.60
Column labels are: mol, the specific molecule evaluated (about 20); E12 to E47 the samples for which the value of each molecule is measured.
Groups to be compared are:
P; samples E12, E25, E26, E27, E44. D; samples E15, E22, E38, E47.
The output should look like this:
mol p-value
A 1
B 0.5556
C 0.9048
etc.
I tried to use a for in cycle, but I am absolutely not able to manage it in this, for me complicated, context.
Any help with comments on the meaning of the instructions for a newbie like me is much appreciated.
apply() works like a looper on matrices and arrays. In this case, with margin=1 it loops along the rows. Each row, temporarily converted into a vector x, is passed on to function(x) wilcox.test(x[P], x[D])$p.value, the result being one p-value per row. P and D are logical vectors specifying which elements within x should be used in each sample.
tab0 <- read.table(text="mol E12 E15 E22 E25 E26 E27 E38 E44 E47
A 7362.40 2475.93 3886.06 5825.59 6882.00 3250.05 3406.65 6416.29 7786.73
B 5391.42 2037.88 3330.05 4043.83 5766.20 2591.69 3603.95 14431.89 8320.70
C 1195.89 241.24 252.46 865.97 1970.28 899.22 346.36 1135.86 1179.31
D 502.64 171.41 434.29 508.22 419.34 260.13 298.14 326.70 167.07
E 181.63 171.41 165.30 150.47 164.09 109.19 122.76 212.74 155.60",
header=TRUE)
tab <- as.matrix(tab0[,-1])
P <- colnames(tab) %in% c("E12", "E25", "E26", "E27", "E44")
D <- colnames(tab) %in% c("E15", "E22", "E38", "E47")
pv <- apply(tab, 1, function(x) wilcox.test(x[P], x[D])$p.value)
data.frame(tab0[1], p.val=signif(pv, 4))
# mol p.val
# 1 A 0.5556
# 2 B 0.4127
# 3 C 0.1111
# 4 D 0.1905
# 5 E 0.9048

New variable: sum of numbers from a list powered by value of different columns

This is my first question in Stackoverflow. I am not new to R, although I sometimes struggle with things that might be considered basic.
I want to calculate the count median diameter (CMD) for each of my rows from a Particle Size Distribution dataset.
My data looks like this (several rows and 53 columns in total):
date CPC n3.16 n3.55 n3.98 n4.47 n5.01 n5.62 n6.31 n7.08 n7.94
2015-01-01 00:00:00 5263.434 72.988 140.346 138.801 172.473 344.806 484.415 606.430 739.625 927.082
2015-01-01 01:00:00 4813.182 152.823 80.861 140.017 213.382 264.496 359.455 487.293 840.349 1069.846
Each variable starting with "n" indicates the number of particles for the corresponding size (variable n3.16 = number of particles of median size of 3.16nm). I will divide the values by 100 prior to the calculations, in order to avoid such high numbers that prevent from the computation.
To compute the CMD, I need to do the following calculation:
CMD = (D1^n1*D2^n2...Di^ni)^(1/N)
where Di is the diameter (to be extracted from the column name), ni is the number of particles for diameter Di, and N is the total sum of particles (sum of all the columns starting with "n").
To get the Di, I created a numeric list from the column names that start with n:
D <- as.numeric(gsub("n", "", names(data)[3:54]))
This is my attempt to create a new variable with the calculation of CMD, although it doesn't work.
data$cmd <- for i in 1:ncol(D) {
prod(D[[i]]^data[,i+2])
}
I also tried to use apply, but I again, it didn't work
data$cmd <- for i in 1:ncol(size) {
apply(data,1, function(x) prod(size[[i]]^data[,i+2])
}
I have different datasets from different sites which have different number of columns, so I would like to make code "universal".
Thank you very much
This should work (I had to mutilate your date variable because of read.table, but it is not involved in the calculations, so just ignore that):
> df
date CPC n3.16 n3.55 n3.98 n4.47 n5.01 n5.62 n6.31 n7.08 n7.94
1 2015-01-01 5263.434 72.988 140.346 138.801 172.473 344.806 484.415 606.430 739.625 927.082
2 2015-01-01 4813.182 152.823 80.861 140.017 213.382 264.496 359.455 487.293 840.349 1069.846
N <- sum(df[3:11]) # did you mean the sum of all n.columns over all rows? if not, you'd need to edit this
> N
[1] 7235.488
D <- as.numeric(gsub("n", "", names(df)[3:11]))
> D
[1] 3.16 3.55 3.98 4.47 5.01 5.62 6.31 7.08 7.94
new <- t(apply(df[3:11], 1, function(x, y) (x^y), y = D))
> new
n3.16 n3.55 n3.98 n4.47 n5.01 n5.62 n6.31 n7.08 n7.94
[1,] 772457.6 41933406 336296640 9957341349 5.167135e+12 1.232886e+15 3.625318e+17 2.054007e+20 3.621747e+23
[2,] 7980615.0 5922074 348176502 25783108893 1.368736e+12 2.305272e+14 9.119184e+16 5.071946e+20 1.129304e+24
df$CMD <- rowSums(new)^(1/N)
> df
date CPC n3.16 n3.55 n3.98 n4.47 n5.01 n5.62 n6.31 n7.08 n7.94 CMD
1 2015-01-01 5263.434 72.988 140.346 138.801 172.473 344.806 484.415 606.430 739.625 927.082 1.007526
2 2015-01-01 4813.182 152.823 80.861 140.017 213.382 264.496 359.455 487.293 840.349 1069.846 1.007684

R/Plotly: Error in list2env(data) : first argument must be a named list

I'm moderately experienced using R, but I'm just starting to learn to write functions to automate tasks. I'm currently working on a project to run sentiment analysis and topic models of speeches from the five remaining presidential candidates and have run into a snag.
I wrote a function to do a sentence-by-sentence analysis of positive and negative sentiments, giving each sentence a score. Miraculously, it worked and gave me a dataframe with scores for each sentence.
score text
1 1 iowa, thank you.
2 2 thanks to all of you here tonight for your patriotism, for your love of country and for doing what too few americans today are doing.
3 0 you are not standing on the sidelines complaining.
4 1 you are not turning your backs on the political process.
5 2 you are standing up and fighting back.
So what I'm trying to do now is create a function that takes the scores and figures out what percentage of the total is represented by the count of each score and then plot it using plotly. So here is the function I've written:
scoreFun <- function(x){{
tbl <- table(x)
res <- cbind(tbl,round(prop.table(tbl)*100,2))
colnames(res) <- c('Score', 'Count','Percentage')
return(res)
}
percent = data.frame(Score=rownames, Count=Count, Percentage=Percentage)
return(percent)
}
Which returns this:
saPct <- scoreFun(sanders.scores$score)
saPct
Count Percentage
-6 1 0.44
-5 1 0.44
-4 6 2.64
-3 13 5.73
-2 20 8.81
-1 42 18.50
0 72 31.72
1 34 14.98
2 18 7.93
3 9 3.96
4 6 2.64
5 2 0.88
6 1 0.44
9 1 0.44
11 1 0.44
What I had hoped it would return is a dataframe with what has ended up being the rownames as a variable called Score and the next two columns called Count and Percentage, respectively. Then I want to plot the Score on the x-axis and Percentage on the y-axis using this code:
d <- subplot(
plot_ly(clPct, x = rownames, y=Percentage, xaxis="x1", yaxis="y1"),
plot_ly(saPct, x = rownames, y=Percentage, xaxis="x2", yaxis="y2"),
margin = 0.05,
nrows=2
) %>% layout(d, xaxis=list(title="", range=c(-15, 15)),
xaxis2=list(title="Score", range=c(-15,15)),
yaxis=list(title="Clinton", range=c(0,50)),
yaxis2=list(title="Sanders", range=c(0,50)),showlegend = FALSE)
d
I'm pretty certain I've made some obvious mistakes in my function and my plot_ly code, because clearly it's not returning the dataframe I want and is leading to the error Error in list2env(data) : first argument must be a named list when I run the `plotly code. Again, though, I'm not very experienced writing functions and I've not found a similar issue when I Google, so I don't know how to fix this.
Any advice would be most welcome. Thanks!
#MLavoie, this code from the question I referenced in my comment did the trick. Many thanks!
scoreFun <- function(x){
tbl <- data.frame(table(x))
colnames(tbl) <- c("Score", "Count")
tbl$Percentage <- tbl$Count / sum(tbl$Count) * 100
return(tbl)
}

How to make mutiple lines in R

I have the data like this, I want to draw multiple lines in R, the lines contained SC1, SC2, SC3, SC4 and SC5, the xlab is chr (from 1 to 10).
chr pos SC1 SC2 SC3 SC4 SC5
chr01.8.5 1 0.000 2.420907e-02 1.317053e+00 7.171021e-02 3.280758e-03 1.185807e+00
chr01.6.5 1 0.714 0.040931607 1.150449274 0.042270667 0.044192568 0.976696855
A quick and slightly dirty way is to use ?matlines
# assume d is your data
plot(d$chr, d$pos) # plots the data as points
matlines(d$chr, d[,-(1:2)]) # plots every column except 1,2 against d$chr

Peak detection in Manhattan plot

The attached plot (Manhattan plot) contains on the x axis chromosome positions from the genome and on the Y axis -log(p), where p is a p-value associated with the points (variants) from that specific position.
I have used the following R code to generate it (from the gap package) :
require(gap)
affy <-c(40220, 41400, 33801, 32334, 32056, 31470, 25835, 27457, 22864, 28501, 26273,
24954, 19188, 15721, 14356, 15309, 11281, 14881, 6399, 12400, 7125, 6207)
CM <- cumsum(affy)
n.markers <- sum(affy)
n.chr <- length(affy)
test <- data.frame(chr=rep(1:n.chr,affy),pos=1:n.markers,p=runif(n.markers))
oldpar <- par()
par(cex=0.6)
colors <- c("red","blue","green","cyan","yellow","gray","magenta","red","blue","green", "cyan","yellow","gray","magenta","red","blue","green","cyan","yellow","gray","magenta","red")
mhtplot(test,control=mht.control(colors=colors),pch=19,bg=colors)
> head(test)
chr pos p
1 1 1 0.79296584
2 1 2 0.96675136
3 1 3 0.43870076
4 1 4 0.79825513
5 1 5 0.87554143
6 1 6 0.01207523
I am interested in getting the coordinates of the peaks of the plot above a certain threshold (-log(p)) .
If you want the indices of the values above the 99th percentile:
# Add new column with log values
test = transform(test, log_p = -log10(test[["p"]]))
# Get the 99th percentile
pct99 = quantile(test[["log_p"]], 0.99)
...and get the values from the original data test:
peaks = test[test[["log_p"]] > pct99,]
> head(peaks)
chr pos p log_p
5 1 5 0.002798126 2.553133
135 1 135 0.003077302 2.511830
211 1 211 0.003174833 2.498279
586 1 586 0.005766859 2.239061
598 1 598 0.008864987 2.052322
790 1 790 0.001284629 2.891222
You can use this with any threshold. Note that I have not calculated the first derivative, see this question for some pointers:
How to calculate first derivative of time series
after calculating the first derivative, you can find the peaks by looking at points in the timeseries where the first derivative is (almost) zero. After identifying these peaks, you can check which ones are above the threshold.
Based on my experience after plotting the graph you can use following R code to find the peak coordinate
plot(x[,1], x[,2])
identify(x[,1], x[,2], labels=row.names(x))
note here x[,1] refers to x coordinate(genome coordinate and x[,2] would be #your -log10P value
at this time use point you mouse to select a point and hit enter which #will give you peak location and then type the following code to get the #coordinate
coords <- locator(type="l")
coords

Resources