Calculate pseudo-median for many columns - r

I am trying to calculate the pseudo-median (Hodges-Lehmann estimator) calculated from the wilcox.test function in R between several different columns in my dataset (8 separate fields/columns).
> head(fantasyproj2)
V1 V2 V3 V4 V5 V6 V7 V8
1 25 25.87 20.35 26.65 27.20 27.0 17.970 27.70
2 24 27.48 19.81 19.57 22.20 27.5 16.350 20.04
3 22 19.89 17.62 21.99 19.12 26.0 22.484 23.70
4 21 17.72 16.09 15.55 18.60 18.5 17.450 14.59
5 21 21.56 17.90 18.46 20.80 23.0 16.540 19.76
6 20 NA 17.73 15.84 19.00 20.5 19.080 15.05
They are all numeric, and do include some missing values (the missing value have no data within them and have been changed to NA in R). when I run the code:
fantasyproj$hodges.lehmann <- apply(fantasyproj2[,c(1,2,3,4,5,6,7,8)],1, function(x) wilcox.test(x, conf.int=TRUE, na.action=na.exclude)$estimate)
I get the error:
Error in uniroot(wdiff, c(mumin, mumax), tol = 1e-04, zq = qnorm(alpha/2, : f() values at end points not of opposite sign
There is not much literature on this error out there except where it says to add the argument exact = TRUEto the statement, but that does nto help. Any help would be greatly appreciated!

Related

read a text file in R

It sounds stupid ! but I could not find a proper way to read this text file.
I have tried read.table and fread functions. But no success, the columns are not matches with the data :
m = fread(meq,fill = T,sep = " ")
m = read.table(meq,fill = T,comment.char="-",sep = "")
That file contains some metadata at the top and isn't in a standard format which can be parsed as a dataframe easily. One solution is to read it in as a character vector, do some manipulations, and then read in the resulting file:
meq <- "LS_FLS_YUN_IECED3_Tower_100pctAv.meq"
lines <- readLines(meq)
lines <- lines[-(1:5)]
lines <- gsub("\\|", "", lines)
lines <- gsub(" +", " ", lines)
file <- tempfile()
writeLines(lines, file)
data.table::fread(file, sep = " ", fill = TRUE)
# V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12
# 1: 1 t 279.1 307.1 351.1 429.2 524.8 539.1 540.5 550.3 558.5 NA
# 2: 2 v_hor_size 6.6 6.8 7.3 9.6 11.8 13.3 13.4 14.9 16.4 NA
# 3: 3 Elevation 4.1 4.1 4.3 5.1 6.2 6.7 6.7 7.3 7.9 NA
# 4: 4 Mf_x e.1n1 43.8 61.4 106.6 270.0 330.2 447.4 461.3 573.9 689.9
# 5: 5 Mf_y e.1n1 107.1 148.0 236.8 493.8 603.8 746.3 762.0 881.7 994.3
# ---
#766: 766 Mt_030 e13n2 5694.5 5524.4 5559.4 6850.9 8377.6 9381.7 9510.6 10592.5 11771.5
#767: 767 Mt_060 e13n2 9223.2 8757.3 8448.3 9210.8 11263.4 11821.5 11901.7 12606.0 13398.2
#768: 768 Mt_090 e13n2 11582.0 10912.8 10380.1 10898.0 13326.5 13686.1 13745.4 14298.0 14960.4
#769: 769 Mt_120 e13n2 11658.8 11015.8 10529.9 11142.4 13625.4 13989.1 14046.3 14564.8 15166.2
#770: 770 Mt_150 e13n2 9386.4 8973.3 8741.8 9551.2 11679.6 12116.0 12177.5 12712.0 13312.7
This is my solution :
read.table(text = mgsub::mgsub(readLines(meq),c("\\|", " e"," e"),c("","e","e")),fill = T,comment.char = "-",sep = "",na.strings ="", stringsAsFactors= F,skip = 2)
V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11
1 W<f6>hler slope: 3.5 4.0 5.0 8.0 8.0 10.0 10.25 12.4 14.95
2 Reference cycles: 10000000.0 10000000.0 10000000.0 10000000.0 2000000.0 2000000.0 2000000.00 2000000.0 2000000.00
3 1 t 279.1 307.1 351.1 429.2 524.8 539.1 540.50 550.3 558.50
4 2 v_hor_size 6.6 6.8 7.3 9.6 11.8 13.3 13.40 14.9 16.40
5 3 Elevation 4.1 4.1 4.3 5.1 6.2 6.7 6.70 7.3 7.90
6 4 Mf_xe.1n1 43.8 61.4 106.6 270.0 330.2 447.4 461.30 573.9 689.90
7 5 Mf_ye.1n1 107.1 148.0 236.8 493.8 603.8 746.3 762.00 881.7 994.30
8 6 Mf_xye.1n1 71.7 98.6 156.9 324.7 397.0 488.6 498.60 574.4 644.80
9 7 Mf_000e.1n1 107.1 148.0 236.8 493.8 603.8 746.3 762.00 881.7 994.30
10 8 Mf_030e.1n1 99.6 137.2 219.7 460.4 563.0 698.3 713.30 828.3 938.10
11 9 Mf_060e.1n1 72.4 100.6 165.6 371.9 454.8 581.1 595.50 706.6 814.60
12 10 Mf_090e.1n1 43.8 61.4 106.6 270.0 330.2 447.4 461.30 573.9 689.90
13 11 Mf_120e.1n1 56.1 78.5 130.2 302.2 369.5 486.4 500.10 610.4 722.30
14 12 Mf_150e.1n1 90.6 125.8 202.9 429.6 525.4 653.7 668.00 777.4 882.00
15 13 Mf_xe.2n1 2591.1 2521.6 2510.2 2854.9 3491.1 3777.4 3819.20 4199.4 4652.10
16 14 Mf_ye.2n1 1407.2 1385.4 1606.3 2993.9 3661.1 4509.0 4603.60 5327.9 6016.10
17 15 Mf_xye.2n1 2337.5 2239.9 2157.8 2262.2 2766.3 3002.7 3045.10 3432.6 3845.10
18 16 Mf_000e.2n1 1407.2 1385.4 1606.3 2993.9 3661.1 4509.0 4603.60 5327.9 6016.10
19 17 Mf_030e.2n1 1692.9 1668.5 1798.8 2846.6 3480.9 4238.0 4325.60 5008.6 5674.50
20 18 Mf_060e.2n1 2298.1 2247.6 2275.5 2781.8 3401.7 3849.5 3909.90 4430.1 5003.00
21 19 Mf_090e.2n1 2591.1 2521.6 2510.2 2854.9 3491.1 3777.4 3819.20 4199.4 4652.10
22 20 Mf_120e.2n1 2410.8 2334.5 2300.9 2561.2 3132.0 3411.9 3457.50 3902.0 4461.90
23 21 Mf_150e.2n1 1862.8 1799.9 1810.7 2631.1 3217.5 3955.5 4040.70 4700.3 5339.30
24 22 Mf_xe.3n1 6414.2 6261.0 6252.6 7146.9 8739.6 9466.1 9570.50 10506.0 11580.60

na.omit seems to be removing negative values in my data frame?

I am trying to figure out different temperature ranges for specific locations (CB, HK, etc.) in my data frame,
it looks like this:
'head(join)'
OTU_num location date otus Depth DO Temperature pH Secchi.Depth
1 Otu0001 CB 03JUN09 21 0.0 7.60 21.0 3.68 NA
2 Otu0001 CB 03JUN09 21 0.5 8.27 16.4 3.68 NA
3 Otu0001 CB 03JUN09 21 1.0 7.65 14.9 3.68 NA
4 Otu0001 CB 03JUN09 21 1.5 5.26 12.2 3.25 NA
5 Otu0001 CB 03JUN09 21 2.0 4.01 10.1 3.25 NA
I am calculating the range using:
ranges <- join %>%
group_by(location) %>%
na.omit %>%
mutate(min=min(Temperature), max=max(Temperature), subtract=min-max) %>%
arrange(subtract)
Some of the temperature values are "NA" so I used na.omit, however it appears to be taking out the negative values? so the ranges I get are wrong.
location min max subtract
MA 0.1 27.3 -27.2
I double checked using the range function for one of the locations (there are a lot and I did not want to use range for each location)
MA <- subset(join, location=="MA")
range(MA$Temperature, na.rm = TRUE)
[1] -2.2 27.6
Why are the values different? Any help is appreciated!!!
I think you should use join %>% filter(!is.na(Temperature)), so only rows that have NA temperatures will be removed.

R generate bins from a data frame respecting blanks

I need to generate bins from a data.frame based on the values of one column. I have tried the function "cut".
For example: I want to create bins of air temperature values in the column "AirTDay" in a data frame:
AirTDay (oC)
8.16
10.88
5.28
19.82
23.62
13.14
28.84
32.21
17.44
31.21
I need the bin intervals to include all values in a range of 2 degrees centigrade from that initial value (i.e. 8-9.99, 10-11.99, 12-13.99...), to be labelled with the average value of the range (i.e. 9.5, 10.5, 12.5...), and to respect blank cells, returning "NA" in the bins column.
The output should look as:
Air_T (oC) TBins
8.16 8.5
10.88 10.5
5.28 NA
NA
19.82 20.5
23.62 24.5
13.14 14.5
NA
NA
28.84 28.5
32.21 32.5
17.44 18.5
31.21 32.5
I've gotten as far as:
setwd('C:/Users/xxx')
temp_data <- read.csv("temperature.csv", sep = ",", header = TRUE)
TAir <- temp_data$AirTDay
Tmin <- round(min(TAir, na.rm = FALSE), digits = 0) # is start at minimum value
Tmax <- round(max(TAir, na.rm = FALSE), digits = 0)
int <- 2 # bin ranges 2 degrees
mean_int <- int/2
int_range <- seq(Tmin, Tmax + int, int) # generate bin sequence
bin_label <- seq(Tmin + mean_int, Tmax + mean_int, int) # generate labels
temp_data$TBins <- cut(TAir, breaks = int_range, ordered_result = FALSE, labels = bin_label)
The output table looks correct, but for some reason it shows a sequential additional column, shifts column names, and collapse all values eliminating blank cells. Something like this:
Air_T (oC) TBins
1 8.16 8.5
2 10.88 10.5
3 5.28 NA
4 19.82 20.5
5 23.62 24.5
6 13.14 14.5
7 28.84 28.5
8 32.21 32.5
9 17.44 18.5
10 31.21 32.5
Any ideas on where am I failing and how to solve it?
v<-ceiling(max(dat$V1,na.rm=T))
breaks<-seq(8,v,2)
labels=seq(8.5,length.out=length(s)-1,by=2)
transform(dat,Tbins=cut(V1,breaks,labels))
V1 Tbins
1 8.16 8.5
2 10.88 10.5
3 5.28 <NA>
4 NA <NA>
5 19.82 18.5
6 23.62 22.5
7 13.14 12.5
8 NA <NA>
9 NA <NA>
10 28.84 28.5
11 32.21 <NA>
12 17.44 16.5
13 31.21 30.5
This result follows the logic given: we have
paste(seq(8,v,2),seq(9.99,v,by=2),sep="-")
[1] "8-9.99" "10-11.99" "12-13.99" "14-15.99" "16-17.99" "18-19.99" "20-21.99"
[8] "22-23.99" "24-25.99" "26-27.99" "28-29.99" "30-31.99"
From this we can tell that 19.82 will lie between 18 and 20 thus given the value 18.5, similar to 10.88 being between 10-11.99 thus assigned the value 10.5

Why subsetting rows with 'apply' in data frame doesn't work in R

I have a data that looks like this.
Name|ID|p72|p78|p51|p49|c36.1|c32.1|c32.2|c36.2|c37
hsa-let-7a-5p|MIMAT0000062|9.1|38|12.7|185|8|4.53333333333333|17.9|23|63.3
hsa-let-7b-5p|MIMAT0000063|11.3|58.6|27.5|165.6|20.4|8.5|21|30.2|92.6
hsa-let-7c|MIMAT0000064|7.8|40.2|9.6|147.8|11.8|4.53333333333333|15.4|17.7|62.3
hsa-let-7d-5p|MIMAT0000065|4.53333333333333|27.7|13.4|158.1|8.5|4.53333333333333|14.2|13.5|50.5
hsa-let-7e-5p|MIMAT0000066|6.2|4.53333333333333|4.53333333333333|28|4.53333333333333|4.53333333333333|5.6|4.7|12.8
hsa-let-7f-5p|MIMAT0000067|4.53333333333333|4.53333333333333|4.53333333333333|78.2|4.53333333333333|4.53333333333333|6.8|4.53333333333333|8.9
hsa-miR-15a-5p|MIMAT0000068|4.53333333333333|70.3|10.3|147.6|4.53333333333333|4.53333333333333|21.1|30.2|100.8
hsa-miR-16-5p|MIMAT0000069|9.5|562.6|60.5|757|25.1|4.53333333333333|89.4|142.9|613.9
hsa-miR-17-5p|MIMAT0000070|10.5|71.6|27.4|335.1|6.3|10.1|51|51|187.1
hsa-miR-17-3p|MIMAT0000071|4.53333333333333|4.53333333333333|4.53333333333333|17.2|4.53333333333333|4.53333333333333|9.5|4.53333333333333|7.3
hsa-miR-18a-5p|MIMAT0000072|4.53333333333333|14.6|4.53333333333333|53.4|4.53333333333333|4.53333333333333|9.5|25.5|29.7
hsa-miR-19a-3p|MIMAT0000073|4.53333333333333|11.6|4.53333333333333|42.8|4.53333333333333|4.53333333333333|4.53333333333333|5.5|17.9
hsa-miR-19b-3p|MIMAT0000074|8.3|93.3|15.8|248.3|4.53333333333333|6.3|44.7|53.2|135
hsa-miR-20a-5p|MIMAT0000075|4.53333333333333|75.2|23.4|255.7|6.6|4.53333333333333|43.8|38|130.3
hsa-miR-21-5p|MIMAT0000076|6.2|19.7|18|299.5|6.8|4.53333333333333|49.9|68.5|48
hsa-miR-22-3p|MIMAT0000077|40.4|128.4|65.4|547.1|56.5|33.4|104.9|84.1|248.3
hsa-miR-23a-3p|MIMAT0000078|58.3|99.3|58.6|617.9|36.6|21.4|107.1|125.5|120.9
hsa-miR-24-1-5p|MIMAT0000079|4.53333333333333|4.53333333333333|4.53333333333333|9.2|4.53333333333333|4.53333333333333|4.53333333333333|4.9|4.53333333333333
hsa-miR-24-3p|MIMAT0000080|638.2|286.9|379.5|394.4|307.8|240.4|186|234.2|564
What I want to do is to simply pick rows where all the values is greater than 10.
But why this code of mine only report the last one?
The data clearly showed that there are more rows that satisfy this condition.
> dat<-read.delim("http://dpaste.com/1215552/plain/",sep="|",na.strings="",header=TRUE,blank.lines.skip=TRUE,fill=FALSE)
But why this code of mine only report the last one?
> dat[apply(dat[, -1], MARGIN = 1, function(x) all(x > 10)), ]
Name ID p72 p78 p51 p49 c36.1 c32.1 c32.2 c36.2 c37
19 hsa-miR-24-3p MIMAT0000080 638.2 286.9 379.5 394.4 307.8 240.4 186 234.2 564
What is the right way to do it?
Update:
alexwhan solution works. But I wonder how can I generalized his approach
so that it can handle data with missing values (NA)
dat<-read.delim("http://dpaste.com/1215354/plain/",sep="\t",na.strings="",heade‌​r=FALSE,blank.lines.skip=TRUE,fill=FALSE)
Since you're including your ID column (which is a factor) in the all(), it's getting messed up. Try:
dat[apply(dat[, -c(1,2)], MARGIN = 1, function(x) all(x > 10)), ]
# Name ID p72 p78 p51 p49 c36.1 c32.1 c32.2 c36.2 c37
# 16 hsa-miR-22-3p MIMAT0000077 40.4 128.4 65.4 547.1 56.5 33.4 104.9 84.1 248.3
# 17 hsa-miR-23a-3p MIMAT0000078 58.3 99.3 58.6 617.9 36.6 21.4 107.1 125.5 120.9
# 19 hsa-miR-24-3p MIMAT0000080 638.2 286.9 379.5 394.4 307.8 240.4 186.0 234.2 564.0
EDIT
For the case where you have NA, you can just just use the na.rm argument for all(). Using your new data (from the comment):
dat<-read.delim("http://dpaste.com/1215354/plain/",sep="\t",na.strings="",header=FALSE,blank.lines.skip=TRUE,fill=FALSE)
dat[apply(dat[, -c(1,2)], MARGIN = 1, function(x) all(x > 10, na.rm = T)), ]
# V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11
# 7 hsa-miR-15a-5p MIMAT0000068 NA 70.3 10.3 147.6 NA NA 21.1 30.2 100.8
# 16 hsa-miR-22-3p MIMAT0000077 40.4 128.4 65.4 547.1 56.5 33.4 104.9 84.1 248.3
# 17 hsa-miR-23a-3p MIMAT0000078 58.3 99.3 58.6 617.9 36.6 21.4 107.1 125.5 120.9
# 19 hsa-miR-24-3p MIMAT0000080 638.2 286.9 379.5 394.4 307.8 240.4 186.0 234.2 564.0
# 20 hsa-miR-25-3p MIMAT0000081 19.3 78.6 25.6 84.3 14.9 16.9 19.1 27.2 113.8
# 21 hsa-miR-26a-5p MIMAT0000082 NA 22.8 31.0 561.2 12.4 NA 67.0 55.8 48.9
ANother idea is to transform your data ton long format( or molton format). I think it is even better to avoid missing values problem with:
library(reshape2)
dat.m <- melt(dat,id.vars=c('Name','ID'))
dat.m$value <- as.numeric(dat.m$value)
library(plyr)
res <- ddply(dat.m,.(Name,ID), summarise, keepme = all(value > 10))
res[res$keepme,]
# Name ID keepme
# 16 hsa-miR-22-3p MIMAT0000077 TRUE
# 17 hsa-miR-23a-3p MIMAT0000078 TRUE
# 19 hsa-miR-24-3p MIMAT0000080 TRUE

How to reshape a matrix

I have a matrix (d) that looks like:
d <-
as.matrix(read.table(text = "
month Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9 Q10 Q11 Q12 Q13
X10 10 7.04 8.07 9.4 8.17 9.39 8.13 9.43 9.06 8.59 9.37 9.79 8.47 8.86
X11 11 12.10 11.50 12.6 13.70 11.90 11.50 13.10 17.20 19.00 14.60 13.70 13.20 16.10
X12 12 24.00 22.00 22.2 20.50 21.60 22.50 23.10 23.30 30.50 34.10 36.10 37.40 28.90
X1 1 18.30 16.30 16.2 14.80 16.60 15.40 15.20 14.80 16.70 14.90 15.00 13.80 15.90
X2 2 16.70 14.40 15.3 14.10 15.50 16.70 15.20 16.10 18.00 26.30 28.00 31.10 34.20",
header=TRUE))
going from Q1 to Q31 (its the days in each month). what I would like to get is:
month day Q
10 1 7.04
10 2 8.07
and so on for the 31 days and the 12 months.
I have tried using the following code:
reshape(d, direction="long", varying = list(colnames(d)[2:32]), v.names="Q", idvar="month", timevar="day")
but I get the error :
Error in d[, timevar] <- times[1L] : subscript out of bounds
Can anyone tell me what is wrong with the code? I don't really understand the help file on "reshape", it's a bit confusing... Thanks!
Almost there - you're just missing as.data.frame(d) to make your matrix into a data frame. Also you don't need the list in varying - just a vector, so
reshape(as.data.frame(d), varying=colnames(d)[2:32], v.names="Q",
direction="long", idvar="month", timevar="day")
The help file is confusing as heck, not least because (as I've learned) the necessary information almost always actually is in there --- somewhere.
As a prime example, midway through the help file, there is this bit:
The function will
attempt to guess the ‘v.names’ and ‘times’ from these names [i.e. the ones in the 'varying' argument]. The
default is variable names like ‘x.1’, ‘x.2’, where ‘sep = "."’
specifies to split at the dot and drop it from the name. To have
alphabetic followed by numeric times use ‘sep = ""’.
That last sentence is the one you need here: "Q1", "Q2", etc. are indeed "alphabetic followed by numeric", so you need to set sep = "" argument if reshape() is to know how to split apart those column names.
Try this:
res <- reshape(as.data.frame(d), idvar="month", timevar="day",
varying = -1, direction = "long", sep = "")
head(res[with(res, order(month,day)),])
# month day Q
# 1.1 1 1 18.3
# 1.2 1 2 16.3
# 1.3 1 3 16.2
# 1.4 1 4 14.8
# 1.5 1 5 16.6
# 1.6 1 6 15.4
The help file on reshape is not a bit confusing. It's a LOT confusing. Assuming your matrix has 12 rows(1 for each month) and 31 columns (I'm guessing you have NA values months with fewer than 31), you could easily construct this by hand.
d <- data.frame(month = rep(d[,1], 31), day = rep(1:31, each = 12), Q = as.vector(d[,2:32])
Now, back to your reshape... I'm guessing it's not parsing the names of your columns correctly. It might work better with Q.1, Q.2, etc. BTW, my reshaping above really depends on what you presented actually being a matrix and not a data.frame.

Resources