Is there a way to only show the first 2 lines of output from the describe command in Hmisc?
For data safety reasons I can only really show n, missing, unique and mean in my output and possibly a histogram.
This means that I would have to hide the output for lowest, highest as well as frequencies and percentiles.
Is this possible? If not I'll probably have to calculate the values myself.
library(Hmisc)
res <- describe(rnorm(400))
#Look at the structure.
str(res)
#It's a list! You can change the objects in it.
res$counts <- res$counts[1:4]
res$values <- NULL
print(res)
#rnorm(400)
# n missing unique Mean
# 400 0 400 0.05392
Related
I do not understand letter ordering in the function multcompletters from multcompView. According to documentation it should be according to mean of the group. In the following example, the middle group got c (from abc) and should have got b. Is this a bug?
require(multcompView)
# Data
datacol <- c(21.1,20.2,21.8,20.9,23.3,21.1,20.2,21.8,20.9,23.3,19.8,16.4,
16.9,16.0,17.6,17.5,16.9,13.3,18.0,17.6,13.5,12.2,15.2,15.1,15.2,14.0)
# Group
faccol <- c(rep(c(1,2),each=10),rep(3,6))
# Combined Dataframe
tukeyset <- data.frame(datacol,as.factor(faccol))
colnames(tukeyset)[2] <- "faccol"
# Tukeytest
tukeyres <- TukeyHSD(x=aov(lm(datacol~faccol,data=tukeyset)))
Tlevels <- tukeyres$faccol[,4]
multcompLetters(Tlevels) # WRONG ORDER, even reversed
# Boxplot
boxplot(tukeyset$datacol~tukeyset$faccol)
# adding the labels
text(x=c(1,2,3),y=c(aggregate(data=tukeyset,datacol~faccol,mean)$datacol),
labels=as.character(multcompLetters(Tlevels,reversed=TRUE)$Letters)[order(names(multcompLetters(Tlevels,reversed=TRUE)['Letters']$Letters))])
Oh no just encountered this problem! Spent 2 hours finally sorting it out!
You cannot call multcompLetters. It will give you the completely wrong order.
You have to use multcompLetters2, multcompLetters3 or multcompLetters4.
Another very important point is you have to convert your input dataset into a dataframe, but not tibble! Tibble doesn't work for this.
I’m new to programming and I’m currently writing a function to go through hundreds of csv files in the working directory.
The files have tons of NA values in it.
The function (which I call it corr) has two parameters, the directory, and a threshold value (numeric vector of length 1 indicating the number of complete cases).
The purpose of the function is to take the complete cases for two columns that are sulfate and nitrate(second and third column in the spreadsheet) and calculate the correlation between them if the number of complete cases is greater than the threshold parameter.
The function should return a vector with the correlation if it met the threshold requirement (the default threshold value is 0).
When I run the code I get back two of the following:
A + sign in the console
OR
2.The objects I created in the function can't be found.
Any help would be much appreciated. Thank you in advance!
corr <- function(directory, threshold=0){
filelist2<- data.frame(list.files(path=directory,
pattern=".csv", full.names=TRUE))
corvector <- numeric()
for(i in 1:length(filelist2)){
data <-data.frame(read.csv(filelist2[i]))
removedNA<-complete.cases(data)
newdata<-data[removedNA,2:3]
if(nrow(removedNA) > threshold){
corvector<-c(corvector, cor(data$sulfate, data$nitrate ))
}
}
corvector
}
I don't think your nrow(removedNA) does what you think it does. To replicate the example I use the mtcars dataset.
data <- mtcars # create dataset
data[2:4, 2] <- NA # create some missings in column 2
data[15:17, 3] <- NA # create some missing in column 3
removedNA <- complete.cases(data)
table(removedNA) # 6 missings indeed
nrow(removedNA) # NULL removedNA is no data.frame, so nrow() doesn't work
newdata <- data[removedNA, 2:3] # this works though
nrow(newdata) # and this shows the rows in 'newdata'
#---- therefore instead of nrow(removedNA) try
if(nrow(data)-nrow(newdata) < threshold) {
...
}
NB: I changed the > in < in the line with threshold. I guess it depends on whether you want to set an absolute minimum number of lines (in which cases you could simply use nrow(newdata) > threshold) as threshold, or whether you want the threshold to reflect the different number of lines in the original data and 'new' data.
I have two cohorts so I did a linear regression seperately per cohort and used a for loop so that my coefficients have been saved per cohort. I now want to get a pooled estimate per SNP, but I have 53 SNPs so would prefer not having to type out all the coefficients by hand. Is there a way to make a for loop to use in the rma command from metafor?
So far I've come as far as thinking that it's probably easiest to merge my two coefficient files together. I've called this coeffs. The first column has the SNP names, the 2nd and 6th columns have the betas from cohort 1 and 2, respectively and columns 3 and 7 have the standard errors from the two cohorts.
So I want to make an item beta that includes my beta from cohort 1 and cohort 2 per for 1 SNP. Then the same idea with se. I then want to have an rma(beta,se) per SNP so I can export the results to excel.
So far I thought of doing the following (but it doesn't work)
output3 <- data.frame(matrix(nrow=84,ncol=3))
names(output3)=c("Pooled Estimate", "Pooled Std.Error", "P-value")
for(l in 3:84){
beta <-c(output3[l,2], output3[l,6])
se <-c(output3[l,3], output3[l,7])
pool <- rma(beta,se)
}
When I run the rma I get the following error message:
Error in [[<-.data.frame(*tmp*, l, value = list(b =
-0.105507438518734, : replacement has 70 rows, data has 84
If I change nrow to 70, then I don't get the information. From the rma output I want the second row and columns 1,2 and 4. I think this is going wrong somewhere.
I figured out my mistake, my problem was I forgot to tell R what lines of data I needed and where it needed to be saved to.
For anyone else with this problem, here is my script which worked. I first created my data.frame where I wanted the data to be saved.
output3 <- data.frame(matrix(nrow=84,ncol=3))
names(output3)=c("Pooled Estimate", "Pooled Std.Error", "P-value")
Next I made a for loop to extract the betas and s.e's from each SNP
for(l in 3:84){
beta <-c(coeffs[l,2], coeffs[l,6])
se <-c(coeffs[l,3], coeffs[l,7])
pool <- rma(beta,se^2)
z3 <- colnames(qcwomenc[1:84])
row.names(output3)<-z3
output3[l,1]<-coef(summary(pool))[1,1]
output3[l,2]<-coef(summary(pool))[1,2]
output3[l,3]<-coef(summary(pool))[1,4]
}
You say it is not working but you don't say what exactly. I think what you are missing is something to save your results. Like a list or data.frame. The variable pool gets updated on each iteration of the loop so after the loop is through it will contain only the last model. Also, your indices do not match the example data.frame as you are referring to column 6 and 7 which do not exist. But I guess they do exist in your actual data.frame. Also your example data.frame is full of NA values. Maybe try like this:
output3 <- data.frame(matrix(runif(84*4), nrow=84,ncol=4))
names(output3)=c("se1", "beta1", "se2", "beta2")
modellist <- list()
for(l in 3:84){
beta <-c(output3[l,2], output3[l,4])
se <-c(output3[l,1], output3[l,3])
pool <- sum(beta, se)
modellist[[l]] <- pool
}
modellist
Note, I used the sum instead your rma() because I don't know this function and I don't know what package it's from.
I've been working on a project for a little bit for a homework assignment and I've been stuck on a logistical problem for a while now.
What I have at the moment is a list that returns 10000 values in the format:
[[10000]]
X-squared
0.1867083
(This is the 10000th value of the list)
What I really would like is to just have the chi-squared value alone so I can do things like create a histogram of the values.
Is there any way I can do this? I'm fine with repeating the test from the start if necessary.
My current code is:
nsims = 10000
for (i in 1:nsims) {cancer.cells <- c(rep("M",24),rep("B",13))
malig[i] <- sum(sample(cancer.cells,21)=="M")}
benign = 21 - malig
rbenign = 13 - benign
rmalig = 24 - malig
for (i in 1:nsims) {test = cbind(c(rbenign[i],benign[i]),c(rmalig[i],malig[i]))
cancerchi[i] = chisq.test(test,correct=FALSE) }
It gives me all I need, I just cannot perform follow-up analysis on it such as creating a histogram.
Thanks for taking the time to read this!
I'll provide an answer at the suggestion of #Dr. Mike.
hist requires a vector as input. The reason that hist(cancerchi) will not work is because cancerchi is a list, not a vector.
There a several ways to convert cancerchi, from a list into a format that hist can work with. Here are 3 ways:
hist(as.data.frame(unlist(cancerchi)))
Note that if you do not reassign cancerchi it will still be a list and cannot be passed directly to hist.
# i.e
class(cancerchi)
hist(cancerchi) # will still give you an error
If you reassign, it can be another type of object:
(class(cancerchi2 <- unlist(cancerchi)))
(class(cancerchi3 <- as.data.frame(unlist(cancerchi))))
# using the ldply function in the plyr package
library(plyr)
(class(cancerchi4 <- ldply(cancerchi)))
these new objects can be passed to hist directly
hist(cancerchi2)
hist(cancerchi3[,1]) # specify column because cancerchi3 is a data frame, not a vector
hist(cancerchi4[,1]) # specify column because cancerchi4 is a data frame, not a vector
A little extra information: other useful commands for looking at your objects include str and attributes.
I am trying to run some Monte Carlo simulations on animal position data. So far, I have sampled 100 X and Y coordinates, 100 times. This results in a list of 200. I then convert this list into a dataframe that is more condusive to eventual functions I want to run for each sample (kernel.area).
Now I have a data frame with 200 columns, and I would like to perform the kernel.area function using each successive pair of columns.
I can't reproduce my own data here very well, so I've tried to give a basic example just to show the structure of the data frame I'm working with. I've included the for loop I've tried so far, but I am still an R novice and would appreciate any suggestions.
# generate dataframe representing X and Y positions
df <- data.frame(x=seq(1:200),y=seq(1:200))
# 100 replications of sampling 100 "positions"
resamp <- replicate(100,df[sample(nrow(df),100),])
# convert to data frame (kernel.area needs an xy dataframe)
df2 <- do.call("rbind", resamp[1:2,])
# xy positions need to be in columns for kernel.area
df3 <- t(df2)
#edit: kernel.area requires you have an id field, but I am only dealing with one individual, so I'll construct a fake one of the same length as the positions
id=replicate(100,c("id"))
id=data.frame(id)
Here is the structure of the for loop I've tried (edited since first post):
for (j in seq(1,ncol(df3)-1,2)) {
kud <- kernel.area(df3[,j:(j+1)],id=id,kern="bivnorm",unin=c("m"),unout=c("km2"))
print(kud)
}
My end goal is to calculate kernel.area for each resampling event (ie rows 1:100 for every pair of columns up to 200), and be able to combine the results in a dataframe. However, after running the loop, I get this error message:
Error in df[, 1] : incorrect number of dimensions
Edit: I realised my id format was not the same as my data frame, so I change it and now have the error:
Error in kernelUD(xy, id, h, grid, same4all, hlim, kern, extent) :
id should have the same length as xy
First, a disclaimer: I have never worked with the package adehabitat, which has a function kernel.area, which I assume you are using. Perhaps you could confirm which package contains the function in question.
I think there are a couple suggestions I can make that are independent of knowledge of the specific package, though.
The first lies in the creation of df3. This should probably be
df3 <- t(df2), but this is most likely correct in your actual code
and just a typo in your post.
The second suggestion has to do with the way you subset df3 in the
loop. j:j+1 is just a single number, since the : has a higher
precedence than + (see ?Syntax for the order in which
mathematical operations are conducted in R). To get the desired two
columns, use j:(j+1) instead.
EDIT:
When loading adehabitat, I was warned to "Be careful" and use the related new packages, among which is adehabitatHR, which also contains a function kernel.area. This function has slightly different syntax and behavior, but perhaps it would be worthwhile examining. Using adehabitatHR (I had to install from source since the package is not available for R 2.15.0), I was able to do the following.
library(adehabitatHR)
for (j in seq(1,ncol(df3)-1,2)) {
kud <-kernelUD(SpatialPoints(df3[,j:(j+1)]),kern="bivnorm")
kernAr<-kernel.area(kud,unin=c("m"),unout=c("km2"))
print(kernAr)
}
detach(package:adehabitatHR, unload=TRUE)
This prints something, and as is mentioned in a comment below, kernelUD() is called before kernel.area().