Error: Can't select within an unnamed vector - r

I have a dataset named "cog_gpa" in R subsetted from the fragile families data set that contains student GPA and some test scores used to measure cognitive abilities. I want to run a random forest to check which ones are more important than others in terms of predicting the GPA.
My dataset (cog_gpa) is a tibble (4898*5) and looks somewhat like this:
ch5dsss ch5ppvts ch5wj9ss ch5wj10ss GPA
13.0000 98 104.0000 117.0000 3.7500
9.0000 76 84.0000 84.0000 3.5000
9.3524 92 92.6763 97.9623 4.0250
When I check str(cog_gpa), I see that all the predictor variables are of the type "dbl+lbl" whereas GPA is of type "num".
I start with the following where I split by GPA:
cog_gpa_split <- initial_split(cog_gpa$GPA, prop = .7)
However, I get the following error:
Error: Can't select within an unnamed vector.
Run `rlang::last_error()` to see where the error occurred.
When I run rlang::last_error(), I get the following:
<error/rlang_error>
Can't select within an unnamed vector.
Backtrace:
1. rsample::initial_split(cog_gpa$GPA, prop = 0.7)
2. rsample::mc_cv(...)
3. tidyselect::vars_select(names(data), !!enquo(strata))
4. tidyselect:::eval_select_impl(...)
Run `rlang::last_trace()` to see the full context.
When I run rlang::last_trace(), I get the following:
<error/rlang_error>
Can't select within an unnamed vector.
Backtrace:
x
1. \-rsample::initial_split(cog_gpa$GPA, prop = 0.7)
2. \-rsample::mc_cv(...)
3. \-tidyselect::vars_select(names(data), !!enquo(strata))
4. \-tidyselect:::eval_select_impl(...)
How do I go about resolving this error? I think I need to ensure that all of my variables are of the same type (i.e. all of them should be of the type numeric) but I am not sure how to do that.

Related

Error during harmonisation in TwoSampleMR (R-package)

I am trying to perform Mandalian Randomisation using the R package “TwoSampleMR”.
As exposure data, I use instruments from the GWAS catalog. (Phenotype - Sphingolipid levels).
As a outcome data, I use GISCOME ischemic stroke outcome GWAS (http://www.kp4cd.org/index.php/node/391)
I have an error when I do harmonization by the command harmonise_data().
The text of the error is:
**Error in data.frame(…, check.names = FALSE) : arguments imply differing number of rows: 1, 0**.
I have noticed that the error is caused by some exact lines in the file with outcomes. When I make a text file that contains only one line from the original file and use it as outcome data, some lines cause an error, and someones don’t.
As an example this one causes an error:
MarkerName CHR POS Allele1 Allele2 Freq1 Effect StdErr P-value
rs10938494 4 47563448 a g 0.2139 0.0294 0.0519 0.5706
This one doesn’t:
rs1000778 11 61655305 a g 0.2559 0.0939 0.0493 0.05705
Here is all commands that I use.
library(TwoSampleMR)
library(MRInstruments)
data(gwas_catalog)
exp <- subset(gwas_catalog, grepl("Sphingolipid levels", Phenotype))
exp_dat<-format_data(exp)
exp_dat<-clump_data(exp_dat)
exp_dat
out_dat<-read_outcome_data(
snps=exp_dat$SNP,
filename='giscome.012vs3456.age-gender-5PC.meta1.txt'
sep='\t', snp_col='MarkerName',
beta_col='Effect',
se_col='StdErr',
effect_allele_col='Allele1',
other_allele_col='Allele2',
eaf_col='Freq1',
pval_col='Р-value'
)
dat<-harmonise_data(exporsure_dat=exp_dat, outcome_dat=out_dat)
What would be the reason for this problem?
Thank you.
It is difficult to comment without looking at your sample input file but you might encounter this sort of error when there are inconsistencies with naming the exposure columns in your data frame.
Please see this thread on.
https://github.com/MRCIEU/TwoSampleMR/issues/226

Why is my custom function printing my row names in addition to the data?

I have the following custom function that I am using to create a table of summary statistics in R.
regression.stats<-function(fit){
formula<-fit$call;
data<-eval(getCall(fit)$data);
abserror<-abs(exp(fit$fitted.values)-data$bm)/exp(fit$fitted.values);
QMLE<-exp((sigma(fit)^2)/2);
smear<-sum(exp(fit$residuals))/nrow(data);
RE<-mean(data$bm)/mean(exp(fit$fitted.values));
CF<-(RE+smear+QMLE)/3;
adjPE<-mean(abs((exp(fit$fitted.values)*CF)-data$bm)
/(exp(fit$fitted.values)*CF));
SEE<-exp(sigma(fit)+4.6052)-100;
summary<-summary(fit)
statistics<-data.frame("df"=fit$df.residual,
"r2"=round(summary(fit)$r.squared,4),
"adjr2"=round(summary(fit)$adj.r.squared,4),
"AIC"=AIC(fit),"BIC"=BIC(fit),
"logLik"=logLik(fit),
"PE"=round(mean(abserror)*100,2),QMLE=round(QMLE,3),
smear=round(smear,3),RE=round(RE,3),CF=round(CF,3),
"adjPE"=round(mean(adjPE)*100,2),
"SEE"=round(SEE,2),row.names = print(substitute(fit)));
return(statistics)
}
I want to bind the resulting rows into a data.frame in order to produce a table of comparison statistics between regression analyses. For example, using the data from the mtcars dataset...
data(mtcars)
lm1<-(cyl~mpg,data=mtcars)
lm2<-(cyl~disp,data=mtcars)
lm2<-(disp~mpg,data=mtcars)
rbind(regression.stats(lm1),regression.stats(lm2),regression.stats(lm3))
I am creating this for an R Markdown html file and I want readers to be able to tell which regression equation produced which statistics. However when I run the code it also ends up printing a list of the names of the lm functions in addition to the regression statistics in the resulting html document.
I have managed to track the problem down to the line row.names = print(substitute(fit))) in my function. If I remove that line it no longer prints the lm name when running the function. However, what happens then is my rows are no longer associated with the correct model name. How can I adjust my function so that it only prints the name of the model function as the row name of the summary function, rather than creating an additional list?
The line
...
row.names = print(substitute(fit))
...
should be
row.names = deparse(substitute(fit))
Or simply substitute(fit) as this gets converted to character
as print doesn't have any return value and it is just printing on the console
After the change in function
rbind(regression.stats(lm1),regression.stats(lm2),regression.stats(lm3))
# df r2 adjr2 AIC BIC logLik PE QMLE smear RE CF adjPE SEE
#lm1 30 0.7262 0.7171 91.46282 95.86003 -42.73141 NaN 1.570 1.443000e+00 NA NA NaN 1.585700e+02
#lm2 30 0.8137 0.8075 79.14552 83.54273 -36.57276 NaN 1.359 1.317000e+00 NA NA NaN 1.189600e+02
#lm3 30 0.7183 0.7090 363.71635 368.11356 -178.85818 NaN Inf 1.861805e+65 NA NA NaN 1.092273e+31

R vector numeric expression warning

I created a simple vector in R to store the temperatures of 3 patients.
temperature <- c(98.1, 98.6, 101.4)
then later I tried to retrieve the temperature of the second and third patient.
temperature[2:3]
[1] 98.6 101.4
While trying to retrieve all three values I succeeded but then got this warning from RStudio
temperature[1:2:3]
[1] 98.1 98.6 101.4
Warning message:
In 1:2:3 : numerical expression has 2 elements: only the first used
What does this warning mean?
The expression temperature[1:2:3] though is valid (valid in the sense that it will compile without errors) in R, but will give you same result as temperature[1:3].
R only uses the first and the last indices. So, temperature[1:3:4:5:3] is same as temperature[1:3].

Error: impossible to replicate vector of size in mutate

I have been using the following code to determine diversity (using vegan package) and it has been going well. In order to calculate diversity using vegan, you have to create a dataframe with only site by species. Then you calculate diversity and then use dplyr's mutate to be able to create a new column to your original dataframe that is your diversity metric.
final_corrected %>% select(eu_density, para_density, bleph_density, colp_density, rot_density, vort_density) -> final_speciesonly
H.protists <- diversity(final_speciesonly)
final_corrected %>% mutate(diversity = H.protists) -> final_diversity
My problem is that I tried to do this analysis again with a summarized dataset, and when I try and mutate, an error pops up:
summary_diversity[3:8] -> summary_speciesonly
H.protists.sum <- diversity(summary_speciesonly)
summary_speciesonly %>% mutate(diversity_sum = H.protists.sum) -> summary_diversity_total
Error: impossible to replicate vector of size in mutate
When I look at the differences between H.protists and H.protists.sum I find that H. protists is a named num value, whereas H.protists.sum is just a num value. Here is a header for each:
header(H.protists)
1 2 3 4 5 6
0.3144922 0.8980537 0.8740576 0.2771206 0.5701381 0.3502690
header(H.protists.sum)
[1] 1.336860 1.331183 1.193013 1.192450 1.258912 1.412319
I think that this is the reason that I am getting an error message, but I am not sure how to fix it. Help?

cummeRbund csHeatmap column user-defined order

I am using the R package cummeRbund (from Bioconductor) to visualize RNA-seq data, I created a cuffGeneSet instance called "DEG_genes" that contains 662 genes that are significantly differentially expressed between males and females. My goal is to create a heatmap using csHeatmap() in which the male and female samples (replicates) are separated but with a specific user-defined order within the sex category.
I used:
> DEG<-diffData(genes(cuff)) # take differentially expressed genes
> DEG_significant<-subset(DEG,significant=='yes') # retain only significant changes
> DEG_sign_IDs <- DEG_significant$gene_id # retrieve IDs
> DEG_genes<-getGenes(cuff,DEG_sign_IDs) # get CuffGeneSet instance
> hmap<-csHeatmap(DEG_genes,clustering='none',labRow=F,replicates=T)
This gives me ALMOST what I want: the heatmap shows Females on the left and Males on the right but they are alphabetically ordered (Female_0,Female_1,Female_10,Female_11,Female_12...Female_19,Female_2,Female_20,Female_21..,Female_29 on the left and similarly for males Male_0,Male_1,Male_10...Male_19,Male_2,Male_20...etc on the right) and I want them to be in a specific order (clusterReps). I created a test vector with replicate names on a specific order (Males on the left with 0 and 6 echanged and females on the right) as follow:
clusterReps<-c("Male_6","Male_1","Male_2","Male_3","Male_4","Male_5","Male_0","Male_7","Male_8","Male_9","Male_10","Male_11","Male_12","Male_13","Male_14","Male_15","Male_16","Male_17","Male_18","Male_19","Male_20","Male_21","Male_22","Male_23","Male_24","Male_25","Male_26","Male_27","Male_28","Male_29","Male_30","Male_31","Male_32","Male_33","Female_0","Female_1","Female_2","Female_3","Female_4","Female_5","Female_6","Female_7","Female_8","Female_9","Female_10","Female_11","Female_12","Female_13","Female_14","Female_15","Female_16","Female_17","Female_18","Female_19","Female_20","Female_21","Female_22","Female_23","Female_24","Female_25","Female_26","Female_27","Female_28")
I would like the data to be exactly the same except the order of the columns that must follow the order of the "clusterReps" vector. Knowing that the heatmap is a ggplot, I looked everywhere for a solution the last 2 days but with no success (despite a closely ressembling problem with heatmap.2() instead of csHeatmap() on stackoverflow, I tried to get a replicate fpkm matrix and use heatmap.2 but could only use heatmap_2 and some options were not accepted).
Using:
> hmap<-hmap+scale_x_discrete(limits=clusterReps)
Scale for 'x' is already present. Adding another scale for 'x', which will replace the existing scale.
only changes the x-axis labels but not the actual data (the heatmap remains identical).
Is there a similar function that rearranges the columns and not just labels?
Thanks in advance for your help, I'm not familiar with handling ggplot objects, and in particular heatmaps from cummeRbund.
EDIT:
Here is what I can give as further information:
> DEG_genes
CuffGeneSet instance for 662 genes
Slots:
annotation
fpkm
repFpkm
diff
count
isoforms CuffFeatureSet instance of size 930
TSS CuffFeatureSet instance of size 785
CDS CuffFeatureSet instance of size 230
promoters CuffFeatureSet instance of size 662
splicing CuffFeatureSet instance of size 785
relCDS CuffFeatureSet instance of size 662
> summary(DEG_genes)
Length Class Mode
662 CuffGeneSet S4
I am afraid I can't give more information for the moment, please let me know if you want me to execute a command and report the output if it can help.
I am not very fluent in R, but I was having the same problem. To solve it I made a script that renames all my sample names in all the files inside the cuffdiff folder to something that will give the right order when sorted alphabetically, and then rebuild the database.

Resources