I have a question I can understand or solve. I downloaded GSE115262 From GEO. https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE115262. I want to extract the gene names from GSM3172784HC$annotation.gene_name. When I do this, I get numbers not the gene names. How do I get the character values? If I run Str(), this is what I get $ annotation.gene_name : Factor w/ 56233 levels "5_8S_rRNA","5S_rRNA",..: 53514 52750 11836 48738. We see I get numbers. If I run head() and look at the GSM3172784HC$annotation.gene_name, I get the gene names, this is what I want. How do I get these?
#### Need to load in all libraries
#General Bioconductor packages
library("GEOquery");
library("Biobase");
# Loop Through Files for download
for(i in 1:length(tmp$V1)){
getGEOSuppFiles(tmp$V1[i])
};
######## Healthy Controls GSE115262 ##########
## May need to read thing mult. times to get into R
GSM3172784HC<-read.table(gzfile("FilePath.txt.gz"), header=T)
## New data-frame
HCData<- cbind(GSM3172784HC$annotation.gene_name, GSM3172784HC$expected_count);
HCData<- as.data.frame(HCData)
row.names(HCData) <- HCData$V1
colnames(HCData) <- c("HC1")
str(GSM3172784HC)
'data.frame': 57955 obs. of 11 variables:
$ X : int 1 2 3 4 5 6 7 8 9 10 ...
$ annotation.gene_id : Factor w/ 57955 levels "ENSG00000000003",..: 1 2 3 4 5 6 7 8 9 10 ...
$ annotation.gene_biotype: Factor w/ 43 levels "3prime_overlapping_ncRNA",..: 20 20 20 20 20 20 20 20 20 20 ...
$ annotation.gene_name : Factor w/ 56233 levels "5_8S_rRNA","5S_rRNA",..: 53514 52750 11836 48738 5916 13731 7375 14125 14433 24521 ...
$ annotation.source : Factor w/ 4 levels "ensembl","ensembl_havana",..: 2 2 2 2 2 2 2 2 2 2 ...
$ transcript_id.s. : Factor w/ 57955 levels "ENST00000000233,ENST00000415666,ENST00000459680,ENST00000463733,ENST00000467281,ENST00000489673",..: 17666 17669 17397 16695 5799 17850 14301 7 1276 12553 ...
$ length : num 1749 940 1073 1538 2430 ...
$ effective_length : num 1623 814 947 1412 2304 ...
$ expected_count : num 0 0 1 1 0 2 2 0 1 1 ...
$ TPM : num 0 0 0.27 0.18 0 0.23 0.07 0 0.65 0.17 ...
$ FPKM : num 0 0 0.41 0.27 0 0.35 0.11 0 0.98 0.25 ...
head(GSM3172784HC)
X annotation.gene_id annotation.gene_biotype annotation.gene_name
1 1 ENSG00000000003 protein_coding TSPAN6
2 2 ENSG00000000005 protein_coding TNMD
3 3 ENSG00000000419 protein_coding DPM1
4 4 ENSG00000000457 protein_coding SCYL3
5 5 ENSG00000000460 protein_coding C1orf112
6 6 ENSG00000000938 protein_coding FGR
annotation.source
1 ensembl_havana
2 ensembl_havana
3 ensembl_havana
4 ensembl_havana
5 ensembl_havana
6 ensembl_havana
transcript_id.s.
1 ENST00000373020,ENST00000494424,ENST00000496771,ENST00000612152,ENST00000614008
2 ENST00000373031,ENST00000485971
3 ENST00000371582,ENST00000371584,ENST00000371588,ENST00000413082,ENST00000466152,ENST00000494752
4 ENST00000367770,ENST00000367771,ENST00000367772,ENST00000423670,ENST00000470238
5 ENST00000286031,ENST00000359326,ENST00000413811,ENST00000459772,ENST00000466580,ENST00000472795,ENST00000481744,ENST00000496973,ENST00000498289
6 ENST00000374003,ENST00000374004,ENST00000374005,ENST00000399173,ENST00000457296,ENST00000468038,ENST00000475472
length effective_length expected_count TPM FPKM
1 1749.40 1623.17 0 0.00 0.00
2 940.50 814.28 0 0.00 0.00
3 1073.00 946.77 1 0.27 0.41
4 1538.00 1411.77 1 0.18 0.27
5 2430.11 2303.88 0 0.00 0.00
6 2350.00 2223.77 2 0.23 0.35
We can convert the column to character
library(dplyr)
GSM3172784HC <- GSM3172784HC %>%
mutate_if(is.factor, as.character)
Or with mutate/across
GSM3172784HC <- GSM3172784HC %>%
mutate(across(where(is.factor), as.character))
In base R, we can do
i1 <- sapply(GSM3172784HC, is.factor)
GSM3172784HC[i1] <- lapply(GSM3172784HC[i1], as.character)
NOTE: With R >= 4.0.0, by default stringsAsFactors = FALSE
Related
I am doing Exploratory Data Analysis on a tibble data frame. I've never used tibble so I'm experiecing some difficulties.
My tibble data frame has this structure:
spec_tbl_df [7,397 x 19] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
$ X1 : num [1:7397] 9617 12179 9905 5745 10067 ...
$ Administrative : num [1:7397] 5 26 4 3 7 16 4 3 2 0 ...
$ Administrative_Duration: num [1:7397] 408 1562 58 103 165 ...
$ Informational : num [1:7397] 2 9 2 0 1 3 4 5 0 0 ...
$ Informational_Duration : num [1:7397] 47.5 503.7 28.5 0 28.5 ...
$ ProductRelated : num [1:7397] 54 183 82 25 115 86 75 23 27 33 ...
$ ProductRelated_Duration: num [1:7397] 1547 9676 4729 1109 3428 ...
$ BounceRates : num [1:7397] 0 0.0111 0 0 0 ...
$ ExitRates : num [1:7397] 0.01733 0.0142 0.01454 0.00167 0.01629 ...
$ PageValues : num [1:7397] 0 19.57 9.06 61.3 4.97 ...
$ SpecialDay : num [1:7397] 0 0 0 0 0 0 0 0 0 0 ...
$ Month : Factor w/ 10 levels "Aug","Dec","Feb",..: 8 8 8 1 8 4 8 7 8 8 ...
$ OperatingSystems : Factor w/ 8 levels "1","2","3","4",..: 2 3 2 2 2 3 3 4 8 2 ...
$ Browser : Factor w/ 13 levels "1","2","3","4",..: 2 2 2 2 2 2 2 1 2 5 ...
$ Region : Factor w/ 9 levels "1","2","3","4",..: 3 2 1 6 4 8 1 1 7 3 ...
$ TrafficType : Factor w/ 19 levels "1","2","3","4",..: 2 12 2 5 10 4 2 4 2 1 ...
$ VisitorType : Factor w/ 3 levels "New_Visitor",..: 3 3 3 1 3 3 3 3 1 3 ...
$ Weekend : Factor w/ 2 levels "FALSE","TRUE": 2 1 1 1 1 1 1 1 1 1 ...
$ Revenue : Factor w/ 2 levels "FALSE","TRUE": 2 2 2 2 2 2 2 2 2 2 ...
Now if I use plot_bar to plot the cathegorical data (using DataExplorer package) I have no problem. I would like, for example, to create a boxplot for the cathegorical variable "Month" where for each month I have a boxplot showing how values are distribuited. The problem is that I can't find a way to access the frequencies. If I do the following:
boxplot(Month)
It creates a single boxplot for all the data (all the months) but it's not helpfull at all. Like this:
I would like the months on the x axis and the frequencies on the y axis and a boxplot for each month.
I've tried to "extract" the feature month, transform it to a matrix and repeat the process but it does not work.
Here is the variable montht taken alone:
> summary(x_Month)
Aug Dec Feb Jul June Mar May Nov Oct Sep
258 1034 123 259 166 1125 2014 1814 327 277
What am I missing ?
Something like this would probably work to create barplots for the frequencies of Month:
library(ggplot2)
spec_tbl_df %>%
ggplot(aes(x = Month)) +
geom_bar()
I'm trying to do the following:
I have a .csv file with N rows and 2 columns that I need to import and convert to a list.
Example file from .csv:
First seven rows of data
I import with command: points <- read.csv("points.csv")
'data.frame': 42 obs. of 2 variables:
$ Firefly : int 0 1 0 1 0 1 0 1 0 1 ...
$ Hawkes_times: Factor w/ 42 levels "[ 0.03485687 0.20167375 0.20275073
I need it as a sorted "List of 2" (one for each Firefly) with the following structure:
> str(points)
List of 2
$ : num [1:33] 0.79 0.87 0.88 0.89 0.94 1.01 1.13 1.19 ...
$ : num [1:14] 0.00 0.10 0.56 0.67 1.27 1.31 1.37 1.42 ...
, where the first list represents Firefly == 0 and second list represents Firefly == 1.
I attempt the following:
fy0 <- subset(points,Firefly == 0)
fy1 <- subset(points,Firefly == 1)
points.list <- list(fy0,fy1)
> str(points.list)
List of 2
$ :'data.frame': 21 obs. of 2 variables:
..$ Firefly : int [1:21] 0 0 0 0 0 0 0 0 0 0 ...
..$ Hawkes_times: Factor w/ 42 levels "[ 0.03485687 0.20167375 0.20275073 0.20941455 0.40515277 0.47026309\n 0.55714817 0.64789982 0.70749241 "| __truncated__,..: 30 29 28 31 39 40 33 37 25 24 ...
$ :'data.frame': 21 obs. of 2 variables:
..$ Firefly : int [1:21] 1 1 1 1 1 1 1 1 1 1 ...
..$ Hawkes_times: Factor w/ 42 levels "[ 0.03485687 0.20167375 0.20275073 0.20941455 0.40515277 0.47026309\n 0.55714817 0.64789982 0.70749241 "| __truncated__,..: 26 32 21 23 20 41 34 22 27 36 ...
I think I need a as.numeric(fy0$Hawkes_times) somewhere, but I want to avoid loops since I will have hundreds of rows and n Firefly values (fy0, fy1, fy2, ... fyn).
Thank you!
-Richard
points <- data.frame(firefly=rep(0:1, times=10), times=1:20)
split(points$times, points$firefly)
# $`0`
# [1] 1 3 5 7 9 11 13 15 17 19
# $`1`
# [1] 2 4 6 8 10 12 14 16 18 20
This does not rely on equally-sized groups:
set.seed(42)
points <- data.frame(firefly=sample(0:1, size=20, replace=TRUE), times=1:20)
split(points$times, points$firefly)
# $`0`
# [1] 3 8 11 14 15 18 19
# $`1`
# [1] 1 2 4 5 6 7 9 10 12 13 16 17 20
and as you can see the order is preserved.
I've got a CSV file from the link Hearthstone Arena Card Pickup probability
It's just a list of vectors now, and I want to convert into 9 column data frame. so it may look like:
My current code is as follows but it's not working at all.
hsd <- read.csv("hearthstonedraw.csv", header = TRUE)
hsd1 <- as.data.frame(hsd,ncol = 9)
hsd1
Answer goest out to Maurits Evers and Adam Sampson.
read.csv can read from the address you indicate and automatically converts character columns into factors (default behaviour) as well as calulating the number of columns.
hsd1 <- read.csv("https://bnetcmsus-a.akamaihd.net/cms/gallery/LN4X4GN4W59R1532566073433.csv", header = TRUE)
str(hsd1)
# 'data.frame': 3931 obs. of 9 variables:
# $ Draft.Class : Factor w/ 9 levels "Druid","Hunter",..: 1 1 1 1 1 1 1 1 1 1 ...
# $ Card.Name : Factor w/ 995 levels "Abominable Bowman",..: 716 813 646 500 263 964 549 186 509 984 ...
# $ Rarity : Factor w/ 5 levels "basic","common",..: 1 1 2 2 2 1 1 2 5 2 ...
# $ Type : Factor w/ 3 levels "Minion","Spell",..: 2 2 2 2 1 2 2 1 2 2 ...
# $ Card.Class : Factor w/ 10 levels "druid","hunter",..: 1 1 1 1 1 1 1 1 1 1 ...
# $ Average : num 1.47 1.45 1.44 1.17 1.03 ...
# $ P.1.or.more.: num 0.78 0.776 0.774 0.696 0.649 ...
# $ P.2.or.more.: num 0.436 0.431 0.428 0.327 0.273 ...
# $ P.3.or.more.: num 0.1784 0.1757 0.1724 0.1081 0.0819 ...
ncol(hsd1)
# [1] 9
# There are 9 columns in the data frame
I have a question about R.
I am using a test called levene.test to test a homogeneity of variance.
I know that you need a factor variable with at least two levels in order for this to work. And from what I see, I do have at least two levels for the factor variable that I am using. But somehow I keep getting the error of:
> nocorlevene <- levene.test(geno1rs11809462$SIF1, geno1rs11809462$k, correction.method = "correction.factor")
Error in `contrasts<-`(`*tmp*`, value = contr.funs[1 + isOF[nn]]) :
contrasts can be applied only to factors with 2 or more levels
I even try generate a variable from a binomial distribution:
k<-rbinom(1304, 1, 0.5)
and then use that as a factor, but is still not working.
Lastly I create a variable with 3 levels:
k<-sample(c(1,0,2), 1304, replace=T)
but some how still not working and getting the same error of:
nocorlevene <- levene.test(geno1rs11809462$SIF1, geno1rs11809462$k, correction.method="zero.removal")
Error in `contrasts<-`(`*tmp*`, value = contr.funs[1 + isOF[nn]]) :
contrasts can be applied only to factors with 2 or more levels
This is the output of the type of the variable in the data:
> str(geno1rs11809462)
'data.frame': 1304 obs. of 16 variables:
$ id : chr "WG0012669-DNA_A03_K05743" "WG0012669-DNA_A04_K05752" "WG0012669-DNA_A05_K05761" "WG0012669-DNA_A06_K05785" ...
$ rs11809462 : Factor w/ 2 levels "2/1","2/2": 2 2 2 2 2 2 2 2 2 2 ...
..- attr(*, "names")= chr "WG0012669-DNA_A03_K05743" "WG0012669-DNA_A04_K05752" "WG0012669-DNA_A05_K05761" "WG0012669-DNA_A06_K05785" ...
$ FID : chr "9370" "9024" "14291" "4126" ...
$ AGE_CALC : num 61 47 NA 62.5 55.6 59.7 46.6 41.2 NA 46.6 ...
$ MREFSUM : num 185 325 NA 211 212 ...
$ NORSOUTH : Factor w/ 3 levels "0","1","NA": 1 1 3 1 1 1 1 1 3 1 ...
$ smoke1 : Factor w/ 3 levels "0","1","NA": 2 2 3 1 1 1 2 1 3 1 ...
$ smoke2 : Factor w/ 3 levels "0","1","NA": 1 1 3 2 2 2 1 2 3 2 ...
$ ANYCG60 : num 0 0 NA 1 0 0 0 0 NA 1 ...
$ DCCT_HBA_MEAN: num 7.39 6.93 NA 7.37 7.56 7.86 6.22 8.88 NA 8.94 ...
$ EDIC_HBA : num 7.17 7.63 NA 8.66 9.68 7.74 6.59 9.34 NA 7.86 ...
$ HBAEL : num 7.3 8.82 NA 9.1 9.3 ...
$ ELDTED_HBA : num 7.23 7.76 NA 8.36 9.21 7.92 6.64 9.64 NA 9.09 ...
$ SIF1 : num 19.6 17 NA 23.8 24.1 ...
$ sex : Factor w/ 2 levels "0","1": 1 1 2 2 2 2 1 1 1 1 ...
$ k : Factor w/ 3 levels "0","1","2": 1 1 2 3 1 3 3 3 1 2 ...
As you can see the variable k, sex have 3 and 2 levels respectively but somehow I still get that error message.
> head(geno1rs11809462)
id rs11809462 FID AGE_CALC MREFSUM NORSOUTH smoke1 smoke2 ANYCG60
1 WG0012669-DNA_A03_K05743 2/2 9370 61.0 184.5925 0 1 0 0
2 WG0012669-DNA_A04_K05752 2/2 9024 47.0 325.0047 0 1 0 0
3 WG0012669-DNA_A05_K05761 2/2 14291 NA NA NA NA NA NA
4 WG0012669-DNA_A06_K05785 2/2 4126 62.5 211.2557 0 0 1 1
5 WG0012669-DNA_A08_K05802 2/2 11280 55.6 212.2922 0 0 1 0
6 WG0012669-DNA_A09_K05811 2/2 11009 59.7 261.0116 0 0 1 0
DCCT_HBA_MEAN EDIC_HBA HBAEL ELDTED_HBA SIF1 sex k
1 7.39 7.17 7.30 7.23 19.6136 0 0
2 6.93 7.63 8.82 7.76 17.0375 0 0
3 NA NA NA NA NA 1 1
4 7.37 8.66 9.10 8.36 23.8333 1 2
5 7.56 9.68 9.30 9.21 24.1338 1 0
6 7.86 7.74 8.53 7.92 25.7272 1 2
If anyone can give me some hints as to why this is happening, it would be great. I just don't know why the variable k or sex or having different levels are giving me error when I run the test.
thank you
I think I may have solved the problem. I believe it is due to NA value in the data. Because after I removed the na using say
x<-na.omit(original_data)
then apply the levene test on x, the warning message disappears.
Hopefully this is the cause of the problem.
If your factor has only one level, you will get this error. To check to see the levels of your factor variables, use lapply(df, levels). It will return nothing for non-factor variables, but will easily let you identify which variable is the offender. This is especially helpful if, like me, you have hundreds of variables.
You need to actually convert your variable to a factor. Just having three (or a finite) number of values does not necessarily make it a factor.
use x <- factor(x) to convert
When you look at the output of str(), it shows you the type of each variable:
<..cropped..>
$ SIF1 : num 19.6 17 NA 23.8 24.1 ...
$ sex : Factor w/ 2 levels "0","1": 1 1 2 2 2 2 1 1 1 1 ...
$ k : Factor w/ 3 levels "0","1","2": 1 1 2 3 1 3 3 3 1 2 ...
notice that $k is a factor but SIF1 is not
Thus, use
geno1rs11809462$SIF1 <- factor(geno1rs11809462$SIF1)
I’m a newbie to R, and I’m having trouble with an R predict command.
I receive this error
Error in `[.data.frame`(newdata, , as.character(object$formula[[2]])) :
undefined columns selected
when I execute this command:
model.predict <- predict.boosting(model,newdata=test)
Here is my model:
model <- boosting(Y~x1+x2+x3+x4+x5+x6+x7, data=train)
And here is the structure of my test data:
str(test)
'data.frame': 343 obs. of 7 variables:
$ x1: Factor w/ 4 levels "Americas","Asia_Pac",..: 4 2 4 2 4 3 3 3 4 1 ...
$ x2: Factor w/ 5 levels "Fifth","First",..: 3 3 2 2 4 2 4 4 1 1 ...
$ x3: Factor w/ 3 levels "Best","Better",..: 2 3 1 1 3 2 2 1 3 3 ...
$ x4: Factor w/ 2 levels "Female","Male": 1 1 2 1 1 2 1 2 2 2 ...
$ x5: int 82 55 47 31 6 53 77 68 76 86 ...
$ x6: num 22.8 14.6 25.5 38.3 7.9 32.8 4.6 34.2 36.7 21.7 ...
$ x7: num 0.679 0.925 0.897 0.684 0.195 ...
And the structure of my training data:
$ RecordID: int 1 2 3 4 5 6 7 8 9 10 ...
$ x1 : Factor w/ 4 levels "Americas","Asia_Pac",..: 1 2 2 3 1 1 1 2 2 4 ...
$ x2 : Factor w/ 5 levels "Fifth","First",..: 5 5 3 2 5 5 5 4 3 2 ...
$ x3 : Factor w/ 3 levels "Best","Better",..: 2 3 2 2 3 1 2 3 1 1 ...
$ x4 : Factor w/ 2 levels "Female","Male": 1 2 2 2 1 1 2 2 1 1 ...
$ x5 : int 1 67 75 51 84 33 21 80 48 5 ...
$ x6 : num 21 13.8 30.3 11.9 1.7 13.2 33.9 17 3.4 19.5 ...
$ x7 : num 0.35 0.85 0.73 0.39 0.47 0.13 0.2 0.12 0.64 0.11 ...
$ Y : Factor w/ 2 levels "Green","Yellow": 2 2 1 2 2 2 1 2 2 2 ..
I think there’s a problem with the structure of the test data, but I can’t find it, or I have a mis-understanding as to the structure of the “predict” command. Note that if I run the predict command on the training data, it works. Any suggestions as to where to look?
Thanks!
predict.boosting() expects to be given the actual labels for the test data, so it can calculate how well it did (as in the confusion matrix shown below).
library(adabag)
data(iris)
iris.adaboost <- boosting(Species~Sepal.Length+Sepal.Width+Petal.Length+
Petal.Width, data=iris, boos=TRUE, mfinal=10)
# make a 'test' dataframe without the classes, as in the question
iris2 <- iris
iris2$Species <- NULL
# replicates the error
irispred=predict.boosting(iris.adaboost, newdata=iris2)
#Error in `[.data.frame`(newdata, , as.character(object$formula[[2]])) :
# undefined columns selected
Here's working example, drawn largely from the help file just so there is a working example here (and to demonstrate the confusion matrix).
# first create subsets of iris data for training and testing
sub <- c(sample(1:50, 25), sample(51:100, 25), sample(101:150, 25))
iris3 <- iris[sub,]
iris4 <- iris[-sub,]
iris.adaboost <- boosting(Species ~ ., data=iris3, mfinal=10)
# works
iris.predboosting<- predict.boosting(iris.adaboost, newdata=iris4)
iris.predboosting$confusion
# Observed Class
#Predicted Class setosa versicolor virginica
# setosa 50 0 0
# versicolor 0 50 0
# virginica 0 0 50
when your y is factor, show this error, try as.vector(y)~.
The column names of the data that you use to predict should be exactly the same as the column names of training data.