In R scientific use of values (e.g. change 8e-04) [duplicate] - r

For some reason, when I convert a character column of numbers in scientific notation, the decimals aren't preserved.
> str(output)
'data.frame': 213950 obs. of 2 variables:
$ ColA : chr ".3370E+03" ".3375E+03" ".3380E+03" ".3385E+03" ...
$ ColB : chr ".4942E+00" ".5295E+00" ".5682E+00" ".6091E+00" ...
> output$ColA = as.numeric(output$ColA)
> str(output)
'data.frame': 213950 obs. of 2 variables:
$ ColA : num 337 338 338 338 339 ...
$ ColB : chr ".4942E+00" ".5295E+00" ".5682E+00" ".6091E+00" ...
I would expect it to read:
$ ColA : num 337 337.5 338 338.5 ...
I tried the solution from this SO question, but no luck:
> options(digits=9)
> str(output)
'data.frame': 213950 obs. of 2 variables:
$ ColA : num 337 338 338 338 339 ...
$ ColB : chr ".4942E+00" ".5295E+00" ".5682E+00" ".6091E+00" ...
What's going on?

You can turn off scientific notation for numbers using the option below;
options(scipen = 999)
That would make all the numbers to appear as decimals.
If you want to revert it back to the default, use
options(scipen = 0)
See getOption("scipen") for more options.

Related

I have a column of characters that I can't use when creating visuals: I have tried multiple transformations and nothing is working. Please Assist

I'm on day two of working with an awesome Chi-Sqr analysis. I have all my data cleaned my matrices and dfs created...the tests run....however I am having a heck of a time trying to create the visualizations.
The Frequency Table
Some of the code I've been running:
M <- cor(Top25.ForVIZ)
Error in cor(Top25.ForVIZ) : 'x' must be numeric
> M.4.viz <- as.table(as.matrix(Top25.ForVIZ))
> M <- cor(M.4.viz)
*Error in cor(M.4.viz) : 'x' must be numeric*
> corrplot::corrplot(M.4.viz)
*Error in corrplot::corrplot(M.4.viz) : The matrix is not in [-1, 1]!*
> head(Top25.ForVIZ)
[Header For Data][1]
> cormat <- round(cor(Top25.ForVIZ),2)
*Error in cor(Top25.ForVIZ) : 'x' must be numeric*
> topic_name <- as.factor(Top25.ForVIZ$topic_name)
> str(Top25.ForVIZ)
Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 25 obs. of 4 variables:
$ topic_name : chr "Accenture (ACN)" "Amazon (AMZN)" "Belgacom (PROX)" "Checking Account" ...
$ TTL.Score.Pre.Disconnect.3.week.cons: num 1489 1519 1833 2441 1560 ...
$ Active.Segment.1 : num 1166 2115 1024 2383 2931 ...
$ TTL.Score.3.Week.Post.Diconnect : num 1546 1712 1401 1683 1587 ...
> cormat <- round(cor(Top25.ForVIZ),2)
*Error in cor(Top25.ForVIZ) : 'x' must be numeric*
> topic_name <- as.numeric(Top25.ForVIZ$topic_name)
*Warning message:
NAs introduced by coercion*
> str(Top25.ForVIZ)
Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 25 obs. of 4 variables:
$ topic_name : chr "Accenture (ACN)" "Amazon (AMZN)" "Belgacom (PROX)" "Checking Account" ...
$ TTL.Score.Pre.Disconnect.3.week.cons: num 1489 1519 1833 2441 1560 ...
$ Active.Segment.1 : num 1166 2115 1024 2383 2931 ...
$ TTL.Score.3.Week.Post.Diconnect : num 1546 1712 1401 1683 1587 ...
[1]: https://i.stack.imgur.com/DGdHh.jpg
My end goal is to be able to work with the data in corplots, ggplot etc...I also would like to understand from a logic perspective what I am doing wrong.
TIA.

Convert scientific notation to numeric, preserving decimals

For some reason, when I convert a character column of numbers in scientific notation, the decimals aren't preserved.
> str(output)
'data.frame': 213950 obs. of 2 variables:
$ ColA : chr ".3370E+03" ".3375E+03" ".3380E+03" ".3385E+03" ...
$ ColB : chr ".4942E+00" ".5295E+00" ".5682E+00" ".6091E+00" ...
> output$ColA = as.numeric(output$ColA)
> str(output)
'data.frame': 213950 obs. of 2 variables:
$ ColA : num 337 338 338 338 339 ...
$ ColB : chr ".4942E+00" ".5295E+00" ".5682E+00" ".6091E+00" ...
I would expect it to read:
$ ColA : num 337 337.5 338 338.5 ...
I tried the solution from this SO question, but no luck:
> options(digits=9)
> str(output)
'data.frame': 213950 obs. of 2 variables:
$ ColA : num 337 338 338 338 339 ...
$ ColB : chr ".4942E+00" ".5295E+00" ".5682E+00" ".6091E+00" ...
What's going on?
You can turn off scientific notation for numbers using the option below;
options(scipen = 999)
That would make all the numbers to appear as decimals.
If you want to revert it back to the default, use
options(scipen = 0)
See getOption("scipen") for more options.

cramer.test: NAs introduced by coercion

I know there is a lot of information in Google about this problem, but I could not solve it.
I have a data frame:
> str(myData)
'data.frame': 1199456 obs. of 7 variables:
$ A: num 3064 82307 4431998 1354 193871 ...
$ B: num 6067 403916 2709997 2743 203434 ...
$ C: num 299 11752 33282 170 2748 ...
$ D: num 105 6676 7065 20 1593 ...
$ E: num 8 572 236 3 170 ...
$ F: num 0 21 95 0 13 ...
$ G: num 583 18512 961328 348 42728 ...
Then I convert it to a matrix in order to apply the Cramer-von Mises test from "cramer" library:
> myData = as.matrix(myData)
> str(myData)
num [1:1199456, 1:7] 3064 82307 4431998 1354 193871 ...
- attr(*, "dimnames")=List of 2
..$ : chr [1:1199456] "8" "32" "48" "49" ...
..$ : chr [1:7] "A" "B" "C" "D" ...
After that, if I apply a "cramer.test(myData[x1:y1,], myData[x2:y2,])" I get the following error:
Error in rep(0, (RVAL$m + RVAL$n)^2) : invalid 'times' argument
In addition: Warning message:
In matrix(rep(0, (RVAL$m + RVAL$n)^2), ncol = (RVAL$m + RVAL$n)) :
NAs introduced by coercion
I also tried to convert the data frame to a matrix like this, but the error is the same:
> myData = as.matrix(sapply(myData, as.numeric))
> str(myData)
num [1:1199456, 1:7] 3064 82307 4431998 1354 193871 ...
- attr(*, "dimnames")=List of 2
..$ : NULL
..$ : chr [1:7] "A" "B" "C" "D" ...
Your problem is that your data set is too large for the algorithm that cramer.test is using (at least the way it's coded). The code tries to create a lookup table according to
lookup <- matrix(rep(0, (RVAL$m + RVAL$n)^2),
ncol = (RVAL$m + RVAL$n))
where RVAL$m and RVAL$n are the number of rows of the two samples. The standard maximum length of an R vector is 2^31-1 on a 32-bit platform: since your samples have equal numbers of rows N, you'll be trying to create a vector of length (2*N^2), which in your case is 5.754779e+12 -- probably too big even if R would let you create the vector.
You may have to look for another implementation of the test, or another test.

Trouble converting list into factor in R

I am having problems creating a boxplot of my data, because one of my variables is in the form of a list.
I am trying to create a boxplot:
boxplot(dist~species, data=out)
and received the following error:
Error in model.frame.default(formula = dist ~ species, data = out) :
invalid type (list) for variable 'species'
I have been unsuccessful in forcing 'species' into the form of a factor:
out[species]<- as.factor(out[[out$species]])
and receive the following error:
Error in .subset2(x, i, exact = exact) : invalid subscript type 'list'
How can I convert my 'species' column into a factor which I can then use to create a boxplot? Thanks.
EDIT:
str(out)
'data.frame': 4570 obs. of 6 variables:
$ GridRef : chr "NT73" "NT80" "NT85" "NT86" ...
$ pred : num 154 71 81 85 73 99 113 157 92 85 ...
$ pred_bin : int 0 0 0 0 0 0 0 0 0 0 ...
$ dist : num 20000 10000 9842 14144 22361 ...
$ years_since_1990: chr "21" "16" "21" "20" ...
$ species :List of 4570
..$ : chr "C.splendens"
..$ : chr "C.splendens"
..$ : chr "C.splendens"
.. [list output truncated]
It's hard to imagine how you got the data into this form in the first place, but it looks like
out <- transform(out,species=unlist(species))
should solve your problem.
set.seed(101)
f <- as.list(sample(letters[1:5],replace=TRUE,size=100))
## need I() to make a wonky data frame ...
d <- data.frame(y=runif(100),f=I(f))
## 'data.frame': 100 obs. of 2 variables:
## $ y: num 0.125 0.0233 0.3919 0.8596 0.7183 ...
## $ f:List of 100
## ..$ : chr "b"
## ..$ : chr "a"
boxplot(y~f,data=d) ## invalid type (list) ...
d2 <- transform(d,f=unlist(f))
boxplot(y~f,data=d2)

How to make sublist/extract expression data of candidate genes from normalized microarray list

I have several processed microarray data (normalized, .txt files) from which I want to extract a list of 300 candidate genes (ILMN_IDs). I need in the output not only the gene names, but also the expression values and statistics info (already present in the original file).
I have 2 dataframes:
normalizedData with the identifiers (gene names) in the first column, named "Name".
candidateGenes with a single column named "Name", containing the identifiers.
I've tried
1).
all=normalizedData
subset=candidateGenes
x=all%in%subset
2).
all[which(all$gene_id %in% subset)] #(as suggested in other bioinf. forum)#,
but it returns a Dataframe with 0 columns and >4000 rows. This is not correct, since normalizedData has 24 columns and compare them, but I always get error.
The key is to be able to compare the first column of all ("Name") with subset. Here is the info:
> class(all)
> [1] "data.frame"
> dim(all)
> [1] 4312 24
> str(all)
> 'data.frame':4312 obs. of 24 variables:
$ Name: Factor w/ 4312 levels "ILMN_1651253": 3401..
$ meanbgt:num 0 ..
$ meanbgc: num ..
$ cvt: num 0.11 ..
$ cvc: num 0.23 ..
$ meant: num 4618 ..
$ stderrt: num 314.6 ..
$ meanc: num 113.8 ...
$ stderrc: num 15.6 ...
$ ratio: num 40.6 ...
$ ratiose: num 6.21 ...
$ logratio: num 5.34 ...
$ tp: num 1.3e-04 ...
$ t2p: num 0.00476 ...
$ wilcoxonp: num 0.0809 ...
$ tq: num 0.0256 ...
$ t2q: num 0.165 ...
$ wilcoxonq: num 0.346 ...
$ limmap: num 4.03e-10 ...
$ limmapa: num 4.34e-06 ...
$ SYMBOL: Factor w/ 3696 levels "","A2LD1",..
$ ENSEMBL: Factor w/ 3143 levels "ENSG00000000003",..
and here is the info about subset:
> class(subset)
[1] "data.frame"
> dim(subset)
>[1] 328 1
> str(subset) 'data.frame': 328 obs. of 1 variable:
$ V1: Factor w/ 328 levels "ILMN_1651429",..: 177 286 47 169 123 109 268 284 234 186 ...
I really appreciate your help!
What you need to do is
all[all$Name %in% subset$V1, ]
When using a data.frame, it's important to drill down the the correct column that has the data you actually want to use. You need to know which columns have the matching IDs. That the only way that this solution really differed from other suggested or other things you've tried.
It's also important to note that when subsetting a data.frame by rows, you need to use the [,] syntax where the vector before the comma indicates rows and the vector after indicates columns. Here, since you want all columns, we leave it empty.

Resources