Compute Column in R - r

What is the difference between the two statements below. They are rendering different outcomes, and since I am trying to come to R from SPSS, I am a little confused.
ds$share.all <- ds[132]/ ds[3]
mean(ds$share.all, na.rm=T)
and
ds$share.all2 <- ds$col1/ ds$Ncol2
mean(ds$share.all2, na.rm=T)
they render the same mean, but on the first, the output is printed as
col1
0.02669424
and the second only prints the .02xxxxx.
Any help will be much appreciated.

Indicating a column of a data frame with single brackets (your first example) produces a data frame with just that column, but using the $ operator (as in your second example) is just a vector. Printing something will print the names associated with it if it has names (the col1 in your first example). The data frame you get with ds[132] has a name attribute, but the vector you get with ds$col1 does not. The equivalent of ds$col1 would be to use double instead of single brackets: ds[[132]]. For example:
> x<-data.frame(1:10)
> names(x)<-"var"
> class(x$var)
[1] "integer"
> class(x[1])
[1] "data.frame"
> identical(x[1],x$var)
[1] FALSE
> identical(x[[1]],x$var)
[1] TRUE

Related

Is there a R function to check and remove empty strings in a dataframe?

Recently I want to deal with some TCGA data and draw some KM survival plot.
But accidently I found something strange:
>dat3
TUMOR_STAGE OS_MONTHS OS_STATUS ADORA2B ENTPD1
TCGA.HQ.A2OE.01 38.57 0:LIVING 45.6397 643.4637
TCGA.FJ.A3Z9.01 12.65 1:DECEASED 25.3327 982.4690
There were 2 empty strings in the case lists data of patient's tumor stage and they were indeed empty quotes ("").
> is.character(dat3$TUMOR_STAGE)
[1] TRUE
> is.na(dat3$TUMOR_STAGE)
[1] FALSE FALSE
> which(dat3$TUMOR_STAGE == "")
[1] 1 2
It's not difficult to remove them, I use filter()
#dat is the actual dataframe
dat <- dat %>% filter(!TUMOR_STAGE == "")
But the question is, for a large downloaded dataframe, what if I don't know whether there is such "empty quotes"? Is there any R function that can be used to check this and remove the rows/columns containing such values?
I'd do Something like:
if('' %in% dat$TUMOR_STAGE){dat_new = dat[!dat$TUMOR_STAGE%in%'',]}
And to apply it to all column, you can extend it with a for loop

UseMethod("type") error; no applicable method for 'type" applied to an object of class "c('double', 'numeric')"

In a dataframe, I have a column that has numeric values and some mixed in character data for some rows. I want to remove all rows with the character data and keep those rows with a number value. The df I have is 6 million rows, so I simply made a small object to try to solve my issue and then implement at a larger scale.
Here is what I did:
a <- c("fruit", "love", 53)
b <- str_replace_all("^[:alpha:]", 0)
Reading answers to other UseMethod errors on here (about factors), I tried to change "a" to as.character(a) and attempt "b" again. But, I get the same error. I'm trying to simply make any alphabetic value into the number zero and I'm fairly new at all this.
There are several issues here, even in these two lines of code. First, a is a character vector, because its first element is a character. This means that your numeric 53 is coerced into a character.
> print(a)
[1] "fruit" "love" "53"
You've got the wrong syntax for str_replace_all. See the documentation for how to use it correctly. But that's not what you want here, because you want numerics.
The first thing you need to do is convert a to a numeric. A crude way of doing this is simply
>b <- as.numeric(a)
Warning message:
NAs introduced by coercion b
> b
[1] NA NA 53
And then subset to include only the numeric values in b:
> b <- b[!is.na(b)]
> b
[1] 53
But whether that's what you want to do with a 6 million row dataframe is another matter. Please think about exactly what you would like to do, supply us with better test data, and ask your question again.
There's probably a more efficient way of doing this on a large data frame (e.g. something column-wise, instead of row-wise), but to answer your specific question about each row a:
as.numeric(stringr::str_replace_all(a, "[a-z]+", "0"))
Note that the replacing value must be a character (the last argument in the function call, "0"). (You can look up the documentation from your R-console by: ?stringr::str_replace_all)

R is changing my variable value by itself

I have a dataframe that has an id field with values as these two:
587739706883375310
587739706883375408
The problem is that, when I ask R to show these two numbers, the output that I get is the following:
587739706883375360
587739706883375360
which are not the real values of my ID field, how do I solve that?
For your information: I have executed options(scipen = 999) to R does not convert my number to a scientific notation.
This problem also happens in R console, if I enter these examples numbers I also get the same printing as shown above.
EDIT: someone asked
dput(yourdata$id)
I did that and the result was:
c(587739706883375360, 587739706883375360, 587739706883375488, 587739706883506560, 587739706883637632, 587739706883637632, 587739706883703040)
To compare, the original data in the csv file is:
587739706883375310,587739706883375408,587739706883375450,587739706883506509,587739706883637600,587739706883637629,587739706883703070
I also did the following test with one of these numbers:
> 587739706883375408
[1] 587739706883375360
> as.double(587739706883375408)
[1] 587739706883375360
> class(as.double(587739706883375408))
[1] "numeric"
> is.double(as.double(587739706883375408))
[1] TRUE
You can use the bit64 package to represent such large numbers:
library(bit64)
as.integer64("587739706883375408")
# integer64
# [1] 587739706883375408
as.integer64("587739706883375408") + 1
# integer64
# [1] 587739706883375409

Strange unexpected tokens inside the string

I have two simmingly identical strings in two data frames. For example, both
df_cont$winner[20]
df_assist$winner[609]
return "ivarovskaya"
But the comparison
identical(df_cont$winner[20], df_assist$winner[609])
returns FALSE.
So, dplyr joins don't work on them and when I count characters in those strings, I get different numbers.
Then I found out that copying those strings from View() panel into Rscript results in this:
Output of problem variables looks like this:
> df_cont$winner[20]
[1] "iva­rov­ska­ya"
> df_assist$winner[609]
[1] "ivarovskaya"
> nchar(df_cont$winner[20])
[1] 14
> nchar(df_assist$winner[609])
[1] 11
dput() function also results in identical strings:
> dput(df_cont$winner[20])
"iva­rov­ska­ya"
> dput(df_cont$winner[20])
"iva­rov­ska­ya"
How can I get rid of those strange red dots?

Data frame column naming

I am creating a simple data frame like this:
qcCtrl <- data.frame("2D6"="DNS00012345", "3A4"="DNS000013579")
My understanding is that the column names should be "2D6" and "3A4", but they are actually "X2D6" and "X3A4". Why are the X's being added and how do I make that stop?
I do not recommend working with column names starting with numbers, but if you insist, use the check.names=FALSE argument of data.frame:
qcCtrl <- data.frame("2D6"="DNS00012345", "3A4"="DNS000013579",
check.names=FALSE)
qcCtrl
2D6 3A4
1 DNS00012345 DNS000013579
One of the reasons I caution against this, is that the $ operator becomes more tricky to work with. For example, the following fails with an error:
> qcCtrl$2D6
Error: unexpected numeric constant in "qcCtrl$2"
To get round this, you have to enclose your column name in back-ticks whenever you work with it:
> qcCtrl$`2D6`
[1] DNS00012345
Levels: DNS00012345
The X is being added because R does not like having a number as the first character of a column name. To turn this off, use as.character() to tell R that the column name of your data frame is a character vector.

Resources