I have a .csv file with yes/no answers in one column. I opened it in my R compiler and tried to run pairs() on it; however, I get an error message of "non-numeric argument to pairs." I have attempted to change the yes/no responses to 0/1 values, but as.numeric() and as.factor() don't seem to do anything. I have also tried changing the data type from character to numerical in the data editor window that appears when I use the fix() function. That results in a column full of "NA".
How can I change the yes/no responses into something that will work with pairs() and with plot()?
I am fairly new to R and would much appreciate your help.
logical vectors can be cast fairly directly into numbers using a shortcut +(.). For instance,
x <- c("yes","no","yes")
(x == "yes")
# [1] TRUE FALSE TRUE
+(x == "yes")
# [1] 1 0 1
Related
Recently I want to deal with some TCGA data and draw some KM survival plot.
But accidently I found something strange:
>dat3
TUMOR_STAGE OS_MONTHS OS_STATUS ADORA2B ENTPD1
TCGA.HQ.A2OE.01 38.57 0:LIVING 45.6397 643.4637
TCGA.FJ.A3Z9.01 12.65 1:DECEASED 25.3327 982.4690
There were 2 empty strings in the case lists data of patient's tumor stage and they were indeed empty quotes ("").
> is.character(dat3$TUMOR_STAGE)
[1] TRUE
> is.na(dat3$TUMOR_STAGE)
[1] FALSE FALSE
> which(dat3$TUMOR_STAGE == "")
[1] 1 2
It's not difficult to remove them, I use filter()
#dat is the actual dataframe
dat <- dat %>% filter(!TUMOR_STAGE == "")
But the question is, for a large downloaded dataframe, what if I don't know whether there is such "empty quotes"? Is there any R function that can be used to check this and remove the rows/columns containing such values?
I'd do Something like:
if('' %in% dat$TUMOR_STAGE){dat_new = dat[!dat$TUMOR_STAGE%in%'',]}
And to apply it to all column, you can extend it with a for loop
I have a data frame of hospital discharge data. It has 99,779 records with 263 variables, including up to 50 ICD-10-CM diagnosis codes per record. I loaded the file including the subcommand "stringsAsFactors= FALSE" and then copied just the diagnosis codes to another df to make it easier to look at the data in RStudio. My current goal is to assign injury severity codes using icdpictr. I ran that program successfully, then looked at the output. As documented in the author's site, when the 7th character of the ICD-10-CM code is "B" or "C", the program ignores it, although it should not. So I want to change the 7th character from "B" or "C" to the character that triggers the attention. Here is where I run into a problem. Setting aside that I don't know how to write a function that will do this for each of my 50 variables, I anticipate writing 50 nearly identical statements like this:
mutate(temp = if_else(substr(DIAG1,7,7) == 'B' | substr(DIAG1,7,7) == 'C',
paste(substr(DIAG1,1,6),'A',sep=""),
DIAG1),
DIAG1 = temp, ...
I ran the program with just this one mutate command. This is the error message that appears:
Error: Problem with mutate() input temp.
x false must be a character vector, not a factor object.
i Input temp is if_else(...).
Although I loaded the DIAG variables as character, when I copied them to the other table, R -- without my permission -- turned them into factors. That was very efficient, but now I can't handle them as character type.
How do I solve this problem?
It may because of the comparison with a factor class and when we have different type for 'yes', 'no' in if_else, it can have that error because the if_else checks the type unlike the ifelse. Based on the OP's code, if 'DIAG1' is factor and the no case is returning 'DIAG1', it is a factor vs character class because substr automatically coerces the factor to character. We can convert the 'DIAG1' to character with as.character and it should work
library(dplyr)
df2 <- df1 %>%
mutate(DIAG1 = as.character(DIAG1),
temp = if_else(substr(DIAG1, 7, 7) %in% c("B", "C"),
paste0(substr(DIAG1,1,6),'A'), DIAG1))
NOTE: When there are more than one element to compare, instead of doing the same operation twice (substr(DIAG1, 7, 7)) and then doing == (as it is elementwise comparison), can use %in% with a single substr
NOTE2: From R 4.0, by default the read.csv/read.table or data.frame construction calls have stringsAsFactors = FALSE by default. Previously, it was TRUE. So, it is better to check the R version as well
I have a large dataframe in R and am trying to do some stats tests on certain columns, but the non-programmers who made the csv file added a bunch of text notes that I need to ignore.
For example a column might have values: 12,20,40,missing,64,32,no input,45,10
How do I only select the numbers using the which statement?
I failed miserably trying:
my_data_frame$Column.Title[which(is.numeric(my_data_frame$Column.Title))]
What do I change in the which function to only select the numbers and ignore the text? Thanks!
You can use the built-in as.numeric() converter to do something like this:
x <- my_data_frame$Column.Title
xn <- as.numeric(x)
which(!is.na(xn))
This won't distinguish between NAs created by failed coercion and pre-existing (numeric) NA values.
If there's a small enough variety of "missing" values you could read the data in with read.csv(..., na.strings=c("NA","missing","no input"))
What is equivalent SQL server isnumeric function in R studio. I am trying to migrate one of SQL logic to r studio and i have column where it holds both Char values and Int values, now i want take only int values and update them as -1 in R data.table. Please help me to solve the problem.
I have attached results as image, column "A" values are current values and i am expecting have the values like column B.
There are also data type tools in R (as in SQL and other languages) such as is.numeric() and is.integer() in R. Normally these return boolean values, but you could use sub or gsub() to make it -1:
example <- list(123, 321, "not numeric", as.Date("2018/01/01"))
gsub(T, -1, sapply(example, is.numeric))
[1] "-1" "-1" "FALSE" "FALSE"
Also, note that in R numeric is different from integer.
example <- list(as.integer(123), 321, "not numeric", as.Date("2018/01/01"))
example[sapply(example, is.integer)] <- -1
example
[[1]]
[1] -1
[[2]]
[1] 321
[[3]]
[1] "not numeric"
[[4]]
[1] "2018-01-01"
You can convert them back and forth with as.numeric() and as.integer(). Also, note that in R data types in this sense are referred to as the class or classes of the data, whereas the type in R refers to the storage or R internal data type.
I think if you're specifically interested in integers, then the question above is a duplicate of the following:
Check if the number is integer
Your if condition would be something like x == round(x, 0). This will be TRUE if values are integers, but not double or other non-numeric classes.
Finally i have fix this issue by following below steps.
captured all numeric values to separate data table by using below script
CustomDerivedL2AMID <- (subset(DimCombinedEnduser$DRVDEUL2AMID, grepl('^\d+$',DimCombinedEnduser$DRVDEUL2AMID)))
library(data.table)
HandleDerivedL2AMID <-data.table(CustomDerivedL2AMID)
match the HandleDerivedL2AMID table results with original data table and replaced all values to -1.
DCE$DRVDEUL2AMID <- replace(DCE$DRVDEUL2AMID,DCE$DRVDEUL2AMID %in% HandleDerivedL2AMID$CustomDerivedL2AMID,'-1')
now i see only character values. no more numeric values with data set under DRVDEUL2AMID.
A and B should be the same dataframe. A is generated in R, B is A exported and them imported back into R.
Both have dimensions 49 x 97, with the first column characters and all other columns numbers.
str() lists them as "chr" and "num" respectively.
Depending on how I look at the number columns, sometimes R finds them identical and sometimes not:
> identical(A,B)
FALSE
#The dataframes A and B are not the same
> identical(A[,1],B[,1])
TRUE
#The character-containing columns are the same
> identical(A[,-1],B[,-1])
FALSE
#The number-containing columns are not the same
> identical(matrix(A[,-1]),matrix(B[,-1]))
TRUE
#If the number-containing columns are converted into a matrix, they are the same
> identical(as.matrix(A[,-1]),as.matrix(B[,-1]))
FALSE
> identical(as.matrix(A[1:49,-1]),as.matrix(B[1:49,-1]))
TRUE
#But if they're converted into a matrix using as.matrix() instead of
# matrix() they're only the same if the 49 rows are explicitly indexed
My question:
What is the difference in how R interprets the numbers?
Are they sometimes treated as doubles and sometimes as floating points?
How do you know when R will do one or the other, and can I be sure that A and B really are the same?
EDIT: my advice after another 2 yrs experience in R:
use all.equal() instead of identical() to see an explanation of what's different and to ignore minute rounding errors
use saveRDS() and readRDS() to export and re-import with exact same format (and much faster)
remember that matrix() and as.matrix() can behave differently
Please read the help page for identical. It applies a much more stringent and extensive set of tests than just checking whether numerical entries are the same. All of the attributes of R objects including names and non-printing attributes are checked for identity. if you want to check numerical equivalent then perhaps you should first be stripping the objects of attributes, perhaps with as.vector or similar functions.