I'm struggling to convert a character vector to numeric in R. I import a dataframe from csv with:
a = read.csv('myData.csv', header=T, stringsAsFactors=F)
One of my factors, fac1, is a vector of numbers but contains some instances of "na" and "nr". Hence, typeof(a$fac1) returns "character"
I create a new dataframe without "na" and "nr" entries
k = a[a$fac1 != "na" & a$fac1 != "nr", ]
I then try to convert fac1 to numeric with:
k$fac1_num = as.numeric(k$fac1)
The problem is that this doesn't work, as typeof(k$fac1_num) now returns "double" instead of "numeric"
Can anyone suggest a fix / point out what I'm doing wrong? Thanks in advance!
Try just coercing to numeric:
a = read.csv('myData.csv', header=T, stringsAsFactors=F)
a$fac1_num = as.numeric(a$fac1)
If you need to subset (which is generally not needed and I would advise against doing routinely since there might be value in knowing what the other column value tell you about the "reality" behind the data), then just:
k <- a[ !is.na(a$fac1_num) , ]
That way you will still have the original character value in the a data-object and can examine its values if needed. The proper test for "numericy" is is.numeric
Try to use sapply with mode :
sapply(your_df, mode)
Related
I try to combine one column with a data.frame. I used both cbind() and data.frame(), but after that the character variable became a numeric one.
>is.character(new_listing_zip)
[1] TRUE
> new_race_disp_use2 <- cbind(new_listing_zip,opo_trans)
> is.character(new_race_disp_use2$new_listing_zip)
[1] FALSE
> is.character(new_listing_zip)
[1] TRUE
> new_race_disp_use2 <- data.frame(new_listing_zip,opo_trans)
> is.character(new_race_disp_use2$new_listing_zip)
[1] FALSE
Does anyone could help me with this? Thank you.
if you check the help files for data.frame() I think you will find your answer
?data.frame
You'll want to set your
options(stringsAsFactors = TRUE)
to change it globally or just set your parameter for
stringsAsFactors = TRUE
when declaring your data.frame, assuming these are actual character strings. Otherwise I would simply declare your variable as a factor when joining it
new_race_disp_use2 <- cbind(factor(new_listing_zip),opo_trans)
Now of course if your 'factor' is actually a numeric you want as a string (seemingly zip codes in your example) you'll want to either set your zip codes as strings to begin with using quotes (i.e. "12345") or set the data type after the data.frame is built
new_race_disp_use$new_listing_zip <- as.character(new_race_disp$new_listing_zip)
or
as.factor(varName)
or simply
factor() instead of as.character()
I am new in R, just have a couple of months using this software.
In a dataframe, I have some values with an apostrophe and I would like to change it to another word.
I tried this:
data$HomeTeam[data$HomeTeam=="M'Gladbach"]<-"Gladbach"
but I get a Warning Message:
In [<-.factor(*tmp*, dta$HomeTeam == "M'Gladbach", value = c(2L, :
invalid factor level, NA generated
Any ideas?
Thanks!
You can try sub
data$HomeTeam <- sub("^[^']*'", "", data$HomeTeam)
data$HomeTeam
#[1] "Gladbach" "Sonja" "Henderson" "Marshall"
The sub output will be 'character' class. If we need to retain the 'factor' class, you can try sub on the levels of 'HomeTeam' and assign the output back to 'levels' (as showed in the comments by #thelatemail)
levels(data$HomeTeam) <- sub("^[^']*'","",levels(data$HomeTeam))
If you want to replace only the word "M'Gladbach" with "Gladbach" as showed in the post, it is better to convert the "HomeTeam" column from factor to character class. It may be better to read the dataset with stringsAsFactors=FALSE option in read.table/read.csv or in the data.frame.
As the "HomeTeam" column is already a factor, you can use as.character (from #rawr's comment)
data$HomeTeam <- as.character(data$HomeTeam)
data$HomeTeam[data$HomeTeam=="M'Gladbach"]<-"Gladbach"
data
set.seed(22)
data <- data.frame(HomeTeam= c("M'Gladbach", "S'Sonja",
"HR'Henderson", "Marshall"), Value=rnorm(4))
I'm trying to read data from a *.txt or *.csv file into R with read.table or read.csv. However, my data is written as e.g. 1.4523e-9 in the file denoting 1.4523*10^{-9} though ggplot recognizes this as a string instead of a real. Is there some sort of eval( )-function to convert this to its correct value ?
Depending on the exact format of the csv file you import,read.csv and read.table often simply convert all columns to factors. Since a straightforward conversion to numeric as failed, I assume this is your problem. You can change this using the colClasses argument as such:
# if every column should be numeric:
df <- read.csv("foobar.csv", colClasses = "numeric")
#if only some columns should be numeric, use a vector.
#to read the first as factor and the second as numeric:
read.csv("foobar.csv", colClasses = c("factor", "numeric")
Of course, both of the above are barebones examples; you probably want to supply other arguments as well, eg header = T.
If you don't want to supply the classes of each column when you read the table (maybe you don't know them yet!), you can convert after the fact using either of the following:
df$a <- as.numeric(as.character(a)) #as you already discovered
df$a <- as.numeric(levels(df$a)[df$a])
Yes, these are both clunky, but they are standard and frequently recommended.
I have to import many datasets automatically with the first column being a name, so a character vector, and the second column being a numeric vector, so I was using these specifications with read.table: colClasses = c("character", "numeric").
This works great if I have a dataframe saved in a df_file like this:
df<- data.frame(V1=c("s1","s2","s3","s4"), V2=c("1e-04","1e-04","1e-04","1e-04")
read.table(df_file, header = FALSE, comment.char="", colClasses = c("character", "numeric"), stringsAsFactors=FALSE)
The problem is in some cases I have dataframes with numeric values in the form of exponential in the second column, and in these cases the import does not work since it does not recognise the column as numeric (or it imports as "character" if I don't specify the colClasses), so my question is:
how can I specify a column to be imported as numeric even when the values are exponential?
For example:
df<- data.frame(V1=c("s1","s2","s3","s4"), V2=c("10^(-4)","10^(-4)","10^(-4)","10^(-4)"))
I want all the exponential values to be imported as numeric, but even when I try to change from character to numeric after they are imported I get all "NA" (as.numeric(as.character(df$V2)) "Warning message: NAs introduced by coercion ")
I have tried to use "real" or "complex" with colClasses too but it still imports the exponentials as character.
Please help,
thank you!
I think the problem is that the form your exponentials are written in doesn't match the R style. If you read them in as character vectors you can convert them to exponentials if you know they all are exponentials. Use gsub to strip out the "10^(" and the ")", leaving you with the "-4", convert to numeric, then convert back to an exponential. Might not be the fastest way, but it works.
From your example:
df<- data.frame(V1=c("s1","s2","s3","s4"), V2=c("10^(-4)","10^(-4)","10^(-4)","10^(-4)"))
df$V2 <- 10^(as.numeric(gsub("10\\^\\(|\\)", "", df$V2)))
df
# V1 V2
#1 s1 1e-04
#2 s2 1e-04
#3 s3 1e-04
#4 s4 1e-04
Whats happening in detail: gsub("10\\^\\(|\\)", "", df$V2) is substituting 10^( and ) with an empty string (you need to escape the carat and the parentheses), as.numeric() is converting your -4 string into the number -4, then you're just running 10^ on each element of the numeric vector you just made.
If you read in your data.frame with stringsAsFactors=FALSE, the column in question should come in as a character vector, in which case you can simply do:
transform(df, V2=eval(parse(text=V2)))
You could use readLines to first load in the data and do all the operations required and then use read.table with textConnection as follows:
tt <- readLines("~/tmp.txt")
tt <- gsub("10\\^\\((.*)\\)$", "1e\\1", tt)
read.table(textConnection(tt), sep="\t", header=TRUE, stringsAsFactors=FALSE)
V1 V2
1 s1 1e-04
2 s2 1e-04
3 s3 1e-04
4 s4 1e-04
I am new to R and I am trying to convert a dataframe to a numeric matrix using the below code
expData <- read.table("GSM469176.txt",header = F)
expVec <- as.numeric(as.matrix(exp_data))
When I use as.matrix, without as.numeric, it returns some numbers (as below)
0.083531 0.083496 0.083464 0.083435 0.083406 0.083377 0.083348"
[9975] "-0.00285 -0.0028274 -0.0028046 -0.0027814 -0.0027574 -0.0027319 -0.0027042
but when I put in the as.numeric, they are all converted to "NA"
I apologize if someone has asked this question before but I can't find a post that solves my problem.
Thanks in advance
You have 2 issues. First, if you examine the structure of the data frame, you'll note that the first column is characters:
head(expData)[, 1:4]
V1 V2 V3 V4
1 YAL002W(cer) 6.1497e-02 6.2814e-02 6.4130e-02
2 YAL002W(par) 7.1352e-02 7.3262e-02 7.5171e-02
3 YAL003W(cer) 2.2428e-02 3.8252e-02 5.4078e-02
4 YAL003W(par) 2.6548e-02 3.6747e-02 4.6947e-02
5 YAL005C(cer) 2.4023e-05 2.3243e-05 2.2462e-05
6 YAL005C(par) 2.0252e-02 2.0346e-02 2.0440e-02
Therefore, trying to convert the complete data frame to numeric will not work as expected.
Second, you are running as.numeric() after as.matrix(), which is converting the matrix to a vector:
x <- as.numeric(as.matrix(expData))
# Warning message:
# NAs introduced by coercion
class(x)
[1] "numeric"
dim(x)
# NULL not a matrix
length(x)
# [1] 14261302
I suggest you try this:
rownames(expData) <- expData$V1
expData$V1 <- NULL
expData <- as.matrix(expData)
dim(expData)
# [1] 7502 1900
class(expData[, 1])
# [1] "numeric"
You get the NA's when R doesn't know how to convert something to a number.
Specifically, the quotation mark in your output tells me that you have one (several) LNG string of numbers. To see why this is bad, try: as.nmeric("-0.00285 -0.0028274")
I don't know what your raw data is like, but as #alexwhan mentioned, the culprit is probably in your call to read.table
To fix it, try explicitly setting the sep argument (ie, next to where you have header)
I would suggest opening up the raw file in a simple text editor (TextEdit.app or notepad, not Word) and seeing how they are separated. M guess is
..., sep="\t"
should do the trick.