I have a big time series dataset in which the numeric results are stored in General format in MS-Excel. I tried using gsub(",", "", dummy ), but it did not work. The dataset does not have any , or any other visible special character other than a decimal point, and R picks up the datatype as character. Values are either positive or negative with one NA and all values have different number of decimal places.
How can I convert without having to deal with N/As after converting to numeric. One thing to note though is that when converted to numeric, some of the values are displayed in scientific notation like 12.1 e+03 and other values with four decimal places.
dummy = c("12.1", "42000", "1.2145", "12.25", N/A, "323.369", "-1.235", "335", "0")
# Convert to numeric
dummy = gsub(",", "", dummy )
dummy = as.numeric(dummy )
Error
Warning message:
NAs introduced by coercion "
Changing N/A to NA solves this issue:
# N/A to NA
dummy = c("12.1", "42000", "1.2145", "12.25", NA, "323.369", "-1.235", "335")
# Convert to numeric
dummy = gsub(",", "", dummy)
dummy = as.numeric(dummy)
To do so for your entire dataset, you can use:
# Across columns (for matrices)
data <- apply(data, 2, function(x){
ifelse(x == "N/A", NA, x)
})
# Then convert characters to numeric (for matrices)
data <- apply(data, 2, as.numeric)
# Across columns (for data frames)
data <- lapply(data, function(x){
ifelse(x == "N/A", NA, x)
})
# Then convert characters to numeric (for data frames)
data <- lapply(data, as.numeric)
Update: *apply differences for object types in R -- thanks to user20650 for pointing this out
Related
This question already has answers here:
Extracting numbers from vectors of strings
(12 answers)
Closed 2 years ago.
In my program, I webscraped a table from Yahoo finance. When I extract the values from the table they are listed as NAs. Is there a way I can turn them into numerics?
library(XML)
Symbol = "HD"
TableC <- readHTMLTable(getNodeSet(htmlTreeParse(readLines(paste0("https://finance.yahoo.com/quote/",Symbol,"/options?p=",Symbol,"&date=",1607644800), warn = FALSE), useInternalNodes = TRUE, asText = TRUE), "//table")[[1]])
TempVolCPosition <- grep("245.00", TableC[,3])
TempVolCVar <- TableC[TempVolCPosition, 11]
print(TempVolCVar)
Columns 3 to 11 are all originally of type character. Before converting to numeric you have to get rid of the text that doesn't belong. You have plus signs, percent signs and hyphens for blanks. Note that the hyphen and the negative sign are the same so to avoid impacting your negative numbers be sure to use gsub with '^-$' and not "-" or you will lose the negative signs.
#replace cells containing only a hyphen with blank
TableC[,3:11] <- apply(TableC[,3:11], MARGIN = 2, function (x) gsub("^-$","", x))
#replace percent sign with blank
TableC[,3:11] <- apply(TableC[,3:11], MARGIN = 2, function (x) gsub("%","", x))
#replace plus sign for positive values with nothing
TableC[,3:11] <- apply(TableC[,3:11], MARGIN = 2, function (x) gsub("+","", x))
#convert to numeric
TableC[,3:11] <- apply(TableC[,3:11], MARGIN = 2, function (x) as.numeric(x))
You can convert
TempVolCVar <- as.numeric(sub('%', '', TempVolCVar))
#[1] 42.82
We can also use parse_number from readr
TempVolCVar <- readr::parse_number(TempVolCVar)
Using the dplyr library you can convert select columns to numeric:
TableC <- TableC %>%
mutate(across(c(Strike, `Last Price`, Bid, Ask, Change, Volume, `Open Interest`), as.numeric))
There will still be a couple of NAs introduced where the original value was a dash. Andrea's solution is cleaner in that regard.
UPDATE
I imported a database from CSV file using the following command:
data.CGS <- read.csv("filepath", sep=";", na.strings=c(""," ","NA"), stringsAsFactors =F)
One column in the CSV file has different types of data, numerical, integers, percentages and characters strings.
Say, for simplicity, that this column has the following elements col=[1,2,1, c, 2%, 4%, 15.5, 16.5]
So, in R will read this column as if one created this variable
col<-c("1","2", "c", "2%", "4%", "15.5", "16.5", "1980", "1/12/1950")
My purpose is to do some tabulations and compute some statistics based on the "truly" numerical data, which in this example are all values except the letter "c" and the dates, 1980 and 1/12/1950.
What is the easiest way to do this in R ? Any help will be much appreciated.
Of course, there is the very simple thing to do, which is to coerce all elements to be numeric, but then in R this implies convert all characters into NA - which I do not like.
One way is to create a new vector that is separate from any text characters.
## Create new vector without any characters
col2 <- col[-grep("[a-zA-Z]", col)]
## To strip percentages (%)
strip_percents <- as.numeric(gsub("%", "", col2))
## All numbers except percentages
no_percents <- as.numeric(col2[-grep("%", col2)])
## Save strings in new vector
all_yo_strings <- col[grep("[a-zA-Z]", col)]
## Save percentages in a new vector
all_yo_percents <- col[grep("%", col)]
all_yo_percents <- as.numeric(gsub("%", "", all_yo_percents))/100
Does that work for your purposes? It will preserve your text strings in the original col variable (which you can access by simply removing the - from col[-grep("[a-zA-Z]", col)]), while giving you a new, numeric vector.
You asked a lot of questions in your question. You can have this as an example
col<-data.frame(var = c("1","2", "c", "2%", "4%", "15.5", "16.5"))
col
library(dplyr)
by gsub you remove % sign from the variable var in filter you remove a cvalue from variable
col %>% mutate(var1 = gsub("%", "", var)) %>% filter(var1 != "c") %>% summarise(m_n = mean(as.numeric(var1)))
m_n
1 6.833333
A package ('related') requires me to change some values withing variables in a largeish SNP dataframe (385x12300). This is no doubt simple but I can't find this particular question anywhere. Sample data:
binfrom<-c(1,1,1,1,0,NA)
x <- sample(binfrom, 100, replace = TRUE)
x<-data.frame(matrix(x,10,10))
I need the variable names X1,X2 etc to replace each "1" in that variable column. The values "0" and "NA" remain unchanged.
Another way is to use which (I'm assuming you have real NAs there- see #akruns comment)
indx <- which(x == 1, arr.ind = TRUE)
x[indx] <- names(x)[indx[, 2]]
This is basically identifies the locations of ones and replacing with the corresponding column names while using the columns location of the generated index.
We convert the columns of 'x' to character class from factor and use Map to replace 1 in each column with the corresponding column name.
x[] <- lapply(x, as.character)
x[] <- Map(function(y,z) replace(y, y==1, z), x, colnames(x))
In the OP's post, NA was created as character "NA". Because of that, the columns were factor while creating data.frame (with stringsAsFactors=TRUE - default option). If we used real NA, then the first step i.e. converting to character is not needed.
In case, we work with data.table, another option is set which should be fast when working with large datasets.
library(data.table)
setDT(x)
for(j in seq_along(x)){
set(x, i=NULL, j= j, value= as.character(x[[j]]))
set(x, i= which(x[[j]]==1 & !is.na(x[[j]])),
j=j, value= names(x)[j])
}
NOTE: Assumption is that we are working with real NA values.
I've got a frame with a set of different variables - integers, factors, logicals - and I would like to recode all of the "NAs" as a numeric across the whole dataset while preserving the underlying variable class. For example:
frame <- data.frame("x" = rnorm(10), "y" = rep("A", 10))
frame[6,] <- NA
dat <- as.data.frame(apply(frame,2, function(x) ifelse(is.na(x)== TRUE, -9, x) ))
dat
str(dat)
However, here the integers turn into factors; when I include as.numeric(x) in the apply() function, this introduces errors. Thanks for any and all thoughts on how to deal with this.
apply returns a matrix of type character. as.data.frame turns this into factors by default. Instead, you could do
dat <- as.data.frame(lapply(frame, function(x) ifelse(is.na(x), -9, x) ) )
I have a column which contain numeric as well as non-numeric values. I want to find the mean of the numeric values which i can use it to replace the non-numeric values. How can this be done in R?
Say your data frame is named df and the column you want to "fix" is called df$x. You could do the following.
You have to unfactor and then convert to numeric. This will give you NAs for all the character strings that cannot be coalesced to numbers.
nums <- as.numeric(as.character(df$x))
As Richie Cotton pointed out, there is a "more efficient, but harder to remember" way to convert factors to numeric
nums <- as.numeric(levels(df$x))[as.integer(df$x)]
To get the mean, you use mean() but pass na.rm = T
m <- mean(nums, na.rm = T)
Assign the mean to all the NA values.
nums[is.na(nums)] <- m
You could then replace the old data, but I don't recommend it. Instead just add a new column
df$new.x <- nums
This is a function I wrote yesterday to combat the non-numeric types. I have a data.frame with unpredictable type for each column. I want to calculate the means for numeric, and leave everything else untouched.
colMeans2 <- function(x) {
# This function tries to guess column type. Since all columns come as
# characters, it first tries to see if x == "TRUE" or "FALSE". If
# not so, it tries to coerce vector into integer. If that doesn't
# work it tries to see if there's a ' \" ' in the vector (meaning a
# column with character), it uses that as a result. Finally if nothing
# else passes, it means the column type is numeric, and it calculates
# the mean of that. The end.
# browser()
# try if logical
if (any(levels(x) == "TRUE" | levels(x) == "FALSE")) return(NA)
# try if integer
try.int <- strtoi(x)
if (all(!is.na(try.int))) return(try.int[1])
# try if character
if (any(grepl("\\\"", x))) return(x[1])
# what's left is numeric
mean(as.numeric(as.character(x)), na.rm = TRUE)
# a possible warning about coerced NAs probably originates in the above line
}
You would use it like so:
apply(X = your.dataframe, MARGIN = 2, FUN = colMeans2)
It sort of depends on what your data looks like.
Does it look like this?
data = list(1, 2, 'new jersey')
Then you could
data.numbers = sapply(data, as.numeric)
and get
c(1, 2, NA)
And you can find the mean with
mean(data.numbers, na.rm=T)
A compact conversion:
vec <- c(0:10,"a","z")
vec2 <- (as.numeric(vec))
vec2[is.na(vec2)] <- mean(vec2[!is.na(vec2)])
as.numeric will print the warning message listed below and convert the non-numeric to NA.
Warning message:
In mean(as.numeric(vec)) : NAs introduced by coercion