UPDATE
I imported a database from CSV file using the following command:
data.CGS <- read.csv("filepath", sep=";", na.strings=c(""," ","NA"), stringsAsFactors =F)
One column in the CSV file has different types of data, numerical, integers, percentages and characters strings.
Say, for simplicity, that this column has the following elements col=[1,2,1, c, 2%, 4%, 15.5, 16.5]
So, in R will read this column as if one created this variable
col<-c("1","2", "c", "2%", "4%", "15.5", "16.5", "1980", "1/12/1950")
My purpose is to do some tabulations and compute some statistics based on the "truly" numerical data, which in this example are all values except the letter "c" and the dates, 1980 and 1/12/1950.
What is the easiest way to do this in R ? Any help will be much appreciated.
Of course, there is the very simple thing to do, which is to coerce all elements to be numeric, but then in R this implies convert all characters into NA - which I do not like.
One way is to create a new vector that is separate from any text characters.
## Create new vector without any characters
col2 <- col[-grep("[a-zA-Z]", col)]
## To strip percentages (%)
strip_percents <- as.numeric(gsub("%", "", col2))
## All numbers except percentages
no_percents <- as.numeric(col2[-grep("%", col2)])
## Save strings in new vector
all_yo_strings <- col[grep("[a-zA-Z]", col)]
## Save percentages in a new vector
all_yo_percents <- col[grep("%", col)]
all_yo_percents <- as.numeric(gsub("%", "", all_yo_percents))/100
Does that work for your purposes? It will preserve your text strings in the original col variable (which you can access by simply removing the - from col[-grep("[a-zA-Z]", col)]), while giving you a new, numeric vector.
You asked a lot of questions in your question. You can have this as an example
col<-data.frame(var = c("1","2", "c", "2%", "4%", "15.5", "16.5"))
col
library(dplyr)
by gsub you remove % sign from the variable var in filter you remove a cvalue from variable
col %>% mutate(var1 = gsub("%", "", var)) %>% filter(var1 != "c") %>% summarise(m_n = mean(as.numeric(var1)))
m_n
1 6.833333
Related
I have a big time series dataset in which the numeric results are stored in General format in MS-Excel. I tried using gsub(",", "", dummy ), but it did not work. The dataset does not have any , or any other visible special character other than a decimal point, and R picks up the datatype as character. Values are either positive or negative with one NA and all values have different number of decimal places.
How can I convert without having to deal with N/As after converting to numeric. One thing to note though is that when converted to numeric, some of the values are displayed in scientific notation like 12.1 e+03 and other values with four decimal places.
dummy = c("12.1", "42000", "1.2145", "12.25", N/A, "323.369", "-1.235", "335", "0")
# Convert to numeric
dummy = gsub(",", "", dummy )
dummy = as.numeric(dummy )
Error
Warning message:
NAs introduced by coercion "
Changing N/A to NA solves this issue:
# N/A to NA
dummy = c("12.1", "42000", "1.2145", "12.25", NA, "323.369", "-1.235", "335")
# Convert to numeric
dummy = gsub(",", "", dummy)
dummy = as.numeric(dummy)
To do so for your entire dataset, you can use:
# Across columns (for matrices)
data <- apply(data, 2, function(x){
ifelse(x == "N/A", NA, x)
})
# Then convert characters to numeric (for matrices)
data <- apply(data, 2, as.numeric)
# Across columns (for data frames)
data <- lapply(data, function(x){
ifelse(x == "N/A", NA, x)
})
# Then convert characters to numeric (for data frames)
data <- lapply(data, as.numeric)
Update: *apply differences for object types in R -- thanks to user20650 for pointing this out
This question already has answers here:
Extracting numbers from vectors of strings
(12 answers)
Closed 2 years ago.
In my program, I webscraped a table from Yahoo finance. When I extract the values from the table they are listed as NAs. Is there a way I can turn them into numerics?
library(XML)
Symbol = "HD"
TableC <- readHTMLTable(getNodeSet(htmlTreeParse(readLines(paste0("https://finance.yahoo.com/quote/",Symbol,"/options?p=",Symbol,"&date=",1607644800), warn = FALSE), useInternalNodes = TRUE, asText = TRUE), "//table")[[1]])
TempVolCPosition <- grep("245.00", TableC[,3])
TempVolCVar <- TableC[TempVolCPosition, 11]
print(TempVolCVar)
Columns 3 to 11 are all originally of type character. Before converting to numeric you have to get rid of the text that doesn't belong. You have plus signs, percent signs and hyphens for blanks. Note that the hyphen and the negative sign are the same so to avoid impacting your negative numbers be sure to use gsub with '^-$' and not "-" or you will lose the negative signs.
#replace cells containing only a hyphen with blank
TableC[,3:11] <- apply(TableC[,3:11], MARGIN = 2, function (x) gsub("^-$","", x))
#replace percent sign with blank
TableC[,3:11] <- apply(TableC[,3:11], MARGIN = 2, function (x) gsub("%","", x))
#replace plus sign for positive values with nothing
TableC[,3:11] <- apply(TableC[,3:11], MARGIN = 2, function (x) gsub("+","", x))
#convert to numeric
TableC[,3:11] <- apply(TableC[,3:11], MARGIN = 2, function (x) as.numeric(x))
You can convert
TempVolCVar <- as.numeric(sub('%', '', TempVolCVar))
#[1] 42.82
We can also use parse_number from readr
TempVolCVar <- readr::parse_number(TempVolCVar)
Using the dplyr library you can convert select columns to numeric:
TableC <- TableC %>%
mutate(across(c(Strike, `Last Price`, Bid, Ask, Change, Volume, `Open Interest`), as.numeric))
There will still be a couple of NAs introduced where the original value was a dash. Andrea's solution is cleaner in that regard.
I have a data frame with factors and characters. I want to change the columns with the column prefix "ID_" to be changed from factors to characters.
I tried the below, but it changes the whole data frame to characters, I just want to change the colnames with "ID_". I don't know how many "ID_" will end up in the data frame (this is part of a larger function that will loop across dataframes with various numbers of "ID_")
###Changes the whole dataframe to character rather than only the intended columns
df.loc[] <- lapply(df.loc[, grepl("ID_", colnames(df.loc))], as.character)
The problem is you assign to the whole data frame with df.loc[] <-. Try this:
my_cols <- grepl("ID_", colnames(df.loc))
df.loc[my_cols] <- lapply(df.loc[my_cols], as.character)
Here is a tidyverse solution:
food <- data_frame(
"ID_fruits" = factor(c("apple", "banana", "cherry")),
"vegetables" = factor(c("asparagus", "broccoli", "cabbage")),
"ID_drinks" = factor(c("absinthe", "beer", "cassis"))
)
food %>%
mutate_at(vars(starts_with("ID_")), as.character)
```
You can also do this with ifelse:
df[] <- ifelse(grepl("^ID_", colnames(df)), lapply(df, as.character), df)
I've been profiting from SO, quite a while now and now decided to sign up and try to a) help others and b) get help from great guys :)
So coming to my question, I have vector extracted from a data frame that looks like this (just little subset of the data):
cho <- c("[M-H]: C4H4O2",
"[M+Hac-H]: C5H10O6",
"[M-H]: C6H4O3",
"[M+Fa-H]: C7H6O",
"[M-H]: C9H8O3",
"[M-H]: C18H30O3);
Now from this vector I want to extract the numbers in order to get the number of "C", "H", and "O" atoms:
temp <- strsplit(cho, "[^[:digit:]]");
temp <- as.numeric(unlist(temp));
#remove NAs
temp <- temp[!is.na(temp)];
#split into three column matrix and convert to df to merge with original df
temp <- as.data.frame(matrix(temp, ncol = 3, byrow = T));
In this case R is recycling the data to generate the matrix, in my case for the bigger data set, the generated temp vector is long enough and the matrix is getting generated, but it is a mess; this is due to cases such as "[M+Fa-H]: C7H6O" where only two numbers can be extracted; how is it possible to get a "1" after an "O" so that three numbers can be extracted instead of two? Is there a workaround for this?
Thanks a lot in advance for your help!
We can use str_extract_all. Use the regex lookaround to match one or more numbers (\\d+) that follows either a C or H or O, extract those numbers in a list, and convert to integer
library(stringr)
lst <- lapply(str_extract_all(cho, "(?<=C|H|O)\\d+"), as.integer)
Or a base R option is
read.csv(text=sub(".*C?(\\d+)H?(\\d+)O?(\\d*).*",
"\\1,\\2,\\3", cho), header=FALSE, fill=TRUE)
I would like to plot a heatmap on a table imported from MATLAB. The table has explicited rownames and colnames and I have loaded it into R with read.table, and I can run summary(i) and get the numeric summaries for each column:
i = read.table("file.txt",header=TRUE)
But when I try to run heatmap, it complains the converted matrix is not numeric, both with and without rownames.force=TRUE:
is.matrix(as.matrix(i,rownames.force=TRUE))
[1] TRUE
heatmap(as.matrix(i,rownames.force=TRUE))
Error in heatmap(as.matrix(i, rownames.force = TRUE)) :
'x' must be a numeric matrix
I think the problem is that as.matrix tries to convert the non-numeric rowname (or colname, I am not sure anymore :-():
as.matrix(i)[1]
[1] "cluster-594-walk-0161"
Any ideas?
Without a reproducible example we are left guessing what goes wrong, but the error suggests that the matrix does not contain numbers but (probably) characters. Does this work:
i = as.numeric(i)
heatmap(as.matrix(i,rownames.force=TRUE))
and what is the output of:
is.numeric(as.matrix(i)[1])
(probably FALSE).
edit:
Your edit shows that the matrix contains characters, not numerics. It may be that in the text file the rownames are included as an additional column, probably the first one. In that case:
i = read.table("file.txt", header = TRUE, row.names = 1)
reads the first column as the rownames. So the problem is most likely in read.table, not in the conversion to a matrix.
The solution is simply rely on defining the row names first then convert the data frame into matrix then inserting the raw names again. it should work perfectly
rnames <- data[,1] # assign labels in column 1 to "rnames"
mat_data <- data.matrix(data[,2:ncol(data)]) # transform column 2-5 into a matrix
rownames(mat_data) <- rnames # assign row names
heatmap.2(mat_data, col=redblue(256), scale="row", key=T, keysize=1.5, trace="none",cexCol=0.9,srtCol=45) # your heatmap