This question already has answers here:
Selecting only numeric columns from a data frame
(12 answers)
Closed 1 year ago.
I have an issue in converting data into the numeric format.
str(DfFilter)
output
'data.frame': 32 obs. of 5 variables:
$ InstanceType : chr " c1.xlarge" " c1.xlarge" " c1.xlarge" " c1.xlarge" ...
$ ProductDescription: chr " Linux/UNIX" " Linux/UNIX" " Linux/UNIX" " Linux/UNIX" ...
$ SpotPrice : num 0.052 0.0739 0.0747 0.0751 0.0755 ...
$ ymd_hms(Timestamp): POSIXct, format: "2021-05-16 06:26:40" "2021-05-16 00:58:55" "2021-05-16 06:46:50" ...
$ Timestamp : 'times' num 06:26:40 00:58:55 06:46:50 14:17:55 19:07:09 ...
..- attr(*, "format")= chr "h:m:s"
but when i run to check for numeric values as follow
is.numeric(DfFilter)
[1] FALSE
why is that so. Kindly help in understanding this issue. Thanks in advance.
With purrr package and based on the comments:
DfModel <- DfFilter %>%
purrr::keep(.p = function(x) is.numeric(x))
It will keep only the numeric variables
Filter with is.numeric could be used to get only numeric columns.
Filter(is.numeric, DfFilter)
# a c
#1 1 2.2
Another way to keep only numeric value in a data.frame the result of is.numeric used in sapply could be used for subsetting with [:
DfFilter[sapply(DfFilter, is.numeric)]
# a c
#1 1 2.2
Example dataset:
DfFilter <- data.frame(a=1, b="b", c=2.2)
I am trying to create a table (in Snowflake db) with exactly the same column names as I keep in the R data.frame object:
'data.frame': 1 obs. of 26 variables:
$ Ship_To : chr "0002061948"
$ Del_Coll_Indicator : chr "D"
$ Currency : chr "GBP"
$ Total_Volume : num 0
$ Total_Quantity : num 0
...
There is no problem with the table creation:
dbWriteTable(con = my_db$con, name = "test5", value = df)
but all column names in the database are converted to upper cases:
'data.frame': 1 obs. of 26 variables:
$ SHIP_TO : chr "0002061948"
$ DEL_COLL_INDICATOR : chr "D"
$ CURRENCY : chr "GBP"
...
Is there any way to keep in the table original names from R's data frame?
As covered by Snowflake's SQL reference docs, when identifiers (such as column names) are unquoted at creation, Snowflake will upper case them, and treat them as case-insensitive. Any quoted identifiers will be kept as-is and treated as a case-sensitive identifier.
Alter the data frame column names (colnames(df)) to use a quoted identifier format via the dbQuoteIdentifier(my_db$con, each_column_name) DBI function. This should help preserve the casing.
I'm trying to import a csv with blanks read as "". Unfortunately they're all reading as "NA" now.
To better demonstrate the problem I'm also showing how NA, "NA", and "" are all mapping to the same thing (except in the very bottom example), which would prevent the easy workaround dt[is.na(dt)] <- ""
> write.csv(matrix(c("0","",NA,"NA"),ncol = 2),"MRE.csv")
Opening this in notepad, it looks like this
"","V1","V2"
"1","0",NA
"2","","NA"
So reading that back...
> fread("MRE.csv")
V1 V1 V2
1: 1 0 NA
2: 2 NA NA
The documentation seems to suggest this but it does not work as described
> fread("MRE.csv",na.strings = NULL)
V1 V1 V2
1: 1 0 NA
2: 2 NA NA
Also tried this which reads the NA as an actual NA, but the problem remains for the empty string which is read as "NA"
> fread("MRE.csv",colClasses=c(V1="character",V2="character"))
V1 V1 V2
1: 1 0 <NA>
2: 2 NA NA
> fread("MRE.csv",colClasses=c(V1="character",V2="character"))[,V2]
[1] NA "NA"
data.table version 1.11.4
R version 3.5.1
A few possible things going on here:
Regardless of you writing "0" here, the reading function (fread) is inferring based on looking at a portion of the file. This is not uncommon (readr does it, too), and is controllable (with colClasses=).
This might be unique to your question here (and not your real data), but your call to write.csv is implicitly putting the literal NA letters in the file (not to be confused with "NA" where you have the literal string). This might be confusing things, even when you override with colClasses=.
You might already know this, but since fread is inferring that those columns are really integer classes, then they cannot contain empty strings: once determined to be a number column, anything non-number-like will be NA.
Let's redo your first csv-generating side to make sure we don't confound the situation.
write.csv(matrix(c("0","",NA,"NA"),ncol = 2), "MRE.csv", na="")
(Below, I'm using magrittr's pipe operator %>% merely for presentation, it is not required.)
The first example demonstrates fread's inference. The second shows our overriding that behavior, and now we have blank strings in each NA spot that is not the literal string "NA".
fread("MRE.csv") %>% str
# Classes 'data.table' and 'data.frame': 2 obs. of 3 variables:
# $ V1: int 1 2
# $ V1: int 0 NA
# $ V2: logi NA NA
# - attr(*, ".internal.selfref")=<externalptr>
fread("MRE.csv", colClasses="character") %>% str
# Classes 'data.table' and 'data.frame': 2 obs. of 3 variables:
# $ V1: chr "1" "2"
# $ V1: chr "0" ""
# $ V2: chr "" "NA"
# - attr(*, ".internal.selfref")=<externalptr>
This can also be controlled on a per-column basis. One issue with this example is that fread is for some reason forcing the column of row-names to be named V1, the same as the next column. This looks like a bug to me, perhaps you can look at Rdatatable's issues and potentially post a new one. (I might be wrong, perhaps this is intentional/known behavior.)
Because of this, per-column overriding seems to stop at the first occurrence of a column name.
fread("MRE.csv", colClasses=c(V1="character", V2="character")) %>% str
# Classes 'data.table' and 'data.frame': 2 obs. of 3 variables:
# $ V1: chr "1" "2"
# $ V1: int 0 NA
# $ V2: chr "" "NA"
# - attr(*, ".internal.selfref")=<externalptr>
One way around this is to go with an unnamed vector, requiring the same number of classes as the number of columns:
fread("MRE.csv", colClasses=c("character","character","character")) %>% str
# Classes 'data.table' and 'data.frame': 2 obs. of 3 variables:
# $ V1: chr "1" "2"
# $ V1: chr "0" ""
# $ V2: chr "" "NA"
# - attr(*, ".internal.selfref")=<externalptr>
Another way (thanks #thelatemail) is with a list:
fread("MRE.csv", colClasses=list(character=2:3)) %>% str
# Classes 'data.table' and 'data.frame': 2 obs. of 3 variables:
# $ V1: int 1 2
# $ V1: chr "0" ""
# $ V2: chr "" "NA"
# - attr(*, ".internal.selfref")=<externalptr>
Side note: if you need to preserve them as ints/nums, then:
if your concern is about how it affects follow-on calculations, then you can:
fix the source of the data so that nulls are not provided;
filter out the incomplete observations (rows); or
fix the calculations to deal intelligently with missing data.
if your concern is about how it looks in a report, then whatever tool you are using to render in your report should have a mechanism for how to display NA values; for example, setting options(knitr.kable.NA="") before knitr::kable(...) will present them as empty strings.
if your concern is about how it looks on your console, you have two options:
interfere with the data by iterating over each (intended) column and changing NA values to ""; this only works on character columns, and is irreversible; or
write your own subclass of data.frame that changes how it is displayed on the console; the benefit to this is that it is non-destructive; the problem is that you have to re-class each object where you want this behavior, and most (if not all) functions that output frames will likely inadvertently strip or omit that class from your input. (You'll need to write an S3 method of print for your subclass to do this.)
I have a dataframe which looks like that:
'data.frame': 3036 obs. of 751 variables:
$ X : chr "01.01.2002" "02.01.2002" "03.01.2002" "04.01.2002" ...
$ A: chr "na" "na" "na" "na" ...
$ B: chr "na" "1,827437365" "0,833922973" "-0,838923572" ...
$ C: chr "na" "1,825300613" "0,813299479" "-0,866639008" ...
$ D: chr "na" "1,820482187" "0,821374034" "-0,875963104" ...
...
I have converted the X row into a date format.
dates <- as.Date(dataFrame$X, '%d.%m.%Y')
Now I want to replace this row. The thing is I cannot create a new dataframe because I after D there are coming over 1000 more rows...
What would be a possible way to do that easily?
I think what you want is simply:
dataFrame$X <- dates
if you you want to do is replace column X with dates. If you want to remove column X, simply do the following:
dataFrame$X <- NULL
(edited with more concise removal method provided by user #shujaa)
I am using data.table fread() function to read some data which have missing values and they were generated in Excel, so the missing values string is "#N/A". However, when I use the na.strings command the final str of the read data is still character. To replicate this, here is code and data.
Data:
Date,a,b,c,d,e,f,g
1/1/03,#N/A,0.384650146,0.992190069,0.203057232,0.636296656,0.271766148,0.347567706
1/2/03,#N/A,0.461486974,0.500702057,0.234400718,0.072789936,0.060900352,0.876749487
1/3/03,#N/A,0.573541006,0.478062582,0.840918789,0.061495666,0.64301024,0.939575302
1/4/03,#N/A,#N/A,#N/A,#N/A,#N/A,#N/A,#N/A
1/5/03,#N/A,#N/A,#N/A,#N/A,#N/A,#N/A,#N/A
1/6/03,#N/A,0.66678429,0.897482818,0.569609033,0.524295691,0.132941158,0.194114347
1/7/03,#N/A,0.576835985,0.982816576,0.605408973,0.093177815,0.902145012,0.291035649
1/8/03,#N/A,0.100952961,0.205491093,0.376410642,0.775917986,0.882827749,0.560508499
1/9/03,#N/A,0.350174456,0.290225065,0.428637309,0.022947911,0.7422805,0.354776101
1/10/03,#N/A,0.834345466,0.935128099,0.163158666,0.301310627,0.273928596,0.537167776
1/11/03,#N/A,#N/A,#N/A,#N/A,#N/A,#N/A,#N/A
1/12/03,#N/A,#N/A,#N/A,#N/A,#N/A,#N/A,#N/A
1/13/03,#N/A,0.325914633,0.68192633,0.320222677,0.249631582,0.605508964,0.739263677
1/14/03,#N/A,0.715104989,0.639040211,0.004186366,0.351412982,0.243570606,0.098312443
1/15/03,#N/A,0.750380716,0.264929325,0.782035411,0.963814327,0.93646428,0.453694758
1/16/03,#N/A,0.282389354,0.762102103,0.515151803,0.194083842,0.102386764,0.569730516
1/17/03,#N/A,0.367802161,0.906878948,0.848538256,0.538705673,0.707436236,0.186222899
1/18/03,#N/A,#N/A,#N/A,#N/A,#N/A,#N/A,#N/A
1/19/03,#N/A,#N/A,#N/A,#N/A,#N/A,#N/A,#N/A
1/20/03,#N/A,0.79933188,0.214688799,0.37011313,0.189503843,0.294051763,0.503147404
1/21/03,#N/A,0.620066341,0.329949446,0.123685075,0.69027192,0.060178071,0.599825005
(data saved in temp.csv)
Code:
library(data.table)
a <- fread("temp.csv", na.strings="#N/A")
gives (I have larger dataset so neglect the number of observations):
Classes ‘data.table’ and 'data.frame': 144 obs. of 8 variables:
$ Date: chr "1/1/03" "1/2/03" "1/3/03" "1/4/03" ...
$ a : chr NA NA NA NA ...
$ b : chr "0.384650146" "0.461486974" "0.573541006" NA ...
$ c : chr "0.992190069" "0.500702057" "0.478062582" NA ...
$ d : chr "0.203057232" "0.234400718" "0.840918789" NA ...
$ e : chr "0.636296656" "0.072789936" "0.061495666" NA ...
$ f : chr "0.271766148" "0.060900352" "0.64301024" NA ...
$ g : chr "0.347567706" "0.876749487" "0.939575302" NA ...
- attr(*, ".internal.selfref")=<externalptr>
This code works fine
a <- read.csv("temp.csv", header=TRUE, na.strings="#N/A")
Is it a bug? Is there some smart workaround?
The documentation from ?fread for na.strings reads:
na.strings A character vector of strings to convert to NA_character_. By default for columns read as type character ",," is read as a blank string ("") and ",NA," is read as NA_character_. Typical alternatives might be na.strings=NULL or perhaps na.strings = c("NA","N/A","").
You should convert them to numeric yourself after, I suppose. At least this is what I understand from the documentation.
Something like this?
cbind(a[, 1], a[, lapply(.SD[, -1], as.numeric)])