I am trying to read a data table into R. The data contains:
two columns with numeric values (continuous),
1556 columns with either 0 or 1, and
one last column with strings, representing two groups (group A and
group B).
Some values are missing, and they are replaced with either ? or some spaces and then ?. As a result, when I read the table into R, the numbers were read as characters.
For example, if Data[1,1]=125, when I write is.numeric(Data[1,1]) I get FALSE. I want to turn all the numbers into numbers, and I want all the ? (with or without spaces before) into missing values. Do not know how to do this. Thank you! (I have 3279 rows).
You can specify the na.strings argument of ?read.table to be na.strings = c("?", "?."). Use that inside the read.table() call when you read the data into R. It should then by recognised correctly. Since you also have some spaces in your data, you could additionally use the strip.white = TRUE argument inside the read.table call.
Related
I am pretty new to R and I wonder if I can replace NA value (which looks like string) with blank, nothing
It is easy when entire table is as.character however my table contains double's as well therefore when I try to run
f <- as.data.frame(replace(df, is.na(df), ""))
or
df[is.na(df)] <- ""
Both does not work.
Error is like
Assigned data `values` must be compatible with existing data.
Error occurred for column `ID`.
Can't convert <character> to <double>.
and I understand, but I really need ID's as well as any other cell in the table (character, double or other) blank to remain in the table, later on it is connected to BI tool and I can't present "NA", just blank for the sake of clarity
If your column is of type double (numbers), you can't replace NAs (which is the R internal for missings) by a character string. And "" IS a character string even though you think it's empty, but it is not.
So you need to choose: converting you whole column to type character or leave the missings as NA.
EDIT:
If you really want to covnert your numeric column to character, you can just use as.character(MYCOLUMN). But I think what you really want is:
Telling your exporting function how to treat NA'S, which is easy, e.g. write.csv(df, na = ""). Also check the help function with ?write.csv.
df <- read.csv(
text = '"2019-Jan","2019-Feb",
"3","1"',
check.names = FALSE
)
OK, so I use check.names = FALSE and now my column names are not syntactically valid. What are the practical consequences?
df
#> 2019-Jan 2019-Feb
#> 1 3 1 NA
And why is this NA appearing in my data frame? I didn't put that in my code. Or did I?
Here's the check.names man page for reference:
check.names logical. If TRUE then the names of the variables in the
data frame are checked to ensure that they are syntactically valid
variable names. If necessary they are adjusted (by make.names) so that
they are, and also to ensure that there are no duplicates.
The only consequence is that you need to escape or quote the names to work with them. You either string-quote and use standard evaluation with the [[ column subsetting operator:
df[['2019-Jan']]
… or you escape the identifier name with backticks (R confusingly also calls this quoting), and use the $ subsetting:
df$`2019-Jan`
Both work, and can be used freely (as long as they don’t lead to exceedingly unreadable code).
To make matters more confusing, R allows using '…' and "…" instead of `…` in certain contexts:
df$'2019-Jan'
Here, '2019-Jan' is not a character string as far as R is concerned! It’s an escaped identifier name.1
This last one is a really bad idea because it confuses names2 with character strings, which are fundamentally different. The R documentation advises against this. Personally I’d go further: writing 'foo' instead of `foo` to refer to a name should become a syntax error in future versions of R.
1 Kind of. The R parser treats it as a character string. In particular, both ' and " can be used, and are treated identically. But during the subsequent evaluation of the expression, it is treated as a name.
2 “Names”, or “symbols”, in R refer to identifiers in code that denote a variable or function parameter. As such, a name is either (a) a function name, (b) a non-function variable name, (c) a parameter name in a function declaration, or (d) an argument name in a function call.
The NA issue is unrelated to the names. read.csv is expecting an input with no comma after the last column. You have a comma after the last column, so read.csv reads the blank space after "2019-Feb", as the column name of the third column. There is no data for this column, so an NA value is assigned.
Remove the extra comma and it reads properly. Of course, it may be easier to just remove the last column after using read.csv.
df <- read.csv(
text = '"2019-Jan","2019-Feb"
"3","1"',
check.names = FALSE
)
df
# 2019-Jan 2019-Feb
# 1 3 1
Consider df$foo where foo is a column name. Syntactically invalid names will not work.
As for the NA it’s a consequence of there being three columns in your first line and only two in your second.
i have about 30 columns within a dataframe of over 100 columns. the file i am reading in stores its numbers as characters. In other words 1300 is 1,300 and R thinks it is a character.
I am trying to fix that issue by replacing the "," with nothing and turn the field into an integer. I do not want to use gsub on each column that has the issue. I would rather store the columns as a vector that have the issue and do one function or loop with all the columns.
I have tried using lapply, but am not sure what to put as the "x" variable.
Here is my function with the error below it
ItemStats_2014[intColList] <- lapply(ItemStats_2014[intColList],
as.integer(gsub(",", "", ItemStats_2014[intColList])) )
Error in [.data.table(ItemStats_2014, intColList) : When i is a
data.table (or character vector), the columns to join by must be
specified either using 'on=' argument (see ?data.table) or by keying x
(i.e. sorted, and, marked as sorted, see ?setkey). Keyed joins might
have further speed benefits on very large data due to x being sorted
in RAM.
The file I am reading in stores its numbers as characters [with commas as decimal separator]
Just directly read those columns in as decimal, not as string:
data.table::fread() understands decimal separators: dec=',' by default.
You might need to play with fread(..., colClasses=(...) ) argument a bit to specify the integer columns:
myColClasses <- rep('string',100) # for example...
myColClasses[intColList] <- 'integer'
# ...any other colClass fixup as needed...
ItemStats_2014 <- fread('your.csv', colClasses=myColClasses)
This approach is simpler and faster and uses much less memory than reading as string, then converting later.
Try using dplyr::mutate_at() to select multiple columns and apply a transformation to them.
ItemStats_2014 <- ItemStats_2014 %>%
mutate_at(intColList, funs(as.integer(gsub(',', '', .))))
mutate_at selects columns from a list or using a dplyr selector function (see ?select_helpers) then applies one or more functions to each column. The . in gsub refers to each selected column that mutate_at passes to it. You can think of it as the x in function(x) ....
I loaded my dataset (original.csv) to R:
original <- read.csv("original.csv")
str(original) showed that my dataset has 16 variables (14 factors, 2 integers). 14 variables have missing values. It was OK, but 3 variables that are originally numbers, are known as factors.
I searched web and get a command as: as.numeric(as.character(original$Tumor_Size))
(Tumor_Size is a variable that has been known as factor).
By the way, missing values in my dataset are marked as dot (.)
After running: as.numeric(as.character(original$Tumor_Size)), the values of Tumor_Size were listed and in the end a warning massage as: “NAs introduced by coercion” was appeared.
I expected after running above command, the variable converted to numeric, but second str(original) showed that my guess was wrong and Tumor_Size and another two variables were factors. In the below is sample of my dataset:
a piece of my dataset
How can I solve my problem?
The crucial information here is how missing values are encoded in your data file. The corresponding argument in read.csv() is called na.strings. So if dots are used:
original <- read.csv("original.csv", na.strings = ".")
I'm not 100% sure what your problem is but maybe this will help....
original<-read.csv("original.csv",header = TRUE,stringsAsFactors = FALSE)
original$Tumor_Size<-as.numeric(original$Tumor_Size)
This will introduce NA's because it cannot convert your dot(.) to a numeric value. If you try to replace the NA's with a dot again it will return the field as a character, to do this you can use,
original$Tumor_Size[is.na(original$Tumor_Size)]<-"."
Hope this helps.
I'm using the write function right now in R with a matrix, and this is what I have
write(my_mtx,file='mtx.tsv',sep='\t')
But this gives me a file with one column? I've also tried adding an 'ncolumns' argument
write(mt_mtx,ncolumns=length(colnames(my_mtx)),file='mtx.tsv',sep='\t')
But this just gives me a repeat of the one columns as opposed to actual separated columns as it appears in the matrix. a little help?
Try using write.table instead
write.table(mt_mtx, file = 'mtx.tsv', sep ='\t', col.names = FALSE, row.names = FALSE)
Then it will default to the correct number of columns and there is no need to transpose
default for write() is one column if the data are character, five columns if the data are numeric, and it fills by rows (see ?write). Try this:
write(t(my_mtx),file='mx.tsv',sep='\t',ncolumns=ncol(my_mtx))