R - NAs turn columns into character class (should be integer/ numeric) - r

I imported a huge dataset with a lot of missing values or N/As, NAs.
This is how i import the Data:
Databsp<-read.csv("C:/Users/adminfor/Desktop/Neuer Ordner/Pseudonymized-Genet-Treatment-Summary-20220201120538.csv", na.strings=TRUE)
Next, I converedt all the NAs or N/As to using the following code:
a <- Databsp %>% replace_with_na_all(condition = ~.x %in% common_na_strings)
Now my question: Why are columns that only include numbers and NAs from the class "character" and not "integer/numeric". I tried several codes, but nothing seems to help...

You don't change your column classes. When you import your data, the column classes are first set, and you do nothing to change them. If a column in your CSV file has only numeric and NA values when you import it, it will be numeric. But if it has strings (including strings that you haven't yet told R are NA-equivalent, like "N/A") then read.csv must read them as character class because they are not numeric. Later, you replace those NA-equivalent values with actual NAs, but that replaces values only, it does not change the class of the columns.
The bad solution would be to patch this. Add an extra step after you replace the NA values, you could use the type.convert() function to re-assess the columns and convert them as necessary, a <- type.convert(a).
The better solution is to give read.csv your list of NA-equivalent strings when you read in the data. This is what the na.strings argument is supposed to be. From ?read.csv
na.strings
a character vector of strings which are to be interpreted as NA values.
So change your import line to
Databsp <- read.csv(
"C:/Users/adminfor/Desktop/Neuer Ordner/Pseudonymized-Genet-Treatment-Summary-20220201120538.csv",
na.strings = common_na_strings
)
And then the columns should be classed appropriately when you read them in, and you can skip the replace_with_na_all step as it is already taken care of. Relatedly, your current na.strings = TRUE does nothing because TRUE is not a character vector.

Related

fread reading data structure wrong with quotes

I have a 5 G file data to load. fread seems to be a fast way to load them but it reads all my data structures wrong. It looks like it is the quotes that result the problem.
# Codes. I don't know how to put raw csv data here.
dt<-fread("data.csv",header=T)
dt2<-read.csv("data.csv",header=T)
str(dt)
str(dt2)
This is the output. All data structures of fread variables are char regardless whether it is num or char.
It's curious that fread didn't use numeric for the id column, maybe some entries contain non-numeric values?
The documentation suggests the use of colClasses parameter.
dt <- fread("data.csv", header = T, colClasses = c("numeric", "character"))
The documentation has a warning for using this parameter:
A character vector of classes (named or unnamed), as read.csv. Or a named list of vectors of column names or numbers, see examples. colClasses in fread is intended for rare overrides, not for routine use. fread will only promote a column to a higher type if colClasses requests it. It won't downgrade a column to a lower type since NAs would result. You have to coerce such columns afterwards yourself, if you really require data loss.
It looks as if the fread command will detect the type in a particular column and then assign the lowest type it can to that column based on what the column contains. From the fread documentation:
A sample of 1,000 rows is used to determine column types (100 rows
from 10 points). The lowest type for each column is chosen from the
ordered list: logical, integer, integer64, double, character. This
enables fread to allocate exactly the right number of rows, with
columns of the right type, up front once. The file may of course still
contain data of a higher type in rows outside the sample. In that
case, the column types are bumped mid read and the data read on
previous rows is coerced.
This means that if you have a column with mostly numeric type values it might assign the column as numeric, but then if it finds any character type values later on it will coerce anything read up to that point to character type.
You can read about these type conversions here, but the long and short of it seems to be that trying to convert a character column to numeric for values that are not numeric will result in those values being converted to NA, or a double might be converted to an integer, leading to a loss of precision.
You might be okay with this loss of precision, but fread will not allow you to do this conversion using colClasses. You might want to go in and remove non-numeric values yourself.

changing specific area from character to numeric in R programming

I use Rstudio and imported a csv file from online.
data <- read.csv("http://databank.worldbank.org/data/download/GDP.csv", stringsAsFactors = FALSE)
In the file, column X.3 is of type character.
I want to convert row (5 to 202) from character to numeric so that I can calculate mean of it.
So, when I use this line below. It still remains as character
data[c(5:202),"X.3"] <- as.numeric(gsub(",","",data[c(5:202),"X.3"]))
when i type class(data[10,"X.3"]) it shows the output as character
I am able to convert the whole column to numeric using
data[,"X.3"] <- as.numeric(gsub(",","",data[,"X.3"]))
but i want to convert only specific row's ie from 5 to 202 beacause the other rows of the column becomes N/A. i am not sure how to do it.
Following changes to your code can help you make it numeric:
data <- read.csv("http://databank.worldbank.org/data/download/GDP.csv", header = T, stringsAsFactors = FALSE, skip = 3)
# skipping first 3 rows which is just empty space/junk and defining the one as header
data <- data[-1,]
#removing the first line after the header
data$US.dollars. <- as.numeric(gsub(',','',data$US.dollars.))
#replacing scientific comma with blank to convert the character to numeric
hist(data$US.dollars.) #sample plot
As mentioned in the comment, you cannot keep part of your column as character and part numeric because R doesn't allow that and it forces type conversion to a higher order in this case numeric to character. You can read here more about Implicit Coercion of R

Change data frame with factors to a big matrix R

I have a big data frame (22k rows, 400 columns) which is generated using read.csv from a csv file. It appears that every column is a factor and all the row values are the levels of this factor.
I now want to do some analysis (like PCA) but I can't work with it unless it is a matrix, but even when I try it like matrix, all I get is
> prcomp(as.matrix(my_data))
Error in colMeans(x, na.rm = TRUE) : 'x' must be numeric
Is there a way of transforming this data frame with factors to a simple big matrix?
I am new in R so forgive all the (maybe terrible) mistakes.
Thanks
You can do it that way:
df<-data.frame(a=as.factor(c(1,2,3)), b=as.factor(c(2,3,4)))
m<-apply(apply(df, 1, as.character), 1, as.numeric)
apply uses a method on the given data.frame. It is important not to leave out to transform it to character first, because otherwise it will be converted to the internal numeric representation of the factor.
To add column names, do this:
m<-m[-1,] # removes the first 'empty' row
colnames(m)<-c("a", "b") # replace the right hand side with your desired column names, e.g. the first row of your data.frame
One more tip. You probably read the data.frame from a file, when you set the parameter header=TRUE, the first row will not be the header but the column names of the data.frame will be correct.

Why is an extra character ("X") added to the rownames of my data.frame?

Originally the rownames of my data (a csv-file) were:
1
5
33
37
However, as I read in my data using:
data <- read.table(data, sep = ",", header = TRUE)
it seems that the rownames has been transformed to:
X1
X5
X33
X37
Are there any common reasons for why this is happening? It's entirely unintentional.
Are you sure that it is the row names that are changing? not the column names?
R generally does not modify row names, it is happy with numbers (converted to character) as row names.
The standard functions for reading in data (read.table and relatives) will convert column names that are not valid names into ones that are, for integers this mean prepending the X that you describe. If you don't want this behavior (skipping this helpful feature could lead to other complications down the line) then look at the check.names argument to read.table.
If it really is the row names that are changing then we will need a reproducible example with a text file to read in and the exact command that you used to read in the file.

R Filtering Out Rows with missing values

I have a csv file where (in one column) some values are missing, and I want to omit the corresponding rows in the data.file.
I thought that by writing
data <- read.csv(file="name.csv",head=TRUE,sep=";", na.strings = "NA")
the na.strings = "NA" option replaces missing values with NA, and then I can use
cleanData <- na.omit(data) or cleanData <- data[complete.cases(data), ]
to filter out the missing parts.
But even after applying the first part, i.e. including the na.strings = "NA" option, the resulting data frame still contains rows with empty entries and not with NA entries.
Does anybody know what went wrong?
to answer the question you raise in the comments:
I believe you have the purpose of the na.strings argument turned around. It doesn't tell R how to replace NAs. Rather, it tells R which values in the input file should be treated as NAs.
For example, you might run into a data.source that uses -1 to indicate that the data is missing. In which case, you would use na.string='-1'
if you look at ?read.csv:
na.strings
a character vector of strings which are to be interpreted as NA values. Blank fields are also considered to be missing values in logical, integer, numeric and complex fields.
You are using na.strings wrongly. To replace empty fields with NA, do data[data == ""] <- NA.
Try data.frame.instance <- data.frame.instance[!is.na(data.frame.instance),] and you should be left with a data.fame without any NAs.

Resources