I want to upload an Excel file as a dataframe in R.
It is a large file with a lot of numbers and some #NV values.
The upload works good for the majority of columns (in total, there are 4,000 columns). But for some columns, R changes the columns to "TRUE" or "FALSE", creating a boolean column.
I don't want that, since all of the columns are supposed to be numeric.
Do you know why R does that?
It would really help if you provided code snippets, because there are many different excel-to-dataframe libraries/methods/behaviors.
But assuming that you are using writexl, the read_excel function has a guess_max parameter for this kind of case. guess_max is 1000 by default.
Try df <- read_excel(path = filepath, sheet = sheet_name, guess_max = 100000)
Since dataframes cannot have different data types in the same column, read_excel has to read your excel file and guess what data type each column should be, before actually filling the dataframe. If a column happens to only have NA values in the first 1000 rows, read_excel will assume you have a column of booleans, and then all subsequent values encountered in future rows will be cast accordingly. So if you set guess_max to something huge, you make read_excel slower, but it might avoid the casting of numerics to booleans.
Related
Complete noob here, specially with R.
For a school project I have to work with a specific dataset which doesn't come with column names in the dataset it self but there is a .txt that has extra information regarding the dataset, including the column names. The problem I'm having is that when I load the dataset rstudio assumes that the first line of data is actually the column names. Initially I just substituted the name with colnames() but by doing so I ended up ignoring/deleting the first line of data, and I'm sure that's not the right away of dealing with it.
How can I go about adding the correct column names without deleting the first line of data? (Preferably inside R due to school work requirements)
Thanks in advance!
When we read the data with read.table, use header = FALSE so that it automatically assigns a column name
df1 <- read.table('file.txt', header = FALSE)
Then, we can assign the preferred column names from the other .txt column
colnames(df1) <- scan('names.txt', what = '', quiet = TRUE)
I have a 5 G file data to load. fread seems to be a fast way to load them but it reads all my data structures wrong. It looks like it is the quotes that result the problem.
# Codes. I don't know how to put raw csv data here.
dt<-fread("data.csv",header=T)
dt2<-read.csv("data.csv",header=T)
str(dt)
str(dt2)
This is the output. All data structures of fread variables are char regardless whether it is num or char.
It's curious that fread didn't use numeric for the id column, maybe some entries contain non-numeric values?
The documentation suggests the use of colClasses parameter.
dt <- fread("data.csv", header = T, colClasses = c("numeric", "character"))
The documentation has a warning for using this parameter:
A character vector of classes (named or unnamed), as read.csv. Or a named list of vectors of column names or numbers, see examples. colClasses in fread is intended for rare overrides, not for routine use. fread will only promote a column to a higher type if colClasses requests it. It won't downgrade a column to a lower type since NAs would result. You have to coerce such columns afterwards yourself, if you really require data loss.
It looks as if the fread command will detect the type in a particular column and then assign the lowest type it can to that column based on what the column contains. From the fread documentation:
A sample of 1,000 rows is used to determine column types (100 rows
from 10 points). The lowest type for each column is chosen from the
ordered list: logical, integer, integer64, double, character. This
enables fread to allocate exactly the right number of rows, with
columns of the right type, up front once. The file may of course still
contain data of a higher type in rows outside the sample. In that
case, the column types are bumped mid read and the data read on
previous rows is coerced.
This means that if you have a column with mostly numeric type values it might assign the column as numeric, but then if it finds any character type values later on it will coerce anything read up to that point to character type.
You can read about these type conversions here, but the long and short of it seems to be that trying to convert a character column to numeric for values that are not numeric will result in those values being converted to NA, or a double might be converted to an integer, leading to a loss of precision.
You might be okay with this loss of precision, but fread will not allow you to do this conversion using colClasses. You might want to go in and remove non-numeric values yourself.
I have a dataset with nearly 30,000 rows and 1935 variables(columns). Among these many are character variables (around 350). Now I can change data type of an individual column using as.numeric on it, but it is painful to search for columns which are character type and then apply this individually on them. I have tried writing a function using a loop but since the data size is huge, laptop is crashing.
Please help.
Something like
take <- sapply(data, is.numeric)
which(take == FALSE)
identify which variables are numeric, but I don't know how extract automatically, so
apply(data[, c(putcolumnsnumbershere)], 1, as.character))
use
sapply(your.data, typeof)
to create a vector of variable types, then use this vector to identify the character vector columns to be converted.
I have two columns I am trying to subtract and put into a new one, but one of them contains values that read '#NULL!', after converting over from SPSS and excel, so R reads it as a factor and will not let me subtract. What is the easiest way to fix it knowing I have 19,000+ rows of data?
While reading the dataset using read.table/read.csv, we can specify the na.strings argument for those values that needs to be transformed to 'NA' or missing values. So, in your dataset it would be
dat <- read.table('yourfile.txt', na.strings=c("#NULL!", "-99999", "-88888"),
header=TRUE, stringsAsFactors=FALSE)
I have a set of data containing both characters and numerics in R. I know that I cannot save them in a data frame since the columns have to have all the same type. Therefore I saved them in a list.
Is there a way to save this list into an excel sheet so that the numerics are saved as numbers and the characters as characters?
I have been using XLConnect sa far and it works for a data frame, where you have columns with either all numerics or all characters
Thanks in advance
Example:
I have several measurements of different quantities (let's call them a and b).
These measurements can either be a number (numeric) or "n.d." (character). For example 3 measurements:
First: a = 5, b ="n.d."
Second: a = 3, b = 4
Third: a = "n.d.", b = "n.d."
EDIT:
The solution given by Roland works great:
Encode the missing values as NA in the data frame to have the columns all as numerics.
Then use
setMissingValue(wb, value = "n.d.")
to set the missing cells to "n.d." in the Excel sheet
Thanks a lot for your time and help
The solution given by Roland works great:
Encode the missing values as NA in the data frame to have the columns all as numerics.
Then use
setMissingValue(wb, value = "n.d.")
to set the missing cells to "n.d." in the Excel sheet
Thanks a lot for your time and help
Given the example:
df=data.frame(a=c(5,'n.d.','n.d.'),b=c('n.d.',4,'n.d.')),
you can use
write.csv(df,file='data.csv')
Then you can import the CSV file into Excel manually.