Importing data from csv into R, column not populated - r

I am new to R programming. I have imported a csv file using the following function
PivotTest <- read.csv(file.choose(), header=T)
The csv file has 7 columns: Meter_Serial_No, Reading_Date, Reading_Description, Reading_value, Entry_Processed, TS_Inserted, TS_LastUpdated.
When uploading, the Meter_Serial_No is filled with zero while there are data in that column in the csv file. When running a function to see what data are in that particular column (PivotTest$Meter_Serial_No), it's returning NULL. Can anyone assist me please.
Furthermore, the csv that I'm importing has more than 127,000 rows. When doing a test with 10 rows of data only, I don't have that problem where the column Meter_Serial_No is replaced with zero.

Depends on the class of values which are there in the column (PivotTest$Meter_Serial_No). I believe there is a problem in type conversion, try the following.
PivotTest <- read.csv("test.csv", header=T,colClasses=c(PivotTest$Meter_Serial_No="character",rep("numeric",6)))

Related

read_csv (readr, R) populates entire column with NA if there are NA in the fist 1000 + x observations in a simple and clean csv (parsing failure)

I was just going through a tremendous headache caused by read_csv messing up my data by substituting content with NA while reading simple and clean csv files.
I’m iterating over multiple large csv files that add up to millions of observations. Some columns contain quite some NA for some variables.
When reading a csv that contains NA in a certain column for the first 1000 + x observations, read_csv populates the entire column with NA and thus, the data is lost for further operations.
The warning message “Warning: x parsing failure” is shown, but as I’m reading multiple files I cannot check this file by file. Still, I would not know an automated fix for the parsing problem indicated also with problems(x)
Using read.csv instead of read_csv does not cause the problem, but it is slow and I run into encoding issues (using different encodings requires too much memory for large files).
An option to overcome this bug is to add a first observation (first row) to your data that contains something for each column, but still I need to read the file first somehow.
See a simplified example below:
##create a dtafrane
df <- data.frame( id = numeric(), string = character(),
stringsAsFactors=FALSE)
##poluate columns
df[1:1500,1] <- seq(1:1500)
df[1500,2] <- "something"
# variable string contains the first value in obs. 1500
df[1500,]
## check the numbers of NA in variable string
sum(is.na(df$string)) # 1499
##write the df
write_csv(df, "df.csv")
##read the df with read_csv and read.csv
df_readr <- read_csv('df.csv')
df_read_standard <- read.csv('df.csv')
##check the number of NA in variable string
sum(is.na(df_readr$string)) #1500
sum(is.na(df_read_standard$string)) #1499
## the read_csv files is all NA for variable string
problems(df_readr) ##What should that tell me? How to fix it?
Thanks to MrFlick for giving the answering comment on my questions:
The whole reason read_csv can be faster than read.csv is because it can make assumptions about your data. It looks at the first 1000 rows to guess the column types (via guess_max) but if there is no data in a column it can't guess what's in that column. Since you seem to know what's supposed to be in the columns, you should use the col_types= parameter to tell read_csv what to expect rather than making it guess. See the ?readr::cols help page to see how to tell read_csv what it needs to know.
Also guess_max = Inf overcomes the problem, but the speed advantage of read_csv seems to be lost.

How do I export data frame to .csv file created by read_bulk function in R?

I would like help on how to export the data frame created by merging .csv files using read bulk package in R. I have 9 .csv files with data that I merged using read_bulk in R. I tried to create an empty data frame first and then populate it with the merged data. My plan was then to export the data frame to a .csv file. This was unsuccessful. I got stuck on populating the empty data frame. Is there a better approach or edit I can make to the code I am using? Here is the code:
>data1=data.frame(Row=numeric(),Title=character(),Color=character(),Value=numeric(),stringsAsFactors=FALSE)
>read_bulk(directory=directory, data=data1)
Data is read from the .csv files and displays correctly on my R Console
>data1
[1] Row Title Color Value
<0 rows> (or 0-length row.names)
ensure your directory is set to where your CSV files are.
This seemed to work for me - the combined dataframe is stored as data1
data1<-read_bulk(directory = getwd(),subdirectories = F,extension = ".csv",data = NULL)
the you can write your merged data to a combinded csv file
write.csv(data1,"bulk.csv",row.names = F,na="")

Crosschecking Two Text Files with R

I need help in initiating/writing an R script that can cross check one .txt file with another to find where there are multiple instances. I have an excel file with a list of miRNAs that I want to cross reference with another excel file that has a column that contains the same miRNA name.
So is it a text file or excel file you are importing? Without more details, file structure, or a reproducible example, it will be hard to help.
You can try:
# Get the correct package
install.packages('readxl')
df1 <- readxl::read_excel('[directory name][file name].xlsx')
df2 <- readxl::read_excel('[directory name][file name].xlsx')
# Creates a new variable flag if miRNA from your first dataset is in the second dataset
df1$is_in_df2 <- ifelse(df1$miRNA %in% df2$miRNA, 'Yes', 'No')

Is there any way to assign the col_types by column names in read_excel(readxl) in R

My application is reading the xls and xlsx files using the read_excel function of the readxl package.
The sequence and the exact number of columns are not known earlier while reading the xls or xlsx file. There are 15 predefined columns out of which 10 columns are mandatory and remaining 5 columns are optional. So the file will always have minimum 10 columns and at maximum 15 columns.
I need to specify the the col-types to the mandatory 10 columns. The only way I can think of is using the column names to specify the col_types as I know for fact that the file has all 10 columns which are mandatory but they are in the random sequence.
I tried looking out for the way of doing so but failed to do so.
Can anyone help me find a way to assign the col_types by column names?
I solve the problem by below workaround. It is not the best way to solve this problem though. I have read the excel file twice which will take a hit on performance if the file has very large volume of data.
First read: Building column data type vector- Reading the file for retrieving the columns information(like column names, number of columns and it's types) and building the column_data_types vector which will have the datatype for every column in the file.
#reading .xlsx file
site_data_columns <- read_excel(paste(File$datapath, ".xlsx", sep = ""))
site_data_column_names <- colnames(site_data_columns)
for(i in 1 : length(site_data_column_names)){
#where date is a column name
if(site_data_column_names[i] == "date"){
column_data_types[i] <- "date"
#where result is a column name
} else if (site_data_column_names[i] == "result") {
column_data_types[i] <- "numeric"
} else{
column_data_types[i] <- "text"
}
}
Second read: Reading the file content- reading the excel file by supplying col_types parameter with the vector column_data_types which has the column data types.
#reading .xlsx file
site_data <- read_excel(paste(File$datapath, ".xlsx", sep = ""), col_types = column_data_types)

replace NA with nothing in R changes variable types to categorical

Is there anyway without converting to char to replace NA with blank or nothing?
I used
data_model <- sapply(data_model, as.character)
data_model[is.na(data_model)] <- " "
data_model=data.table(data_model)
however it changes all the columns' types to categorical.
I want to save the data set and use it in sas it does not understand NA.
Here's a somewhat belated (and shameless self-promotion) from The R Primer on how to export a data frame to SAS. It should automatically correctly handle your NAs:
First you can use the foreign package to export the data frame as a SAS xport dataset. Here, I'll just export the trees data frame.
library(foreign)
data(trees)
write.foreign(trees, datafile = "toSAS.dat",
codefile="toSAS.sas", package="SAS")
This gives you two files, toSAS.dat and toSAS.sas. It is easy to get the data into SAS since the codefile toSAS.sas contains a SAS script that can be read and interpreted directly by SAS and reads the data in toSAS.dat.

Resources