NULL values, regression / correlation switch - r

I have a dataset with lets say 2 variables. I want to do some regression testing, but the quite a few numeric observations have "NULL". I would want to use this as a value however, but I don't want to convert it to a specific number, ie 99999.
I keep trying all the different ways after googling and it doesn't work.
Benny2 <- read_excel("C:/Users/EH9508/Desktop/Benny2.xlsx")
I have two variables "Days" and "Amount" both have numeric values and "NULL"
Any help would be appreicated.

You can convert the file/sheet to csv from Excel (save as > csv) and then:
mydata <- read.csv("path/to/file.csv")
If you don't have access to Excel, then this is how it goes with the xlsx library:
library("xlsx")
mydata <- read.xlsx("path/to/file.xlsx")
If you put the csv/xlsx file in the same folder as your R script, you can type the file name without the path as read.xlsx("file.xlsx").
If you already have your data in R and are wondering how to get the NULL converted to a given value, try this:
mydata <- matrix(rnorm(10),5,2) # You data
mydata[2,1] <- NA # Some NA
mydata[5,2] <- NA
mydata[is.na(mydata)] <- 99999 # Replaces mydata where NA for 99999

Related

read_csv (readr, R) populates entire column with NA if there are NA in the fist 1000 + x observations in a simple and clean csv (parsing failure)

I was just going through a tremendous headache caused by read_csv messing up my data by substituting content with NA while reading simple and clean csv files.
I’m iterating over multiple large csv files that add up to millions of observations. Some columns contain quite some NA for some variables.
When reading a csv that contains NA in a certain column for the first 1000 + x observations, read_csv populates the entire column with NA and thus, the data is lost for further operations.
The warning message “Warning: x parsing failure” is shown, but as I’m reading multiple files I cannot check this file by file. Still, I would not know an automated fix for the parsing problem indicated also with problems(x)
Using read.csv instead of read_csv does not cause the problem, but it is slow and I run into encoding issues (using different encodings requires too much memory for large files).
An option to overcome this bug is to add a first observation (first row) to your data that contains something for each column, but still I need to read the file first somehow.
See a simplified example below:
##create a dtafrane
df <- data.frame( id = numeric(), string = character(),
stringsAsFactors=FALSE)
##poluate columns
df[1:1500,1] <- seq(1:1500)
df[1500,2] <- "something"
# variable string contains the first value in obs. 1500
df[1500,]
## check the numbers of NA in variable string
sum(is.na(df$string)) # 1499
##write the df
write_csv(df, "df.csv")
##read the df with read_csv and read.csv
df_readr <- read_csv('df.csv')
df_read_standard <- read.csv('df.csv')
##check the number of NA in variable string
sum(is.na(df_readr$string)) #1500
sum(is.na(df_read_standard$string)) #1499
## the read_csv files is all NA for variable string
problems(df_readr) ##What should that tell me? How to fix it?
Thanks to MrFlick for giving the answering comment on my questions:
The whole reason read_csv can be faster than read.csv is because it can make assumptions about your data. It looks at the first 1000 rows to guess the column types (via guess_max) but if there is no data in a column it can't guess what's in that column. Since you seem to know what's supposed to be in the columns, you should use the col_types= parameter to tell read_csv what to expect rather than making it guess. See the ?readr::cols help page to see how to tell read_csv what it needs to know.
Also guess_max = Inf overcomes the problem, but the speed advantage of read_csv seems to be lost.

Parsing large XML to dataframe in R

I have large XML files that I want to turn into dataframes for further processing within R and other programs. This is all being done in macOS.
Each monthly XML is around 1gb large, has 150k records and 191 different variables. In the end I might not need the full 191 variables but I'd like to keep them and decide later.
The XML files can be accessed here (scroll to the bottom for the monthly zips, when uncompressed one should look at "dming" XMLs)
I've made some progress but processing for larger files takes too long (see below)
The XML looks like this:
<ROOT>
<ROWSET_DUASDIA>
<ROW_DUASDIA NUM="1">
<variable1>value</variable1>
...
<variable191>value</variable191>
</ROW_DUASDIA>
...
<ROW_DUASDIA NUM="150236">
<variable1>value</variable1>
...
<variable191>value</variable191>
</ROW_DUASDIA>
</ROWSET_DUASDIA>
</ROOT>
I hope that's clear enough. This is my first time working with an XML.
I've looked at many answers here and in fact managed to get the data into a dataframe using a smaller sample (using a daily XML instead of the monthly ones) and xml2. Here's what I did
library(xml2)
raw <- read_xml(filename)
# Find all records
dua <- xml_find_all(raw,"//ROW_DUASDIA")
# Create empty dataframe
dualen <- length(dua)
varlen <- length(xml_children(dua[[1]]))
df <- data.frame(matrix(NA,nrow=dualen,ncol=varlen))
# For loop to enter the data for each record in each row
for (j in 1:dualen) {
df[j, ] <- xml_text(xml_children(dua[[j]]),trim=TRUE)
}
# Name columns
colnames(df) <- c(names(as_list(dua[[1]])))
I imagine that's fairly rudimentary but I'm also pretty new to R.
Anyway, this works fine with daily data (4-5k records), but it's probably too inefficient for 150k records, and in fact I waited a couple hours and it hadn't finished. Granted, I would only need to run this code once a month but I would like to improve it nonetheless.
I tried to turn the elements for all records into a list using the as_list function within xml2 so I could continue with plyr, but this also took too long.
Thanks in advance.
While there is no guarantee of better performance on larger XML files, the ("old school") XML package maintains a compact data frame handler, xmlToDataFrame, for flat XML files like yours. Any missing nodes available in other siblings result in NA for corresponding fields.
library(XML)
doc <- xmlParse("/path/to/file.xml")
df <- xmlToDataFrame(doc, nodes=getNodeSet(doc, "//ROW_DUASDIA"))
You can even conceivably download the daily zips, unzip need XML, and parse it into data frame should the large monthly XMLs pose memory challenges. As example, below extracts December 2018 daily data into a list of data frames to be row binded at end. Process even adds a DDate field. Method is wrapped in a tryCatch due to missing days in sequence or other URL or zip issues.
dec_urls <- paste0(1201:1231)
temp_zip <- "/path/to/temp.zip"
xml_folder <- "/path/to/xml/folder"
xml_process <- function(dt) {
tryCatch({
# DOWNLOAD ZIP TO URL
url <- paste0("ftp://ftp.aduanas.gub.uy/DUA%20Diarios%20XML/2018/dd2018", dt,".zip")
file <- paste0(xml_folder, "/dding2018", dt, ".xml")
download.file(url, temp_zip)
unzip(temp_zip, files=paste0("dding2018", dt, ".xml"), exdir=xml_folder)
unlink(temp_zip) # DESTROY TEMP ZIP
# PARSE XML TO DATA FRAME
doc <- xmlParse(file)
df <- transform(xmlToDataFrame(doc, nodes=getNodeSet(doc, "//ROW_DUASDIA")),
DDate = as.Date(paste("2018", dt), format="%Y%m%d", origin="1970-01-01"))
unlink(file) # DESTROY TEMP XML
# RETURN XML DF
return(df)
}, error = function(e) NA)
}
# BUILD LIST OF DATA FRAMES
dec_df_list <- lapply(dec_urls, xml_process)
# FILTER OUT "NAs" CAUGHT IN tryCatch
dec_df_list <- Filter(NROW, dec_df_list)
# ROW BIND TO FINAL SINGLE DATA FRAME
dec_final_df <- do.call(rbind, dec_df_list)
Here is a solution that processes the entire document at once as opposed to reading each of the 150,000 records in the loop. This should provide a significant performance boost.
This version can also handle cases where the number of variables per record is different.
library(xml2)
doc<-read_xml('<ROOT>
<ROWSET_DUASDIA>
<ROW_DUASDIA NUM="1">
<variable1>value1</variable1>
<variable191>value2</variable191>
</ROW_DUASDIA>
<ROW_DUASDIA NUM="150236">
<variable1>value3</variable1>
<variable2>value_new</variable2>
<variable191>value4</variable191>
</ROW_DUASDIA>
</ROWSET_DUASDIA>
</ROOT>')
#find all of the nodes/records
nodes<-xml_find_all(doc, ".//ROW_DUASDIA")
#find the record NUM and the number of variables under each record
nodenum<-xml_attr(nodes, "NUM")
nodeslength<-xml_length(nodes)
#find the variable names and values
nodenames<-xml_name(xml_children(nodes))
nodevalues<-trimws(xml_text(xml_children(nodes)))
#create dataframe
df<-data.frame(NUM=rep(nodenum, times=nodeslength),
variable=nodenames, values=nodevalues, stringsAsFactors = FALSE)
#dataframe is in a long format.
#Use the function cast, or spread from the tidyr to convert wide format
# NUM variable values
# 1 1 variable1 value1
# 2 1 variable191 value2
# 3 150236 variable1 value3
# 4 150236 variable2 value_new
# 5 150236 variable191 value4
#Convert to wide format
library(tidyr)
spread(df, variable, values)

NA Values and Numeric Datatype

I'm currently using read.xlsx from the xlsx package to write data from an Excel spreadsheet to a data frame. My problem is that the data frame becomes type character because the first row read from the file has NA values. Converting the frame using as.numeric just screws up the formatting. So currently, I run a command like so:
CDF<- read.xlsx(wb, sheet=1, startRow=2,cols=c(2,3))
CDF then equals to a dataframe with the following values:
NA NA
1 3.1569948515638899E-3 4.2560545366418102E-2
2 4.6179211819458499E-2 0.43699596110695599
3 9.3875238651998996E-2 0.63041471352096301
4 7.1254813513786902E-2 0.76236994294326599
That's fine. But I need to run the command beginning from row 1, not row 2. If I run CDF<- read.xlsx(wb, sheet=1, startRow=1,cols=c(2,3)) then the data frame I get is
jobs.1000output.ratio earn.output.ratio
1 NA NA
2 3.1569948515638899E-3 4.2560545366418102E-2
3 4.6179211819458499E-2 0.43699596110695599
4 9.3875238651998996E-2 0.63041471352096301
5 7.1254813513786902E-2 0.76236994294326599
6 4.2305078854580701E-2 0.61710149253731295
But in this case the datatype of any value I choose from CDF is a string. I need it to be a numeric type. How can I keep the NA values in the data frame while still preserving the overall datatype of the frame? (I want to avoid using as.numeric because I want my data frame to keep being two columns)
Thanks for your help and patience!
Something like this?
CDF <- read.xlsx(wb, sheet=1, startRow=1, cols=c(2,3), colClasses = "numeric")
To follow on my comments I have create a function for you:
return_num <- function(dataframe){
for(i in 1:ncol(dataframe)){
if(!is.numeric(dataframe[,i])){
dataframe[,i] = as.numeric(dataframe[,i])
}else{
print(paste(names(dataframe[i]), " is already numeric"))
}
}
}
Could call the function after
return_num(CDF)

How to convert rows

I have uploaded a data set which is called as "Obtained Dataset", it usually has 16 rows of numeric and character variables, some other files of similar nature have less than 16 characters, each variable is the header of the data which starts from the 17th row and onwards "in this specific file".
Obtained dataset & Required Dataset
For the data that starts 1st column is the x-axis, 2nd column is y-axis and 3rd column is depth (which are standard for all the files in the database) 4th column is GR 1 LIN, 5th column is CAL 1 LIN so and soforth as given in the first 16 rows of the data.
Now i want an R code which can convert it into the format shown in the required data set, also if a different data set has say less than 16 lines of names say GR 1 LIN and RHOB 1 LIN are missing then i want it to still create a column with NA entries till 1:nrow.
Currently i have managed to export this file to excel and manually clean the data and rename the columns correspondingly and then save it as csv and then read.csv("filename") etc but it is simply not possible to do this for 400 files.
Any advice how to proceed will be of great help.
I have noticed that you have probably posted this question again, and in a different format. This is a public forum, and people are happy to help. However, it's your job to simplify life of others, and you are requested to put in some effort. Here is some advice on that.
Having said that, here is some code I have written to help you out.
Step0: Creating your first data set:
sink("test.txt") # This will `sink` all the output to the file "test.txt"
# Lets start with some dummy data
cat("1\n")
cat("DOO\n")
cat(c(sample(letters,10),"\n"))
cat(c(sample(letters,10),"\n"))
cat(c(sample(letters,10),"\n"))
cat(c(sample(letters,10),"\n"))
# Now a 10 x 16 dummy data matrix:
cat(paste(apply(matrix(sample(160),10),1,paste,collapse = "\t"),collapse = "\n"))
cat("\n")
sink() # This will stop `sink`ing.
I have created some dummy data in first 6 lines, and followed by a 10 x 16 data matrix.
Note: In principle you should have provided something like this, or a copy of your dataset. This would help other people help you.
Step1: Now we need to read the file, and we want to skip the first 6 rows with undesired info:
(temp <- read.table(file="test.txt", sep ="\t", skip = 6))
Step2: Data clean up:
We need a vector with names of the 16 columns in our data:
namesVec <- letters[1:16]
Now we assign these names to our data.frame:
names(temp) <- namesVec
temp
Looks good!
Step3: Save the data:
write.table(temp,file="test-clean.txt",row.names = FALSE,sep = "\t",quote = FALSE)
Check if the solution is working. If it is working, than move to next step, otherwise make necessary changes.
Step4: Automating:
First we need to create a list of all the 400 files.
The easiest way (to explain also) is copy the 400 files in a directory, and then set that as working directory (using setwd).
Now first we'll create a vector with all file names:
fileNameList <- dir()
Once this is done, we'll need to function to repeat step 1 through 3:
convertFiles <- function(fileName) {
temp <- read.table(file=fileName, sep ="\t", skip = 6)
names(temp) <- namesVec
write.table(temp,file=paste("clean","test.txt",sep="-"),row.names = FALSE,sep = "\t",quote = FALSE)
}
Now we simply need to apply this function on all the files we have:
sapply(fileNameList,convertFiles)
Hope this helps!

Making data out of one point in multiple csv files in R

I'm trying to create a variable out of a single point in 5 different data tables.
ie. I have data for mental disorders for each year in seperate CSV files. How do I track just one variable (eg Autism) in each file and put it into one variable?
Here is what I have so far:
d2000 <- read.table("C:/AL00.csv")
d2001 <- read.table("C:/AL01.csv")
d2002 <- read.table("C:/AL02.csv")
d2003 <- read.table("C:/AL03.csv")
rownames(d2000) <- d2000[,3]
rownames(d2001) <- d2001[,3]
rownames(d2002) <- d2002[,3]
rownames(d2003) <- d2003[,3]
ASD = c(d2000["Autism","Total"],d2001["Autism","Total"],d2002["Autism","Total"])
This isn't working. I tried typing in just one of the data points:
>d2000["Autism","Total"]
[1] 2,763
Levels: 1,075 1,480 2,763
It outputs the correct number, but what are these "Levels"? Are they my problem, if so, how would I fix?
I would do something like this :
ll <- lapply(list.files(pattern="AL[0-9]+.*csv",full.names=TRUE),
function(x) read.table(x, stringsAsFactors=FALSE))
res <- do.call(rbind,ll)[,'Autism']
This will give you the column Autism in on vector. Then to convert it to a numeric you can do some regular expressions:
as.numeric(gsub(',','',res))

Resources