NA Values and Numeric Datatype - r

I'm currently using read.xlsx from the xlsx package to write data from an Excel spreadsheet to a data frame. My problem is that the data frame becomes type character because the first row read from the file has NA values. Converting the frame using as.numeric just screws up the formatting. So currently, I run a command like so:
CDF<- read.xlsx(wb, sheet=1, startRow=2,cols=c(2,3))
CDF then equals to a dataframe with the following values:
NA NA
1 3.1569948515638899E-3 4.2560545366418102E-2
2 4.6179211819458499E-2 0.43699596110695599
3 9.3875238651998996E-2 0.63041471352096301
4 7.1254813513786902E-2 0.76236994294326599
That's fine. But I need to run the command beginning from row 1, not row 2. If I run CDF<- read.xlsx(wb, sheet=1, startRow=1,cols=c(2,3)) then the data frame I get is
jobs.1000output.ratio earn.output.ratio
1 NA NA
2 3.1569948515638899E-3 4.2560545366418102E-2
3 4.6179211819458499E-2 0.43699596110695599
4 9.3875238651998996E-2 0.63041471352096301
5 7.1254813513786902E-2 0.76236994294326599
6 4.2305078854580701E-2 0.61710149253731295
But in this case the datatype of any value I choose from CDF is a string. I need it to be a numeric type. How can I keep the NA values in the data frame while still preserving the overall datatype of the frame? (I want to avoid using as.numeric because I want my data frame to keep being two columns)
Thanks for your help and patience!

Something like this?
CDF <- read.xlsx(wb, sheet=1, startRow=1, cols=c(2,3), colClasses = "numeric")

To follow on my comments I have create a function for you:
return_num <- function(dataframe){
for(i in 1:ncol(dataframe)){
if(!is.numeric(dataframe[,i])){
dataframe[,i] = as.numeric(dataframe[,i])
}else{
print(paste(names(dataframe[i]), " is already numeric"))
}
}
}
Could call the function after
return_num(CDF)

Related

read_excel: Import empty cells as missing values (na)

I'm using read_excel to import data into R. Sometimes the specific cell in the excel file is empty. I want R to record this as na and not to ignore that cell. Currently, R doesn't import anything.
I've read through the R Documentation on read_excel and found that, by default, read_excel treats blank cells as missing data. But instead of not importing anything, I'd like to have the actual information that these data are missing. I couldn't find information on how to do that.
In the excel file A1 in sheet 1 does not contain any data.
x <- read_excel("file.xlsx", sheet = 1, range="A1", col_names = FALSE)
Expected result: x 1 obs. of 1 variable, the value should be NA
Actual result: x 0 obs. of 0 variables
A workaround that could work in the above example is to use:
x <- as.numeric(read_excel("file.xlsx", sheet = 1, range="A1", col_names = FALSE))
It is not so elegant- R translates the tibble of 0x0 into NA. (Note: it will also change any strings into NA).
I don't think it works for importing more than one cell.
If anybody finds a more elegant and more generally applicable solution, that would be helpful.

NULL values, regression / correlation switch

I have a dataset with lets say 2 variables. I want to do some regression testing, but the quite a few numeric observations have "NULL". I would want to use this as a value however, but I don't want to convert it to a specific number, ie 99999.
I keep trying all the different ways after googling and it doesn't work.
Benny2 <- read_excel("C:/Users/EH9508/Desktop/Benny2.xlsx")
I have two variables "Days" and "Amount" both have numeric values and "NULL"
Any help would be appreicated.
You can convert the file/sheet to csv from Excel (save as > csv) and then:
mydata <- read.csv("path/to/file.csv")
If you don't have access to Excel, then this is how it goes with the xlsx library:
library("xlsx")
mydata <- read.xlsx("path/to/file.xlsx")
If you put the csv/xlsx file in the same folder as your R script, you can type the file name without the path as read.xlsx("file.xlsx").
If you already have your data in R and are wondering how to get the NULL converted to a given value, try this:
mydata <- matrix(rnorm(10),5,2) # You data
mydata[2,1] <- NA # Some NA
mydata[5,2] <- NA
mydata[is.na(mydata)] <- 99999 # Replaces mydata where NA for 99999

cbind column in several csv files in r

I am new to R and dont know exactly how to do for loops.
Here is my problem: I have about 160 csv files in a folder, each with a specific name. In each file, there is a pattern:"HL.X.Y.Z.", where X="Region", Y="cluster", and Z="point". What i need to do is read all these csv files, extract strings from the names, create a column with the strings for each csv file, and bind all these csv files in a single data frame.
Here is some code of what i am trying to do:
setwd("C:/Users/worddirect")
files.names<-list.files(getwd(),pattern="*.csv")
files.names
head(files.names)
>[1] "HL.1.1.1.2F31CA.150722.csv" "HL.1.1.2.2F316A.150722.csv"
[3] "HL.1.1.3.2F3274.150722.csv" "HL.1.1.4.2F3438.csv"
[5] "HL.1.10.1.3062CD.150722.csv" "HL.1.10.2.2F343D.150722.csv"
Doing like this to read all files works just fine:
files.names
for (i in 1:length(files.names)) {
assign(files.names[i], read.csv(files.names[i],skip=18))
}
Adding an extra column for an individual csv files like this works fine:
test<-cbind("Region"=rep(substring(files.names[1],4,4),times=nrow(HL.1.1.1.2F31CA.150722.csv)),
"Cluster"=rep(substring(files.names[1],6,6),times=nrow(HL.1.1.1.2F31CA.150722.csv)),
"Point"=rep(substring(files.names[1],8,8),times=nrow(HL.1.1.1.2F31CA.150722.csv)),
HL.1.1.1.2F31CA.150722.csv)
head(test)
Region Cluster Point Date.Time Unit Value
1 1 1 1 6/2/14 11:00:01 PM C 24.111
2 1 1 1 6/3/14 1:30:01 AM C 21.610
3 1 1 1 6/3/14 4:00:01 AM C 20.609
However, a for loop of the above doesn`t work.
files.names
for (i in 1:length(files.names)) {
assign(files.names[i], read.csv(files.names[i],skip=18))
cbind("Region"=rep(substring(files.names[i],4,4),times=nrow(i)),
"Cluster"=rep(substring(files.names[i],6,6),times=nrow(i)),
"Point"=rep(substring(files.names[i],8,8),times=nrow(i)),
i)
}
>Error in rep(substring(files.names[i], 4, 4), times = nrow(i)) :
invalid 'times' argument
The final step would be to bind all the csv files in a single data frame.
I appreciate any suggestion. If there is any simpler way to do what i did i appreciate too!
There are many ways to solve a problem in R. A more R-like way to solve this problem is with an apply() function. The apply() family of functions acts like an implied for loop, applying one or more operations to each item in passed to it via a function argument.
Another important feature of R is the anonymous function. Combining lapply() with an anonymous function we can solve your multi file read problem.
setwd("C:/Users/worddirect")
files.names<-list.files(getwd(),pattern="*.csv")
# read csv files and return them as items in a list()
theList <- lapply(files.names,function(x){
theData <- read.csv(x,skip=18)
# bind the region, cluster, and point data and return
cbind(
"Region"=rep(substring(x,4,4),times=nrow(theData)),
"Cluster"=rep(substring(x,6,6),times=nrow(theData)),
"Point"=rep(substring(x,8,8),times=nrow(theData)),
theData)
})
# rbind the data frames in theList into a single data frame
theResult <- do.call(rbind,theList)
regards,
Len
i is number, which doesn't have nrow property.
You can use following code
result = data.frame()
for (i in 1:length(files.names)) {
assign(files.names[i], read.csv(files.names[i],skip=18))
result = rbind(
cbind(
"Region"=rep(substring(files.names[i],4,4),times=nrow(files.names[i])),
"Cluster"=rep(substring(files.names[i],6,6),times=nrow(files.names[i])),
"Point"=rep(substring(files.names[i],8,8),times=nrow(files.names[i])),
files.names[i]))
}

How to remove empty columns in transaction data read with the arules package?

I have a dataset made in the format of a basket data. I have read that dataset in R using a package call arules which has an inbuilt function for reading transactions, so I have used that and read my dataset. Following is the code I used:
trans = read.transactions("C:/Users/HARI/Desktop/Graph_mining/transactional_data_v3.csv", format = "basket", sep=",",rm.duplicates=TRUE)
inspect(trans[1:5])
items
1 {,
ANTIVERT,
SOFTCLIX}
2 {,
CEFADROXIL,
ESTROGEN}
3 {,
BENZAMYCIN,
BETAMETH,
KEFLEX,
PERCOCET}
4 {,
ACCUTANE(RXPAK;10X10),
BENZAMYCIN}
5 {,
ALBUTEROL,
BUTISOLSODIUM,
CLARITIN,
NASACORTAQ}
As you can see, when I use inspect(trans) it shows transactions with an empty column in each. My question is how can I remove those empty columns?
For a full dput of the trans object, please see this link.
I think I've found a solution to your problem. I took your csv file, opened it in Excel and replaced all empty cells with NA. Then I pasted the whole thing into LibreOffice Calc and saved it back to csv specifying that double quotes should be used for all cells (oddly enough, Excel won't do that except with a vba macro. You could read the file directly in LibreOffice instead of Excel, however, replacing empty cells with NA's will take forever). Then:
trans <- read.table("d:/downloads/transactional_data_2.csv", sep=",", stringsAsFactors = TRUE, na.strings="NA", header=TRUE)
trans2 <- as(trans, "transactions")
inspect(trans2[1:5])
RESULTS
inspect(trans[1:5])
items transactionID
1 {X1=SOFTCLIX,
X2=ANTIVERT} 1
2 {X1=ESTROGEN,
X2=CEFADROXIL} 2
3 {X1=KEFLEX,
X2=BETAMETH,
X3=PERCOCET,
X4=BENZAMYCIN} 3
4 {X1=BENZAMYCIN,
X2=ACCUTANE(RXPAK;10X10)} 4
5 {X1=CLARITIN,
X2=ALBUTEROL,
X3=NASACORTAQ,
X4=BUTISOLSODIUM} 5
I think that's the results you're looking for...?
I'm not super familiar with the arules package. My best guess is to read in the data using read.csv and then convert to transaction format instead of using the provided read.transactions:
tran2 <- read.csv("downloads/transactional_data.csv")
tran3 <- as(tran2, "transactions")
EDIT: I believe that the blanks in your data are not being read in correctly; additionally, there are duplicates which should also be filtered out. This should deal with that. You will need the reshape2 package.
trans2 <- read.csv("downloads/transactional_data.csv", na.strings="", stringsAsFactors=FALSE )
trans2$id <- seq(nrow(trans2))
t2.long <- melt(trans2, id.vars="id")
t2.long$variable <- NULL
t3 <- as(lapply(split(t2.long$value, t2.long$id), unique), "transactions")

Learning for loops in R and can't pull out a specific variable

I am having trouble figuring out for loops in R after learning in Python for a while. What I want to do is pull out $nitrate or $Sulfate from the vector of CSV files this code returns:
getpollutant <- function(id=1:332, directory, pollutant) {
data<-c()
for (i in id) {
data[i]<- c(paste(directory, "/",formatC(i, width=3, flag=0),".csv",sep=""))
}
df<-c()
for (d in 1:length(data)){ df[[d]]<-c(read.csv(data[d]))
}
df
}
I haven't included the for loop for pollutant yet, I've tried many different approaches but can't get it to work quite right... with the code above I can put in: getpollutant(1:10, "specdata") and it will give me all the csv files from the specdata directory with labels 001 through 010, it spits out each csv file in separated chunks with headers of the format [[i]]$columnname with the contents of the column listed below. What I want to do is pull out a specific columnname (pollutant) and return the contents of that column from every csv file. I have read through the help pages and just can't seem to get my formatting right...
#RomanLuštrik
I don't know if this is what you're looking for but here's a sample output if I put in
getpollutant(1, "specdata"):
[[1]]
[[1]]$Date
[1] 2003-01-01 2003-01-02 2003-01-03
[[1]]$sulfate
[1] NA NA NA NA NA NA 7.210 NA NA NA 1.300
[[1]]$nitrate
[1] NA NA NA .474 NA NA NA .964 NA NA NA
obviously this is a very small version of what the output is but basically it takes the CSV files in the specified range id and prints them out like this...
Do you only want to read in a certain column from the files? and do you know which column it is by number (e.g. the 3rd column)? In that case you can use the colClasses argument to read.table/read.csv to specify only reading in the given column.
If you don't know which column it is ahead of time then you may need to read in the entire file, then only return the given column. In that case you probably want to use [[]] subsetting instead of $ subsetting.
You can also make your code more compact and possibly more efficient by using sprintf and lapply or sapply.
Consider this code:
lapply(1:332, function(id) {
read.csv( sprint("%s/%03d.csv", directory, id )
})
or
sapply( list.files(directory, pattern='\\.csv$',full.names=TRUE),
function(nm) read.csv(nm)[[pollutant]] )

Resources