So I have an excel spreadsheet with NA values....What is the best way to copy the data and put it in R...I usually use data=read.delim("clipboard").... But because of those missing values....I keep getting this error
Error in if (del == 0 && to == 0) return(to) :
missing value where TRUE/FALSE needed
What are the possible ways I can get rid of this error?...I tried putting zeros instead of NA values but that kinda screws up the what the code is doing
Heres the link of the code that I'm using R programming fixing error was really helpful for my data problems.
I was gonna post the whole set but theres a limit of 30000 characters
You need to set option fill to TRUE , This will let you in case the rows have unequal length, to add NA fields.
read.table(fileName,header=TRUE,fill=TRUE)
fileName here is your excel file path. for example filename ='c:\temp\myfile.csv'.
This should work also with read.delim which is a wrapper of read.table. You can give read.table a string , but you set the text argument not the file one. For example:
read.table(text = ' Time Speed Time Speed
0.8 2.9 0.3 2.7
1.3 2.8 0.9 2.7
1.7 2.3 2.5 3.1
2.0 0.6
2.3 1.7 13.6 3.3
3.0 1.4 15.1 3.5
3.5 1.3 17.5 3.3',head=T,fill=T)
Time Speed Time.1 Speed.1
1 0.8 2.9 0.3 2.7
2 1.3 2.8 0.9 2.7
3 1.7 2.3 2.5 3.1
4 2.0 0.6 NA NA
5 2.3 1.7 13.6 3.3
6 3.0 1.4 15.1 3.5
7 3.5 1.3 17.5 3.3
Related
Before my question, here is a little background.
I am creating a general purpose data shaping and charting library for plotting survey data of a particular format.
As part of my scripts, I am using the subset function on my data frame. The way I am working is that I have a parameter file where I can pass this subsetting criteria into my functions (so I don't need to directly edit my main library). The way I do this is as follows:
subset_criteria <- expression(variable1 != "" & variable2 == TRUE)
(where variable1 and variable2 are columns in my data frame, for example).
Then in my function, I call this as follows:
my.subset <- subset(my.data, eval(subset_criteria))
This part works exactly as I want it to work. But now I want to augment that subsetting criteria inside the function, based on some other calculations that can only be performed inside the function. So I am trying to find a way to combine together these subsetting expressions.
Imagine inside my function I create some new column in my data frame automatically, and then I want to add a condition to my subsetting that says that this additional column must be TRUE.
Essentially, I do the following:
my.data$newcolumn <- with(my.data, ifelse(...some condition..., TRUE, FALSE))
Then I want my subsetting to end up being:
my.subset <- subset(my.data, eval(subset_criteria & newcolumn == TRUE))
But it does not seem like simply doing what I list above is valid. I get the wrong solution. So I'm looking for a way of combining these expressions using expression and eval so that I essentially get the combination of all the conditions.
Thanks for any pointers. It would be great if I can do this without having to rewrite how I do all my expressions, but I understand that might be what is needed...
Bob
You should probably avoid two things: using subset in non-interactive setting (see warning in the help pages) and eval(parse()). Here we go.
You can change the expression into a string and append it whatever you want. The trick is to convert the string back to expression. This is where the aforementioned parse comes in.
sub1 <- expression(Species == "setosa")
subset(iris, eval(sub1))
sub2 <- paste(sub1, '&', 'Petal.Width > 0.2')
subset(iris, eval(parse(text = sub2))) # your case
> subset(iris, eval(parse(text = sub2)))
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
6 5.4 3.9 1.7 0.4 setosa
7 4.6 3.4 1.4 0.3 setosa
16 5.7 4.4 1.5 0.4 setosa
17 5.4 3.9 1.3 0.4 setosa
18 5.1 3.5 1.4 0.3 setosa
19 5.7 3.8 1.7 0.3 setosa
20 5.1 3.8 1.5 0.3 setosa
22 5.1 3.7 1.5 0.4 setosa
24 5.1 3.3 1.7 0.5 setosa
27 5.0 3.4 1.6 0.4 setosa
32 5.4 3.4 1.5 0.4 setosa
41 5.0 3.5 1.3 0.3 setosa
42 4.5 2.3 1.3 0.3 setosa
44 5.0 3.5 1.6 0.6 setosa
45 5.1 3.8 1.9 0.4 setosa
46 4.8 3.0 1.4 0.3 setosa
I'm creating an R package with several files in /data. The way one loads data in the R package is to use the system.file(),
system.file(..., package = "base", lib.loc = NULL, mustWork = FALSE)
The file in /data I would like to load into an R data.table has the extension *.txt.gz, my_file.txt.gz. How do I load this into a data.table via read.table() or fread()?
Within the R script, I tried :
#' #import data.table
#' #export
my_function = function(){
my_table = read.table(system.file("data", "my_file.txt.gz", package = "FusionVizR"), header=TRUE)
}
This leads to an error via devtools document():
Error in read.table(system.file("data", "my_file.txt.gz", package = "FusionVizR"), header = TRUE) (from script1.R#7) :
no lines available in input
In addition: Warning message:
In file(file, "rt") :
file("") only supports open = "w+" and open = "w+b": using the former
I appear to get the same issue via fread()
#' #import data.table
#' #export
my_function() = function(){
my_table = fread(system.file("data", "my_file.txt.gz", package = "FusionVizR"), header=TRUE)
}
This outputs the error:
Input is either empty or fully whitespace after the skip or autostart. Run again with verbose=TRUE.
So, it appears that system.file() doesn't give an object to the file which I could load into an R data.table. How do I do this?
Do yourself a HUGE favour and study fread() closely: it is one of the very best features in data.table. I have examples (at work) of reading from a pipe of other commands, of reading compresse data and more.
Here is a simple mock example:
R> write.csv(iris, file="/tmp/demo.csv")
R> system("gzip /tmp/demo.csv") # to be very plain
R> fread("zcat /tmp/demo.csv.gz")
V1 Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1: 1 5.1 3.5 1.4 0.2 setosa
2: 2 4.9 3.0 1.4 0.2 setosa
3: 3 4.7 3.2 1.3 0.2 setosa
4: 4 4.6 3.1 1.5 0.2 setosa
5: 5 5.0 3.6 1.4 0.2 setosa
---
146: 146 6.7 3.0 5.2 2.3 virginica
147: 147 6.3 2.5 5.0 1.9 virginica
148: 148 6.5 3.0 5.2 2.0 virginica
149: 149 6.2 3.4 5.4 2.3 virginica
150: 150 5.9 3.0 5.1 1.8 virginica
R>
Seems in the hast I wrote one column too many (rownames) but you get the idea.
Now, you don't even need fread (but it still more powerful than the alternatives):
R> head(read.csv(file="/tmp/demo.csv.gz"))
X Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 1 5.1 3.5 1.4 0.2 setosa
2 2 4.9 3.0 1.4 0.2 setosa
3 3 4.7 3.2 1.3 0.2 setosa
4 4 4.6 3.1 1.5 0.2 setosa
5 5 5.0 3.6 1.4 0.2 setosa
6 6 5.4 3.9 1.7 0.4 setosa
R>
R figured out by itself it needed to compress the file.
Edit: I was editing this question earlier when it was deleted under me, which is about as de-motivating as it gets. In a nutshell:
system.file() works, e.g. file <- system.file("rawdata", "population.csv", package="gunsales") does contain the complete path as the file exists: "/usr/local/lib/R/site-library/gunsales/rawdata/population.csv". But this is easy to mess up. (Needless to say I do have the package and the file.)
look into the data/ directory and what Writing R Extensions says. It is a good mechanism.
I have a 5GB csv with 2 million rows. The header are comma separated strings and each row are comma separated doubles with no missing or corrupted data. It is rectangular.
My objective is to read a random 10% (with or without replacement, doesn't matter) of the rows into RAM as fast as possible. An example of a slow solution (but faster than read.csv) is to read in the whole matrix with fread and then keep a random 10% of the rows.
require(data.table)
X <- data.matrix(fread('/home/user/test.csv')) #reads full data.matix
X <- X[sample(1:nrow(X))[1:round(nrow(X)/10)],] #sample random 10%
However I'm looking for the fastest possible solution (this is slow because I need to read the whole thing first, then trim it after).
The solution deserving of a bounty will give system.time() estimates of different alternatives.
Other:
I am using Linux
I don't need exactly 10% of the rows. Just approximately 10%.
I think this should work pretty quickly, but let me know since I have not tried with big data yet.
write.csv(iris,"iris.csv")
fread("shuf -n 5 iris.csv")
V1 V2 V3 V4 V5 V6
1: 37 5.5 3.5 1.3 0.2 setosa
2: 88 6.3 2.3 4.4 1.3 versicolor
3: 84 6.0 2.7 5.1 1.6 versicolor
4: 125 6.7 3.3 5.7 2.1 virginica
5: 114 5.7 2.5 5.0 2.0 virginica
This takes a random sample of N=5 for the iris dataset.
To avoid the chance of using the header row again, this might be a useful modification:
fread("tail -n+2 iris.csv | shuf -n 5", header=FALSE)
Here's a file with 100000 lines in it like this:
"","a","b","c"
"1",0.825049088569358,0.556148858508095,0.591679535107687
"2",0.161556158447638,0.250450366642326,0.575034103123471
"3",0.676798462402076,0.0854280597995967,0.842135070590302
"4",0.650981109589338,0.204736212035641,0.456373531138524
"5",0.51552157686092,0.420454133534804,0.12279288447462
$ wc -l d.csv
100001 d.csv
So that's 100000 lines plus a header. We want to keep the header and sample each line if a random number from 0 to 1 is greater than 0.9.
$ awk 'NR==1 {print} ; rand()>.9 {print}' < d.csv >sample.csv
check:
$ head sample.csv
"","a","b","c"
"12",0.732729186303914,0.744814146542922,0.199768838472664
"35",0.00979996216483414,0.633388962829486,0.364802648313344
"36",0.927218825090677,0.730419414117932,0.522808947600424
"42",0.383301998255774,0.349473554175347,0.311060158303007
and it has 10027 lines:
$ wc -l sample.csv
10027 sample.csv
This took 0.033s of real time on my 4-yo box, probably the HD speed is the limiting factor here. It should scale linearly since the file is being dealt with strictly line-by-line.
You then read in sample.csv using read.csv or fread as desired:
> s = fread("sample.csv")
You could use sqldf::read.csv.sql and an SQL command to pull the data in:
library(sqldf)
write.csv(iris, "iris.csv", quote = FALSE, row.names = FALSE) # write a csv file to test with
read.csv.sql("iris.csv","SELECT * FROM file ORDER BY RANDOM() LIMIT 10")
Sepal_Length Sepal_Width Petal_Length Petal_Width Species
1 6.3 2.8 5.1 1.5 virginica
2 4.6 3.1 1.5 0.2 setosa
3 5.4 3.9 1.7 0.4 setosa
4 4.9 3.0 1.4 0.2 setosa
5 5.9 3.0 4.2 1.5 versicolor
6 6.6 2.9 4.6 1.3 versicolor
7 4.3 3.0 1.1 0.1 setosa
8 4.8 3.4 1.9 0.2 setosa
9 6.7 3.3 5.7 2.5 virginica
10 5.9 3.2 4.8 1.8 versicolor
It doesn't calculate the 10% for you, but you can choose the absolute limit of rows to return.
I have a dataset with many missing values. Some of the missing values are NAs, some are Nulls, and others have varying lengths of blank spaces. I would like to utilize the fread function in R to be able to read all these values as missing.
Here is an example:
#Find fake data
iris <- data.table(iris)[1:5]
#Add missing values non-uniformly
iris[1,Species:=' ']
iris[2,Species:=' ']
iris[3,Species:='NULL']
#Write to csv and read back in using fread
write.csv(iris,file="iris.csv")
fread("iris.csv",na.strings=c("NULL"," "))
V1 Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1: 1 5.1 3.5 1.4 0.2
2: 2 4.9 3.0 1.4 0.2 NA
3: 3 4.7 3.2 1.3 0.2 NA
4: 4 4.6 3.1 1.5 0.2 setosa
5: 5 5.0 3.6 1.4 0.2 setosa
From the above example, we see that I am unable to account for the first missing value since there are many blank spaces. Any one know of a way to account for this?
Thanks so much for the wonderful answer from #eddi.
fread("sed 's/ *//g' iris.csv",na.strings=c("",NA,"NULL"))
I asked a question like this before but I decided to simplify my data format because I'm very new at R and didnt understand what was going on....here's the link for the question How to handle more than multiple sets of data in R programming?
But I edited what my data should look like and decided to leave it like this..in this format...
X1.0 X X2.0 X.1
0.9 0.9 0.2 1.2
1.3 1.4 0.8 1.4
As you can see I have four columns of data, The real data I'm dealing with is up to 2000 data points.....Columns "X1.0" and "X2.0" refer "Time"...so what I want is the average of "X" and "X.1" every 100 seconds based on my 2 columns of time which are "X1.0" and "X2.0"...I can do it using this command
cuts <- cut(data$X1.0, breaks=seq(0, max(data$X1.0)+400, 400))
by(data$X, cuts, mean)
But this will only give me the average from one set of data....which is "X1.0" and "X".....How will I do it so that I could get averages from more than one data set....I also want to stop having this kind of output
cuts: (0,400]
[1] 0.7
------------------------------------------------------------
cuts: (400,800]
[1] 0.805
Note that the output was done every 400 s....I really want a list of those cuts which are the averages at different intervals...please help......I just used data=read.delim("clipboard") to get my data into the program
It is a little bit confusing what output do you want to get.
First I change colnames but this is optional
colnames(dat) <- c('t1','v1','t2','v2')
Then I will use ave which is like by but with better output. I am using a trick of a matrix to index column:
matrix(1:ncol(dat),ncol=2) ## column1 is col1 adn col2...
[,1] [,2]
[1,] 1 3
[2,] 2 4
Then I am using this matrix with apply. Here the entire solution:
cbind(dat,
apply(matrix(1:ncol(dat),ncol=2),2,
function(x,by=10){ ## by 10 seconds! you can replace this
## with 100 or 400 in you real data
t.col <- dat[,x][,1] ## txxx
v.col <- dat[,x][,2] ## vxxx
ave(v.col,cut(t.col,
breaks=seq(0, max(t.col),by)),
FUN=mean)})
)
EDIT correct the cut and simplify the code
cbind(dat,
apply(matrix(1:ncol(dat),ncol=2),2,
function(x,by=10)ave(dat[,x][,1], dat[,x][,1] %/% by)))
X1.0 X X2.0 X.1 1 2
1 0.9 0.9 0.2 1.2 3.3000 3.991667
2 1.3 1.4 0.8 1.4 3.3000 3.991667
3 2.0 1.7 1.6 1.1 3.3000 3.991667
4 2.6 1.9 2.2 1.6 3.3000 3.991667
5 9.7 1.0 2.8 1.3 3.3000 3.991667
6 10.7 0.8 3.5 1.1 12.8375 3.991667
7 11.6 1.5 4.1 1.8 12.8375 3.991667
8 12.1 1.4 4.7 1.2 12.8375 3.991667
9 12.6 1.8 5.4 1.2 12.8375 3.991667
10 13.2 2.1 6.3 1.3 12.8375 3.991667
11 13.7 1.6 6.9 1.1 12.8375 3.991667
12 14.2 2.2 9.4 1.3 12.8375 3.991667
13 14.6 1.8 10.0 1.5 12.8375 10.000000