writing small dataframe to csv creates a huge file - r

I am trying to write the dataframe T_df into a csv file, however the saved "TFile.csv" file grows to approx 50GB on the Microsoft Azure / R server. Has someone experienced something similar and can please advise?
Example:
write.csv(T_df,"TFile.csv")
creates 50GB file, while dataframe is not that big
object.size(T_df)
2449776 bytes
str(T_df)
'data.frame': 101994 obs. of 3 variables:

Don't know if there's something special about your particular data but I don't see this when I run Microsoft R Server version 9.3.0.
> T_df <- data.frame(a = runif(101994), b = runif(101994), c = runif(101994))
> object.size(T_df)
2448752 bytes
> str(T_df)
'data.frame': 101994 obs. of 3 variables:
$ a: num 0.248 0.504 0.197 0.634 0.407 ...
$ b: num 0.226 0.686 0.556 0.629 0.412 ...
$ c: num 0.959 0.122 0.214 0.666 0.23 ...
>
> write.csv(T_df,"TFile.csv")
TFile.csv is 6.1 M

Related

Problems with full_join in R. no applicable method to "character"

I am new to R world I am struggling with full_join function. I am pretty sure the problem is easy. I got it working on other situations I assume they were the same as the present one. Anyhow, probably someone can help me. Let's go:
I have several datasets within a big list:
NDVI2003 <- ls(pattern = "x2003_meanNDVI_m.*$")
PixelQa2003 <- ls(pattern = "x2003_meanPixelQa_m.*$")
full_list <- do.call(c, list(NDVI2003,PixelQa2003))
The first 2 functions are just grabbing some files from a folder. This files look like:
> str(x2003_meanNDVI_m1)
'data.frame': 354 obs. of 5 variables:
$ date : chr "2001-12-03" "2001-12-10" "2001-12-19" "2001-12-26" ...
$ 2003_NDVI_1: num 0.441 0.518 0.322 0.311 0.499 0.319 0.163 0.134 0.452 0.536 ...
$ 2003_NDVI_2: num 0.377 0.446 0.075 0.1 0.006 0.279 0.368 0.135 0.423 0.522 ...
$ 2003_NDVI_3: num 0.332 0.397 0.07 0.093 0.006 0.236 0.469 0.127 0.411 0.535 ...
$ 2003_NDVI_4: num 0.653 0.621 0.536 0.064 0.652 0.576 0.52 0.158 0.666 0.663 ...
The 3rd function is simply getting together all these files:
> head(full_list,20)
[1] "x2003_meanNDVI_m1" "x2003_meanNDVI_m2" "x2003_meanNDVI_m3" "x2003_meanNDVI_m4" "x2003_meanNDVI_m5"
[6] "x2003_meanNDVI_m6" "x2003_meanPixelQa_m1" "x2003_meanPixelQa_m2" "x2003_meanPixelQa_m3" "x2003_meanPixelQa_m4"
[11] "x2003_meanPixelQa_m5" "x2003_meanPixelQa_m6"
So far, very simple. Now it comes to the problem... I want to join all these files by the column 'date'. This very same procedure is working on other scripts I built:
data2003 <- reduce(full_list, full_join, by="date")
But I keep getting an error:
> data2003 <- reduce(full_list, full_join, by="date")
Error in UseMethod("full_join") :
no applicable method for 'full_join' applied to an object of class "character"
So far, what I have tried:
Changing the column type from character, to date, to number... Nothing.
Altering the order of dplyr and plyr packages when opening R.
Changing variable names and so on.
full_lst <- list(NDVI2003,PixelQa2003) instead of full_list <- do.call(c, list(NDVI2003,PixelQa2003))
-Adding full_list <- mget(full_list)
Google for hours lookin for an answer...
Any help will be really welcome.

How do you get structure of data frame with limited length for variable names?

I have a data frame for a raw data set where the variable names are extremely long. I would like to display the structure of the data frame using the str function, and impose a character limit on the displayed variable names, so that it is easier to read.
Here is a reproducible example of the kind of thing I am talking about.
#Data frame with long names
set.seed(1);
DATA <- data.frame(ID = 1:50,
Value = rnorm(50),
This_variable_has_a_really_long_and_annoying_name_to_illustrate_the_problem_of_a_data_frame_with_a_long_and_annoying_name = runif(50));
#Show structure of DATA
str(DATA);
> str(DATA)
'data.frame': 50 obs. of 3 variables:
$ ID : int 1 2 3 4 5 6 7 8 9 10 ...
$ Value : num -0.626 0.184 -0.836 1.595 0.33 ...
$ This_variable_has_a_really_long_and_annoying_name_to_illustrate_the_problem_of_a_data_frame_with_a_long_and_annoying_name: num 0.655 0.353 0.27 0.993 0.633 ...
I would like to use the str function but impose an upper limit on the number of characters to display in the variable names, so that I get output that is something like the one below. I have read the documentation, but I have not been able to identify if there is an option to do this. (There seem to be options to impose upper limits on the lengths of strings in the data, but I cannot see an option to impose a limit on the length of the variable name.)
'data.frame': 50 obs. of 3 variables:
$ ID : int 1 2 3 4 5 6 7 8 9 10 ...
$ Value : num -0.626 0.184 -0.836 1.595 0.33 ...
$ This_variable_has... : num 0.655 0.353 0.27 0.993 0.633 ...
Question: Is there a simple way to get the structure of the data frame, but imposing a limitation on the length of the variable names (to get output something like the above)?
As far as I can see you're right, there doesn't seem to be a built in means to control this. You also can't do it after the fact because str() doesn't return anything. So the easiest option seems to be renaming beforehand. Relying on setNames(), you could create a simple function to accomplish this:
short_str <- function(data, n = 20, ...) {
name_vec <- names(data)
str(setNames(data, ifelse(
nchar(name_vec) > n, paste0(substring(name_vec, 1, n - 4), "... "), name_vec
)), ...)
}
short_str(DATA)
'data.frame': 50 obs. of 3 variables:
$ ID : int 1 2 3 4 5 6 7 8 9 10 ...
$ Value : num -0.626 0.184 -0.836 1.595 0.33 ...
$ This_variable_has... : num 0.655 0.353 0.27 0.993 0.633 ...

When I read in a large table using fread it slightly changes the numbers in one of the columns

I have a large file that looks like this
region type coeff p-value distance count
82365593523656436 A -0.9494 0.050 -16479472.5 8
82365593523656436 B 0.47303 0.526 57815363.0 8
82365593523656436 C -0.8938 0.106 42848210.5 8
When I read it in using fread, suddenly 82365593523656436 is not found anymore
correlations <- data.frame(fread('all_to_all_correlations.txt'))
> "82365593523656436" %in% correlations$region
[1] FALSE
I can find a slightly different number
> "82365593523656432" %in% correlations$region
[1] TRUE
but this number is not in the actual file
grep 82365593523656432 all_to_all_correlations.txt
gives no results, while
grep 82365593523656436 all_to_all_correlations.txt
does.
When I try to read in the small sample file I showed above instead of the full file I get
Warning message:
In fread("test.txt") :
Some columns have been read as type 'integer64' but package bit64 isn't loaded.
Those columns will display as strange looking floating point data.
There is no need to reload the data.
Just require(bit64) toobtain the integer64 print method and print the data again.
and the data looks like
region type coeff p.value distance count
1 3.758823e-303 A -0.94940 0.050 -16479472 8
2 3.758823e-303 B 0.47303 0.526 57815363 8
3 3.758823e-303 C -0.89380 0.106 42848210 8
So I think during reading 82365593523656436 was changed into 82365593523656432. How can I prevent this from happening?
IDs (and that's apparently what the first column is) should usually be read as characters:
correlations <- setDF(fread('region type coeff p-value distance count
82365593523656436 A -0.9494 0.050 -16479472.5 8
82365593523656436 B 0.47303 0.526 57815363.0 8
82365593523656436 C -0.8938 0.106 42848210.5 8',
colClasses = c(region = "character")))
str(correlations)
#'data.frame': 3 obs. of 6 variables:
# $ region : chr "82365593523656436" "82365593523656436" "82365593523656436"
# $ type : chr "A" "B" "C"
# $ coeff : num -0.949 0.473 -0.894
# $ p-value : num 0.05 0.526 0.106
# $ distance: num -16479473 57815363 42848211
# $ count : int 8 8 8

Accessing dataframes after splitting a dataframe

I'm splitting a dataframe in multiple dataframes using the command
data <- apply(data, 2, function(x) data.frame(sort(x, decreasing=F)))
I don't know how to access them, I know I can access them using df$1 but I have to do that for every dataframe,
df1<- head(data$`1`,k)
df2<- head(data$`2`,k)
can I get these dataframes in one go (like storing them in some form) however the indexes of these multiple dataframes shouldn't change.
str(data) gives
List of 2
$ 7:'data.frame': 7 obs. of 1 variable:
..$ sort.x..decreasing...F.: num [1:7] 0.265 0.332 0.458 0.51 0.52 ...
$ 8:'data.frame': 7 obs. of 1 variable:
..$ sort.x..decreasing...F.: num [1:7] 0.173 0.224 0.412 0.424 0.5 ...
str(data[1:2])
List of 2
$ 7:'data.frame': 7 obs. of 1 variable:
..$ sort.x..decreasing...F.: num [1:7] 0.265 0.332 0.458 0.51 0.52 ...
$ 8:'data.frame': 7 obs. of 1 variable:
..$ sort.x..decreasing...F.: num [1:7] 0.173 0.224 0.412 0.424 0.5 ...
Thanks to #r2evans I got it done, here is his code from the comments
Yes. Two short demos: lapply(data, head, n=2), or more generically
sapply(data, function(df) mean(df$x)). – r2evans
and after that fetching the indexes
df<-lapply(df, rownames)

How do I read ragged/implied do data into r

How do I read data like the example below? (My actual files are like ftp://ftp.aoml.noaa.gov/hrd/pub/hwind/Operational/2012/AL182012/1030/0730/AL182012_1030_0730.gz formatted per http://www.aoml.noaa.gov/hrd/Storm_pages/grid.html -- They look like fortran implied-do writes)
The issue I have is that there are multiple headers and vectors within the file having differing numbers of values per line. Scan seems to start from the beginning for .gz files, while I want the reads to parse incrementally through the file.
This is a headerline with a name.
The fourth line has the number of elements in the first vector,
and the next vector is encoded similarly
7
1 2 3
4 5 6
7
8
1 2 3
4 5 6
7 8
This doesn't work as I'd like:
fh<-gzfile("junk.gz")
headers<-readLines(fh,3)
nx<-as.numeric(readLines,1)
x<-scan(fh,nx)
ny<-as.numeric(readLines,1)
y<-scan(fh,ny)
This sort of works, but I have to then calculate the skip values:
...
x<-scan(fh,skip=3,nx)
...
Ah... I discovered that using the gzfile() to open does not allow seek operations on the data, so the scan()s all rewind and start at the beginning of the file. If I unzip the file and operate on the uncompressed data, I can read the various bits incrementally with readLines(fh,n) and scan(fh,n=n)
readVector<-function(fh,skip=0,what=double()){
if (skip !=0 ){junk<-readLines(fh,skip)}
n<-scan(fh,1)
scan(fh,what=what,n=n)
}
fh<-file("junk")
headers<-readLines(fh,3)
x<-readVector(fh)
y<-readVector(fh)
xl<-readVector(fh)
yl<-readVector(fh)
... # still need to process a parenthesized complex array, but that is a different problem.
Looking at a few sample files, it looks like you only need to determine the number to be read once, and that can be used for processing all parts of the file.
As I mentioned in a comment, grep would be useful for helping automate the process. Here's a quick function I came up with:
ReadFunky <- function(myFile) {
fh <- gzfile(myFile)
myFile <- readLines(fh)
vecLen <- as.numeric(myFile[5])
startAt <- grep(paste("^\\s+", vecLen), myFile)
T1 <- lapply(startAt[-5], function(x) {
scan(fh, n = vecLen, skip = x)
})
T2 <- gsub("\\(|\\)", "",
unlist(strsplit(myFile[(startAt[5]+1):length(myFile)], ")(",
fixed = TRUE)))
T2 <- read.csv(text = T2, header = FALSE)
T2 <- split(T2, rep(1:vecLen, each = vecLen))
T1[[5]] <- T2
names(T1) <- myFile[startAt-1]
T1
}
You can apply it to a downloaded file. Just replace with the actual path to where you downloaded the file.
temp <- ReadFunky("~/Downloads/AL182012_1030_0730.gz")
The function returns a list. The first four items in the list are the vectors of coordinates.
str(temp[1:4])
# List of 4
# $ MERCATOR X COORDINATES ... KILOMETERS : num [1:159] -476 -470 -464 -458 -452 ...
# $ MERCATOR Y COORDINATES ... KILOMETERS : num [1:159] -476 -470 -464 -458 -452 ...
# $ EAST LONGITUDE COORDINATES ... DEGREES: num [1:159] -81.1 -81 -80.9 -80.9 -80.8 ...
# $ NORTH LATITUDE COORDINATES ... DEGREES: num [1:159] 36.2 36.3 36.3 36.4 36.4 ...
The fifth item is a set of 2-column data.frames that contain the data from your "parenthesized complex array". Not really sure what the best structure for this data was, so I just stuck it in data.frames. You'll get as many data.frames as the expected number of values for the given data set (in this case, 159).
length(temp[[5]])
# [1] 159
str(temp[[5]][1:4])
# List of 4
# $ 1:'data.frame': 159 obs. of 2 variables:
# ..$ V1: num [1:159] 7.59 7.6 7.59 7.59 7.58 ...
# ..$ V2: num [1:159] -1.33 -1.28 -1.22 -1.16 -1.1 ...
# $ 2:'data.frame': 159 obs. of 2 variables:
# ..$ V1: num [1:159] 7.66 7.66 7.65 7.65 7.64 ...
# ..$ V2: num [1:159] -1.29 -1.24 -1.19 -1.13 -1.07 ...
# $ 3:'data.frame': 159 obs. of 2 variables:
# ..$ V1: num [1:159] 7.73 7.72 7.72 7.71 7.7 ...
# ..$ V2: num [1:159] -1.26 -1.21 -1.15 -1.1 -1.04 ...
# $ 4:'data.frame': 159 obs. of 2 variables:
# ..$ V1: num [1:159] 7.8 7.8 7.79 7.78 7.76 ...
# ..$ V2: num [1:159] -1.22 -1.17 -1.12 -1.06 -1.01 ...
Update
If you want to modify the function so you can read directly from the FTP url, change the first two lines to read as the following and continue from the "myFile" line:
ReadFunky <- function(myFile, fromURL = TRUE) {
if (isTRUE(fromURL)) {
x <- strsplit(myFile, "/")[[1]]
y <- download.file(myFile, destfile = x[length(x)])
fh <- gzfile(x[length(x)])
} else {
fh <- gzfile(myFile)
}
Usage would be like: temp <- ReadFunky("ftp://ftp.aoml.noaa.gov/hrd/pub/hwind/Operational/2012/AL182012/1023/1330/AL182012_1023_1330.gz") for a file that you are going to download directly, and temp <- ReadFunky("~/AL182012_1023_1330.gz", fromURL=FALSE) for a file that you already have saved on your system.

Resources