Split data in R and perform operation - r

I have a very large file that simply contains wave heights for different tidal scenarios at different locations. My file is organized into 13 wave heights x 9941 events, for 5153 locations.
What I want to do is read in this very long data file, which looks like this:
0.0
0.1
0.2
0.4
1.2
1.5
2.1
.....
Then split it into segments of length 129,233 (corresponds to 13 tidal scenarios for 9941 events at a specific location). On this subset of the data I'd like to perform some statistical functions to calculate exceedance probability, among other things. I will then join it to the file containing location information, and print some output files.
My code so far is not working, although I've tried many things. It seems to read the data just fine, however it is having trouble with the split. I suspect it may have something to do with the format of the input data from the file.
# read files with return period wave heights at defense points
#Read wave heights for 13 tides per 9941 events, for 5143 points
WaveRP.file <- paste('waveheight_test.out')
WaveRPtable <- read.csv(WaveRP.file, head=FALSE)
WaveRP <- c(WaveRPtable)
#colnames(WaveRP) <- c("WaveHeight")
print(paste(WaveRP))
#Read X,Y information for defense points
DefPT.file <- paste('DefXYevery10thpt.out')
DefPT <- read.table(DefPT.file, head=FALSE)
colnames(DefPT) <- c("X_UTM", "Y_UTM")
#Split wave height data frame by defense point
WaveByDefPt <- split(WaveRP, 129233)
print(paste(length(WaveByDefPt[[1]])))
for (i in 1:length(WaveByDefPt)/129233){
print(paste("i",i))
}
I have also tried
#Split wave height data frame by defense point
WaveByDefPt <- split(WaveRP, ceiling(seq_along(WaveRP)/129233))
No matter how I seem to perform the split, I am simply getting the original data as one long subset. Any help would be appreciated!
Thanks :)
Kimberly

Try cut to build groups:
v <- as.numeric(readLines(n = 7))
0.0
0.1
0.2
0.4
1.2
1.5
2.1
groups <- cut(v, breaks = 3) # you want breaks = 129233
aggregate(x = v, by = list(groups), FUN = mean) # e.g. means per group
# Group.1 x
# 1 (-0.0021,0.699] 0.175
# 2 (0.699,1.4] 1.200
# 3 (1.4,2.1] 1.800

You are kind of shuffling the data into various data types here.
When the file is originally read, it is a dataframe with 1 column (V1). Then you pass it to c(), which results in a list with a single vector in it. This means if you try and do anything to WaveRP you will probably fail because that's the name of the list. The numeric vector is WaveRP[[1]].
Instead, just extract the numeric vector using the $ operator and then you can work with it. Or just work with it inside the data frame. The fun part will be thinking of a way to create the grouping vector. I'll give an example.
Something like this:
WaveRP.file <- paste('waveheight_test.out')
WaveRPtable <- read.csv(WaveRP.file, head=FALSE)
WaveRPtable$group <- ceiling(seq_along(WaveRPtable$V1)/129233)
SplitWave <- split(WveRPtable,WaveRPtable$group)
Now you will have a list containing 13 dataframes. Look at each one using double bracket indexing. SplitWave[[2]], for example, to look at the second group. You can merge the location information file with these dataframes individually.

Related

Omitting NAs from Data

First time posting. Apologies if I'm not as clear as I intend.
I have an excel (xlxs) spreadsheet of data; it's sequencing data if that helps. Generally indexed as follows:
column 1 = organism families (hundreds of organisms down this column)
columns 2-x = specific samples
Many of the boxes scattered throughout the data are zero values, or too low, which I want to omit. I set my data such that anything under 5 is set to an NA. Since different samples will have many more, less, or different species omitted by that threshold, I want to separate by samples. Code so far is:
#Files work, I just omitted my directories to place online
`my_counts <- read_excel("...Family_120821.xlsx" , sheet = "family_Counts")
my_perc <- read_excel("...Family_120821.xlsx" , sheet = "family_Percentages")
my_counts[my_counts < 5] <- NA
my_counts
my_perc[my_perc < 0.05] <- NA
my_perc
S13 <- my_counts$family , my_counts$Sample.13
S13A <- na.omit(S13)
S13A
S14 <- my_counts$Sample.14
S14A <- na.omit(S14)
S14A
S15 <- my_counts$Sample.15
S15A <- na.omit(S15)
S15A
...
First question, there a better way I can go about this such that I can replicate it in different data without typing out each individual sample?
Most important question: When I do this, I get what I want, which is the values I want, no NAs. But they are values, when I want another dataframe so I can write it back to an xlxs. As I have it, I lose the association to the organism.
Ex: Before
All samples by associated organisms
Ex: After
Single sample, no NAs, but also no association to organism index
Essentially the following image, but broken into individual samples. With only the organisms that met my threshold of 5 for counts, 0.05 for percents.
enter image description here

Writing a while loop for two sets of data for R

This is probably simple, but Im new to R and it doesn't work like GrADs so I;ve been searching high and low for examples but to no avail..
I have two sets of data. Data A (1997) and Data B (2000)
Data A has 35 headings (apples, orange, grape etc). 200 observations.
Data B has 35 headings (apples, orange, grape, etc). 200 observations.
The only difference between the two datasets is the year.
So i would like to correlate the two dataset i.e. 200 data under Apples (1997) vs 200 data under Apples (2000). So 1 heading should give me only 1 value.
I've converted all the header names to V1,V2,V3...
So now I need to do this:
x<-1
while(x<35) {
new(x)=cor(1997$V(x),2000$V(x))
print(new(x))
}
and then i get this error:
Error in pptn26$V(x) : attempt to apply non-function.
Any advise is highly appreciated!
Your error comes directly from using parentheses where R isn't expecting them. You'll get the same type of error if you do 1(x). 1 is not a function, so if you put it right next to parentheses with no white space between, you're attempting to apply a non function.
I'm also a bit surprised at how you are managing to get all the way to that error, before running into several others, but I suppose that has something to do with when R evaluates what...
Here's how to get the behavior you're looking for:
mapply(cor, A, B)
# provided A is the name of your 1997 data frame and B the 2000
Here's an example with simulated data:
set.seed(123)
A <- data.frame(x = 1:10, y = sample(10), z = rnorm(10))
B <- data.frame(x = 4:13, y = sample(10), z = rnorm(10))
mapply(cor, A, B)
# x y z
# 1.0000000 0.1393939 -0.2402058
In its typical usage, mapply takes an n-ary function and n objects that provide the n arguments for that function. Here the n-ary function is cor, and the objects are A, and B, each a data frame. A data frame is structured as a list of vectors, the columns of the data frame. So mapply will loop along your columns for you, making 35 calls to cor, each time with the next column of both A and B.
If you have managed to figure out how to name your data frames 1997 and 2000, kudos. It's not easy to do that. It's also going to cause you headaches. You'll want to have a syntactically valid name for your data frame(s). That means they should start with a letter (or a dot, but really a letter). See the R FAQ for the details.

Can dplyr and data.table be used traditionally and inside loops to extract results from data frames?

Suppose I have a data frame of over 700,000 observations and four variables and would like to extract some values by first indexing one of the District variables (shown here as Dist):
Date X Y Dist
2003/01 2.4 5.5 1
2003/02 2.3 4.0 1
2003/03 1.9 4.4 1
.
.
.
2004/11 3.7 2.9 700
2004/12 2.6 5.9 700
That is, a dataset of Xs and Ys for 700 districts, with each district having an yearly record for Xs and Ys. For each district, some values need to be extracted so I thought I could use dplyr here instead of traditional loops and conditions; however, I'm new to it and not very used to its syntax and inspite of passing some efficient commands, I'm not getting the proper results. The resulting data frame should look something like:
X Dist
Some avg. 5
Or even values for multiple districts, arranged in ascending order:
X Dist
Some avg. 4
" 5
" 6
At first, I 'sliced' off data for the districts and saved it as test to extract mean and number of non-NA observations but the resulting dataset contained warnings that I'm unclear why. For example, for districts 1 to 10:
test <- slice(df, Dist == c(1:10))
Gave a warning of longer object length being not a multiple of shorter object. I could slice for each district, and merge them row-wise but that is tedious. I actually used a for loop to come up with similar values but those are simply incomparable when it comes to dplyr's efficiency and speed in extracting valuable insights through just one-liners instead of lines of codes and conditions. It simply speeds up everything besides making markdown files cleaner and readable. How can the chained operation %>% be used here to come up with similar results? Can they be used with traditional loops and conditions?

Best way to combine lists netcdf into one dataframe in R - Nested for loops or mapply?

I am trying to combine multiple netcdf files with multiple variables:
- 6 types of parameters
-36 years
- 12 months
-31 days
- 6 Y coordinates
- 5 X coordinates
Each netcdf file contains data for 1 month of a year and 1 parameter, there are thus 432 * 6 =2592 files.
How would I best combine this all in a dataframe? It would in the end have to generate something like this:
rowID Date year month day coord.X coord.Y par1 par2 par3 par4 par5 par6
1 1979-01-01 1979 01 01 176 428 3.2 0.005 233.5 0.1 12.2 4.4
..................... 402568 rows in between.................
402570 2014-12-31 2014 12 31 180 433 1.7 0.006 235.7 0.2 0.0 2.7
How would I best combine this? I already have been struggling with this quite some time...
Excuse me for not knowing how to be able to make this question reproducible.. but there are so many elements involved.
This where I have my files from:
ftp://rfdata:forceDATA#ftp.iiasa.ac.at/WFDEI/
This is what I have so far, I think this what they call a nested loop right?:
I ussually just try and try and try and in the end it works.. but I find this a tough job. Any recommendations on first steps are welcome, please.
require(ncdf4)
directory<-c("C:/folder/") # general folder
parameter<-c("par1","par2","par3","par4","par5","par6") # names of 6 parameters
directory2<-c("_folder2/") # parameter specific folder
directory3<-c("name") # last part of folder name
years<-c("1979","otheryears","2014") # years which are also part of netcdf file name
months<-c("01","othermonths","12") # months which are also part of netcdf file name
x=c(176:180) # X-coordinates
y=c(428:433) # Y-coordinates
require(plyr)
for (p in parameter){
assign(paste0(p,"list"), list())
for (i in years){
for (j in months){
for (k in x){
for (l in y){
fileloc<-paste(directory,p,directory2,p,directory3,i,j,".nc",sep="") #location to open
ncin<-nc_open(fileloc)
assign(paste0(p))<-ncvar_get(ncin,p) # extract the desired parameter from the netcdf list "ncin" and store in vector with name of parameter
day<-ncvar_get(ncin,"day") # extract the day of month from the netcdf list "ncin"
par.coord<-paste(p,"[",y,",",x,",","]",sep="") #string with function to select coordinates
temp<-data.frame(i,j,day,p=par.coord) # store day and parameter in dataframe
temp<-cbind(date=as.Date(with(temp,paste(i,j,day,sep="-")),"%Y-%m-%d"),temp,Y=y,X=x) # Add date and coordinates to df
assign(paste0(p,"list"), list(temp) #store multiple data frames in a list.. I think?
}assign(paste0(p,"list"), do.call(rbind,data) # something to bind the dataframes by row in a list
}}}}
There are many ways to skin a cat like that. Nested loops are perhaps a bit easier to debug if you're new with R. One question I think you want to ask yourself is whether the files have primacy or your conceptual structure has primacy. That is, if your conceptual structure specifies a location for which there isn't a file, what do you want your code to do? If you only want to try to parse extant files, I find it useful to use a list.files(, full.names = TRUE, recursive = TRUE) to find the files I want to parse and then write a function to parse a single file (and its name) to produce the data structure I want. From there, it is an lapply or purrr::map.
In order to extract these Netcdf files by extract and group all the Netcdf files in one dataframe by:
-6 parameters
-36 years
-12 months
-31 days
-6 Y coordinates
-5 X coordinates
First I made sure all *.nc files were in one folder.
Second I simplified multiple for loops into one, since variables as year, month and parameter were available from the file name:
The variables day, Xcoord and Y coord could be extracted as one in an array.
require(arrayhelpers);require(stringr);require(plyr);require(ncdf4)
# store all files from ftp://rfdata:forceDATA#ftp.iiasa.ac.at/WFDEI/ in the following folder:
setwd("C:/folder")
temp = list.files(pattern="*.nc") #list all the file names
param<-gsub("_\\S+","",temp,perl=T) #extract parameter from file name
xcoord=seq(176,180,by=1) #The X-coordinates you are interested in
ycoord=seq(428,433,by=1) #The Y-coordinates you are interested in
list_var<-list() # make an empty list
for (t in 1:length(temp)){
temp_year<-str_sub(temp[],-9,-6) #take string number last place minus 9 till last place minus 6 to extract the year from file name
temp_month<-str_sub(temp[],-5,-4) #take string number last place minus 9 till last place minus 6 to extract the month from file name
temp_netcdf<-nc_open(temp[t])
temp_day<-rep(seq(1:length(ncvar_get(temp_netcdf,"day"))),length(xcoord)*length(ycoord)) # make a string of day numbers the same length as amount of values
dim.order<-sapply(temp_netcdf[["var"]][[param[t]]][["dim"]],function(x) x$name) # gives the name of each level of the array
start <- c(lon = 428, lat = 176, tstep = 1) # indicates the starting value of each variable
count <- c(lon = 6, lat = 5, tstep = length(ncvar_get(nc_open(temp[t]),"day"))) # indicates how many values of each variable have to be present starting from start
tempstore<-ncvar_get(temp_netcdf, param[t], start = start[dim.order], count = count[dim.order]) # array with parameter values
df_temp<-array2df (tempstore, levels = list(lon=ycoord, lat = xcoord, day = NA), label.x = "value") # convert array to dataframe
Add_date<-sort(as.Date(paste(temp_year[t],"-",temp_month[t],"-",temp_day,sep=""),"%Y-%m-%d"),decreasing=FALSE) # make vector with the dates
list_var[t]<-list(data.frame(Add_date,df_temp,parameter=param[t])) #add dates to data frame and store in a list of all output files
nc_close(temp_netcdf) #close nc file to prevent data loss and prevent error when working with a lot of files
}
All_NetCDF_var_in1df<-do.call(rbind,list_var)
#### If you want to take a look at the netcdf files first use:
list2env(
lapply(setNames(temp, make.names(gsub("*.nc$", "", temp))),
nc_open), envir = .GlobalEnv) #import all parameters lists to global environment

using ffdfdply to split data and get characteristics of each id in the split

Within R I'm using ffdf to work with a large dataset. I want to use ffdfdply from the ffbase package to split the data according to a certain variable (var) and then compute some characteristics for all the observations with a unique value for var (for example: the number of observations for each unique value of var). To see if this is possible using ffdfdply I executed the example described below.
I expected that it would split on each Species and then calculate the minimum Petal.Width for each Species and then return a two columns each with three entries listing the Species and minimum Petal.Width for that Species. Expected output:
Species min_pw
1 setosa 0.1
2 versicolor 1.0
3 virginica 1.4
However for BATCHBYTES=5000 it will use two splits, one containing two Species and the other containing one Species. This results in the following:
Species min_pw
1 setosa 0.1
2 virginica 1.4
When I change BATCHBYTES to 2000, this will force ffdfdply to use three splits and thus results in the expected output posted above. However I want to have another way of enforcing a split into each unique value of the variable assigned to 'split'. Is there any way to make this happen? Or do you have any other suggestions to get the result I need?
ffiris <- as.ffdf(iris)
result <- ffdfdply(x = ffiris,
split = ffiris$Species,
FUN = function(x) {
min_pw <- min(x$Petal.Width)
data.frame(Species=x$Species, min_pw= min_pw)
},
BATCHBYTES = 5000,
trace=TRUE
)
dim(result)
dim(iris)
result
The function ffdfdply was designed when you have a lot of split elements e.g. when you have 1000000 customers and you want to have data in memory at least split by customer but possibly more customers if your RAM allows such that the internals do not need to do an ffwhich 1000000 times.
That is why the doc of ffdfdply states:
Please make sure your FUN covers the fact that several split elements can be in one chunk of data on which FUN is applied.'
So the solution for your issue is to cover this in FUN namely as follows e.g.
FUN=function(x){
require(doBy)
summaryBy(Petal.Width ~ Species, data=x, keep.names=TRUE, FUN=min)
}

Resources