I'm cleaning a large dataset using xarray. Various variable attributes get set throughout the code, but not always in the same order. I'm wondering if its possible to manually reorder the attributes for all variables at the end of the code, such that they're always in the same order.
For example, if various lines of code set:
ds['hurs'].attrs['auxilliary_variables'] = "hurs_qc"
ds['hurs'].attrs['long_name'] = "relative_humidity"
ds['hurs'].attrs['standard_name'] = "relative_humidity"
ds['hurs'].attrs['units'] = "percent"
How can I then reorganize these attributes in the following order: long_name, standard_name, units, auxilliary_variables?
Related
I have several numerical variables, the value of which I want to add (+) into a new variable. The different things I tried always gave me the error message „non numeric argument to binary operator“. How do I add the values?
I have added them also by no_pp <- c(„Suppliers“ + „Producers“ + „Buyers“) but it doesn’t show up as a new variable
I'm creating a variable dataset and assigning a value to it like this :
dataset = iris
Now i assign a different value to the same variable like this :
dataset = read.csv(filename, header = FALSE)
Does R Override the previous value of dataset? Can anyone explain me how this works and can we assign more than one value to the same variable?
Yes, r will override the value of the previously assigned variable (Some examples to get started can be found here)
On a sidenote: In contrast to other languages, r uses <- as an assignment operator, so to make the code more readable to other users, you should consider using that instead of =.
I am trying to do a subset within a function and it is not working out the way I was hoping. Here is the code I am actually trying to run:
plot.by = function(output, plot.warming, plot.baseline, apply.dim, attribute, selection){
par(mfrow= c(3,3))
for(site in 1:length(output)){
zero = apply(get(output[site])[,,attribute], c(1), sum)/32
zero = zero[selection]
zero = as.data.frame(zero)
zero = rownames(zero)
two = get(plot.warming[site])[zero,,,]
zero = get(plot.baseline[site])[zero,,,]
boxplot(apply(two, apply.dim, mean) - apply(zero, apply.dim, mean), ylim = c(-200,300))
}
}
plot.by(output= data.zero, plot.warming= aet.two, plot.baseline = aet.zero, apply.dim = c(1,3), attribute= "snowpack", selection= zero>2000)
and here is my example to play with:
select.by.attribute = function(data, selection){
tmp = data[selection]
plot(tmp)
}
select.by.attribute(data=data, selection= data>100)
I know that the example function works, however I believe it only works because it does the selection before the function is even run. If I run my actual code with a clean workspace it says "zero" is not found. If at all possible I would like the selection= >1000 rather than having the object in there.
In addition, any suggestion on how to search for this stuff in the future or an information source would be great. For example I don't even know what the line is called to "call" the function or the different attributes- made searching for the question quite difficult.
To add more information to what I am trying to do. My data is from a hydrological model where the outputs are daily measurements of things like snowfall, precipitation, evaporation, etc. In the end I am trying to plot data from these sites based upon certain attributes such as precip>2000. So to do this I first need to apply over some attribute, then I subset those specific row names (which are the site names), and then I plot those sites at the bottom (the plot(apply is to collapse 4 dimensions into two to plot them).
Essentially I need to do this a lot of times- so I want to be able to quickly do this for whatever attribute I want, whether that be precipitation or snowfall, as well as a selection to be >2000 or whatever number. Hence why I tried the make the function.
I am using something like this to filter my data frame:
d1 = data.frame(data[data$ColA == "ColACat1" & data$ColB == "ColBCat2", ])
When I print d1, it works as expected. However, when I type d1$ColB, it still prints everything from the original data frame.
> print(d1)
ColA ColB
-----------------
ColACat1 ColBCat2
ColACat1 ColBCat2
> print(d1$ColA)
Levels: ColACat1 ColACat2
Maybe this is expected but when I pass d1 to ggplot, it messes up my graph and does not use the filter. Is there anyway I can filter the data frame and get only the records that match the filter? I want d1 to not know the existence of data.
As you allude to, the default behavior in R is to treat character columns in data frames as a special data type, called a factor. This is a feature, not a bug, but like any useful feature if you're not expecting it and don't know how to properly use it, it can be quite confusing.
factors are meant to represent categorical (rather than numerical, or quantitative) variables, which comes up often in statistics.
The subsetting operations you used do in fact work normally. Namely, they will return the correct subset of your data frame. However, the levels attribute of that variable remains unchanged, and still has all the original levels in it.
This means that any method written in R that is designed to take advantage of factors will treat that column as a categorical variable with a bunch of levels, many of which just aren't present. In statistics, one often wants to track the presence of 'missing' levels of categorical variables.
I actually also prefer to work with stringsAsFactors = FALSE, but many people frown on that since it can reduce code portability. (TRUE is the default, so sharing your code with someone else may be risky unless you preface every single script with a call to options).
A potentially more convenient solution, particularly for data frames, is to combine the subset and droplevels functions:
subsetDrop <- function(...){
droplevels(subset(...))
}
and use this function to extract subsets of your data frames in a way that is assured to remove any unused levels in the result.
This was such a pain! ggplot messes up if you don't do this right. Using this option at the beginning of my script solved it:
options(stringsAsFactors = FALSE)
Looks like it is the intended behavior but unfortunately I had turned this feature on for some other purpose and it started causing trouble for all my other scripts.
I have a data frame in R with POSIXct variable sessionstarttime. Each row is identified by integer ID variable of a specified location . Number of rows is different for each location. I plot overall graph simply by:
myplot <- ggplot(bigMAC, aes(x = sessionstarttime)) + geom_freqpoly()
Is it possible to create a loop that will create and save such plot for each location separately?
Preferably with a file name the same as value of ID variable?
And preferably with the same time scale for each plot?
Not entirely sure what you're asking but you can do one of two things.
a) You can save each individual plot in a loop with a unique name based on ID like so:
ggsave(myplot,filename=paste("myplot",ID,".png",sep="")) # ID will be the unique identifier. and change the extension from .png to whatever you like (eps, pdf etc).
b) Just assign each plot to an element of a list. Then write that list to disk using save
That would make it very easy to load and access any individual plot at a later time.
I am not sure if I get what you want to do. From what I guess, i suggest to write a simple function that saves the plot. and then use lapply(yourdata,yourfunction,...) . Since lapply can be used for lists, it´s not necessary that the number of rows is equal.
HTH
use something like this in your function:
ggsave(filename,scale=1.5)