R: Avoid repeating lines of code using R subsets in scripts

R: Avoid repeating lines of code using R subsets in scripts - r

I'm very new to R - but have been developing SAS-programs (and VBA) for some years. Well, the thing is that I have 4 lines of R-code (scripts?) that I would like to repeat 44 times. Two times for each of 22 different train stations, indicating whether the train is in- or out-going. The four lines of code are:
dataGL_FLIin <- subset( dataGL_all, select = c(Tidsinterval, Dag, M.ned, Ugenr.,Kode, Ugedag, FLIin))
names(dataGL_FLIin)[names(dataGL_FLIin)=='FLIin'] <- 'GL_Antal'
dataGL_FLIin$DIR<-"IN"
dataGL_FLIin$STATION<-"FLI
To avoid repeating the 4 lines 44 times I need 2 "macro variables" (yes, I'm aware, that this is a SAS-thing only, sorry). One "macro variable" indicating the train station and one indicating the direction. In the example above the train station is FLI and the direction is in. Below the same 4 lines are demonstrated for the train station FBE, this time in out-going direction.
dataGL_FBEout <- subset( dataGL_all, select = c(Tidsinterval, Dag, M.ned, Ugenr.,Kode, Ugedag, FBEout))
names(dataGL_FBEout)[names(dataGL_FBEout)=='FBEout'] <- 'GL_Antal'
dataGL_FBEout$DIR<-"OUT"
dataGL_FBEout$STATION<-"FBE"
I have looked many places and tried many combinations of R-functions and R-lists, but I can't make it work. Quite possible I'm getting it all wrong. I apologize in advance if the question is (too) stupid, but will however be very grateful for any help on the matter.
Pls. notice that I, in the end, want 44 different data-frames created:
1) dataGL_FLIin
2) dataGL_FBEout
3) Etc. ...
ADDED: 2 STATION 2 DIRECTIONS EXAMPLE OF MY PROBLEM
'The one data frame I have'
Date<-c("01-01-15 04:00","01-01-15 04:20","01-01-15 04:40")
FLIin<-c(96,39,72)
FLIout<-c(173,147,103)
FBEin<-c(96,116,166)
FBEout<-c(32,53,120)
dataGL_all<-data.frame(Date, FLIin, FLIout, FBEin, FBEout)
'The four data frames I would like'
GL_antal<-c(96,39,72)
Station<-("FLI")
Dir<-("IN")
dataGL_FLIin<-data.frame(Date, Station, Dir, GL_antal)
GL_antal<-c(173,147,103)
Station<-("FLI")
Dir<-("OUT")
dataGL_FLIout<-data.frame(Date, Station, Dir, GL_antal)
GL_antal<-c(96,116,166)
Station<-("FBE")
Dir<-("IN")
dataGL_FBEin<-data.frame(Date, Station, Dir, GL_antal)
GL_antal<-c(32,53,120)
Station<-("FBE")
Dir<-("OUT")
dataGL_FBEout<-data.frame(Date, Station, Dir, GL_antal)
Thanks,
lars

With your example, it is now clearer what you want and I give it a second try. I use dataGL_all as defined in your question and the define
stations <- rep(c("FLI","FBE"),each=2)
directions <- rep(c("in","out"),times=length(stations)/2)
You could also extract the stations and directions from your data frame. Using your example, the following would work
stations <- substr(names(dataGL_all)[-1],1,3)
directions <- substr(names(dataGL_all)[-1],4,6)
Then, I define the function that will work on the data:
dataGLfun <- function(station,direction) {
name <- paste0(station,direction)
dataGL <- dataGL_all[,c("Date", name)]
names(dataGL)[names(dataGL)==name] <- 'GL_Antal'
dataGL$DIR<-direction
dataGL$STATION<-station
dataGL
}
And now I apply this function to all stations with both directions:
dataGL <- mapply(dataGLfun,stations,directions,SIMPLIFY=FALSE)
names(dataGL) <- paste0(stations,directions)
Now, you can get the data frames for each combination of station and direction. For instance, the two examples in your question, you get with dataGL$FLIin and dataGL$FBEout. The reason that there is a $ instead of a _ is that I did not actually create a separate variable for each data frame. Instead, I created a list, where each element of the list is one of the data frames. This has the advantage that it will be easier to do something to all the data frames later. With your solution, you would have to type all the various variable names, but if the data frames are in a list, you can work with them using functions like lapply.
If you prefer to have many different variables, you could do the following
for (i in seq_along(stations)) {
assign(paste0("dataGL_",stations[i],directions[i]), dataGLfun(stations[i],directions[i]))
}
However, in my opinion, this is not how you should solve this problem in R.

Related

Performing HCPC on the columns (i.e. variables) instead of the rows (i.e. individuals) after (M)CA

I would like to perform a HCPC on the columns of my dataset, after performing a CA. For some reason I also have to specify at the start, that all of my columns are of type 'factor', just to loop over them afterwards again and convert them to numeric. I don't know why exactly, because if I check the type of each column (without specifying them as factor) they appear to be numeric... When I don't load and convert the data like this, however, I get an error like the following:
Error in eigen(crossprod(t(X), t(X)), symmetric = TRUE) : infinite or
missing values in 'x'
Could this be due to the fact that there are columns in my dataset that only contain 0's? If so, how come that it works perfectly fine by reading everything in first as factor and then converting it to numeric before applying the CA, instead of just performing the CA directly?
The original issue with the HCPC, then, is the following:
# read in data; 40 x 267 data frame
data_for_ca <- read.csv("./data/data_clean_CA_complete.csv",row.names=1,colClasses = c(rep('factor',267)))
# loop over first 267 columns, converting them to numeric
for(i in 1:267)
data_for_ca[[i]] <- as.numeric(data_for_ca[[i]])
# perform CA
data.ca <- CA(data_for_ca,graph = F)
# perform HCPC for rows (i.e. individuals); up until here everything works just fine
data.hcpc <- HCPC(data.ca,graph = T)
# now I start having trouble
# perform HCPC for columns (i.e. variables); use their coordinates that are stocked in the CA-object that was created earlier
data.cols.hcpc <- HCPC(data.ca$col$coord,graph = T)
The code above shows me a dendrogram in the last case and even lets me cut it into clusters, but then I get the following error:
Error in catdes(data.clust, ncol(data.clust), proba = proba, row.w =
res.sauv$call$row.w.init) : object 'data.clust' not found
It's worth noting that when I perform MCA on my data and try to perform HCPC on my columns in that case, I get the exact same error. Would anyone have any clue as how to fix this or what I am doing wrong exactly? For completeness I insert a screenshot of the upper-left corner of my dataset to show what it looks like:
Thanks in advance for any possible help!

I know this is old, but because I've been troubleshooting this problem for a while today:
HCPC says that it accepts a data frame, but any time I try to simply pass it $col$coord or $colcoord from a standard ca object, it returns this error. My best guess is that there's some metadata it actually needs/is looking for that isn't in a data frame of coordinates, but I can't figure out what that is or how to pass it in.
The current version of FactoMineR will actually just allow you to give HCPC the whole CA object and tell it whether to cluster the rows or columns. So your last line of code should be:
data.cols.hcpc <- HCPC(data.ca, cluster.CA = "columns", graph = T)

Calling a specific column in a subset of data that has been binned and stored in a list

I have a very large data set that I have binned, and stored each bin (subset) as a list so that I can easily call any given subset. My problem is in calling for a specific column within a subset.
For example my data (which has diameters and strengths as the columns), is broken up into 20 bins, by diameter. I manually binned the data, like so:
subset.1 <- subset(mydata, Diameter <= 0.01)
Similar commands were used, to make 20 bins. Then I stored the names (subset.1 through subset.20) into a list:
diameter.bin<-list(subset.1, ... , subset.20)
I can successfully call each diameter bin using:
diameter.bin[x]
Now, if I only want to see the strength values for a given diameter bin, I can use the original name (that is store in the list):
subset.x$Strength
But I cannot get this information using the list call:
diameter.bin[x]$Strength
This command returns NULL
Note that when I call any subset (either by diameter.bin[x], subset.x or even subset.x$Strength) my column headers do show up. When I use:
names(subset.1)
This returns "Diameter" and "Strength"
But when I use:
names(diameter.bin[1])
This returns NULL.
I'm assuming that the column header is part of the problem, but I'm not sure how to fix it, other than take the headers off of the original data file. I would prefer not to do this if at all possible.
The end goal is to look at the distribution of strength values for each diameter bin, so I will be doing things like drawing histograms, calculating parameters etc. I was hoping to do something along these lines to produce the histograms:
n=length(diameter.bin)
for(i in (1:n))
{
hist(diameter.bin[i]$Strength)
}
And do something similar to this to store median values for each bin in a new vector.
Any tips are greatly appreciated, as right now I'm doing it all 1 bin at a time, and I know a loop (or something similar) would really speed up my analysis.

You need two square brackets. Here is a reproducible example demonstrating the issue:
> diam <- data.frame(x=rnorm(5), y=rnorm(5))
>
> diam.l <- list(diam, diam)
> diam.l[1]$x
NULL
> diam.l[[1]]$x
[1] -0.5389441 -0.5155441 -1.2437108 -2.0044323 -0.6914124

Simple Function to normalize related objects

I'm quite new to R and I'm trying to write a function that normalizes my data in diffrent dataframes.
The normalization process is quite easy, I just divide the numbers I want to normalize by the population size for each object (that is stored in the table population).
To know which object relates to one and another I tried to use IDs that are stored in each dataframe in the first column.
I thought to do so because some objects that are in the population dataframe have no corresponding objects in the dataframes to be normalized, as to say, the dataframes sometimes have lesser objects.
Normally one would built up a relational database (which I tried) but it didn't worked out for me that way. So I tried to related the objects within the function but the function didn't work. Maybe someone of you has experience with this and can help me.
so my attempt to write this function was:
# Load Tables
# Agriculture, Annual Crops
table.annual.crops <-read.table ("C:\\Users\\etc", header=T,sep=";")
# Agriculture, Bianual and Perrenial Crops
table.bianual.crops <-read.table ("C:\\Users\\etc", header=T,sep=";")
# Fishery
table.fishery <-read.table ("C:\\Users\\etc", header=T,sep=";")
# Population per Municipality
table.population <-read.table ("C:\\Users\\etc", header=T,sep=";")
# attach data
attach(table.annual.crops)
attach(table.bianual.crops)
attach(table.fishery)
attach(table.population)
# Create a function to normalize data
# Objects should be related by their ID in the first column
# Values to be normalized and the population appear in the second column
funktion.norm.percapita<-function (x,y){if(x[,1]==y[,1]){x[,2]/y[,2]}else{return("0")}}
# execute the function
funktion.norm.percapita(table.annual.crops,table.population)

Lets start with the attach steps... why? Its usually unecessary and can get you into trouble! Especially since both your population data.frame and your crops data.frame have Geocode as a column!
as suggested in the comments, you can use merge. This will by default combine data.frames using columns of the same name. You can specify which columns on which to merge with the by parameters.
dat <- merge(table.annual.crops, table.population)
dat$crop.norm <- dat$CropValue / dat$Population
The reason your function isn't working? Look at the results of your if statemnt.
table.annual.crops[,1] == table.population[,1]
Gives a vector of booleans that will recycle the shorter vector. If your data is quite large (on the order of millions of rows) the merge function can be slow. if this is the case, take a look at the data.table package and use its merge function instead.

performing a calculation with a `paste`d vector reference

So I have some lidar data that I want to calculate some metrics for (I'll attach a link to the data in a comment).
I also have ground plots that I have extracted the lidar points around, so that I have a couple hundred points per plot (19 plots). Each point has X, Y, Z, height above ground, and the associated plot.
I need to calculate a bunch of metrics on the plot level, so I created plotsgrouped with split(plotpts, plotpts$AssocPlot).
So now I have a data frame with a "page" for each plot, so I can calculate all my metrics by the "plot page". This works just dandy for individual plots, but I want to automate it. (yes, I know there's only 19 plots, but it's the principle of it, darn it! :-P)
So far, I've got a for loop going that calculates the metrics and puts the results in a data frame called Results. I pulled the names of the groups into a list called groups as well.
for(i in 1:length(groups)){
Results$Plot[i] <- groups[i]
Results$Mean[i] <- mean(plotsgrouped$PLT01$Z)
Results$Std.Dev.[i] <- sd(plotsgrouped$PLT01$Z)
Results$Max[i] <- max(plotsgrouped$PLT01$Z)
Results$75%Avg.[i] <- mean(plotsgrouped$PLT01$Z[plotsgrouped$PLT01$Z <= quantile(plotsgrouped$PLT01$Z, .75)])
Results$50%Avg.[i] <- mean(plotsgrouped$PLT01$Z[plotsgrouped$PLT01$Z <= quantile(plotsgrouped$PLT01$Z, .50)])
...
and so on.
The problem arises when I try to do something like:
Results$mean[i] <- mean(paste("plotsgrouped", groups[i],"Z", sep="$")). mean() doesn't recognize the paste as a reference to the vector plotsgrouped$PLT27$Z, and instead fails. I've deduced that it's because it sees the quotes and thinks, "Oh, you're just some text, I can't get the mean of you." or something to that effect.
Btw, groups is a list of the 19 plot names: PLT01-PLT27 (non-consecutive sometimes) and FTWR, so I can't simply put a sequence for the numeric part of the name.
Anyone have an easier way to iterate across my test plots and get arbitrary metrics?
I feel like I have all the right pieces, but just don't know how they go together to give me what I want.
Also, if anyone can come up with a better title for the question, feel free to post it or change it or whatever.

Try with:
for(i in seq_along(groups)) {
Results$Plot[i] <- groups[i] # character names of the groups
tempZ = plotsgrouped[[groups[i]]][["Z"]]
Results$Mean[i] <- mean(tempZ)
Results$Std.Dev.[i] <- sd(tempZ)
Results$Max[i] <- max(tempZ)
Results$75%Avg.[i] <- mean(tempZ[tempZ <= quantile(tempZ, .75)])
Results$50%Avg.[i] <- mean(tempZ[tempZ <= quantile(tempZ, .50)])
}

perform function on pairs of columns

I am trying to run some Monte Carlo simulations on animal position data. So far, I have sampled 100 X and Y coordinates, 100 times. This results in a list of 200. I then convert this list into a dataframe that is more condusive to eventual functions I want to run for each sample (kernel.area).
Now I have a data frame with 200 columns, and I would like to perform the kernel.area function using each successive pair of columns.
I can't reproduce my own data here very well, so I've tried to give a basic example just to show the structure of the data frame I'm working with. I've included the for loop I've tried so far, but I am still an R novice and would appreciate any suggestions.
# generate dataframe representing X and Y positions
df <- data.frame(x=seq(1:200),y=seq(1:200))
# 100 replications of sampling 100 "positions"
resamp <- replicate(100,df[sample(nrow(df),100),])
# convert to data frame (kernel.area needs an xy dataframe)
df2 <- do.call("rbind", resamp[1:2,])
# xy positions need to be in columns for kernel.area
df3 <- t(df2)
#edit: kernel.area requires you have an id field, but I am only dealing with one individual, so I'll construct a fake one of the same length as the positions
id=replicate(100,c("id"))
id=data.frame(id)
Here is the structure of the for loop I've tried (edited since first post):
for (j in seq(1,ncol(df3)-1,2)) {
kud <- kernel.area(df3[,j:(j+1)],id=id,kern="bivnorm",unin=c("m"),unout=c("km2"))
print(kud)
}
My end goal is to calculate kernel.area for each resampling event (ie rows 1:100 for every pair of columns up to 200), and be able to combine the results in a dataframe. However, after running the loop, I get this error message:
Error in df[, 1] : incorrect number of dimensions
Edit: I realised my id format was not the same as my data frame, so I change it and now have the error:
Error in kernelUD(xy, id, h, grid, same4all, hlim, kern, extent) :
id should have the same length as xy

First, a disclaimer: I have never worked with the package adehabitat, which has a function kernel.area, which I assume you are using. Perhaps you could confirm which package contains the function in question.
I think there are a couple suggestions I can make that are independent of knowledge of the specific package, though.
The first lies in the creation of df3. This should probably be
df3 <- t(df2), but this is most likely correct in your actual code
and just a typo in your post.
The second suggestion has to do with the way you subset df3 in the
loop. j:j+1 is just a single number, since the : has a higher
precedence than + (see ?Syntax for the order in which
mathematical operations are conducted in R). To get the desired two
columns, use j:(j+1) instead.
EDIT:
When loading adehabitat, I was warned to "Be careful" and use the related new packages, among which is adehabitatHR, which also contains a function kernel.area. This function has slightly different syntax and behavior, but perhaps it would be worthwhile examining. Using adehabitatHR (I had to install from source since the package is not available for R 2.15.0), I was able to do the following.
library(adehabitatHR)
for (j in seq(1,ncol(df3)-1,2)) {
kud <-kernelUD(SpatialPoints(df3[,j:(j+1)]),kern="bivnorm")
kernAr<-kernel.area(kud,unin=c("m"),unout=c("km2"))
print(kernAr)
}
detach(package:adehabitatHR, unload=TRUE)
This prints something, and as is mentioned in a comment below, kernelUD() is called before kernel.area().

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

R: Avoid repeating lines of code using R subsets in scripts - r

Related

Performing HCPC on the columns (i.e. variables) instead of the rows (i.e. individuals) after (M)CA

Calling a specific column in a subset of data that has been binned and stored in a list

Simple Function to normalize related objects

performing a calculation with a `paste`d vector reference

perform function on pairs of columns

Categories

Resources