Simple Function to normalize related objects

Simple Function to normalize related objects - r

I'm quite new to R and I'm trying to write a function that normalizes my data in diffrent dataframes.
The normalization process is quite easy, I just divide the numbers I want to normalize by the population size for each object (that is stored in the table population).
To know which object relates to one and another I tried to use IDs that are stored in each dataframe in the first column.
I thought to do so because some objects that are in the population dataframe have no corresponding objects in the dataframes to be normalized, as to say, the dataframes sometimes have lesser objects.
Normally one would built up a relational database (which I tried) but it didn't worked out for me that way. So I tried to related the objects within the function but the function didn't work. Maybe someone of you has experience with this and can help me.
so my attempt to write this function was:
# Load Tables
# Agriculture, Annual Crops
table.annual.crops <-read.table ("C:\\Users\\etc", header=T,sep=";")
# Agriculture, Bianual and Perrenial Crops
table.bianual.crops <-read.table ("C:\\Users\\etc", header=T,sep=";")
# Fishery
table.fishery <-read.table ("C:\\Users\\etc", header=T,sep=";")
# Population per Municipality
table.population <-read.table ("C:\\Users\\etc", header=T,sep=";")
# attach data
attach(table.annual.crops)
attach(table.bianual.crops)
attach(table.fishery)
attach(table.population)
# Create a function to normalize data
# Objects should be related by their ID in the first column
# Values to be normalized and the population appear in the second column
funktion.norm.percapita<-function (x,y){if(x[,1]==y[,1]){x[,2]/y[,2]}else{return("0")}}
# execute the function
funktion.norm.percapita(table.annual.crops,table.population)

Lets start with the attach steps... why? Its usually unecessary and can get you into trouble! Especially since both your population data.frame and your crops data.frame have Geocode as a column!
as suggested in the comments, you can use merge. This will by default combine data.frames using columns of the same name. You can specify which columns on which to merge with the by parameters.
dat <- merge(table.annual.crops, table.population)
dat$crop.norm <- dat$CropValue / dat$Population
The reason your function isn't working? Look at the results of your if statemnt.
table.annual.crops[,1] == table.population[,1]
Gives a vector of booleans that will recycle the shorter vector. If your data is quite large (on the order of millions of rows) the merge function can be slow. if this is the case, take a look at the data.table package and use its merge function instead.

Related

Create a new data frame of the means of randomly selected rows - looped

Question:
I have a data.frame (hlth) that consists of 49 vectors - a mix of numeric(25:49) and factor(1:24). I am trying to randomly select 50 rows, then calculate column means only for the numeric columns (dropping the other values), and then place the random row mean(s) into a new data.frame (beta). I would then like to iterate this process 1000 times.
I have attempted this process but the values that get returned are identical and the new means will not enter the new data.frame
Here is a few rows and columns of the data.frame(hlth)
DateIn adgadj Sex VetMedCharges pwtfc
1/01/2006 3.033310 STEER 0.00 675.1151
1/10/1992 3.388245 STEER 2540.33 640.2261
1/10/1995 3.550847 STEER 572.78 607.6200
1/10/1996 2.893707 HEIFER 549.42 425.5217
1/10/1996 3.647233 STEER 669.18 403.8238
The code I have used thus far:
set.seed[25]
beta<-data.frame()
net.row<-function(n=50){
netcol=sample(1:nrow(hlth),size=n ,replace=TRUE)
rNames <- row.names(hlth)
subset(hlth,rNames%in%netrow,select=c(25:49))
colMeans(s1,na.rm=TRUE,dims=1)
}
beta$net.row=replicate(1000,net.row()); net.row
The two issues, that I have detected, are:
1) Returns the same value(s) each iteration
2) "Error during wrap-up: object of type 'closure' is not subsettable" when the beta$netrow
Any suggestions would be appreciated!!!

Just adding to my comment (and firstly pasting it):
netcol=sample(1:nrow(hlth),size=n ,replace=TRUE) should presumably by netrow = ... and the error is a scoping problem - R is trying to subset the function beta, presumably again, because it can't find netRowMeans in the data.frame you've defined, moves on to the global environment and throws an error there.
There are also a couple of other things. You don't assign subset(hlth,rNames%in%netrow,select=c(25:49)) to a variable, which I think you mean to assign to s1, so colMeans is probably running on something you've set in the global environment.
If you want to pass a variable directly in to the data frame beta in that manner, you'll have to initialise beta with the right number of columns and number of rows - the column means you've passed out will be a vector of (1 x 25), so won't fit in a single column. You would probably be better of initalising a matrix called mat or something (to avoid confusion with scoping errors masking the actual error messages) with 25 columns and 1000 rows.
EDIT: Question has been edited slightly since I posted this, but most points still stand.

R: Avoid repeating lines of code using R subsets in scripts

I'm very new to R - but have been developing SAS-programs (and VBA) for some years. Well, the thing is that I have 4 lines of R-code (scripts?) that I would like to repeat 44 times. Two times for each of 22 different train stations, indicating whether the train is in- or out-going. The four lines of code are:
dataGL_FLIin <- subset( dataGL_all, select = c(Tidsinterval, Dag, M.ned, Ugenr.,Kode, Ugedag, FLIin))
names(dataGL_FLIin)[names(dataGL_FLIin)=='FLIin'] <- 'GL_Antal'
dataGL_FLIin$DIR<-"IN"
dataGL_FLIin$STATION<-"FLI
To avoid repeating the 4 lines 44 times I need 2 "macro variables" (yes, I'm aware, that this is a SAS-thing only, sorry). One "macro variable" indicating the train station and one indicating the direction. In the example above the train station is FLI and the direction is in. Below the same 4 lines are demonstrated for the train station FBE, this time in out-going direction.
dataGL_FBEout <- subset( dataGL_all, select = c(Tidsinterval, Dag, M.ned, Ugenr.,Kode, Ugedag, FBEout))
names(dataGL_FBEout)[names(dataGL_FBEout)=='FBEout'] <- 'GL_Antal'
dataGL_FBEout$DIR<-"OUT"
dataGL_FBEout$STATION<-"FBE"
I have looked many places and tried many combinations of R-functions and R-lists, but I can't make it work. Quite possible I'm getting it all wrong. I apologize in advance if the question is (too) stupid, but will however be very grateful for any help on the matter.
Pls. notice that I, in the end, want 44 different data-frames created:
1) dataGL_FLIin
2) dataGL_FBEout
3) Etc. ...
ADDED: 2 STATION 2 DIRECTIONS EXAMPLE OF MY PROBLEM
'The one data frame I have'
Date<-c("01-01-15 04:00","01-01-15 04:20","01-01-15 04:40")
FLIin<-c(96,39,72)
FLIout<-c(173,147,103)
FBEin<-c(96,116,166)
FBEout<-c(32,53,120)
dataGL_all<-data.frame(Date, FLIin, FLIout, FBEin, FBEout)
'The four data frames I would like'
GL_antal<-c(96,39,72)
Station<-("FLI")
Dir<-("IN")
dataGL_FLIin<-data.frame(Date, Station, Dir, GL_antal)
GL_antal<-c(173,147,103)
Station<-("FLI")
Dir<-("OUT")
dataGL_FLIout<-data.frame(Date, Station, Dir, GL_antal)
GL_antal<-c(96,116,166)
Station<-("FBE")
Dir<-("IN")
dataGL_FBEin<-data.frame(Date, Station, Dir, GL_antal)
GL_antal<-c(32,53,120)
Station<-("FBE")
Dir<-("OUT")
dataGL_FBEout<-data.frame(Date, Station, Dir, GL_antal)
Thanks,
lars

With your example, it is now clearer what you want and I give it a second try. I use dataGL_all as defined in your question and the define
stations <- rep(c("FLI","FBE"),each=2)
directions <- rep(c("in","out"),times=length(stations)/2)
You could also extract the stations and directions from your data frame. Using your example, the following would work
stations <- substr(names(dataGL_all)[-1],1,3)
directions <- substr(names(dataGL_all)[-1],4,6)
Then, I define the function that will work on the data:
dataGLfun <- function(station,direction) {
name <- paste0(station,direction)
dataGL <- dataGL_all[,c("Date", name)]
names(dataGL)[names(dataGL)==name] <- 'GL_Antal'
dataGL$DIR<-direction
dataGL$STATION<-station
dataGL
}
And now I apply this function to all stations with both directions:
dataGL <- mapply(dataGLfun,stations,directions,SIMPLIFY=FALSE)
names(dataGL) <- paste0(stations,directions)
Now, you can get the data frames for each combination of station and direction. For instance, the two examples in your question, you get with dataGL$FLIin and dataGL$FBEout. The reason that there is a $ instead of a _ is that I did not actually create a separate variable for each data frame. Instead, I created a list, where each element of the list is one of the data frames. This has the advantage that it will be easier to do something to all the data frames later. With your solution, you would have to type all the various variable names, but if the data frames are in a list, you can work with them using functions like lapply.
If you prefer to have many different variables, you could do the following
for (i in seq_along(stations)) {
assign(paste0("dataGL_",stations[i],directions[i]), dataGLfun(stations[i],directions[i]))
}
However, in my opinion, this is not how you should solve this problem in R.

Loop and clear the basic function in R

I've got this dataset
install.packages("combinat")
install.packages("quantmod")
library(quantmod)
library(combinat)
library(utils)
getSymbols("AAPL",from="2012-01-01")
data<-AAPL
p1<-4
dO<-data[,1]
dC<-data[,4]
emaO<-EMA(dO,n=p1)
emaC<-EMA(dC,n=p1)
Pos_emaO_dO_UP<-emaO>dO
Pos_emaO_dO_D<-emaO<dO
Pos_emaC_dC_UP<-emaC>dC
Pos_emaC_dC_D<-emaC<dC
Pos_emaC_dO_D<-emaC<dO
Pos_emaC_dO_UP<-emaC>dO
Pos_emaO_dC_UP<-emaO>dC
Pos_emaO_dC_D<-emaO<dC
Profit_L_1<-((lag(dC,-1)-lag(dO,-1))/(lag(dO,-1)))*100
Profit_L_2<-(((lag(dC,-2)-lag(dO,-1))/(lag(dO,-1)))*100)/2
Profit_L_3<-(((lag(dC,-3)-lag(dO,-1))/(lag(dO,-1)))*100)/3
Profit_L_4<-(((lag(dC,-4)-lag(dO,-1))/(lag(dO,-1)))*100)/4
Profit_L_5<-(((lag(dC,-5)-lag(dO,-1))/(lag(dO,-1)))*100)/5
Profit_L_6<-(((lag(dC,-6)-lag(dO,-1))/(lag(dO,-1)))*100)/6
Profit_L_7<-(((lag(dC,-7)-lag(dO,-1))/(lag(dO,-1)))*100)/7
Profit_L_8<-(((lag(dC,-8)-lag(dO,-1))/(lag(dO,-1)))*100)/8
Profit_L_9<-(((lag(dC,-9)-lag(dO,-1))/(lag(dO,-1)))*100)/9
Profit_L_10<-(((lag(dC,-10)-lag(dO,-1))/(lag(dO,-1)))*100)/10
which are given to this frame
frame<-data.frame(Pos_emaO_dO_UP,Pos_emaO_dO_D,Pos_emaC_dC_UP,Pos_emaC_dC_D,Pos_emaC_dO_D,Pos_emaC_dO_UP,Pos_emaO_dC_UP,Pos_emaO_dC_D,Profit_L_1,Profit_L_2,Profit_L_3,Profit_L_4,Profit_L_5,Profit_L_6,Profit_L_7,Profit_L_8,Profit_L_9,Profit_L_10)
colnames(frame)<-c("Pos_emaO_dO_UP","Pos_emaO_dO_D","Pos_emaC_dC_UP","Pos_emaC_dC_D","Pos_emaC_dO_D","Pos_emaC_dO_UP","Pos_emaO_dC_UP","Pos_emaO_dC_D","Profit_L_1","Profit_L_2","Profit_L_3","Profit_L_4","Profit_L_5","Profit_L_6","Profit_L_7","Profit_L_8","Profit_L_9","Profit_L_10")
There is vector with variables for later usage
vector<-c("Pos_emaO_dO_UP","Pos_emaO_dO_D","Pos_emaC_dC_UP","Pos_emaC_dC_D","Pos_emaC_dO_D","Pos_emaC_dO_UP","Pos_emaO_dC_UP","Pos_emaO_dC_D")
I made all possible combination with 4 variables of the vector (there are no depended variables)
comb<-as.data.frame(combn(vector,4))
comb
and get out the ,,nonsense" combination (where are both possible values of variable)
rc<-comb[!sapply(comb, function(x) any(duplicated(sub('_D|_UP', '', x))))]
rc
Then I prepare the first combination to later subseting
var<-paste(rc[,1],collapse=" & ")
var
and subset the frame (with all DVs)
kr<-eval(parse(text=paste0('subset(frame,' , var,')' )))
kr
Now I have the subseted df by the first combination of 4 variables.
Then I used the evaluation function on it
evaluation<-function(x){
s_1<-nrow(x[x$Profit_L_1>0,])/nrow(x)
s_2<-nrow(x[x$Profit_L_2>0,])/nrow(x)
s_3<-nrow(x[x$Profit_L_3>0,])/nrow(x)
s_4<-nrow(x[x$Profit_L_4>0,])/nrow(x)
s_5<-nrow(x[x$Profit_L_5>0,])/nrow(x)
s_6<-nrow(x[x$Profit_L_6>0,])/nrow(x)
s_7<-nrow(x[x$Profit_L_7>0,])/nrow(x)
s_8<-nrow(x[x$Profit_L_8>0,])/nrow(x)
s_9<-nrow(x[x$Profit_L_9>0,])/nrow(x)
s_10<-nrow(x[x$Profit_L_10>0,])/nrow(x)
n_1<-nrow(x[x$Profit_L_1>0,])/nrow(frame)
n_2<-nrow(x[x$Profit_L_2>0,])/nrow(frame)
n_3<-nrow(x[x$Profit_L_3>0,])/nrow(frame)
n_4<-nrow(x[x$Profit_L_4>0,])/nrow(frame)
n_5<-nrow(x[x$Profit_L_5>0,])/nrow(frame)
n_6<-nrow(x[x$Profit_L_6>0,])/nrow(frame)
n_7<-nrow(x[x$Profit_L_7>0,])/nrow(frame)
n_8<-nrow(x[x$Profit_L_8>0,])/nrow(frame)
n_9<-nrow(x[x$Profit_L_9>0,])/nrow(frame)
n_10<-nrow(x[x$Profit_L_10>0,])/nrow(frame)
pr_1<-sum(kr[,"Profit_L_1"])/nrow(kr[,kr=="Profit_L_1"])
pr_2<-sum(kr[,"Profit_L_2"])/nrow(kr[,kr=="Profit_L_2"])
pr_3<-sum(kr[,"Profit_L_3"])/nrow(kr[,kr=="Profit_L_3"])
pr_4<-sum(kr[,"Profit_L_4"])/nrow(kr[,kr=="Profit_L_4"])
pr_5<-sum(kr[,"Profit_L_5"])/nrow(kr[,kr=="Profit_L_5"])
pr_6<-sum(kr[,"Profit_L_6"])/nrow(kr[,kr=="Profit_L_6"])
pr_7<-sum(kr[,"Profit_L_7"])/nrow(kr[,kr=="Profit_L_7"])
pr_8<-sum(kr[,"Profit_L_8"])/nrow(kr[,kr=="Profit_L_8"])
pr_9<-sum(kr[,"Profit_L_9"])/nrow(kr[,kr=="Profit_L_9"])
pr_10<-sum(kr[,"Profit_L_10"])/nrow(kr[,kr=="Profit_L_10"])
mat<-matrix(c(s_1,n_1,pr_1,s_2,n_2,pr_2,s_3,n_3,pr_3,s_4,n_4,pr_4,s_5,n_5,pr_5,s_6,n_6,pr_6,s_7,n_7,pr_7,s_8,n_8,pr_8,s_9,n_9,pr_9,s_10,n_10,pr_10),ncol=3,nrow=10,dimnames=list(c(1:10),c("s","n","pr")))
df<-as.data.frame(mat)
return(df)
}
result<-evaluation(kr)
result
And I need to help in several cases.
1, in evaluation function the way the matrix is made is wrong (s_1,n_1,pr_1 are starting in first column but I need to start the order by rows)
2, I need to use some loop/lapply function to go trough all possible combinations (not only the first one like in this case (var<-paste(rc[,1],collapse=" & ")) and have the understandable output where is evaluation function used on every combination and I will be able to see for which combination of variables is the evaluation done (understand I need to recognize for what is this evaluation made) and compare evaluation results for each combination.
3, This is not main point, BUT I generally want to evaluate all possible combinations (it means for 2:n number of variables and also all combinations in each of them) and then get the best possible combination according to specific DV (Profit_L_1 or Profit_L_2 and so on). And I am so weak in looping now, so, if it this possible, keep in mind what am I going to do with it later.
Thanks, feel free to update, repair or improve the question (if there is something which could be done way more easily, effectively - do it - I am open for every senseful advice.

perform function on pairs of columns

I am trying to run some Monte Carlo simulations on animal position data. So far, I have sampled 100 X and Y coordinates, 100 times. This results in a list of 200. I then convert this list into a dataframe that is more condusive to eventual functions I want to run for each sample (kernel.area).
Now I have a data frame with 200 columns, and I would like to perform the kernel.area function using each successive pair of columns.
I can't reproduce my own data here very well, so I've tried to give a basic example just to show the structure of the data frame I'm working with. I've included the for loop I've tried so far, but I am still an R novice and would appreciate any suggestions.
# generate dataframe representing X and Y positions
df <- data.frame(x=seq(1:200),y=seq(1:200))
# 100 replications of sampling 100 "positions"
resamp <- replicate(100,df[sample(nrow(df),100),])
# convert to data frame (kernel.area needs an xy dataframe)
df2 <- do.call("rbind", resamp[1:2,])
# xy positions need to be in columns for kernel.area
df3 <- t(df2)
#edit: kernel.area requires you have an id field, but I am only dealing with one individual, so I'll construct a fake one of the same length as the positions
id=replicate(100,c("id"))
id=data.frame(id)
Here is the structure of the for loop I've tried (edited since first post):
for (j in seq(1,ncol(df3)-1,2)) {
kud <- kernel.area(df3[,j:(j+1)],id=id,kern="bivnorm",unin=c("m"),unout=c("km2"))
print(kud)
}
My end goal is to calculate kernel.area for each resampling event (ie rows 1:100 for every pair of columns up to 200), and be able to combine the results in a dataframe. However, after running the loop, I get this error message:
Error in df[, 1] : incorrect number of dimensions
Edit: I realised my id format was not the same as my data frame, so I change it and now have the error:
Error in kernelUD(xy, id, h, grid, same4all, hlim, kern, extent) :
id should have the same length as xy

First, a disclaimer: I have never worked with the package adehabitat, which has a function kernel.area, which I assume you are using. Perhaps you could confirm which package contains the function in question.
I think there are a couple suggestions I can make that are independent of knowledge of the specific package, though.
The first lies in the creation of df3. This should probably be
df3 <- t(df2), but this is most likely correct in your actual code
and just a typo in your post.
The second suggestion has to do with the way you subset df3 in the
loop. j:j+1 is just a single number, since the : has a higher
precedence than + (see ?Syntax for the order in which
mathematical operations are conducted in R). To get the desired two
columns, use j:(j+1) instead.
EDIT:
When loading adehabitat, I was warned to "Be careful" and use the related new packages, among which is adehabitatHR, which also contains a function kernel.area. This function has slightly different syntax and behavior, but perhaps it would be worthwhile examining. Using adehabitatHR (I had to install from source since the package is not available for R 2.15.0), I was able to do the following.
library(adehabitatHR)
for (j in seq(1,ncol(df3)-1,2)) {
kud <-kernelUD(SpatialPoints(df3[,j:(j+1)]),kern="bivnorm")
kernAr<-kernel.area(kud,unin=c("m"),unout=c("km2"))
print(kernAr)
}
detach(package:adehabitatHR, unload=TRUE)
This prints something, and as is mentioned in a comment below, kernelUD() is called before kernel.area().

How to attach a simple data.frame to a SpatialPolygonDataFrame in R?

I have (again) a problem with combining data frames in R. But this time, one is a SpatialPolygonDataFrame (SPDF) and the other one is usual data.frame (DF). The SPDF has around 1000 rows the DF only 400. Both have a common column, QDGC
Now, I tried
oo <- merge(SPDF,DF, by="QDGC", all=T)
but this only results in a normal data.frame, not a spatial polygon data frame any more.
I read somewhere else, that this does not work, but I did not understand what to do in such a case (has to do something with the ID columns, merge uses)
oooh such a hard question, I quess...
Thanks!
Jens

Let df = data frame, sp = spatial polygon object and by = name or column number of common column. You can then merge the data frame into the sp object using the following line of code
sp#data = data.frame(sp#data, df[match(sp#data[,by], df[,by]),])
Here is how the code works. The match function inside aligns the columns so that order is preserved. So when we merge it with sp#data, order is correctly preserved. A quick check to see if the code has worked is to inspect the two columns corresponding to the common column and see if they are identical (the common columns get duplicated and it is easy to remove the copy, but i keep it as it is a good check)

It is as easy as this:
require(sp) # the trick is that this package must be loaded!
oo <- merge(SPDF,DF, by="QDGC")
I've tested by myself. But it only works if you use merge from package sp. This is the default when sp package is loaded. merge function is then overloaded and sp::merge is used if the first argument is spatial structure.

merge can produce a dataframe with more rows than the originals if there's not a simple 1-1 mapping of the two dataframes. In which case, it would have to copy all the geometry and create multiple polygons, which is probably not a good thing.
If you have a dataframe which is the same number of rows as a SpatialPointsDataFrame, then you can just directly replace the #data slot.
library(sp)
example(overlay) # to get the srdf object
srdf#data
spplot(srdf)
srdf#data=data.frame(x=runif(3),xx=rep(0,3))
spplot(srdf)
if you get the number of rows wrong:
srdf#data=data.frame(x=runif(2),xx=rep(0,2))
spplot(srdf)
Error in data.frame(..., check.names = FALSE) :
arguments imply differing number of rows: 3, 2

Maybe the function joinCountryData2Map in the rworldmap package can give inspiration. (But I may be wrong, as I was last time.)

One more solution is to use append_data function from the tmaptools package. It is called with these arguments:
append_data(shp, data, key.shp = NULL, key.data = NULL,
ignore.duplicates = FALSE, ignore.na = FALSE,
fixed.order = is.null(key.data) && is.null(key.shp))
It's a bit unfortunate that it's called append since I'd understand append more ina sense of rbind and we want to have something like join or merge here.
Ignoring that fact, function is really useful in making sure you got your joins correct and if some rows are present only on one side of join. From the docs:
Under coverage (shape items that do not correspond to data records),
over coverage (data records that do not correspond to shape items
respectively) as well as the existence of duplicated key values are
automatically checked and reported via console messages. With
under_coverage and over_coverage the under and over coverage key
values from the last append_data call can be retrieved,

If it is two shapefiles that are needed to be merged to a single object, just use rbind().
When using rbind(), just make sure that both the arguments you use are SpatialDataFrames. You can check this using class(sf). If it is not a dataframe, then use st_as_sf() to convert them to a SpatialDataFrame before you rbind them.
Note : You can also use this to append to NULLs, especially when you are using a result from a loop and you want to cumulate the results.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Simple Function to normalize related objects - r

Related

Create a new data frame of the means of randomly selected rows - looped

R: Avoid repeating lines of code using R subsets in scripts

Loop and clear the basic function in R

perform function on pairs of columns

How to attach a simple data.frame to a SpatialPolygonDataFrame in R?

Categories

Resources