Easiest way to apply series of calculations to similar data frames in R - r

The following is an example of how I want to treat my data sets. It might be a bit different to understand how my data frame is structured, but I hope it makes sense:
First density must be calculated for columns A, B, and C using raw data from columns ADry, AEthanol, BDry ...... (Since these were earlier defined as vectors too, i used the vectors instead data frame columns as it was shorter - ADry_1_0 instead of Sample_1_0$ADry_1_0)
Sample_1_0$ADensi_1_0=(ADry_1_0/(ADry_1_0-AEthanol_1_0))*(peth-pair)+pair
Sample_1_0$BDensi_1_0=(BDry_1_0/(BDry_1_0-BEthanol_1_0))*(peth-pair)+pair
Sample_1_0$CDensi_1_0=(CDry_1_0/(CDry_1_0-CEthanol_1_0))*(peth-pair)+pair
This yields 10 densities for both A, B, and C. What's interesting is the mean density
Mean_1_0=apply(Sample_1_0[7:9],2,mean)
Next standard deviations are found. We are mainly interested in standard deviations for our raw data columns (ADry and AEthanol), as error propagation calculations are afterwards carried out to find out how the deviations sum up when calculating the densities
StdAfv_1_0=apply(Sample_1_0,2,sd)
Error propagation (same for B and C)
ASd_1_0=(sqrt((sd(Sample_1_0$ADry_1_0)/mean(Sample_1_0$ADry_1_0))^2+(sqrt((sd(Sample_1_0$ADry_1_0)^2+sd(Sample_1_0$AEthanol_1_0)^2))/(mean(Sample_1_0$ADry_1_0)-mean(Sample_1_0$AEthanol_1_0)))^2))*mean(Sample_1_0$ADensi_1_0)
In the end we semi manually gathered the end informations (mean density and deviation hereof) in a plot-able dataframe. Some of the codes might be a tad long and maybe we could have achieved equal results using shorter codes, but bear with us, we are rookies.
So now to the real actual problem
This was for A_1_0, B_1_0, and C_1_0. We would like to apply the same series of commands to 15 other data frames. The dimensions are the same, and they will be named A_1_1, A_1_2, A_2_0 and so on.
Is it possible to use some kind of loop function or make a loadable script containing x and y placeholders, where we can easily insert A_1_1 for instance??
Thanks in advance, i tried to keep the amount of confusion at a minimum, although it's tough!
Data list

If instead of individual vectors you combine the raw data into data frames (or even better data.tables) and then subsequently store all the data frames for all runs into a list as #Gregor suggested, you can use this function below and the lapply function.
my_func <- function(dataset, peth, pair){
require(data.table)
names <- names(dataset)
setDT(dataset)[, `:=` (ADens = (get(names[1])/(get(names[1])-get(names[4])))*(peth-pair)+pair,
BDens = (get(names[2])/(get(names[2])-get(names[5])))*(peth-pair)+pair,
CDens = (get(names[3])/(get(names[3])-get(names[6])))*(peth-pair)+pair)
][, .(ADens_mean = mean(ADens),
ADens_sd = sd(ADens),
AErr = (sqrt((sd(get(names[1]))/mean(get(names[1])))^2) +
(sqrt((sd(get(names[1]))^2 + sd(get(names[4]))^2))/
(mean(get(names[1])) - mean(get(names[4]))))^2)* mean(ADens),
BDens_mean = mean(BDens),
BDens_sd = sd(BDens),
BErr = (sqrt((sd(get(names[2]))/mean(get(names[2])))^2) +
(sqrt((sd(get(names[2]))^2 + sd(get(names[5]))^2))/
(mean(get(names[2])) - mean(get(names[5]))))^2)* mean(BDens),
CDens_mean = mean(CDens),
CDens_sd = sd(CDens),
CErr = (sqrt((sd(get(names[3]))/mean(get(names[3])))^2) +
(sqrt((sd(get(names[3]))^2 + sd(get(names[6]))^2))/
(mean(get(names[3])) - mean(get(names[6]))))^2)* mean(CDens))
]
}
rbindlist(lapply(list_datasets, my_func, peth = 2, pair = 1))
Now, this assumes that you put your raw vectors into data frames with the columns in the order in which they appeared in your example (and that they are the only columns in the data set). If this is not the case, you may just have to edit the indices in the names[x] calls. If you wanted to have a little more flexibility, you could also define a list of list with the column names for each data set in your individual raw data sets, add that as an argument to my_func and then replace all the instances of names[x] with get(list_column_names[x])
This function should output a data.table with the results for each set of data sets (1-16) in individual rows with 6 columns (ADens_mean, ADens_sd, ...)
NOTE since there was no actual data to work with, I can't say for sure that this function does exactly what you want, but I think it will be close. This will also require you to download the data.table package.

Related

R apply multiple functions when large number of categories/types are present using case_when (R vectorization)

Suppose I have a dataset of the following form:
City=c(1,2,2,1)
Business=c(2,1,1,2)
ExpectedRevenue=c(35,20,15,19)
zz=data.frame(City,Business,ExpectedRevenue)
zz_new=do.call("rbind", replicate(zz, n=30, simplify = FALSE))
My actual dataset contains about 200K rows. Furthermore, it contains information for over 100 cities.
Suppose, for each city (which I also call "Type"), I have the following functions which need to be applied:
#Writing the custom functions for the categories here
Type1=function(full_data,observation){
NewSet=full_data[which(!full_data$City==observation$City),]
BusinessMax = max(NewSet$ExpectedRevenue)+10*rnorm(1)
return(BusinessMax)
}
Type2=function(full_data,observation){
NewSet=full_data[which(!full_data$City==observation$City),]
BusinessMax = max(NewSet$ExpectedRevenue)-100*rnorm(1)
return(BusinessMax)
}
Once again the above two functions are extremely simply ones that I use for illustration. The idea here is that for each City (or "Type") I need to run a different function for each row in my dataset. In the above two functions, I used rnorm in order to check and make sure that we are drawing different values for each row.
Now for the entire dataset, I want to first divide the observation into its different City (or "Types"). I can do this using (zz_new[["City"]]==1) [also see below]. And then run the respective functions for each classes. However, when I run the code below, I get -Inf.
Can someone help me understand why this is happening?
For the example data, I would expect to obtain 20 plus 10 times some random value (for Type =1) and 35 minus 100 times some random value (for Type=2). The values should also be different for each row since I am drawing them from a random normal distribution.
library(dplyr) #I use dplyr here
zz_new[,"AdjustedRevenue"] = case_when(
zz_new[["City"]]==1~Type1(full_data=zz_new,observation=zz_new[,]),
zz_new[["City"]]==2~Type2(full_data=zz_new,observation=zz_new[,])
)
Thanks a lot in advance.
Let's take a look at your code.
I rewrite your code
library(dplyr)
zz_new[,"AdjustedRevenue"] = case_when(
zz_new[["City"]]==1~Type1(full_data=zz_new,observation=zz_new[,]),
zz_new[["City"]]==2~Type2(full_data=zz_new,observation=zz_new[,])
)
to
zz_new %>%
mutate(AdjustedRevenue = case_when(City == 1 ~ Type1(zz_new,zz_new),
City == 2 ~ Type2(zz_new,zz_new)))
since you are using dplyr but don't use the powerful tools provided by this package.
Besides the usage of mutate one key change is that I replaced zz_new[,] with zz_new. Now we see that both arguments of your Type-functions are the same dataframe.
Next step: Take a look at your function
Type1 <- function(full_data,observation){
NewSet=full_data[which(!full_data$City==observation$City),]
BusinessMax = max(NewSet$ExpectedRevenue)+10*rnorm(1)
return(BusinessMax)
}
which is called by Type1(zz_new,zz_new). So the definition of NewSet gives us
NewSet=full_data[which(!full_data$City==observation$City),]
# replace the arguments
NewSet <- zz_new[which(!zz_new$City==zz_new$City),]
Thus NewSet is always a dataframe with zero rows. Applying max to an empty column of a data.frame yields -Inf.

Is there a way to run a wilcoxon test for variables with different lengths?

I am trying to run a wilcox.test() on two subsets of data from a data frame. They are not of equal length (48 vs. 260). I want to see if there is a difference between the dbh (diameter at breast height) of live oak trees and water oak trees.
Pine_stand <- read.csv("Pine_stand.csv")
live_oaks <- subset(Pine_stand,Species=="live oak",select=c("dbh"));live_oaks
water_oaks <- subset(Pine_stand,Species=="water oak",select=c("dbh"));water_oaks
wilcox.test(live_oaks~water_oaks,conf.int=T,correct=F)
Error in model.frame.default(formula = live_oaks ~ water_oaks) :
invalid type (list) for variable 'live_oaks'
that was my first attempt then I tried this
Pine_stand <- read.csv("Pine_stand.csv")
live_dbh <- subset(Pine_stand,Species=="live oak",select=c("dbh"));live_oaks
water_dbh <- subset(Pine_stand,Species=="water oak",select=c("dbh"));water_oaks
oaks<-c(live_dbh,water_dbh)
wilcox.test(dbh~Species,data=oaks)
Error in (function (..., row.names = NULL, check.rows = FALSE, check.names = TRUE, :
arguments imply differing number of rows: 48, 260
>
and received that error. I have tried vectorizing the two groups and appending and tapply ... I know there is a simple answer I am overlooking, I just can't get it to work. All of the examples I am reading are comparing two vectors with the same length. I know I can do the Wilcoxon test by hand when there are different numbers, so there should be a way. Any advice is welcome.
Yes, you can run a wilcox.test for variables of different length. As stated in http://www.r-tutor.com/elementary-statistics/non-parametric-methods/mann-whitney-wilcoxon-test
“Using the Mann-Whitney-Wilcoxon Test, we can decide whether the
population distributions are identical without assuming them to follow
the normal distribution.”
Therefore it’s a non-parametric equivalent of the t-test that we can use, when the assumptions for the t-test are not met (for example distribution is not normal or variances in two samples are not equal).
The problem in your code is that with these two statements:
live_dbh <- subset(Pine_stand,Species=="live oak",select=c("dbh"))
water_dbh <- subset(Pine_stand,Species=="water oak",select=c("dbh"))
you are creating two vectors that contain only dph values, but you lose information about the labels (Species). Therefore you should write:
live_dbh <- subset(Pine_stand,Species=="live oak",select=c("dbh", “Species”))
water_dbh <- subset(Pine_stand,Species=="water oak",select=c("dbh", “Species”))
Secondly when you are trying two merge the two sets with this code:
oaks<-c(live_dbh,water_dbh)
instead of creating a data frame you create a list. Why is that happening? First, as we can read from documentation for c(), its name stands for “Combine Values into a Vector or List”. Probably you have already used it to merge two vectors into one. However in case of subset function it actually gives as a result one column data-frame and not a vector. Therefore our live_dbh and water_dbh sets are data frames (and now with the label they even have two columns).
In case of one column data-frame you can always use c() function with recursive parameter set to TRUE to merge them:
total<-c(one_column_df1, one_column_df2, recursive=TRUE)
However it’s usually safer to use rbind function (and it’s also the only function that will work in case we are merging data frames with more than one column). Rbind stands for row bind.
oaks<-rbind(live_dbh,water_dbh)
Now you should be able to run a wilcox.test:
wilcox.test(dbh~Species,data=oaks)
How about
wilcox.test(dbh~Species, data=Pine_stand,
subset=(Species %in% c("live oak", "water oak"))
? (If these are the only two species in your data set, you don't need the subset argument.)

in R, countif efficiently with vectorization

I have a function theresults which takes a 71x2446 data frame and returns a 2x2446 double matrix. the first number in each of the 2446 pairs represents an integer 1-6, which is basically what category the line fits into, and the second number is the Profit or Loss on that category. I want to calculate the sum of profits across each category while counting the frequency of each category. My question is if the way I've written it is an efficient use of vectors
vec<-as.data.frame(t(apply(theData,1,theresults)))
vec[2][vec[1]==1]->successCrossed
vec[2][vec[1]==2]->failCrossed
vec[2][vec[1]==3]->successFilled
vec[2][vec[1]==4]->failFilled
vec[2][vec[1]==5]->naCount
vec[2][vec[1]==6]->otherCount
then there are a bunch of calls to length() and mean() while summarizing the results.
theresults references the original data frame in this sort of way
theresults<-function(theVector)
{
if(theVector[['Aggressor']]=="Y")
{
if(theVector[['Side']]=="Sell")
{choice=6}
else
{choice=3}
if(!is.na(theVector[['TradePrice']])&&!is.na(theVector[['L1_BidPri_1']])&&!is.na(theVector[['L1_AskPri_1']])&&!is.na(theVector[['L2_BidPri_1']])&&!is.na(theVector[['L2_AskPri_1']]))
{
Profit<- switch(choice,
-as.numeric(theVector[['TradePrice']]) + 10000*as.numeric(theVector[['L1_AskPri_1']])/as.numeric(theVector[['L2_BidPri_1']]),
-as.numeric(theVector[['TradePrice']]) + 10000*as.numeric(theVector[['L1_BidPri_1']])/as.numeric(theVector[['L2_BidPri_1']]),
You can try combining the 2x2446 vector into a string vector representing the type and profit statuses...then calling "table" on it.
Here's an example:
data = cbind(sample(1:6, replace=T, 30),
sample (c("profit", "loss"), replace=T, 30))
x = apply(data, MARGIN=1, paste, collapse="")
table(x)
I'm pretty sure that for this type of operation, even if the data set were in the hundreds of thousands of rows, the correct answer would be to use Uwe's maxim; this code is fast enough and will not be a bottleneck in the program.
(in response to the other answer, cbind is slow and memory intensive relative to my current solution.)

Bandwidth selection using NP package

New to R and having problem with a very simple task! I have read a few columns of .csv data into R, the contents of which contains of variables that are in the natural numbers plus zero, and have missing values. After trying to use the non-parametric package, I have two problems: first, if I use the simple command bw=npregbw(ydat=y, xdat=x, na.omit), where x and y are column vectors, I get the error that "number of regression data and response data do not match". Why do I get this, as I have the same number of elements in each vector?
Second, I would like to call the data ordered and tell npregbw this, using the command bw=npregbw(ydat=y, xdat=ordered(x)). When I do that, I get the error that x must be atomic for sort.list. But how is x not atomic, it is just a vector with natural numbers and NA's?
Any clarifications would be greatly appreciated!
1) You probably have a different number of NA's in y and x.
2) Can't be sure about this, since there is no example. If it is of following type:
x <- c(3,4,NA,2)
Then ordered(x) should work fine. Please provide an example of your case.
EDIT: You of course tried bw=npregbw(ydat=y, xdat=x)? ordered() makes your vector an ordered factor (see ?ordered), which is not an atomic vector (see 2.1.1 link and ?factor)
EDIT2: So the problem was the way of subsetting data. Note the difference in various ways of subsetting. data$x and data[,i] (where i = column number of column x) give you vectors, while data[c("x")] and data[i] give a data frame. Functions expect vectors, unless they call for data = (your data). In that case they work with column names

perform function on pairs of columns

I am trying to run some Monte Carlo simulations on animal position data. So far, I have sampled 100 X and Y coordinates, 100 times. This results in a list of 200. I then convert this list into a dataframe that is more condusive to eventual functions I want to run for each sample (kernel.area).
Now I have a data frame with 200 columns, and I would like to perform the kernel.area function using each successive pair of columns.
I can't reproduce my own data here very well, so I've tried to give a basic example just to show the structure of the data frame I'm working with. I've included the for loop I've tried so far, but I am still an R novice and would appreciate any suggestions.
# generate dataframe representing X and Y positions
df <- data.frame(x=seq(1:200),y=seq(1:200))
# 100 replications of sampling 100 "positions"
resamp <- replicate(100,df[sample(nrow(df),100),])
# convert to data frame (kernel.area needs an xy dataframe)
df2 <- do.call("rbind", resamp[1:2,])
# xy positions need to be in columns for kernel.area
df3 <- t(df2)
#edit: kernel.area requires you have an id field, but I am only dealing with one individual, so I'll construct a fake one of the same length as the positions
id=replicate(100,c("id"))
id=data.frame(id)
Here is the structure of the for loop I've tried (edited since first post):
for (j in seq(1,ncol(df3)-1,2)) {
kud <- kernel.area(df3[,j:(j+1)],id=id,kern="bivnorm",unin=c("m"),unout=c("km2"))
print(kud)
}
My end goal is to calculate kernel.area for each resampling event (ie rows 1:100 for every pair of columns up to 200), and be able to combine the results in a dataframe. However, after running the loop, I get this error message:
Error in df[, 1] : incorrect number of dimensions
Edit: I realised my id format was not the same as my data frame, so I change it and now have the error:
Error in kernelUD(xy, id, h, grid, same4all, hlim, kern, extent) :
id should have the same length as xy
First, a disclaimer: I have never worked with the package adehabitat, which has a function kernel.area, which I assume you are using. Perhaps you could confirm which package contains the function in question.
I think there are a couple suggestions I can make that are independent of knowledge of the specific package, though.
The first lies in the creation of df3. This should probably be
df3 <- t(df2), but this is most likely correct in your actual code
and just a typo in your post.
The second suggestion has to do with the way you subset df3 in the
loop. j:j+1 is just a single number, since the : has a higher
precedence than + (see ?Syntax for the order in which
mathematical operations are conducted in R). To get the desired two
columns, use j:(j+1) instead.
EDIT:
When loading adehabitat, I was warned to "Be careful" and use the related new packages, among which is adehabitatHR, which also contains a function kernel.area. This function has slightly different syntax and behavior, but perhaps it would be worthwhile examining. Using adehabitatHR (I had to install from source since the package is not available for R 2.15.0), I was able to do the following.
library(adehabitatHR)
for (j in seq(1,ncol(df3)-1,2)) {
kud <-kernelUD(SpatialPoints(df3[,j:(j+1)]),kern="bivnorm")
kernAr<-kernel.area(kud,unin=c("m"),unout=c("km2"))
print(kernAr)
}
detach(package:adehabitatHR, unload=TRUE)
This prints something, and as is mentioned in a comment below, kernelUD() is called before kernel.area().

Resources