Custom function does not work in R 'ddply' function - r

I am trying to use a custom function inside 'ddply' in order to create a new variable (NormViability) in my data frame, based on values of a pre-existing variable (CelltiterGLO).
The function is meant to create a rescaled (%) value of 'CelltiterGLO' based on the mean 'CelltiterGLO' values at a specific sub-level of the variable 'Concentration_nM' (0.01).
So if the mean of 'CelltiterGLO' at 'Concentration_nM'==0.01 is set as 100, I want to rescale all other values of 'CelltiterGLO' over the levels of other variables ('CTSC', 'Time_h' and 'ExpType').
The normalization function is the following:
normalize.fun = function(CelltiterGLO) {
idx = Concentration_nM==0.01
jnk = mean(CelltiterGLO[idx], na.rm = T)
out = 100*(CelltiterGLO/jnk)
return(out)
}
and this is the code I try to apply to my dataframe:
library("plyr")
df.bis=ddply(df,
.(CTSC, Time_h, ExpType),
transform,
NormViability = normalize.fun(CelltiterGLO))
The code runs, but when I try to double check (aggregate or tapply) if the mean of 'NormViability' equals '100' at 'Concentration_nM'==0.01, I do not get 100, but different numbers. The fact is that, if I try to subset my df by the two levels of the variable 'ExpType', the code returns the correct numbers on each separated subset. I tried to make 'ExpType' either character or factor but I got similar results. 'ExpType has two levels/values which are "Combinations" and "DoseResponse", respectively. I can't figure out why the code is not working on the entire df, I wonder if this is due to the fact that the two levels of 'ExpType' do not contain the same number of levels for all the other variables, e.g. one of the levels of 'Time_h' is missing for the level "Combinations" of 'ExpType'.
Thanks very much for your help and I apologize in advance if the answer is already present in Stackoverflow and I was not able to find it.
Michele

I (the OP) found out that the function was missing one variable in the arguments, that was used in the statements. Simply adding the variable Concentration_nM to the custom function solved the problem.
THANKS
m.

Related

R apply multiple functions when large number of categories/types are present using case_when (R vectorization)

Suppose I have a dataset of the following form:
City=c(1,2,2,1)
Business=c(2,1,1,2)
ExpectedRevenue=c(35,20,15,19)
zz=data.frame(City,Business,ExpectedRevenue)
zz_new=do.call("rbind", replicate(zz, n=30, simplify = FALSE))
My actual dataset contains about 200K rows. Furthermore, it contains information for over 100 cities.
Suppose, for each city (which I also call "Type"), I have the following functions which need to be applied:
#Writing the custom functions for the categories here
Type1=function(full_data,observation){
NewSet=full_data[which(!full_data$City==observation$City),]
BusinessMax = max(NewSet$ExpectedRevenue)+10*rnorm(1)
return(BusinessMax)
}
Type2=function(full_data,observation){
NewSet=full_data[which(!full_data$City==observation$City),]
BusinessMax = max(NewSet$ExpectedRevenue)-100*rnorm(1)
return(BusinessMax)
}
Once again the above two functions are extremely simply ones that I use for illustration. The idea here is that for each City (or "Type") I need to run a different function for each row in my dataset. In the above two functions, I used rnorm in order to check and make sure that we are drawing different values for each row.
Now for the entire dataset, I want to first divide the observation into its different City (or "Types"). I can do this using (zz_new[["City"]]==1) [also see below]. And then run the respective functions for each classes. However, when I run the code below, I get -Inf.
Can someone help me understand why this is happening?
For the example data, I would expect to obtain 20 plus 10 times some random value (for Type =1) and 35 minus 100 times some random value (for Type=2). The values should also be different for each row since I am drawing them from a random normal distribution.
library(dplyr) #I use dplyr here
zz_new[,"AdjustedRevenue"] = case_when(
zz_new[["City"]]==1~Type1(full_data=zz_new,observation=zz_new[,]),
zz_new[["City"]]==2~Type2(full_data=zz_new,observation=zz_new[,])
)
Thanks a lot in advance.
Let's take a look at your code.
I rewrite your code
library(dplyr)
zz_new[,"AdjustedRevenue"] = case_when(
zz_new[["City"]]==1~Type1(full_data=zz_new,observation=zz_new[,]),
zz_new[["City"]]==2~Type2(full_data=zz_new,observation=zz_new[,])
)
to
zz_new %>%
mutate(AdjustedRevenue = case_when(City == 1 ~ Type1(zz_new,zz_new),
City == 2 ~ Type2(zz_new,zz_new)))
since you are using dplyr but don't use the powerful tools provided by this package.
Besides the usage of mutate one key change is that I replaced zz_new[,] with zz_new. Now we see that both arguments of your Type-functions are the same dataframe.
Next step: Take a look at your function
Type1 <- function(full_data,observation){
NewSet=full_data[which(!full_data$City==observation$City),]
BusinessMax = max(NewSet$ExpectedRevenue)+10*rnorm(1)
return(BusinessMax)
}
which is called by Type1(zz_new,zz_new). So the definition of NewSet gives us
NewSet=full_data[which(!full_data$City==observation$City),]
# replace the arguments
NewSet <- zz_new[which(!zz_new$City==zz_new$City),]
Thus NewSet is always a dataframe with zero rows. Applying max to an empty column of a data.frame yields -Inf.

How two decile bin continuous variables?

I am trying to decile my data into equal bins and summarise it to see if there are any existing patterns with respect to the Dependent Variable. While summarising the data, I also want to see the lower bound and the upper bound of a variable for each decile.
I have written the below code in R-
telecom_final_Analyse<-read.csv("sampletelecomfinal.csv")
col_name_final<-colnames(telecom_final_Analyse)
Variable_profile<-vector("list",79) #I have 79 variables
names(Variable_profile)<-col_name_final
for (j in 1:79) {
if(class(telecom_final_Analyse[,col_name_final[j]])=="numeric" || class(telecom_final_Analyse[,col_name_final[j]])=="integer"){
telecom_final_Analyse%>%mutate(dec=ntile(telecom_final_Analyse[,col_name_final[j]],10))->telecom_final_Analyse
z<-as.name(col_name_final[j])
telecom_final_Analyse%>%group_by(dec)%>%summarise(n=sum(churn),N=n(),churn_percentage=n/N,greaterthan = min(z,na.rm=TRUE),lessthan=max(z,na.rm=TRUE))->Variable_profile[[col_name_final[j]]]
}
else{
x<-as.name(col_name_final[j])
telecom_final_Analyse%>%group_by_(x)%>%summarise(n=sum(churn),N=n(),churn_percentage=n/N)->Variable_profile[[col_name_final[j]]]
}
}
I am getting the following error - Error in min(z, na.rm = TRUE) : invalid 'type' (symbol) of argument
The following is the code I used for one variable to get the desired output In the same way I want to get output for all integer/numeric variables in the dataset
telecom_final_Analyse%>%mutate(dec=ntile(telecom_final_Analyse$eqpdays ,10))->telecom_final_Analyse
telecom_final_Analyse%>%group_by(dec)%>%summarise(n=sum(churn),N=n(),churn_percentage=n/N,greaterthan=min(eqpdays,na.rm=TRUE),lessthan=max(eqpdays,na.rm=TRUE))
I am able to do it manually for 1 variable, this is the output I got. The same way I want for my other continuous variables as well
I've not run this (no reprex) but you can extent your code for the single variable with mutate_if(is.numeric,{a function},{some parameters})
See: https://dplyr.tidyverse.org/reference/mutate_all.html
So try...
telecom_final_Analyse%>%mutate_if(is.numeric, ntile, 10)
Note this will.mutate the existing columns. If you want to keep the old ones and create new ones you can wrap multiple mutate functions in "list(first_function, second_function)" and then the output data set will be wider than before. It's all there in the online help.
Hope this works for you

R - Assign the mean of a column sub-sector to each row of that sub-sector

I am trying to create a column which has the mean of a variable according to subsectors of my data set. In this case, the mean is the crime rate of each state calculated from county observations, and then assigning this number to each county relative to the state they are located in. Here is the function wrote.
Create the new column
Data.Final$state_mean <- 0
Then calculate and assign the mean.
for (j in range[1:3136])
{
state <- Data.Final[j, "state"]
Data.Final[j, "state_mean"] <- mean(Data.Final$violent_crime_2009-2014,
which(Data.Final[, "state"] == state))
}
Here is the following error
Error in range[1:3137] : object of type 'builtin' is not subsettable
Very much appreciated if you could, take a few minutes to help a beginner out.
You've got a few problems:
range[1:3136] isn't valid syntax. range(1:3136) is valid syntax, but the range() function just returns the minimum and maximum. You don't need anything more than 1:3136, just use
for (j in 1:3136) instead.
Because of the dash, violent_crime_2009-2014 isn't a standard column name. You'll need to use it in backticks, Data.Final$\violent_crime_2009-2014`` or in quotes with [: Data.Final[["violent_crime_2009-2014"]] or Data.Final[, "violent_crime_2009-2014"]
Also, your code is very inefficient - you re-calculate the mean on every single time. Try having a look at the
Mean by Group R-FAQ. There are many faster and easier methods to get grouped means.
Without using extra packages, you could do
Data.Final$state_mean = ave(x = Data.Final[["violent_crime_2009-2014"]],
Data.Final$state,
FUN = mean)
For friendlier syntax and greater efficiency, the data.table and dplyr packages are popular. You can see examples using them at the link above.
Here is one of many ways this can be done (I'm sure someone will post a tidyverse answer soon if not before I manage to post):
# Data for my example:
data(InsectSprays)
# Note I have a response column and a column I could subset on
str(InsectSprays)
# Take the averages with the by var:
mn <- with(InsectSprays,aggregate(x=list(mean=count),by=list(spray=spray),FUN=mean))
# Map the means back to your data using the by var as the key to map on:
InsectSprays <- merge(InsectSprays,mn,by="spray",all=TRUE)
Since you mentioned you're a beginner, I'll just mention that whenever you can, avoid looping in R. Vectorize your operations when you can. The nice thing about using aggregate, and merge, is that you don't have to worry about errors in your mapping because you get an index shift while looping and something weird happens.
Cheers!

Pairs in R - Re-order variables

I try to make a scatter-plot matrix with a dataframe(here it is http://statweb.stanford.edu/~tibs/ElemStatLearn/). However, the order of the variables is not the one that I wish and I would like to ignore the variable train.
Dataframe order:
lcavol, lweight, age, lbph, svi, lcp, gleason, pgg45, lpsa,train
The order I wish:
lpsa, lcavol, lweight, age, lbph, svi, lcp, gleason, pgg45
For the moment, here is my code:
prostate1 <- read.table("C:/Users/.../Desktop/prostate.data")
prostate=as.data.frame.matrix(prostate1)
pairs(prostate, col="purple")
I tried to add the arguments horInd and verInd, but I get the following warnings:
1: horInd" is not a graphical parameter
2: verInd" is not a graphical parameter
If anyone could help me, it would really be appreciated.
try this:
prostate1 <- read.table("C:/Users/.../Desktop/prostate.data")
prostate = as.matrix(prostate1)
prostate.reordered = prostate[, c("lpsa", "lcavol", "lweight", "age", "lbph", "svi", "lcp", "gleason", "pgg45")]
pairs(prostate.reordered, col="purple")
The idea is to select the columns you want, in the order you want, using the column names for selection.
Of course, it would probably even more efficient not to convert everything from the data frame into a matrix, but only the required columns...

Bandwidth selection using NP package

New to R and having problem with a very simple task! I have read a few columns of .csv data into R, the contents of which contains of variables that are in the natural numbers plus zero, and have missing values. After trying to use the non-parametric package, I have two problems: first, if I use the simple command bw=npregbw(ydat=y, xdat=x, na.omit), where x and y are column vectors, I get the error that "number of regression data and response data do not match". Why do I get this, as I have the same number of elements in each vector?
Second, I would like to call the data ordered and tell npregbw this, using the command bw=npregbw(ydat=y, xdat=ordered(x)). When I do that, I get the error that x must be atomic for sort.list. But how is x not atomic, it is just a vector with natural numbers and NA's?
Any clarifications would be greatly appreciated!
1) You probably have a different number of NA's in y and x.
2) Can't be sure about this, since there is no example. If it is of following type:
x <- c(3,4,NA,2)
Then ordered(x) should work fine. Please provide an example of your case.
EDIT: You of course tried bw=npregbw(ydat=y, xdat=x)? ordered() makes your vector an ordered factor (see ?ordered), which is not an atomic vector (see 2.1.1 link and ?factor)
EDIT2: So the problem was the way of subsetting data. Note the difference in various ways of subsetting. data$x and data[,i] (where i = column number of column x) give you vectors, while data[c("x")] and data[i] give a data frame. Functions expect vectors, unless they call for data = (your data). In that case they work with column names

Resources