in R, fix an argument for use the lapply function - r

This post contains two questions. The first is some related with the second.
First, suppose that I want define one function that receives two arguments: one data frame and one variable(column) and I would like to do some counts or statistics. In first time, I have to determine the variable position. For example, suppose that my two first rows of the df are
> df
person age rent
1 23 1000
2 35 1.500
and my function is like this
> myfun<- function(df, var)
{
# determining the variable
ind<- which(names(df) %in% var )
# selecting the variable
v <- df[,ind]
# rest of function
....
}
I think that it may be more easy... Is there some way to determine v directly?
Second Question: I have a large list of data frames(samples of one population). All data frame have the same variables and one of these variable is the rent. I would like to calculate the mean of the rent variable for each sample and I would like to use the lapply function. For one sample, I can do the following code
> mean(sample$rent , na.rm = T)
All that I want is do something like this
> apply(list, mean( , variablefix = rent))
One option is create a new mean function with the rent argument being fix or with only one argument and apply the lappy function:
>mean_rent <- function(df){...}
>lapply(df, mean_rent)
But, I want a way to use the apply function directly in only one line
Some ideas?

Question One: you can also use the names (i.e a character string) or a variable containing the name to index data.frames (and vectors,matrices etc.), so you just have to do:
myfun<- function(df, var) {
# select the column
v <- df[,var]
# rest of function
}
but it is more common to define the function on a vector and then just call it with myfun(df[,var])
Question Two: Instead of assigning the new function to a variable, you can also just pass it on directly, i.e.
lapply(list_of_dfs, function(df){ mean( df$rent ) })

Related

Can I use names from a list/dataframe, to be recognised as list/dataframe name within R script for a loop function?

I'd like to use a loop function to recognise names from a list/dataframe as an actual list/dataframe name in the R script (for data analysis or manipulation).
I will create some pseudo data to try to help show what i'm trying to do.
Here is code to create 3 lists
height <- sample(120:200,200,TRUE)
weight <- sample(40:140,200,TRUE)
income <- sample(20000:200000,200, TRUE)
This code creates a list containing those list names
vars <- c("height","weight","income")
The code below doesn't run, but I would like to use a loop code like this, where it takes the name from the list position and uses it in script as a list name. Thus it's using the name to calculate the mean, and it's using the name to create a new object.
for (i in 1:3)
{mean_**vars[i]** = mean(**vars[i]**) }
The result should be 3 objects "mean_height", "mean_weight", "mean_income" which contain the mean scores
I'm not so much interested in the calculating of mean scores, I'm interested in the ability to use the names from the list. I want to be able to expand this to other analyses that are repetitive.
Apologies if above hasn't been articulated too well, I'm quite new to R, so I hope it makes some sense.
Any help will be most useful, or if you can point me in the right direction that would be great.
This may be what you're looking for, where lapply applies the mean function to each of the items in vars (a list of dataframes). Note that you want to make the list of dataframes using the variable names.
height <- sample(120:200,200,TRUE)
weight <- sample(40:140,200,TRUE)
income <- sample(20000:200000,200, TRUE)
vars <- list(height, weight, income)
lapply(vars, function(x) mean(x))
Then create an output dataframe using that:
df1 <- data.frame(lapply(vars, function(x) mean(x)))
colnames(df1) <- c("mean_height", "mean_weight", "mean_income")
df1
From your additional comment, using vars <- list(height, weight, income) should allow you do this:
mean(height)
mean(vars[[1]])
[1] 160.48
[1] 160.48
This should work to output dynamically named variables:
vars <- list(height = height, weight = weight, income = income)
for (i in names(vars)){
assign(paste("mean_", i, sep = ""), mean(vars[[i]]))
}
mean_height
mean_weight
mean_income
[1] 163.28
[1] 90.465
[1] 109686.5
However, I'd suggest not programming that way since it can cause issues and it's not very scalable. E.g., you could end up with 10000 variables.
I guess what you want is something like below, which produces three objects into your global environment for the means of weight, height, and income from list list, i.e.,
list2env(setNames(Map(mean,lst),paste0("mean_",names(lst))),envir = .GlobalEnv)
DATA
height <- sample(120:200,200,TRUE)
weight <- sample(40:140,200,TRUE)
income <- sample(20000:200000,200, TRUE)
lst <- list(height,weight,income)
A more common approach in R is to use lists of data, rather than separate variables.
Like this:
# make this reproducible
set.seed(123)
# make an empty list for the data
raw_data <- list()
# then fill the list. The data can be of varying length in a list.
raw_data$height <- sample(120:200,200,TRUE)
raw_data$weight <- sample(40:140,200,TRUE)
raw_data$income <- sample(20000:200000,200, TRUE)
Then looping becomes a one-liner and your names are preserved, using the *apply family of functions:
mean_data <- lapply(raw_data, mean)
# print that
mean_data
$height
[1] 159.06
$weight
[1] 90.83
$income
[1] 114000.7
Note what we didn't have to do:
know the number of variables.
have variables all the same length.
build a loop and keep track of names.
All handled automagically. Nice.

In R, how do I concatenate multiple columns within a list of lists

I have a function in R which returns a list with N columns each of M rows of multiple types - date, numeric and char. I am using sapply to create multiple copies of these lists, which then end up in a top level list. I would like to concatenate the underlying lists together to produce a single list of N columns and M * number of list rows.
I've been trying different combinations of do.call, sapply, rbind, c, etc, but I think I'm missing something pretty fundamental. Below is a simple script that mimics the problem and shows the desired outcome. I've used 3 variables here, but the number of variables is arbitrary.
# Set up test function
testfun <- function(varName)
{
currDate = seq(as.Date('2018-12-31'), as.Date('2019-01-10'), "days")
t1 = runif(11)
t2 = runif(11)
groupNum = c(rep(1,5), rep(2,6))
varName = rep(varName, 11)
dataout= data.frame(currDate, t1, t2, groupNum, varName)
}
# create 3 test variables and run the data
varNames = c('test1', 'test2', 'test3')
tmp = sapply(varNames, testfun)
# I would like it to look like the following, but for any given number of variables
desiredAnswer <- rbind(as.data.frame(tmp[,1]),as.data.frame(tmp[,2]),as.data.frame(tmp[,3]))
The final answer will later be used to create a data table and feed ggplot with the varName as facets.
I'm happy to use any method to get the desired results, there's no reason the function needs to produce a list instead of say a data.frame. I'm certain I'm doing something dumb, any help appreciated.

For loop within function to generate subsets of dataframes

I am attempting to write a function which accepts a dataframe, and then generates subset dataframes within a for() loop. As a first step, I tried the following:
dfcreator<-function(X,Z){
for(i in 1:Z){
df<-subset(X,Stratum==Z) #build dataframe from observations where index=value
assign(paste0("pop", Z),df) #name dataframe
}
}
This however does not save anything in to memory, and when I try to specify a return() I am still not getting what I need. For reference, I am using the
Sweden data set (which is native to RStudio).
EDIT Per Melissa's Advice!
I tried to implement the following code:
sampler <- function(df, n,...) {
return(df[sample(nrow(df),n),])
}
sample_list<-map2(data_list, stratumSizeVec, sampler)
where stratumSizeVec is a 1X7 df and data_list is a list of seven dfs. When I do this, I get seven samples in sample list all of the same size equal to stratumSizeVec[1]. Why is map2 not inputting the in the following manner
sampler(data_list$pop0,stratumSizeVec[1])
sampler(data_list$pop1,stratumSizeVec[2])
...
sampler(data_list$pop6,stratumSizeVec[7])
Furthermore, is there a way to "nest" the map2 function within lapply?
I'm confused as to why you never actually utilize i anywhere in your loop. It looks like you're creating Z copies of the data set where Stratum == Z - is that what you are after?
as for your code, I would use the following:
data_list <- split(df, df$Stratum)
names(data_list) <- paste0("pop", sort(unique(df$Stratum)))
This doesn't define a function, we are calling base-R function (namely split) which splits up a data frame based on some vector (here, we use df$Stratum). The result is a list of data frames, each with a single value of Stratum.
Random sampling from rows
sampled_data <- lapply(data_list, function(df, n,...) { # n is the number of rows to take, the dots let you send other information to the `sample` function.
df[sample(nrow(df), n, ...),]
},
n = 5,
replace = FALSE # this is default, but the purpose of using the ... notation is to allow this (and any other options in the `sample` function) to be changed.
)
You can also define the function separately:
sampler <- function(df, n,...) {
df[sample(nrow(df), n, ...),]
}
sampled_data <- lapply(data_list, sampler, n = 10) # replace 10 with however many samples you want.
purrr:map2 method
As defined, the sampler function does not need to be modified, each element of the first list (data_list) is put into the first argument of sampler, and the corresponding element of the 2nd "list" (sampleSizeVec) is put into the 2nd argument.
library(purrr)
map2(data_list, sampleSizeVec, sampler, replace = FALSE) # replace = FALSE not needed, there as an example only.

How to replace mutiple nested for loops with apply family functions in R?

I have four main variables in my dataset (dat).
SubjectID
Group (can be Easy1, Easy2, Hard1, Hard2)
Object (x, y, z, w)
Reaction time
For each combination of variables 1, 2 and 3 I want to change the reaction time, so that all values above the 3rd Quartile + 1.5IQR are set to the value of 3rd Quartile + 1.5 IQR.
TUK <- function (a,b,c) {
....
}
Basically, the for loop logic would be:
for (i in dat$SubjectID):
for (j in dat$Group):
for (k in dat$Object) :
TUK(i,j,k)
How can I do this with apply function family?
Thank you!
Adding reproducible example:
SubjectID <- c(3772113,3772468)
Group <- c("Easy","Hard")
Object <- c("A","B")
dat <- data.frame(expand.grid(SubjectID,Group,Object))
dat$RT <- rnorm(8,1500,700)
colnames(dat) <- c("SubjectID","Group","Object","RT")
TUK <- function (SUBJ,GROUP,OBJECT){
p <- dat[dat$SubjectID==SUBJ & dat$Group== GROUP & dat$Object==OBJECT, "RT"]
p[p$RT< 1000 | p$RT> 2000,] <- NA
dat[dat$SubjectID==SUBJ & dat$Group== GROUP & dat$Object==OBJECT, "RT"]<<- p
}
A big part of your problem is that your TUK function is terrible. Here are some reasons why
Problem: it depends on having a data frame named dat in the global environment. Change the name of your data and it breaks.
Solution: you should pass in all arguments needed. In this case, dat should be an argument.
Problem: Global assignment <<- should be avoided. There are certain advanced cases where it is necessary (e.g., sometimes in Shiny apps), but in general it makes a function behave in very un-R-like ways.
Solution: Simply return() a value and assign it like any other normal R function.
Problem: It's over-complicated. You're by passing in SUBJ, GROUP, and OBJECT but only using them to subset you're trying to do inside your function the "grouping" bit that dplyr or data.table or base::ave excels at. It's as if you're trying to build you function in a way so that if could only possibly be used embedded in this particular for loop.
Solution: Functions should be simple building blocks. Make this a function of just a single vector. It will be much cleaner and easier to debug. When it works on a single vector, use dplyr or data.table or ave (or even a for loop) to do the split-apply-combining of it. This also makes your function more generally useful instead of being cemented to this one particular case.
With the above in mind, here's an attempted re-write:
TUK2 <- function (RT){
RT[RT < 1000 | RT > 2000] <- NA
return(RT)
}
See how much simpler! Now if we want to apply this function to each of the GROUP:SUBJ:OBJECT groupings in your data, and replace the RT column with the result, we do this with dplyr:
library(dplyr)
group_by(dat, Group, SubjectID, Object) %>%
mutate(new_RT = TUK2(RT))
dplyr does the grouping of data, the splitting of data, applies the simple function to each piece, and combines it all back together for us.
Now, in your question, you said
For each combination of variables 1, 2 and 3 I want to change the reaction time, so that all values above the 3rd Quartile + 1.5IQR are set to the value of 3rd Quartile + 1.5 IQR.
This doesn't sound much like what your function does. Based only on this description, I would code this as
group_by(dat, Group, SubjectID, Object) %>%
mutate(new_RT = pmin(RT, quantile(RT, probs = 0.75) + 1.5 * IQR(RT)))
pmin is for parallel minimum, it's a vectorized way to take the smaller of two vectors. Try, e.g., pmin(1:10, 7), to see what it does.
In both examples, the dplyr data frame won't be saved, of course, unless you re-assign it with dat <- group_by(dat, ...) etc. This is the functional programming way of doing things - no global assignment.
One additional note: with the re-written function you could still use loops instead of dplyr. I don't know why you would - surely the dplyr syntax is nicer - but I just want to illustrate that the small building-block function is generally useful, it's not "baking in" dplyr in the way that your original function was "baking in" a particular for loop.
for (sub %in% unique(dat$SubjectID)) {
for (obj %in% unique(dat$Object)) {
for (grp %in% unique(dat$Group)) {
dat[dat$SubjectID == sub &
dat$Object == obj &
dat$Group == grp, "RT"] <-
TUK2(
dat[dat$SubjectID == sub &
dat$Object == obj &
dat$Group == grp, "RT"]
)
}
}
}

ddply: Error in do.call("c", res) : variable names are limited to 10000 bytes

I have a function to label as spam strings in a datasets. I use this function with success by calling:
dtm_english.label <- getSpamLabel(comment$rawMessage, dictionary_english, 2) # 2 is the threshold level
But then when I call
dtm_english.label <- ddply(comment, .(rawMessage), getSpamLabel, dictionary_english, 2, .progress = "text")
after ddply completes without any output the task I get
Error in do.call("c", res) : variable names are limited to 10000 bytes
I can post the function if relevant
I am not sure what you are attempting to do, next time please describe exactly what you are trying to achieve. To me it looks like you are trying to apply a function to one column of your data.frame. ddply is meant to be used to apply a function to subsets of the data. It is described as "Split data frame, apply function, and return results in a data frame".
If what you want to do is split your column into sections before applying the function, you would need for example a factor in your dataframe to tag the groups.
You would use the "group" factor in the .variable argument to ddply, not the variable to which you would like to apply the function, FUN=summarize, and then your function call.
dtm_english <- ddply(comment, .(group), summarize,
label=getSpamLabel(rawMessage, dictionary_english, 2),
.progress = "text")
This will give as output a new dataframe with a row for each level of group.

Resources