lapply() changing global variable in R - r

Using R, I wanted to save each variable's value when running lapply().
Below is what I tested now:
list_C <- list()
list_D <- list()
n <- 1
data_partition <- split(data, with(data, paste(A, B, sep=":")))
final_result <- lapply(data_partition,
function(dat) {
if(... condition ...) {
<Some R codes to run>
list_C[[n]] <- dat$C
list_D[[n]] <- dat$D
n <- n + 1
}
})
However, after running the code, 'n' remains just '1' and there's no change. How can I change the variable of 'n' to get the right saving lists of 'list_C' and 'list_D'?

Related

undefined columns selected and cannot xtfrm data frame error

I am trying to write a code that checks for outliers based on IQR and change those respective values to "NA". So I wrote this:
dt <- rnorm(200)
dg <- rnorm(200)
dh <- rnorm(200)
l <- c(1,3) #List of relevant columns
df <- data.frame(dt,dg,dh)
To check if the column contains any outliers and change their value to NA:
vector.is.empty <- function(x) return(length(x) ==0)
#Checks for empty values in vector and returns booleans.
for (i in 1:length(l)){
IDX <- l[i]
BP <- boxplot.stats(df[IDX])
OutIDX <- which(df[IDX] %in% BP$out)
if (vector.is.empty(OutIDX)==FALSE){
for (u in 1:length(OutIDX)){
IDX2 <- OutIDX[u]
df[IDX2,IDX] <- NA
}
}
}
So, when I run this code, I get these error messages:
I've tried to search online for any good answers. but I'm not sure why they claim that the column is unspecified. Any clues here?
I would do something like that in order to replace the outliers:
# Set a seed (to make the example reproducible)
set.seed(31415)
# Generate the data.frame
df <- data.frame(dt = rnorm(100), dg = rnorm(100), dh = rnorm(100))
# A list to save the result of boxplot.stats()
l <- list()
for (i in 1:ncol(df)){
l[[i]] <- boxplot.stats(df[,i])
df[which(df[,i]==l[[i]]$out),i] <- NA
}
# Which values have been replaced?
lapply(l, function(x) x$out)

Applying a Function to a Data Frame : lapply vs traditional way

I have this data frame in R:
x <- seq(1, 10,0.1)
y <- seq(1, 10,0.1)
data_frame <- expand.grid(x,y)
I also have this function:
some_function <- function(x,y) { return(x+y) }
Basically, I want to create a new column in the data frame based on "some_function". I thought I could do this with the "lapply" function in R:
data_frame$new_column <-lapply(c(data_frame$x, data_frame$y),some_function)
This does not work:
Error in `$<-.data.frame`(`*tmp*`, f, value = list()) :
replacement has 0 rows, data has 8281
I know how to do this in a more "clunky and traditional" way:
data_frame$new_column = x + y
But I would like to know how to do this using "lapply" - in the future, I will have much more complicated and longer functions that will be a pain to write out like I did above. Can someone show me how to do this using "lapply"?
Thank you!
When working within a data.frame you could use apply instead of lapply:
x <- seq(1, 10,0.1)
y <- seq(1, 10,0.1)
data_frame <- expand.grid(x,y)
head(data_frame)
some_function <- function(x,y) { return(x+y) }
data_frame$new_column <- apply(data_frame, 1, \(x) some_function(x["Var1"], x["Var2"]))
head(data_frame)
To apply a function to rows set MAR = 1, to apply a function to columns set MAR = 2.
lapply, as the name suggests, is a list-apply. As a data.frame is a list of columns you can use it to compute over columns but within rectangular data, apply is often the easiest.
If some_function is written for that specific purpose, it can be written to accept a single row of the data.frame as in
x <- seq(1, 10,0.1)
y <- seq(1, 10,0.1)
data_frame <- expand.grid(x,y)
head(data_frame)
some_function <- function(row) { return(row[1]+row[2]) }
data_frame$yet_another <- apply(data_frame, 1, some_function)
head(data_frame)
Final comment: Often functions written for only a pair of values come out as perfectly vectorized. Probably the best way to call some_function is without any function of the apply-familiy as in
some_function <- function(x,y) { return(x + y) }
data_frame$last_one <- some_function(data_frame$Var1, data_frame$Var2)

replacing nested for loop with lapply()

I have sen this example to explain how to replace an nested for loop with the lapply() function. However i do not fully understand what is happening in the nested for loop?
according to my understanding, the for loop creates for every country for all years two new variables called tempX and tempY, but what happens in the last line of the argument in the for loop?
what is the purpose of variable1 and variable2 ?
# Generate random data:
allCountries <- LETTERS[1:10]
allYears <- 1990:2012
myData <- expand.grid(allCountries, allYears) # create a dataframe with all possible combinations
colnames(myData) <- c("Country", "Year")
myData$variable1 <- rnorm(nrow(myData))
myData$variable2 <- rnorm(nrow(myData))
# Silly function to perform
myFunction <- function(x, y){
x * y - x / y
}
### Doubly-nested loop ###
myData$computedFigure <- NA # Make an "empty" variable in my data.frame
for(ii in allCountries){
for(jj in allYears){
tempX <- myData[myData$Country == ii & myData$Year == jj, c("variable1")]
tempY <- myData[myData$Country == ii & myData$Year == jj, c("variable2")]
# "Save" results into appropriate location in my data.frame
myData[myData$Country == ii & myData$Year == jj, c("computedFigure")] <- myFunction(tempX, tempY)
}
}
### Simple lapply() approach ###
computedFigureList <- lapply(1:nrow(myData), function(x){
tempX <- myData[x, c("variable1")]
tempY <- myData[x, c("variable2")]
# "Save" results into appropriate location in my data.frame
myFunction(tempX, tempY)
})
myData$computedFigure2 <- unlist(computedFigureList)
with(myData, plot(computedFigure, computedFigure2))
In the last line of the loop myData[myData$Country == ii & myData$Year == jj, c("computedFigure")] <- myFunction(tempX, tempY)
, the function myFunctionis applied and recorded in the computedFigure column.
variable1 and variable2 are set randomly to illustrate the data in myData (x and y) in myFunction.
The for loops are exploring the combinations in countries and years...
The two codes (for loop and lappy) will not generate exactly the same result. The lapply will generate a list just with the result of the myFunction. The for loops will generate a dataframe.
Actually you don't need nested *apply functions, you can actually use outer + diag to compute computedFigure, which can achieve the same results as you did via nested for loop.
myData$computedFigure <- diag(with(myData,outer(variable1,variable2,myFunction)))

The loop in my R function appears to be running twice

I need to add rows to a data frame. I have many files with many rows so I have converted the code to a function. When I go through each element of the code it works fine. When I wrap everything in a function each row from my first loop gets added twice.
My code looks for a string (xx or x). If xx is present is replaces the xx with numbers 00-99 (one row for each number) and 0-9. If x is present it replaces it with number 0-9.
Create DF
a <- c("1.x", "2.xx", "3.1")
b <- c("single", "double", "nothing")
df <- data.frame(a, b, stringsAsFactors = FALSE)
names(df) <- c("code", "desc")
My dataframe
code desc
1 1.x single
2 2.xx double
3 3.1 nothing
My function
newdf <- function(df){
# If I run through my code chunk by chunk it works as I want it.
df$expanded <- 0 # a variable to let me know if the loop was run on the row
emp <- function(){ # This function creates empty vectors for my loop
assign("codes", c(), envir = .GlobalEnv)
assign("desc", c(), envir = .GlobalEnv)
assign("expanded", c(), envir = .GlobalEnv)
}
emp()
# I want to expand xx with numbers 00 - 99 and 0 - 9.
#Note: 2.0 is different than 2.00
# Identifies the rows to be expanded
xd <- grep("xx", df$code)
# I used chr vs. numeric so I wouldn't lose the trailing zero
# Create a vector to loop through
tens <- formatC(c(0:99)); tens <- tens[11:100]
ones <- c("00","01","02","03","04","05","06","07","08","09")
single <- as.character(c(0:9))
exp <- c(single, ones, tens)
# This loop appears to run twice when I run the function: newdf(df)
# Each row is there twice: 2.00, 2.00, 2.01 2.01...
# It runs as I want it to if I just highlight the code.
for (i in xd){
for (n in exp) {
codes <- c(codes, gsub("xx", n, df$code[i])) #expanding the number
desc <- c(desc, df$desc[i]) # repeating the description
expanded <- c(expanded, 1) # assigning 1 to indicated the row has been expanded
}
}
# Binds the df with the new expansion
df <- df[-xd, ]
df <- rbind(as.matrix(df),cbind(codes,desc,expanded))
df <- as.data.frame(df, stringsAsFactors = FALSE)
# Empties the vector to begin another expansion
emp()
xs <- grep("x", df$code) # This is for the single digit expansion
# Expands the single digits. This part of the code works fine inside the function.
for (i in xs){
for (n in 0:9) {
codes <- c(codes, gsub("x", n, df$code[i]))
desc <- c(desc, df$desc[i])
expanded <- c(expanded, 1)
}
}
df <- df[-xs,]
df <- rbind(as.matrix(df), cbind(codes,desc,expanded))
df <- as.data.frame(df, stringsAsFactors = FALSE)
assign("out", df, envir = .GlobalEnv) # This is how I view my dataframe after I run the function.
}
Calling my function
newdf(df)

Forcing the use of a for loop with group_by and mutate()

I have a list of data frames (generated by the permutation order of an initial dataframe) to which I would like to apply complicated calculus using group_by_at() and mutate(). It works well with a single data frame but fail using a for loop since mutate requires the name of the dataframe and some of my calculus as well. So I thought, well, let's create a list of different dataframes all having the same name and loop over the initial sequence of names. Unfortunately the trick does not work and I get the following message:
Error: object of type 'closure' is not subsettable.
Here is the self contained example showing all my steps. I think the problem comes from mutate. So, how could I force the use of for loop with mutate?
data <- read.table(text = 'obs gender ageclass weight year subdata income
1 F 1 10 yearA sub1 1000
2 M 2 25 yearA sub1 1200
3 M 2 5 yearB sub2 1400
4 M 1 11 yearB sub1 1350',
header = TRUE)
library(dplyr)
library(GiniWegNeg)
dataA <- select(data, gender, ageclass)
dataB <- select(data, -gender, -ageclass)
rm(data)
# Generate permutation of indexes based on the number of column in dataA
library(combinat)
index <- permn(ncol(dataA))
# Attach dataA to the previous list of index
res <- lapply(index, function(x) dataA[x])
# name my list keeping track of permutation order in dataframe name
names(res) <- unlist(lapply(res,function(x) sprintf('data%s',paste0(toupper(substr(colnames(x),1,1)),collapse = ''))))
# Create a list containing the name of each data.frame name
NameList <- unlist(lapply(res,function(x) sprintf('data%s',paste0(toupper(substr(colnames(x),1,1)),collapse = ''))))
# Define as N the number of columns/permutation/dataframes
N <- length(res)
# Merge res and dataB for all permutation of dataframes
res <- lapply(res,function(x) cbind(x,dataB))
# Change the name of res so that all data frames are named data
names(res) <- rep("data", N)
# APPLY FOR LOOP TO ALL DATAFRAMES
for (j in NameList){
runCalc <- function(data, y){
data <- data %>%
group_by_at(1) %>%
mutate(Income_1 = weighted.mean(income, weight))
data <- data %>%
group_by_at(2) %>%
mutate(Income_2 = weighted.mean(income, weight))
gini <- c(Gini_RSV(data$Income_1, data$weight), Gini_RSV(data$Income_2,data$weight))
Gini <- data.frame(gini)
colnames(Gini) <- c("Income_1","Income_2")
rownames(Gini) <- c(paste0("Gini_", y))
return(Gini)
}
runOtherCalc <- function(df, y){
Contrib <- (1/5) * df$Income_1 + df$Income_2
Contrib <- data.frame(Contrib)
colnames(Contrib) <- c("myresult")
rownames(Contrib) <- c(paste0("Contrib_", y)
return(Contrib)
}
# Run runCalc over dataframe data by year
df1_List <- lapply(unique(data$year), function(i) {
byperiod <- subset(data, year == i)
runCalc(byperiod, i)
})
# runCalc returns df which then passes to runOtherCalc, again by year
df1_OtherList <- lapply(unique(data$year), function(i)
byperiod <- subset(data, year == i)
df <- runCalc(byperiod, i)
runOtherCalc(df, i)
})
# Run runCalc over dataframe data by subdata
df2_List <- lapply(unique(data$subdata), function(i) {
byperiod <- subset(data, subdata == i)
runCalc(bysubdata, i)
})
# runCalc returns df which then passes to runOtherCalc, again by subdata
df2_OtherList <- lapply(unique(data$subdata), function(i)
bysubdata <- subset(data, subdata == i)
df <- runCalc(bysubdata, i)
runOtherCalc(df, i)
})
# Return all results in separate frames, then append by row in 2 frames
Gini_df1 <- do.call(rbind, df1_List)
Contrib_df1 <- do.call(rbind,df1_OtherList)
Gini_df2 <- do.call(rbind, df1_List)
Contrib_df2 <- do.call(rbind,df1_OtherList)
Gini <- rbind(Gini_df1, Gini_df2)
Contrib <- rbind(Contrib_df1, Contrib_df2)
}
Admittedly, the R error you receive below is a bit cryptic but usually it means you are running an operation on an object that does not exist.
Error: object of type 'closure' is not subsettable.
Specifically, it comes with your lapply call as data is not defined anywhere globally (only within the runCalc method) and as above you remove it with rm(data).
dfList <- lapply(unique(data$year), function(i) {
byperiod <- subset(data, year == i)
runCalc(byperiod, i)
})
By, the way the use of lapply...unique...subset can be replaced with the underused grouping base R function, by().
Gathering from your text and code, I believe you intend to run a year grouping on each dataframe of your list, res. Then consider two by calls, wrapped in a larger function that receives as a parameter a dataframe, df. Then run lapply across all items of list to return a new list of nested dataframe pairs.
# SECONDARY FUNCTIONS
runCalc <- function(data) {
data <- data %>%
group_by_at(1) %>%
mutate(Income_1 = weighted.mean(income, weight))
data <- data %>%
group_by_at(2) %>%
mutate(Income_2 = weighted.mean(income, weight))
Gini <- data.frame(
year = data$year[[1]],
Income_1 = unname(Gini_RSV(data$Income_1, data$weight)),
Income_2 = unname(Gini_RSV(data$Income_2, data$weight)),
row.names = paste0("Gini_", data$year[[1]])
)
return(Gini)
}
runOtherCalc <- function(df){
Contrib <- data.frame(
myresult = (1/5) * df$Income_1 + df$Income_2,
row.names = paste0("Contrib_", df$year[[1]])
)
return(Contrib)
}
# PRIMARY FUNCTION
runDfOperations <- function(df) {
gList <- by(df, df$year, runCalc)
gTmp <- do.call(rbind, gList)
cList <- by(gTmp, gTmp$year, runOtherCalc)
cTmp <- do.call(rbind, cList)
gtmp$year <- NULL
return(list(gTmp, cTmp))
}
# RETURNS NESTED LIST OF TWO DFs FOR EACH ORIGINAL DF
new_res <- lapply(res, runDfOperations)
# SEPARATE LISTS IF NEEDED (EQUAL LENGTH)
Gini <- lapply(new_res, "[[", 1)
Contrib <- lapply(new_res, "[[", 2)

Resources