Code Breaking When Turned Into Custom Function? - r

I am putting together a summary table from a larger data frame. I noticed that I was re-using the following code but with different %like% characters:
# This code creates a df of values where the row name matches the character
df <- (data[which(data$`col_name` %like% "Total"),])
df <- df[3:ncol(df)]
df[is.na(df)] <- 0
# This creates a row composed of the sum of each column
for (i in seq_along(df)) {
df[10, i] <- sum(df[i])
}
# This inserts the resulting values into a separate summary table
summary[1, 2:ncol(summary)] <- df[nrow(df),]
To keep the code dry and avoid repetition, I thought it would be best to translate this into a custom function that I could then call with different strings:
create_row <- function(x) {
df <- (data[which(data$`Crop year` %like% as.character(x)),])
df <- df[3:ncol(df)]
df[is.na(df)] <- 0
for (i in seq_along(df)) {
df[10, i] <- sum(df[i])
}
}
# Then populate the summary table as before with the results
total <- create_row("Total")
summary[1, 2:ncol(summary)] <- total[nrow(total),]
However when attempting to run this, it simply returns an empty variable.
Through trial and error, I have found that the line of code causing this is:
df[is.na(df)] <- 0
The code works absolutely fine when run line by line outside of this custom function.

As mentioned in the comments if you add return(df) at the end of the function, the function will work. We need to do that because for loop unlike any other functions doesn't return an object after it's executed.
Moreover, as mentioned in the comments by #alan that you can use colSums to get sum of each column directly instead of for loop to loop over each column and take its sum.

Related

Looping function with left_join over multiple variables

I am working to loop a function that contains a left_join iteratively over a dataframe based on multiple variables in R. The function works when I run it line-by-line over the dataframe, but breaks down in the loop. I need to automate this process because I have to run it hundreds of times, but I am getting errors using foreach and mapply.
A portion of the original data set and the original function is this:
library(tidyverse)
ID <- c(22226820,22226820,22226814,22226814)
ID_US_1 <- c(22226830,22226818,22226816,22226832)
mydf <- data.frame(cbind(ID==as.character(ID),ID_US_1=as.character(ID_US_1)))
ID_key <- c(22226830,22226818,22226818,22226816,22226816,22226832,22226832,22226806,22226806,22226814,22226814,22226804)
ID_key_US <- c(0,22226806,22226814,22226804,22226802,22226840,22226842,22226798,22226796,22226816,22226832,22227684)
mykey <- data.frame(cbind(ID_key=as.character(ID_key),ID_key_US=as.character(ID_key_US)))
myfx <- function(iteration_prior,iteration){
# iteration_prior <- "1"
# iteration <- "2"
varnameprior <- paste0("ID_US","_",iteration_prior)
varname <- paste0("ID_US","_",iteration)
colnames(mykey) <- c(varnameprior,varname)
mydf <-mydf %>%
left_join(x=.,y=mykey,by=varnameprior)
mydf[,ncol(mydf)][is.na(mydf[,ncol(mydf)])] <- 0
mydf[,ncol(mydf)]<-as.character(mydf[,ncol(mydf)])
return(mydf)
}
prior <- c(1,2,3)
current <- c(2,3,4)
mylist <- data.frame(cbind(prior=prior,current=current))
mydf <- myfx(prior[1],current[1])
mydf <- myfx(prior[2],current[2])
This creates my desired output, which is iterative columns of data. ID_US_2 is calculated based on ID_US_1 using the mykey dataframe and ID_US_3 is calculated using ID_US_2 and mykey.
I need to carry out this operation hundreds of times, which means I need to automate the process. I have tried a foreach loop and get the error that 'Join columns must be present in data'. I think this means my new output is not correctly amending to the dataframe. I got the same error/issue with mapply.
library(foreach)
foreach(i=prior,j=current) %do% {myfx(i,j)}
I also considered a nested for loop, but was hung up on the multiple variables (and foreach/mapply seem better suited).
I think that your only issue is that you haven't reassigned mydf in the foreach command. Editing that, you have:
foreach(i=prior, j=current) %do% {mydf <- myfx(i,j)}

rownames on multiple dataframe with for loop in R

I have several dataframe. I want the first column to be the name of each row.
I can do it for 1 dataframe this way :
# Rename the row according the value in the 1st column
row.names(df1) <- df1[,1]
# Remove the 1st column
df1 <- df1[,-1]
But I want to do that on several dataframe. I tried several strategies, including with assign and some get, but with no success. Here the two main ways I've tried :
# Getting a list of all my dataframes
my_df <- list.files(path="data")
# 1st strategy, adapting what works for 1 dataframe
for (i in 1:length(files_names)) {
rownames(get(my_df[i])) <- get(my_df[[i]])[,1] # The problem seems to be in this line
my_df[i] <- my_df[i][,-1]
}
# The error is Could not find function 'get>-'
# 2nd strategy using assign()
for (i in 1:length(my_df)) {
assign(rownames(get(my_df[[i]])), get(my_df[[i]])[,1]) # The problem seems to be in this line
my_df[i] <- my_df[i][,-1]
}
# The error is : Error in assign(rownames(my_df[i]), get(my_df[[i]])[, 1]) : first argument incorrect
I really don't see what I missed. When I type get(my_df[i]) and get(my_df[[i]])[,1], it works alone in the console...
Thank you very much to those who can help me :)
You may write the code that you have in a function, read the data and pass every dataframe to the function.
change_rownames <- function(df1) {
row.names(df1) <- df1[,1]
df1 <- df1[,-1]
df1
}
my_df <- list.files(path="data")
list_data <- lapply(my_df, function(x) change_rownames(read.csv(x)))
We can use a loop function like lapply or purrr::map to loop through all the data.frames, then use dplyr::column_to_rownames, which simplifies the procedure a lot. No need for an explicit for loop.
library(purrr)
library(dplyr)
map(my_df, ~ .x %>% read.csv() %>% column_to_rownames(var = names(.)[1]))

How do I alias a column name in a for loop?

I'm making a function and I'd like to call a column in a particular way.
Initialize data
a <- c(1,2,3,4,5)
b <- c(6,7,8,9,10)
c <- c(1,2,3,4,5)
d <- c(6,7,8,9,10)
df <- as.data.frame(cbind(a,b,c,d))
Call column for the table function
Func <- function(df){
X <- df
Y <- names(M)
for(i in 1:2){
table(X$___,X$___)
}}
The trouble is I don't know how to call the columns.
I'd like it to be the equivalent to table(X$a, X$b) as it iterates through the loop.
I tried this and it didn't work
for(i in 1:2){
Q <- Y[i]
W <- Y[j]
table(X$Q,X$W)
}}
It is necessary for a function I'm using that I make a table with the form table(X$a, X$b) and I don't know quite how to achieve that in a for loop?
Instead of calling table using names of the column you could use column index and use it in the function so you don't have to worry about how to call the columns.
Replace your for loop and use
table(df[1:2])
which would give you the expected result.
You need to use two [[ to get the content of the column:
df <- datasets::mtcars
for (i in 1:2) df[[i]]
This will also work for column names
for (i in names(df)) df[[i]]
Not sure what you are trying to achieve though. You can also just do:
lapply(df[1:2], table)
You can also loop through col using column index. In the following code you can loop through iris dataset column:
for(i in 1:length(colnames(iris))){
print(iris[,i]) # to get single column
print(iris[,c(i,i+1)]) # to get multiple column data
}

dataframe is collapsed to a vector when given to function

I am trying to make use of the content of a dataframe in a function, here is a simplified example of my problem.
df <- data.frame(v1=1:10,v2=23:32)
df2 <- data.frame(v1=1:3,v2=3:5)
fxm <- function(x,y,q)
{
return(cbind(q[q[,2]==x,],y))
}
mapply(fxm,df[,1],df[,2],q=df2)
Error in q[, 2] : incorrect number of dimensions
if I add a print statement:
df <- data.frame(v1=1:10,v2=23:32)
df2 <- data.frame(v1=1:3,v2=3:5)
fxm <- function(x,y,q)
{
print(q)
return(cbind(q[q[,2]==x,],y))
}
mapply(fxm,df[,1],df[,2],q=df2)
I get:
[1] 1 2 3
Error in q[, 2] : incorrect number of dimensions
The data frame is converted to a vector of its first column for some reason. How can I stop this from happening, and have the whole dataframe accessible to my function?
I am trying to select a subset of the dataframe and returning it based on the other two parameters of the function, which is why I need the whole dataframe to be passed to the function.
If I understand you correctly, you want the whole thing q = df2 passed to the fxm function you define, am I right?
The problem is that in your code mapply will extract elements from q = df2 as some additional parameters just same as extracting elements from df[,1] and df[,2]. You need to set MoreArgs parameter for mapply to pass the whole thing to the function like this:
df <- data.frame(v1=1:10,v2=23:32)
df2 <- data.frame(v1=1:3,v2=3:5)
fxm <- function(x,y,q)
{
print(q)
return(cbind(q[q[,2]==x,],y))
}
mapply(fxm,df[,1],df[,2], MoreArgs = list(q=df2))
This still doesn't work for me and there is some error elsewhere. From the printing result you can see the whole data.frame prints out, which solves your original problem.

Stepwise reducing the input dataframe in a loop

I need to do iteratively evaluate the variance of a dataset while i reduce the data.frame row by row in each step. As an example
data <- matrix(runif(100),10,10)
perc <- list("vector")
sums <- sum(data)
for (i in 1:nrow(data)) {
data <- data[-1,]
perc[[i]] <- sum(data)/sums # in reality, here are ~8 additonal lines of code
}
I dont like that data is re-initialized in every step, and that the loop breaks with an error, when data is emptied.
So the questions are:
1. How to express data <- data[-1,] in an incrementing way (something like tmp <- data[-c(1:i),], which doesnt work?
2. Is there a way to stop the loop, before the last row is removed from data?
You could try
set.seed(123)
data <- matrix(runif(100),10,10)
sums <- sum(data)
perc <- lapply(2:nrow(data),function(x) sum(data[x:nrow(data),]/sums))
The above code yields the same result as your original code, but without error message and without modifying data.
perc1 <- list()
for (i in 1:nrow(data)) {
data <- data[-1,]
perc1[[i]] <- sum(data)/sums
}
identical(perc,perc1)
#[1] TRUE
If you wish to preserve the for loop in order to perform other calculations within the loop, you could try:
for (i in 2:nrow(data)) {
perc[[i-1]] <- sum(data[i:nrow(data),])/sums
# do more stuff here
}
identical(perc,perc1)
#[1] TRUE
If you are using the loop index i for other calculations within the loop, you will most probably need to replace it with i-1. It may depend on what is calculated.
You can use lapply
res <- lapply(2:nrow(data), function(i)sum(data[i:nrow(data),])/sums)
You can write the loop part like this:
for (i in 2:nrow(data)) {
perc[[i - 1]] <- sum(data[i:nrow(data),])/sums # in reality, here are ~8 additonal lines of code
}

Resources