Looping over multiple variables in R - r

I have been using Stata and the loops are easily executed there. However, in R I have faced some errors in looping over variables. I tried some of the codes over here and it does not work. Basically, I am trying to clean the data by logging the values. I had to convert negative values to positive first before logging them.
I intend to loop over multiple firm statistics on the dataframe but I faced errors in doing so.
varlist <- c("revenue", "profit", "cost")`
for (v in varlist) {
data$log_v <- log(abs(ifelse(data$v>1, data$v, NA)))
data$log_v <- ifelse(data$v<0, data$log_v*-1,data$log_v)
}
Error in $<-.data.frame(tmp,"log_v", value = numeric(0)) : replacement has 0 rows, data has 9

It looks like you might be assuming that data$log_v is getting read as data$log_profit, but R's going to take it own it's own and read it as "log_v" all 3 times. This example might not be quite everything you're trying to do but it might help you. It's taking a list of variables and referencing them via their string names.
df <- data.frame(x = rnorm(15), y = rnorm(15))
vars <- c("x", "y")
for (v in vars) {
df[paste0("log_", v)] <- log(abs(df[v]))
}
Here's roughly the same thing in data.table.
library(data.table)
dt <- data.table(x = rnorm(15), y = rnorm(15))
dt[, `:=`(log_x = log(abs(x)), log_y = log(abs(y)))]

Here is an explanation to the source of your confusion:
A data.frame is a special type of list, it's elements are vectors of the same length – columns. Normally, you access an element of a list using the [[ function, for example df[["revenue"]]. Instead of "revenue", you can also use a variable, such as df[[varlist[1]]]. So far, so good.
However, lists have a convenience operator, $, which allows you to access the elements with less typing: df$revenue. Unfortunately, you cannot use variables this way: this by design. Since you don't have to use quotes with $, the operator cannot know whether you mean revenue as the literal name of the element or revenue as the variable that holds the literal name of the element.
Therefore, if you want to use variables, you need to use the [[ function, and not the $. Since programmers hate typing and want to make code as terse as possible, various ways around it have been invented, such as data.tables and tidyverse (I am exaggerating a bit here).
Also, here is a tidyverse solution.
library(tidyverse)
varlist <- c("revenue", "profit", "cost")
df <- data.frame(revenue=rnorm(100), profit=rnorm(100), cost=rnorm(100))
df <- df %>% mutate_at(varlist, list(log10 = ~ log10(abs(.))))
Explanation:
mutate_all applies log10(abs(.)) to every column. The dot . is a temporary variable that hold the column values for each of the columns.
by default, mutate_all will replace the existing variables. However, if instead of providing a function (~ log10(abs(.))) you provide a named list (list(log10 = ~ log10(abs(.)))), it will add new columns using log10 as a suffix in column name.
this method makes it easy to apply several functions to your columns, not only the one.
See? No (obvious) loops at all!

Related

colnames and mutate on multiple dataframes

I have a problem with cleaning up my code. I understand I could type this all out but we don't want that obviously.
I have only dataframes in my global environment. They are all "data.frame".
I want to check the dimensions of all of them and put that in a tibble. I managed that somehow. I also would like to change their colnames() tolower() which works easy if I just type the name of the data.frame, but there's more than 2 and I want it done automatically. Then I also want to mutate all data.frames in the same way.
Small example of my code:
library(tidyverse)
x <- data.frame(letters[1:2]) #To create the data
y <- data.frame(letters[3:4])
dfs <- as.list(ls()) #I take whatever is in my environment
I managed below to get a tibble of the dimensions:
z <- as_tibble(lapply(seq_along(dfs),
function(j) dim(get(dfs[[j]]))), .name_repair = "unique")
colnames(z) <- dfs
Now for the colnames of all the data.frames stored in my list I basically want to perform this code:
colnames(dfs[[1]]) <- tolower(colnames(dfs[[1]])
but that returns NULL as I found out earlier. So I used get() in there to make it work for the dimensions. But if I use get() to assign colnames it says it can't find function "get<-".
Since all colnames for all dataframes are the same (just different nrows()) I could save the lowercase colnames as value and use that, but that doesn't take away that it cant find the get<- function.
names <- tolower(colnames(x))
sapply(seq_along(dfs),
function(j) colnames(get(dfs[[j]])) <- names)
*Error in colnames(get(dfs[[j]])) <- names :
could not find function "get<-"*
as for the mutating part I tried a for loop:
for(i in seq_along(dfs)){
get(dfs[[i]]) <- get(dfs[[i]]) %>% mutate(cd = ab)
}
But it's the same issue.
Could anyone help clearing this problem for me? (and if a cleaner code for the dimensions is available that would be highly appreciated)
I am just trying to up my coding skills. I would have been long done if I just typed it all out but that defeats the purpose.
Thanks!
-JK
Using base R
lapply(dfs, function(x) transform(setNames(x, tolower(names(x))), X = c('a', 'b')))

How do you establish the column number that an apply function is working on to use that number to retrieve a variable indexed by that number

I am converting my for-loops in R for a model that has multiple input datasets. In the for-loop I use the current loop value to retrieve values from other datasets. I am looking to replicate this using an apply function (over columns in a dataset) however I'm struggling to establish index of the apply function in order to retrieve the appropriate variables from other data
The apply function references the column by the variable in the function which is fine and I've tried to use both colname (after having named my various columns by number) but have not had any joy. Below is an example dataset and for loop with what I'd like to achieve (simplified somewhat). The length of the vectors and the number of columns in the tabular dataset will always be equal.
iteration<-1:3
df <- data.frame("column1" = 6:10, "column2" = 12:16, "column3" = 31:35)
variable1<-rnorm(3,mean = 25)
variable2<-rnorm(3, mean = 0.21)
outcome<-numeric()
for (i in iteration) {
intermediate<-(mean(df[,i])*variable1[i])^variable2[i]
outcome<-c(outcome,intermediate)
}
outcome
The expected results are outcome above...trying this in apply
What I imagine it to be is this:
apply(df, 2, function(x) (mean(x)*variable1[colnumber(x)])^variable2[colnumber(x)]
or perhaps
apply(df, 2, function(x) (mean(x)*variable1[x])^variable2[x])
but these two obviously do not work.
first time user so apologies for any etiquette issues but found the answer to my own problem using the purrr package, but maybe this helps someone else
pmap(list(df, variable1, variable2), function(df, variable1, variable2) (mean(df)*variable1)^variable2)

Split-apply-combine with aggregate : can the applied function accept multiple arguments that are specified variables of the original data?

Some context: On my quest to improve my R-code I'm trying to replace my for-loops whenever I can by R's apply-class functions.
The question: Are R's apply functions such as sapply, tapply, aggregate, etc. useful for applying functions that are more complicated in the sense that they take as arguments specified variables of the original data?
Simple examples of what works and what does not:
I have a dataframe with one time variable date.time and two numeric variables val.one and value.two:
Generate the data:
df <- data.frame(date.time = seq(ymd_hms("2000-01-01 00:00:00"),ymd_hms("2000-01-03 00:00:00"), length.out=100),value.one = c(1:100), value.two = c(1:100) + 10)
I would like to apply a function to every 10 hour cut of the dataframe that has as its two arguments the two numeric variables of the dataframe. For example, if I want to compute the mean of each of the two values for each 10 hour cut the solution is the following:
A function that computes the mean of value.one and value.two for each time period of 10 hours:
work_on_subsets <- function(data, time.step = "10 hours"){
aggregate(data[,-1], list(cut(df$date.time, breaks = time.step)), function(x) mean(x))}
However, If I want to work with the two data values separately to run another function, say compute the som of the two averages, I run into trouble. The function work_on_subsets_2 gives me the following error : Error in x$value.one : $ operator is invalid for atomic vectors
A function that computes the sum of the means of value.one and value.two for each 10 hour time period:
work_on_subsets_2 <- function(data, time.step = "10 hours"){
aggregate(data, list(cut(df$date.time, breaks = time.step)), function(x) mean(x$value.one) + mean(x$value.two)}
In the limit, I would like to be able to do something like this:
A function that runs another_function on value.one and value.two for each time period of 10 hours :
another_function <- function(a,b) {
# do something with a and b
}
work_on_subsets_3 <- function(data, time.step = "10 hours"){aggregate(data, list(cut(df$date.time, breaks = time.step)), another_function(x$value.one, x$value.two))}
Am I using the wrong tools for this job? I have already a working solution using for loops, but I'm trying to get a grip on the split-apply-combine strategy. If so, are there any viable alternatives to for-loops?
Hi there are a basic things you are doing wrong here. You are creating a function which has data as its data.frame but you are still referencing df from the global environment. You're also missing at least one bracket. And I don't quite know why you have two layers of functions embedded.
My solution departs from your method but hopefully will help. I'd recommend using plyr package when you want to split dataframes and apply functions as I find it much more intuitive. Combining it with dplyr also helps in my opinion. Note: Always load plyr before dplyr or you run into dependency issues.
If I understand your question correctly the below should work, and you could create different functions to apply
library(plyr)
library(dplyr)
#create function you want to apply
MeanFun <- function(data) mean(data[["value.one"]]) + mean(data[["value.two"]])
#add grouping variable to your dataframe. You can link this with pipes (%>%)
# if you don't want to create a new data.frame, but for testing purposes it
# more clearly shows wants happening
df1 <- df %>% mutate(Breaks = cut(date.time, breaks = time.step))
# use plyr's ssply to split the dataframe on "Breaks" column and apply the function
out <- ddply(df1, "Breaks", MeanFun)

Using a custom function (ifelse) with dcast

I'm interested in reshaping a dataframe, but instead of using standard dcast functions like the mean, I'd like to use a custom function. Specifically, I'm interested in using an ifelse statement to assign binary values.
Here's a reproducible example:
# dataframe that includes extraneous information
df <- data.frame(sale_id=c(1,1,1,2,2,2,3,3,4,5),project_id=c(501,502,503,501,502,503,501,502,504,505),
sale_year=c(1990,1991,1993,1990,1992,1990,1991,1993,1990,1992),
var1=c(5,4,3,6,5,4,4,7,2,9),var2=c(7,3,4,8,5,8,2,3,5,7))
# list of the variables I actually need (I don't need 'sale_year')
varlist <- c("var1","var2")
# selecting out id variables and variables I'm interested in manipulating
dfvars <- df[,c("sale_id","project_id",varlist)]
# melt dataframe
library(reshape2)
mdata <- melt(dfvars, id=c('sale_id','project_id'))
# create custom ifelse function, assign '1' if mean is above a critical value, and '0' if not
funx <- function(u){ifelse(mean(u)>5,1,0)}
# cast data using this function
cdata <- dcast(mdata, sale_id~variable, funx)
It works if I just use a standard function, like mean (ex):
cdata <- dcast(mdata, sale_id~variable, mean)
But with my ifelse() function, I get an error about data types (logical vs. double), which doesn't make sense to me, since the result of "mean(u) > 5" should be returning a logical result (TRUE or FALSE), to then be used by the ifelse() part.
I believe this has to do with the details of type coercion. The return of your custom function is being treated as double for some sets of observations, but logical in others. The code works when you make explicit the return type.
Example:
# Works
funx1 <- function(u){ifelse(mean(u)>5,TRUE,FALSE)}
funx2 <- function(u){as.logical(ifelse(mean(u)>5,1,0))}
funx3 <- function(u){as.numeric(ifelse(mean(u)>5,1,0))}

How to write loop for creating multiple variables in R

I have a data set in R called data, and in this data set I have more than 600 variables. Among these variables I have 94 variables called data$sleep1,data$sleep2...data$sleep94, and another 94 variables called data$wakeup1,data$wakeup2...data$wakeup94.
I want to create new variables, data$total1-data$total94, each of which is the sum of sleep and wakeup for the same day.
For example, data$total64 <-data$sleep64 + data$wakeup64,data$total94<-data$sleep94+data$wakeup94.
Without a loop, I need to write this code 94 times. I hope someone could give me some tips on this. It doesn't have to be a loop, but an easier way to do this.
FYI, every variables are numeric and have about 30% missing values. The missing are random, it could be anywhere. missing value is a blank but not 0.
I recommend storing your data in long form. To do this, use melt. I'll use data.table.
Sample data:
library(data.table)
set.seed(102943)
x <- setnames(as.data.table(matrix(runif(1880), nrow = 10)),
paste0(c("sleep", "wakeup"), rep(1:94, 2)))[ , id := 1:.N]
Melt:
long_data <-
melt(x, id.vars = "id",
measure.vars = list(paste0("sleep", 1:94),
paste0("wakeup", 1:94)))
#rename the output to look more familiar
#**note: this syntax only works in the development version;
# to install, follow instructions
# here: https://github.com/jtilly/install_github
# to install from https://github.com/Rdatatable/data.table
# (or, read ?setnames and figure out how to make the old version work)
setnames(long_data, -1L, c("day", "sleep", "wakeup"))
I hope you'll find it's much easier to work with the data in this form.
For example, your problem is now simple:
long_data[ , total := sleep + wakeup]
We could do this without a loop. Assuming that the columns are arranged in the sequence mentioned, we subset the 'sleep' columns and 'wakeup' columns separately using grep and then add the datasets together.
sleepDat <- data[grep('sleep', names(data))]
wakeDat <- data[grep('wakeup', names(data))]
nm1 <- paste0('total', 1:94)
data[nm1] <- sleepDat+wakeDat
If there are missing values and they are NA, we can replace the NA values with 0 and then add it together as before.
data[nm1] <- replace(sleepDat, is.na(sleepDat), 0) +
replace(wakeDat, is.na(wakeDat), 0)
If the missing value is '', then the columns would be either factor or character class (not clear from the OP's post). In that case, we may need to convert the dataset to numeric class so that the '' will be automatically converted to NA
sleepDat[] <- lapply(sleepDat, function(x)
as.numeric(as.character(x)))
wakeDat[] <- lapply(wakeDat, function(x)
as.numeric(as.character(x)))
and then proceed as before.
NOTE: If the columns are character, just omit the as.character step and use only as.numeric.

Resources