Short syntax simplification with mutate and lapply - r

I have code that creates a new dataframe, df2 which is a copy of an existing dataframe, df but with four new columns a,b,c,d. The values of these columns are given by their own functions.
The code below works as intended but it seems repetitive. Is there a more succinct form that you would recommend?
df2 <- df %>% mutate(a = lapply(df[,c("value")], f_a),
b = lapply(df[,c("value")], f_b),
c = lapply(df[,c("value")], f_c),
d = lapply(df[,c("value")], f_d)
)
Example of cell contents in "value" column "-0.57(-0.88 to -0.26)".
I am applying a function to extract first number:
f_a <- function(x){
substring(x, 1, regexpr("\\(", x)[1] - 1)
}
This works fine when applied to a single string (-0.57 from the example). In the data frame I found that lapply gives correct values based on input from any cell in the "value" column. The code seems a bit repetitive but works.

We can use map
library(tidyverse)
df[c('a', 'b', 'c', d')] <- map(list(f_a, f_b, f_c, f_d), ~ lapply(df$value, .x))
Note: Without the functions or an example, not clear whether this is the optimal solution. Also, as noted in the comments, many of the functions can be applied directly on the column instead of looping through each element.

Related

Looping over multiple variables in R

I have been using Stata and the loops are easily executed there. However, in R I have faced some errors in looping over variables. I tried some of the codes over here and it does not work. Basically, I am trying to clean the data by logging the values. I had to convert negative values to positive first before logging them.
I intend to loop over multiple firm statistics on the dataframe but I faced errors in doing so.
varlist <- c("revenue", "profit", "cost")`
for (v in varlist) {
data$log_v <- log(abs(ifelse(data$v>1, data$v, NA)))
data$log_v <- ifelse(data$v<0, data$log_v*-1,data$log_v)
}
Error in $<-.data.frame(tmp,"log_v", value = numeric(0)) : replacement has 0 rows, data has 9
It looks like you might be assuming that data$log_v is getting read as data$log_profit, but R's going to take it own it's own and read it as "log_v" all 3 times. This example might not be quite everything you're trying to do but it might help you. It's taking a list of variables and referencing them via their string names.
df <- data.frame(x = rnorm(15), y = rnorm(15))
vars <- c("x", "y")
for (v in vars) {
df[paste0("log_", v)] <- log(abs(df[v]))
}
Here's roughly the same thing in data.table.
library(data.table)
dt <- data.table(x = rnorm(15), y = rnorm(15))
dt[, `:=`(log_x = log(abs(x)), log_y = log(abs(y)))]
Here is an explanation to the source of your confusion:
A data.frame is a special type of list, it's elements are vectors of the same length – columns. Normally, you access an element of a list using the [[ function, for example df[["revenue"]]. Instead of "revenue", you can also use a variable, such as df[[varlist[1]]]. So far, so good.
However, lists have a convenience operator, $, which allows you to access the elements with less typing: df$revenue. Unfortunately, you cannot use variables this way: this by design. Since you don't have to use quotes with $, the operator cannot know whether you mean revenue as the literal name of the element or revenue as the variable that holds the literal name of the element.
Therefore, if you want to use variables, you need to use the [[ function, and not the $. Since programmers hate typing and want to make code as terse as possible, various ways around it have been invented, such as data.tables and tidyverse (I am exaggerating a bit here).
Also, here is a tidyverse solution.
library(tidyverse)
varlist <- c("revenue", "profit", "cost")
df <- data.frame(revenue=rnorm(100), profit=rnorm(100), cost=rnorm(100))
df <- df %>% mutate_at(varlist, list(log10 = ~ log10(abs(.))))
Explanation:
mutate_all applies log10(abs(.)) to every column. The dot . is a temporary variable that hold the column values for each of the columns.
by default, mutate_all will replace the existing variables. However, if instead of providing a function (~ log10(abs(.))) you provide a named list (list(log10 = ~ log10(abs(.)))), it will add new columns using log10 as a suffix in column name.
this method makes it easy to apply several functions to your columns, not only the one.
See? No (obvious) loops at all!

Struggling with for-loop and mutate in R

I am trying to wrap my head around R, and I'm sure I'm doing something silly.
I have a dataframe that includes 30 brands (whose names I have separately in a list called "brands") and a list of new names that I wish to insert into the dataframe (called "known brands").
I am trying to populate the results of an if statement within new columns in an R dataframe (using the names within "known brands), but this keeps on generating an error message (unexpected '{' in "{")
I'm not sure where I'm going wrong - here's my code:
for(i in 1:length(brands)){
plot1a_df <- plot1a_df %>% mutate(known_brands[i] = ifelse(brands[i] >1, 1, 0))
}
To illustrate with data (assume 3 x2 columns):
plot1a_df = data.frame(brands = c(1,0,2), Misc = c(0,0,0))
The idea is to end up with a third column ("known_brands") with c(0,0,1)
To add a logical column with dplyr:
library(dplyr)
plot1a_df %>% mutate(is_brand_known = brands %in% brand_list)
Another example with iris dataset.
species_list = c('setosa')
iris %>% mutate(is_setosa = Species %in% species_list)
for (i in 1:30){
plot1a_df[, known_cols[i]] <- ifelse(plot1a_df[,brands[i]] >1, 1, 0)
print(plot1a_df[, known_cols[i]])
}
Found my solution could be achieved without mutate - although, still wonder if it is possible to combine for loops within dplyr (realise there's a lot of commentary on here, but nothing at a high level (for this simpleton to understand at least!)

using adist on two columns of data frame

I want to use adist to calculate edit distance between the values of two columns in each row.
I am using it in more-or-less this way:
A <- c("mad","car")
B <- c("mug","cat")
my_df <- data.frame(A,B)
my_df$dist <- adist(my_df$A, my_df$B, ignore.case = TRUE)
my_df <- my_df[order(dist),]
The last two rows are the same as in my case, but the actual data frame looks a bit different - columns of my original data frame are character type, not factor. Also, the dist column seems to be returned as 2-column matrix, I have no idea why it happens.
Update:
I have read a bit and found that I need to apply it over the rows, so my new code is following:
apply(my_df, 1, function(d) adist(d[1], d[2]))
It works fine, but for my original dataset calling it by column numbers is inpractical, how can I refer to column names in this function?
Using tidyverse approach, you may use the following code:
library(tidyverse)
A <- c("mad","car")
B <- c("mug","cat")
my_df <- data.frame(A,B)
my_df %>%
rowwise() %>%
mutate(Lev_dist=adist(x=A,y=B,ignore.case=TRUE))
You can overcome that problem by using mapply, i.e.
mapply(adist, df$A, df$B)
#[1] 2 1
As per adist function definition the x and y arguments should be character vectors. In your example the function is returning a 2x2 matrix because it is comparing also the cross words "mad" with "cat" and "car" with "mug".
Just look at the matrix master diagonal.

Finding the closest character string in a second data frame in R

I have a quite big data.frame with non updated names and I want to get the correct names that are stored in another data.frame.
I am using stringdist function to find the closest match between the two columns and then I want to put the new names in the original data.frame.
I am using a code based on sapply function, as in the following example :
dat1 <- data.frame("name" = paste0("abc", seq(1:5)),
"value" = round(rnorm(5), 1))
dat2 <- data.frame("name" = paste0("abd", seq(1:5)),
"other_info" = seq(11:15))
dat1$name2 <- sapply(dat1$name,
function(x){
char_min <- stringdist::stringdist(x, dat2$name)
dat2[which.min(char_min), "name"]
})
dat1
However, this code is too slow considering the size of my data.frame.
Is there a more optimized alternative solution, using for example data.table R package?
First convert the data frames into data tables:
dat1 <- data.table(dat1)
dat2 <- data.table(dat2)
Then use the ":=" and "amatch" command to create a new column that approximately matches the two names:
dat1[,name2 := dat2[stringdist::amatch(name, dat2$name)]$name]
This should be much faster than the sapply function. Hope this helps!

R: Apply function on specific columns preserving the rest of the dataframe

I'd like to learn how to apply functions on specific columns of my dataframe without "excluding" the other columns from my df. For example i'd like to multiply some specific columns by 1000 and leave the other ones as they are.
Using the sapply function for example like this:
a<-as.data.frame(sapply(table.xy[,1], function(x){x*1000}))
I get new dataframes with the first column multiplied by 1000 but without the other columns that I didn't use in the operation. So my attempt was to do it like this:
a<-as.data.frame(sapply(table.xy, function(x) if (colnames=="columnA") {x/1000} else {x}))
but this one didn't work.
My workaround was to give both dataframes another row with IDs and later on merge the old dataframe with the newly created to get a complete one. But I think there must be a better solution. Isn't it?
If you only want to do a computation on one or a few columns you can use transform or simply do index it manually:
# with transfrom:
df <- data.frame(A = 1:10, B = 1:10)
df <- transform(df, A = A*1000)
# Manually:
df <- data.frame(A = 1:10, B = 1:10)
df$A <- df$A * 1000
The following code will apply the desired function to the only the columns you specify.
I'll create a simple data frame as a reproducible example.
(df <- data.frame(x = 1, y = 1:10, z=11:20))
(df <- cbind(df[1], apply(df[2:3],2, function(x){x*1000})))
Basically, use cbind() to select the columns you don't want the function to run on, then use apply() with desired functions on the target columns.
In dplyr we would use mutate_at in which you can select or exclude (by preceding variable name with "-" minus sign) specific variables.
You can just name a function
df <- df %>%
mutate_at(vars(columnA), scale)
or create your own
df <- df %>%
mutate_at(vars(columnA, columnC), function(x) {do this})

Resources