I'm writing a program that calculates the difference between an element of a dataset and the rest of elements. I'm using dplyr mutate and I need to pass the entire row as an argument to a function which calculates the difference. Using iris as a example:
#Difference function
diff_func <- function (e1, e2) {
return(sum(e1-e2))
}
chosenElement <- iris[1,1:4] # Chosen element
elements <- iris[10:50,1:4] # Elements to compare to
elements %>%
rowwise() %>%
mutate(difference=diff_func(chosenElement, c(Petal.Width, Petal.Length, Sepal.Width, Sepal.Length)))
This works, but as I use the entire row, I would like to use something like "this" or "row" instead of specifying all the columns of the row:
elements %>%
rowwise() %>%
mutate(difference=diff_func(chosenElement, row))
Does anyone know if this can be done?
We can do this very easily in base R by replicating the chosenElement to make the dimensions same
elementsNew <- elements - chosenElement[col(elements)]
Note that mutate is for changing/transforming the values of a single column/multiple columns -> a single column. Of course, we can place other types of objects in a list. Assuming that the 'difference' should be for each column of 'elements' with that of corresponding element of 'chosenElement', the mutate is not doing that with the diff_func
Update
Based on the clarification it seems we need get the difference between the columns with the corresponding chosenElement (here we replicated) and then do the rowSums
elements %>%
mutate(difference = rowSums(. - chosenElement[col(.)]))
A purrr base combination:
do.call(cbind,purrr::map2(elements,chosenElement,function(x,y) x-y))
Since (a - d) + (b - e) + (c - f) == (a + b + c) - (d + e + f), it's just a difference between row sums of the elements and sum of chosenElements, which you can do within base R:
elements$dfrnce <- rowSums(elements) - sum(chosenElement)
Or, in dplyr:
elements %>%
mutate(dfrnce = rowSums(.) - sum(chosenElement))
Related
I'm trying to create some sort of loop to generate a % of age over the next few years, in months. I have two columns, age and term. Dividing them gets me the % I'm looking for, but I need an easy way to add 1 to age, and keep term consistent, and use that to create a new column. Something like:
for i = n
col_n<-data_set$term/(data_set$age + n)
n=30
library(tidyverse)
# create example data frame
df <- tribble(~age, ~term,
10, 5,
12, 6)
# create function to add new column
agePlusN <- function(df, n) {
mutate(df, "col.{n}" := term/(age+ n))
}
# iterate through 1:30 applying agePlusN()
walk(1:30, \(n) df <<- agePlusN(df, n))
This works, but the last step is a bit ugly. It should really use map instead of walk, but I couldn't quite figure out how to get it not to add new rows.
Attempt 2
# create function to add new column
agePlusN <- function(df, n) {
mutate(df, "col.{n}" := term/(age+n)) %>%
select(-term, -age)
}
# iterate through 1:30 applying agePlusN()
df2 <-
map_dfc(1:30, \(n) agePlusN(df, n)) %>%
bind_cols(df, .)
Notes:
The := in mutate allows you to use glue() syntax in the names on the left hand side argument (eg. "col.{n}")
map_dfc() means map and then use bind_cols to combine all of the outputs
\(n) is equivalent to function(n)
The . in the call to bind_cols() isn't necessary but makes sure the 'age' and 'term' columns are put at the beginning of the resulting dataframe.
I still think this could be done better without having to call bind_cols, but I'm not smart enough to figure it out.
I have a very simple sample data frame df_test as:
df_test <- data.frame("A" = 1:5)
I would like to select the row containing 5. I know I can achieve it by using the filter() command as:
df_analysis <- df_test %>%
filter(A == 5)
However, I want to run a for loop (as the actual data set has many variables and is complex), thus instead of filtering columns manually one by one, I would like to run a for loop of columns that can pick one variable at a time and filter rows accordingly. For this example, I create a character vector v as v = c("A").
Now to filter, instead of using the column name, when I try to use this vector index as:
df_analysis <- df_test %>%
filter(v[1] == 5)
It produces 0 rows instead of 1.
How can I filter rows using vector index instead of column index or name?
Thanks!
With the addition of purrr, you can do:
map(.x = v,
~ df_test %>%
filter(across(all_of(.x)) == 5))
[[1]]
A
1 5
We can use base R
df_test[df_test[[v]] == 5, , drop = FALSE]
Or with dplyr, by converting to symbol and evaluate (!!)
library(dplyr)
df_test %>%
filter(!! rlang::sym(v) == 5)
# A
#1 5
Or with .data
df_test %>%
filter(.data[[v]] == 5)
In its current form, your filter operation compares the literal string "A" (i.e., the contents of v[1]) to the numeric 5, which is of course always false and therefore can't return any valid rows. Instead, you'd need to pass the variable A (contained in df_test) as the first argument to filter(). You can do this by using get() like so:
df_analysis <- df_test %>%
filter(get(v[1]) == 5)
The other solution here using purrr is honestly much better, but I wanted to point out why your code didn't work as expected.
I need to apply a function (which takes two arguments of different lengths) to each item in a vector. The function looks up the value in the first argument that ends with the characters in the second argument and outputs the index (the objective is to perform a left join on two tables using a fuzzy join, but regex_left_join crashed so this is the first step in a workaround solution).
Example input:
x <- c("492820UA665110", "492820UA742008", "493600N077751", "671884RB25355")
y <- c("RB25355", "S56890")
Function:
idx_endsWith <- function(.x, .y) {
return(ifelse(length(which(endsWith(.x, .y))) == 1,
which(endsWith(.x, .y)),
NA))
}
So for example,
> idx_endsWith(x, y[1])
[1] 4
How can I apply this function to each element in y without using a loop? I need to vectorize the function, but mapply doesn't work because the vectors need to be the same length. I'm looking for a solution in dplyr.
For dplyr, as you requested, this should work:
data.frame(y, stringsAsFactors = FALSE) %>%
rowwise %>%
mutate(index = idx_endsWith(x, y))
I have written several functions and want to only apply them to the last two columns of an input CSV file. The question is how to convert the last two columns to vectors and apply my functions to them?
myAvg <- function(anyVector){
average <- sum(anyVector) / length(anyVector)
return(average)
}
mySD <- function(anyVector){
std_Dev <- sqrt(sum((anyVector - mean(anyVector)) ^ 2 / (length(anyVector) - 1)))
return(std_Dev)
}
myRange <- function(anyVector){
myRange <- max(anyVector) - min(anyVector)
return(myRange)
}
data <- read.csv("CardioGoodnessFit.csv")
print(data)
As #Mako212 suggested this can be simple achieved by using the apply function in R:
avg = apply(data[,c('Income','Miles')],MARGIN=2,FUN=myAvg)
sdev = apply(data[,c('Income','Miles')],MARGIN=2,FUN=mySD)
Function myAvg will be applyied to each column of the subset of data. Columns of interest can be specified either by providing the names of the columns or column numbers in a vector. Apply is generally used for a matrix or data.frame type object. While MARGIN provides the option to apply the FUN column-wise (MARGIN = 2) , row-wise (MARGIN=1) or both(to each element of data MARGIN=c(1,2))
There is no need to convert to vectors (or in this case, even to write functions) if you use e.g. dplyr:
library(dplyr)
# means
data %>% summarise(avg = mean(Income))
data %>% summarise(avg = mean(Miles))
# standard deviations
data %>% summarise(sdev = sd(Income))
data %>% summarise(sdev = sd(Miles))
# range
data %>% summarise(range = max(Income) - min(Income))
data %>% summarise(range = max(Miles) - min(Miles))
Let's say I have a function that takes two vectors:
someFunction <- function(x,y){
return(mean(x+y));
}
And say I have some data
toy <- data.frame(a=c(1,1,1,1,1,2,2,2,2,2), b=rnorm(10), c=rnorm(10))
What I want to do is return the result of the function someFunction for each value of toy$a, i.e. I want to acchieve the same result as the code
toy$d <- toy$b + toy$c
result <- aggregate(toy$d, list(toy$a), mean)
However, in real life, the function someFunction is way more complicated and it needs two inputs, so the workaround in this toy example is not possible. So, what I want to do is:
Group the data set according to one column.
For each value in the column (in the toy example, that's 1 and 2), take two vectors v1, v2, and return someFunction(v1,v2)
Checkout dplyr package, specifically group_by and summarize functions.
Assuming that you want to compute someFunction(b, c) for each value of a, the syntax would look like
library(dplyr)
data %>% group_by(a) %>% summarize(someFunction(b, c))
library(data.table)
toy <- data.table(toy)
toy[, list(New_col = someFunction(b, c)), by = 'a']