I need to apply a function (which takes two arguments of different lengths) to each item in a vector. The function looks up the value in the first argument that ends with the characters in the second argument and outputs the index (the objective is to perform a left join on two tables using a fuzzy join, but regex_left_join crashed so this is the first step in a workaround solution).
Example input:
x <- c("492820UA665110", "492820UA742008", "493600N077751", "671884RB25355")
y <- c("RB25355", "S56890")
Function:
idx_endsWith <- function(.x, .y) {
return(ifelse(length(which(endsWith(.x, .y))) == 1,
which(endsWith(.x, .y)),
NA))
}
So for example,
> idx_endsWith(x, y[1])
[1] 4
How can I apply this function to each element in y without using a loop? I need to vectorize the function, but mapply doesn't work because the vectors need to be the same length. I'm looking for a solution in dplyr.
For dplyr, as you requested, this should work:
data.frame(y, stringsAsFactors = FALSE) %>%
rowwise %>%
mutate(index = idx_endsWith(x, y))
Related
Here is my problem: I have a function that take 1 double and return several doubles and I would like to directly get them on different columns of the tibble. I tried several methods with mutate() and/or map().
I do not want to call the method n times and take each time a different element of the list returned by the function.
Here is a generic problem of what I am trying to do:
library(tidyverse)
## a random function that return more than 1 element
f <- function(x){
return(list(x/2,x**2))
}
## a tibble with a column on which I apply the function
tib <- tibble( x = rep(100:120))
tib%>%
mutate(y = f(x))
## error:
#Error : Problem with `mutate()` input `y`.
#x Input `y` can't be recycled to size 21.
#ℹ Input `y` is `f(x)`.
#ℹ Input `y` must be size 21 or 1, not 2.
## what I want to avoid:
tib %>%
mutate(y = f(x)[[1]], z = f(x)[[2]])
I have been struggling with this issue for some times now. Apologies if it has been answered already I searched an answer on several forums but got nothing.
Thank you in advance !
You can solve this using nested data frames (or list columns depending on your preferred terminology).
The code below is a generic example of how to do this using mutate and map, which will create the nested data frame, and then unnest:
f <- function(x){
tibble(y = x* 2, z = x*3)
}
tib <- tibble(x = rep(100:120)) %>%
mutate(data = map(x, f)) %>%
unnest()
tib
I'm writing a program that calculates the difference between an element of a dataset and the rest of elements. I'm using dplyr mutate and I need to pass the entire row as an argument to a function which calculates the difference. Using iris as a example:
#Difference function
diff_func <- function (e1, e2) {
return(sum(e1-e2))
}
chosenElement <- iris[1,1:4] # Chosen element
elements <- iris[10:50,1:4] # Elements to compare to
elements %>%
rowwise() %>%
mutate(difference=diff_func(chosenElement, c(Petal.Width, Petal.Length, Sepal.Width, Sepal.Length)))
This works, but as I use the entire row, I would like to use something like "this" or "row" instead of specifying all the columns of the row:
elements %>%
rowwise() %>%
mutate(difference=diff_func(chosenElement, row))
Does anyone know if this can be done?
We can do this very easily in base R by replicating the chosenElement to make the dimensions same
elementsNew <- elements - chosenElement[col(elements)]
Note that mutate is for changing/transforming the values of a single column/multiple columns -> a single column. Of course, we can place other types of objects in a list. Assuming that the 'difference' should be for each column of 'elements' with that of corresponding element of 'chosenElement', the mutate is not doing that with the diff_func
Update
Based on the clarification it seems we need get the difference between the columns with the corresponding chosenElement (here we replicated) and then do the rowSums
elements %>%
mutate(difference = rowSums(. - chosenElement[col(.)]))
A purrr base combination:
do.call(cbind,purrr::map2(elements,chosenElement,function(x,y) x-y))
Since (a - d) + (b - e) + (c - f) == (a + b + c) - (d + e + f), it's just a difference between row sums of the elements and sum of chosenElements, which you can do within base R:
elements$dfrnce <- rowSums(elements) - sum(chosenElement)
Or, in dplyr:
elements %>%
mutate(dfrnce = rowSums(.) - sum(chosenElement))
I have a dataframe with 500 observations for each of 3106 US counties. I would like to merge that dataframe with a SpatialPolygonsDataFrame.
I have tried a few approaches. I have found that if I filter the data by a variable iter_id I can use sp::merge() to merge the datasets. I presume that I can then rbind them back together. sp::merge() does not allow a right or full join and the spatial data needs to be in the left position. So a many to one will not work. The really nasty way I have tried is:
(I am not sure how to represent the dataframe with the variables of interest here)
library(choroplethr)
data(continental_us_states)
us <- tigris::counties(continental_us_states)
gm_y_corr <- tribble(~GEOID,~iter_id,~neat_variable,
01001,1,"value_1",
01003,1,"value_2",
...
01001,2,"value_3",
01003,2,"value_4",
...
01001,500,"value_5",
01003,500,"value_6")
filtered <- gm_y_corr %>%
filter(iter_id ==1)
us.gm <- sp::merge(us, filtered ,by='GEOID')
for (j in 2:500) {
tmp2 <- gm_y_corr %>%
filter(iter_id == j)
tmp3 <- sp::merge(us, tmp2,by='GEOID')
us.gm <- rbind(us.gm,tmp3)
}
I know there must be a better way. I have tried group_by. But multple matches are found. So I must not be understanding the group_by.
> geo_dat <- gm_y_corr %>%
+ group_by(iter_id)%>%
+ sp::merge(us, .,by='GEOID')
Error in .local(x, y, ...) : non-unique matches detected
I would like to merge the spatial data with the interesting data.
Here you can use the splitting functionality of base R in split or the more recent dplyr::group_split. This will separate your data frame according to your splitting variable and you can lapply or purrr::map a function such as merge to it and then dplyr::bind_rows to collapse the returned list back to a dataframe. Since I can't manage to get the us data I have just written what I suspect would work.
gm_y_corr %>%
group_by(iter_id) %>% # group
group_split() %>% # split
lapply(., function(x){ # apply function(x) merge(us, x, by = "GEOID") to leach list element
merge(us, x, by = "GEOID")
}) %>%
bind_rows() # collapse to data frame
equivalently this is the same as using base R functionality. The new group_by %>% group_split is a little more intuitive in my opinion.
gm_y_corr %>%
split(.$iter_id) %>%
lapply(., function(x){
merge(us, x, by = "GEOID")
}) %>%
bind_rows()
If you wanted to just use group_by you would have to follow up with dplyr::do function which I believe does a similar thing to what I have just done above. But without you having to split it yourself.
In R language I would like to create a function to view selected columns for comparison in the Viewer. Assuming my dataframe is df1:
compare_col <- function(x){
select(df1, x) %>%
View()
}
If I define the function by x, I can only put input 1 column.
compare_col <- function(x)
compare_col("col_1")
Only if I define the function by say x,y, then can I input in 2 columns.
compare_col <- function(x, y)
compare_col("col_1", "col_2")
How can I create a function that is dynamic enough to input in any no. of columns?
You can use the rlang package to achieve this.
This will allow you to input a string of column names using the syms and !!! operator which will splice and evaluate in the given environment dynamically as you require.
library(dplyr)
#library(rlang)
compare_col <- function(x){
df1 %>% select(!!! syms(x)) %>%
View()
}
compare_col(c("col1", "col2"))
Just realised, all I actually needed to do was vectorise the inputs when calling the function.
compare_col(c("col1", "col2"))
I have written several functions and want to only apply them to the last two columns of an input CSV file. The question is how to convert the last two columns to vectors and apply my functions to them?
myAvg <- function(anyVector){
average <- sum(anyVector) / length(anyVector)
return(average)
}
mySD <- function(anyVector){
std_Dev <- sqrt(sum((anyVector - mean(anyVector)) ^ 2 / (length(anyVector) - 1)))
return(std_Dev)
}
myRange <- function(anyVector){
myRange <- max(anyVector) - min(anyVector)
return(myRange)
}
data <- read.csv("CardioGoodnessFit.csv")
print(data)
As #Mako212 suggested this can be simple achieved by using the apply function in R:
avg = apply(data[,c('Income','Miles')],MARGIN=2,FUN=myAvg)
sdev = apply(data[,c('Income','Miles')],MARGIN=2,FUN=mySD)
Function myAvg will be applyied to each column of the subset of data. Columns of interest can be specified either by providing the names of the columns or column numbers in a vector. Apply is generally used for a matrix or data.frame type object. While MARGIN provides the option to apply the FUN column-wise (MARGIN = 2) , row-wise (MARGIN=1) or both(to each element of data MARGIN=c(1,2))
There is no need to convert to vectors (or in this case, even to write functions) if you use e.g. dplyr:
library(dplyr)
# means
data %>% summarise(avg = mean(Income))
data %>% summarise(avg = mean(Miles))
# standard deviations
data %>% summarise(sdev = sd(Income))
data %>% summarise(sdev = sd(Miles))
# range
data %>% summarise(range = max(Income) - min(Income))
data %>% summarise(range = max(Miles) - min(Miles))