Use `dplyr` to avoid `for` loop: calculate distance to observations - r

I have two data sets A and B and for each observation in A I want to calculate a distance distance (e.g. an euclidean distance, L1 distance, or something else) to each observation in B (the calculation of the distance is based on the variables in the data sets). An observation from A should then be related to an observation in B for which this distance is minimal.
For example, if A has 5000 observations and B has 10000 observations then
for(i in 1:5000)
{
x = data.frame(x = numeric(), y = numeric())
for(j in 1:10000)
{
x[j,] = distance(A[i,], B[j,])
}
A[i,]$associated_row_B = x[which.min(x[1,]),1]
}
does basically what I want (I still have to solve if observations have the same distance). But since I am using dplyr I hardly ever had to use a for loop. My solution needs even two loops so I wonder if there is a possibility to avoid the for loop using a solution from dplyr/tidyverse.
A very basic example:
A:
i a b
1 -0.5920377 a
2 0.4263199 b
3 0.6737029 a
4 1.3063658 c
5 0.1314103 d
B:
i a b
1 -0.30201541 a
2 -0.07093386 b
3 0.96317764 c
4 -0.33303061 d
5 -1.00834895 d
and the distance function:
distance = function(x,y) return(c((x[2] - y[2])^2 + abs(x[3] - y[3]), y[1])
The first element of the return value is the actual distance, the second value is the identifier from B.

Fair warning: this is going to be pretty inefficient for large datasets!
You can accomplish this using crossing from tidyr and slice from dplyr.
First, let's create two dummy dataframes, A_df and B_df
A_df <- data.frame(
observation_A = runif(100),
id_A = 1:100
)
B_df <- data.frame(
observation_B = runif(50),
id_B = 1:50
)
For clarity, I've kept the column names unique between A_df and B_df. Next, we'll use tidyr::crossing to find every combination of rows between the two dataframes. Next, we use mutate to calculate the distance (here I arbitrarily took the absolute value of their difference, but you can apply your custom distance function here). Finally, we group by id_A, and keep only the minimum using slice (and base R which.max).
library(tidyverse)
full_df <- A_df %>%
crossing(B_df) %>%
mutate(distance = abs(observation_A-observation_B)) %>%
group_by(id_A) %>%
slice(which.min(distance))
Looking at full_df, we get what we were hoping for:
> full_df
# A tibble: 100 x 5
# Groups: id_A [100]
observation_A id_A observation_B id_B distance
<dbl> <int> <dbl> <int> <dbl>
1 0.826 1 0.851 44 0.0251
2 0.903 2 0.905 3 0.00176
3 0.371 3 0.368 18 0.00305
4 0.554 4 0.577 34 0.0232
5 0.656 5 0.654 10 0.00268
6 0.120 6 0.110 37 0.0101
7 0.991 7 0.988 6 0.00244
8 0.983 8 0.988 6 0.00483
9 0.325 9 0.318 45 0.00649
10 0.860 10 0.864 40 0.00407
# ... with 90 more rows

Related

R, applying function on multiple elements

My function is defined as the following, where i subset a dataframe to a specific name and return the first 5 elements.
Bestideas <- function(x) {
topideas <- subset(Masterall, Masterall$NAME == x) %>%
slice(1:5)
return(topideas)
I would then like to apply the function, to an entire df (with one column of Names), so that the function is applied to each name on the list and binds it into a new df, containing the first five ideas from all unique names. Through research - I have arrived at the following:
bestideas_collection = lapply(UNIQUE_NAMES_DF, Bestideas) %>% bind_rows()
However, it doesn't work. It returns a dataframe with only five ideas in total, and from 5 different names. As there is 30 Unique names in my list, I expected 30*5 = 150 ideas in the "bestideas_collection" variable. I get this error message:
"longer object length is not a multiple of shorter object lengthlonger object length is not a multiple of shorter object length"
Further, if I do it manually for each name, it works just as intended - which makes me think that the function works fine, and that the issue is with the lapply function.
holder <- Bestideas("NAME 1")
bestideas_collection <- bind_rows(bestideas_collection,holder)
holder <- Bestideas("NAME 2")
bestideas_collection <- bind_rows(bestideas_collection,holder)
holder <- Bestideas("NAME 3")
bestideas_collection <- bind_rows(bestideas_collection,holder)
...
Can anyone help me if I am using the function wrong, or do you have alternative methods of doing it? I have already tried with a for-loop - but it gives me the same error as with the lapply function.
I don't have your data, so I tried to reproduce your problem on a fabricated set. I was unable to do so. With a very simple case, your function works as expected.
library(dplyr)
set.seed(123)
Masterall <- data.frame(NAME = rep(LETTERS, 10), value = rnorm(260)) %>%
group_by(NAME) %>% arrange(desc(value))
UNIQUE_NAMES_DF <- LETTERS
lapply(UNIQUE_NAMES_DF, Bestideas) %>% bind_rows()
# A tibble: 130 x 2
# Groups: NAME [26]
NAME value
<chr> <dbl>
1 A 1.65
2 A 1.44
3 A 0.838
4 A 0.563
5 A 0.181
6 B 1.37
7 B 0.452
8 B 0.153
9 B -0.0450
10 B -0.0540
# ... with 120 more rows
Is your UNIQUE_NAMES_DF a data.frame? If so, that is the trouble. The lapply function expects a vector as its first input. It can handle a data.frame, but clearly unexpected results occur. Here is an example:
UNIQUE_NAMES_DF <- data.frame(NAME = LETTERS, other = sample(letters))
lapply(UNIQUE_NAMES_DF, Bestideas) %>% bind_rows()
# A tibble: 12 x 2
# Groups: NAME [11]
NAME value
<chr> <dbl>
1 C -0.785
2 D 0.385
3 E -0.371
4 F 1.13
5 I 1.10
6 N -0.641
7 P -1.02
8 Q -0.0341
9 U -1.07
10 X -0.0834
11 Z 1.26
12 Z -0.739
I do not know the structure of your UNIQUE_NAMES_DF, but if you just feed the column with the names into your lapply, it should work:
lapply(UNIQUE_NAMES_DF$NAME, Bestideas) %>% bind_rows()
# A tibble: 130 x 2
# Groups: NAME [26]
NAME value
<chr> <dbl>
1 A 1.65
2 A 1.44
3 A 0.838
4 A 0.563
5 A 0.181
6 B 1.37
7 B 0.452
8 B 0.153
9 B -0.0450
10 B -0.0540
# ... with 120 more rows

R grouped time series correlations with tidyverse

I want time series correlations in a grouped data frame. Here's a sample dataset:
x <- cbind(expand.grid(type = letters[1:4], time = seq(1:4), kind = letters[5:8]), value = rnorm(64)) %>% arrange(type, time, kind)
which produces 64 rows of the variables type, time, kind and value.
I want a time series correlation of the values for each kind grouped by type. Think of each type and time combination as an ordered vector of 4 values. I group by type and time, then arrange by kind, then remove kind.
y <- x %>% group_by(type) %>% arrange(type, time, kind) %>% select(-kind)
I can then group y by type and time and nest such that all the values are together in the data variable, regroup by type only and create a new variable which is the lead data.
z <- y %>% group_by(type, time) %>% nest(value) %>% group_by(type) %>% mutate(ahead = lead(data))
Now I want to run mutate(R = cor(data, ahead)), but I can't seem get the syntax correct.
I've also tried mutate(R = cor(data$value, ahead$value)) and mutate(R = cor(data[1]$value, ahead[1]$value)), to no avail.
The error I get from cor is: supply both 'x' and 'y' or a matrix-like 'x'.
How do I reference the data and ahead variables as vectors to run with cor?
Ultimately, I'm looking for a 16 row data frame with columns type, time, and R where R is a single correlation value.
Thank you for your attention.
We can use map2_dbl from purrr to pass data and ahead at the same time to cor function.
library(dplyr)
z %>%
mutate(R = purrr::map2_dbl(data, ahead, cor)) %>%
select(-data, -ahead)
# type time R
# <fct> <int> <dbl>
# 1 a 1 0.358
# 2 a 2 -0.0498
# 3 a 3 -0.654
# 4 a 4 1
# 5 b 1 -0.730
# 6 b 2 0.200
# 7 b 3 -0.928
# 8 b 4 1
# 9 c 1 0.358
#10 c 2 0.485
#11 c 3 -0.417
#12 c 4 1
#13 d 1 0.140
#14 d 2 -0.448
#15 d 3 -0.511
#16 d 4 1
In base R, we can use mapply
z$R <- mapply(cor, z$data, z$ahead)

How to divide each of a range a variables by a second range of variables in R

I have a range of columns containing the numerators of certain diseases, and a range of columns containing the denominators of the same diseases. I want to loop through each of the numerator columns dividing by the appropriate denominator column creating a percentage column for each disease.
All my columns follow the same name format, disease1_num, disease2_num, disease1_den, disease1_den
I want to divide disease1_num/disease1_den*100 to create disease1_perc, then disease2_num/disease2_den*100 to create disease2_perc etc.
There are approximately 20 diseases in my dataset.
I am mainly using tidyverse commands.
I have tried using gather to create two datasets, one with the numerators, one with the denominator, extracted the diseasename, joined them together, calculated the percentage and then spread the dataset again, before adding this back to the original dataset, which does work but it is a bit long winded, ideally I would like to do this in place in the original dataset.
# A tibble: 3 x 5
id disease1_num disease2_num disease1_den disease2_den
<dbl> <dbl> <dbl> <dbl> <dbl>
1 1 5 4 12 15
2 2 8 6 14 16
3 3 10 8 17 18
df_num <- df %>%
select(id,disease1_num:disease2_num) %>%
gather(key="num_indicator",value="num",disease1_num:disease2_num) %>%
mutate(indicator=str_remove(num_indicator,'_num'))
df_den <- df%>%
select(id, disease1_den:disease2_den) %>%
gather(key="den_indicator",value="den",disease1_den:disease2_den) %>%
mutate(indicator=str_remove(den_indicator,'_den'))
df_numden <- left_join(df_num,df_den,c('id','indicator'))
df_perc <- df_numden %>%
mutate(perc_indicator=str_replace(den_indicator,'den','perc'),
perc=num/den*100) %>%
select(id, perc_indicator:perc) %>%
spread(perc_indicator,perc)
df_final <- left_join(df,df_perc,'id')
We can just use grep to get column indices and divide directly.
num_cols <- grep("num$", names(df), value = TRUE)
den_cols <- grep("den$", names(df), value = TRUE)
df[sub("_num","_perc", num_cols)]<- df[num_cols]/df[den_cols] * 100
df
# id disease1_num disease2_num disease1_den disease2_den disease1_perc disease2_perc
#1 1 5 4 12 15 41.7 26.7
#2 2 8 6 14 16 57.1 37.5
#3 3 10 8 17 18 58.8 44.4
Note that you need to be sure that you have same number of num_cols and den_cols.

Can I combine two strings into one, and use the combined name to assign a data frame to?

I want to create a function to take in a dataframe and a string assigned GENDER. The function will find the mean and sd of each variable in the df by GENDER and return a dataframe with all that info to a new df named "GENDERstats" that I could use in further analysis later on.
I can get everything I want to up until I name the new "GENDERstats" df, then it throws an error
Here's what I have so far, with dummy data
df <- data.frame(GENDER=c("M","F","M","F","M","F"),HELP=c(5,4,2,7,5,5),CARE=c(6,4,7,8,5,4),TRUST=c(6,5,3,6,8,6),SERVE=c(6,5,7,8,7,6))
my.func <- function(dat, bias){
datFrame <- data.frame()
for(i in 2:5){
d1 <- aggregate(dat[,i],by=list(dat[,bias]),FUN=mean,na.rm=TRUE)
d2 <- aggregate(dat[,i],by=list(dat[,bias]),FUN=sd,na.rm=TRUE)
d1$sd <- d2$x
d1$Var <- i
datFrame <- rbind(datFrame,d1)
}
# paste(bias,"stats") <- datFrame
}
I get the df I want in "datFrame", but I want to paste the bias variable and "stats" to make a new data frame. I will be doing this with several different "biases"
I want the new df to look like this:
Group.1 x sd Var
1 F 5.333333 1.5275252 2
2 M 4.000000 1.7320508 2
3 F 5.333333 2.3094011 3
4 M 6.000000 1.0000000 3
5 F 5.666667 0.5773503 4
6 M 5.666667 2.5166115 4
7 F 6.333333 1.5275252 5
8 M 6.666667 0.5773503 5
and from there I can plot graphs or only focus on means or sds
I'm not quite sure how to fix your function (a couple details are missing), but you can get the same results without a user-defined function or for loop. The following iterates over combinations of GENDER + other variables, generate means and SDs with aggregate, and then rbinds the dataframes in do.call:
do.call("rbind", lapply(2:ncol(df),
function(j) {
df_out <- aggregate(df[j], list(df$GENDER), "mean")
df_out[3] <-
aggregate(df[j], list(df$GENDER), "sd")[[2]]
df_out[4] <- j
`names<-`(df_out, c("gender", "x", "sd", "var"))
}))
#### OUTPUT ####
gender x sd var
1 F 5.33333 1.52753 2
2 M 4.00000 1.73205 2
3 F 5.33333 2.30940 3
4 M 6.00000 1.00000 3
5 F 5.66667 0.57735 4
6 M 5.66667 2.51661 4
7 F 6.33333 1.52753 5
8 M 6.66667 0.57735 5
I'm not sure if there isn't a slicker way of doing this in base R. Personally, I would go with dplyr's gather + group_by + summarise, which is much cleaner and easier to understand. The output is pretty much the same as the above, just in a different order. The rounding only looks different because of how tibbles are printed:
library(dplyr)
library(tidyr)
df %>%
gather(var, val, -GENDER) %>%
group_by(GENDER, var) %>%
summarise(x = mean(val), sd = sd(val))
#### OUTPUT ####
# A tibble: 8 x 4
# Groups: GENDER [2]
GENDER var x sd
<chr> <chr> <dbl> <dbl>
1 F CARE 5.33 2.31
2 F HELP 5.33 1.53
3 F SERVE 6.33 1.53
4 F TRUST 5.67 0.577
5 M CARE 6 1
6 M HELP 4 1.73
7 M SERVE 6.67 0.577
8 M TRUST 5.67 2.52

Function in R (Merge Bases)

I have the following bases in R.
table1<-data.frame(group=c(1,1,1,2,2,2),price=c(10,20,30,10,20,30),
visits=c(100,200,300,150,250,350))
table1<-table1 %>% arrange(price) %>% split(.$group)
$`1`
group price visits
1 1 10 100
3 1 20 200
5 1 30 300
$`2`
group price visits
2 2 10 150
4 2 20 250
6 2 30 350
group_1<-data.frame(case_1=c(0.2,0.3,0.4),case_2=c(0.22,0.33,0.44))
group_2<-data.frame(case_1=c(0.3,0.4,0.5),case_2=c(0.33,0.44,0.55))
So, the question is How can I do the following operation without repeating it four times. I suppose that an apply function, or similar, will suit better.
sum(table1$`1`[,c("group")] * group_1[,c("case_1")])
sum(table1$`1`[,c("group")] * group_1[,c("case_2")])
sum(table2$`1`[,c("group")] * group_2[,c("case_1")])
sum(table2$`1`[,c("group")] * group_2[,c("case_2")])
After going through step-by-step in the data you have provided and understanding what you are trying to do. Here is a suggestion using mapply.
group_list <- list(group_1, group_2)
mapply(function(x, y) colSums(x * y),split(table1$group, table1$group),group_list)
# 1 2
#case_1 0.90 2.40
#case_2 0.99 2.64
We take the groups in one list say group_list. Split table1 by group and perform multiplication between them using mapply and take the column-wise sum. If I have understood you correctly, this is what you needed let me know if it is otherwise.
Based on the initial dataset, we can do this using group_by operations
library(tidyverse)
bind_rows(group_1, group_2) %>%
bind_cols(table1['group'], .) %>%
mutate(case_1 = group*case_1, case_2 = group*case_2) %>%
group_by(group) %>%
summarise_each(funs(sum))
# A tibble: 2 × 3
# group case_1 case_2
# <dbl> <dbl> <dbl>
#1 1 0.9 0.99
#2 2 2.4 2.64
data
table1<-data.frame(group=c(1,1,1,2,2,2),price=c(10,20,30,10,20,30),
visits=c(100,200,300,150,250,350))

Resources