I am working with two datasets that I would like to join based not exact matches between them, but rather approximate matches. My question is similar to this OP.
Here are examples of what my two dataframes look like.
df1 is this one:
x
4.8
12
4
3.5
12.5
18
df2 is this one:
x y
4.8 6.6
12 1
4.5 1
3.5 0.5
13 1.8
15 2
I am currently using inner_join(df1, df2, by=c("x") to join the two together.
This gives me:
x y
4.8 6.6
12 1
3.5 0.5
However, what I really want to do is join the two dfs based on these conditions:
any exact matches are joined first (exactly like how inner_join() currently works)
BUT, if there are no exact matches, then join to any match ± 0.5
The kind of output I am trying to get would look like this:
x y
4.8 6.6
12 1
4 1 #the y value is from x=4.5 in df1
4 0.5 #the y value is from x=3.5 in df1
3.5 0.5
12.5 1 #the y value is from x=12 in df1
12.5 1.8 #the y value is from x=13 in df1
I typically work in dplyr, so a dplyr solution would be appreciated. But, I am also open to other suggestions because I don't know if dplyr will be flexible enough to do a "fuzzy" join.
(I am aware of the fuzzyjoin package, but it doesn't seem to get at exactly what I am trying to do here)
A possible solution, with no join:
library(tidyverse)
df1 %>%
rename(x1 = x) %>%
crossing(df2) %>%
mutate(diff = abs(x1-x)) %>%
filter(diff <= 0.5) %>%
group_by(x1) %>%
mutate(aux = any(diff == 0)) %>%
filter(aux*(diff == 0) | !aux) %>%
select(-diff, -aux) %>%
ungroup
#> # A tibble: 7 × 3
#> x1 x y
#> <dbl> <dbl> <dbl>
#> 1 3.5 3.5 0.5
#> 2 4 3.5 0.5
#> 3 4 4.5 1
#> 4 4.8 4.8 6.6
#> 5 12 12 1
#> 6 12.5 12 1
#> 7 12.5 13 1.8
You could use {powerjoin}
library(powerjoin)
power_left_join(
df1, df2,
by = ~ .x$x == .y$x | ! .x$x %in% .y$x & .x$x <= .y$x +.5 & .x$x >= .y$x -.5,
keep = "left")
#> x y
#> 1 4.8 6.6
#> 2 12.0 1.0
#> 3 4.0 1.0
#> 4 4.0 0.5
#> 5 3.5 0.5
#> 6 12.5 1.0
#> 7 12.5 1.8
Created on 2022-04-14 by the reprex package (v2.0.1)
I am trying to mutate a new variable depending on the mean function of each group. I tried running this code using the pre-loaded data frame "ToothGrowth" available in R.
Output Results are incorrect, seems like it is looping the means of each group as a list instead of assigning to each group.
A diagram showing what I am trying to achieve:
data("ToothGrowth")
head(ToothGrowth)
tg.tb01<-ToothGrowth %>%
group_by(supp, dose) %>% # mydata has 3 more variables.
summarise(mean = mean(len)) %>%
print()
ToothGrowth %>%
group_by(supp) %>%
mutate(submean2 = len - tg.tb01$mean/tg.tb01$dose)
You don't need another dataset to store the average. Look below:
library(dplyr)
library(datasets)
ToothGrowth %>%
group_by(supp, dose) %>%
mutate(lenmean = mean(len),
submean2 = len - lenmean/dose)
#> # A tibble: 60 x 5
#> # Groups: supp, dose [6]
#> len supp dose lenmean submean2
#> <dbl> <fct> <dbl> <dbl> <dbl>
#> 1 4.2 VC 0.5 7.98 -11.8
#> 2 11.5 VC 0.5 7.98 -4.46
#> 3 7.3 VC 0.5 7.98 -8.66
#> 4 5.8 VC 0.5 7.98 -10.2
#> 5 6.4 VC 0.5 7.98 -9.56
#> 6 10 VC 0.5 7.98 -5.96
#> 7 11.2 VC 0.5 7.98 -4.76
#> 8 11.2 VC 0.5 7.98 -4.76
#> 9 5.2 VC 0.5 7.98 -10.8
#> 10 7 VC 0.5 7.98 -8.96
#> # ... with 50 more rows
If I understand correctly you should use instead. Please specify your desired output with numbers.
tg.tb01<-ToothGrowth %>%
group_by(supp, dose) %>%
mutate(mean = mean(len)) %>%
ungroup() %>%
group_by(supp) %>%
mutate(submean2 = len - mean/dose)
I have data in wide format and I would like to have it summarized with e.g. median and IQR in long format.
The summarizing of the long format works fine but then I need to clean and convert the column names to make rows out of them.
How can this be achieved?
MWE with desired output:
require(data.table)
# Demo data
dt <- data.table("c" = c(1,2,3,1,1,2,3,2), "323.1" = c(1,2,3,4,1,3,4,8), "454.3" = c(1,3,2,4,5,6,7,8)) # Real data has plenty more columns in the style "number.number"
# Create summary
dt[, unlist(recursive=FALSE, lapply(
.(med = median, iqr = IQR),
function(f) lapply(.SD, f)
)), by=.(c)]
# Desired output:
(dt_output <- data.table("c" = c(1,1,2,2,3,3), "var" = c("323.1", "454.3"), "med" = c(1,3,3.5,4,6,4.5), "iqr" = c(1.5,3,0.5,2,2.5,2.5)))
Output after summary:
c med.323.1 med.454.3 iqr.323.1 iqr.454.3
1: 1 1.0 4.0 1.5 2.0
2: 2 3.0 6.0 3.0 2.5
3: 3 3.5 4.5 0.5 2.5
Desired output:
c var med iqr
1: 1 323.1 1.0 1.5
2: 1 454.3 3.0 3.0
3: 2 323.1 3.5 0.5
4: 2 454.3 4.0 2.0
5: 3 323.1 6.0 2.5
6: 3 454.3 4.5 2.5
Thanks!
An option is to melt first before performing your calculations:
melt(dt, id.vars="c")[, .(med = median(value), iqr = IQR(value)), .(c, variable)]
output:
c variable med iqr
1: 1 323.1 1.0 1.5
2: 2 323.1 3.0 3.0
3: 3 323.1 3.5 0.5
4: 1 454.3 4.0 2.0
5: 2 454.3 6.0 2.5
6: 3 454.3 4.5 2.5
You can use gather() to reashape your data, then summarise to get the statistics like this:
library(tidyverse)
dt %>% gather(., var, value, -c) %>%
group_by(c, var) %>% summarise(med = median(value),
iqr = IQR(value))
# A tibble: 6 x 4
# Groups: c [3]
c var med iqr
<dbl> <chr> <dbl> <dbl>
1 1 323.1 1 1.5
2 1 454.3 4 2
3 2 323.1 3 3
4 2 454.3 6 2.5
5 3 323.1 3.5 0.5
6 3 454.3 4.5 2.5
I want to select cols using colnames and their values in a single pipe chain without referring other objects, such as NAMES <- names(d). Can I do it with select_if() ?
For example,
I can use colnames to select cols.
(select(matches(...)) is smarter treating only colnames).
library(dplyr)
d <- iris %>% select(-Species) %>% tibble::as.tibble()
d %>% select_if(stringr::str_detect(names(.), "Petal"))
And I can use the values.
d %>% select_if(~ mean(.) > 5)
But how to use both of them ? (especially OR)
Below code is what I want (of course, don't run).
d %>% select_if(stringr::str_detect(names(.), "Petal") | ~ mean(.) > 5)
Any help would be greatly appreciated.
A workaround that is not too complicated is:
d %>% select_if(stringr::str_detect(names(.), "Petal") | sapply(., mean) > 5)
# or
d %>% select_if(grepl("Petal",names(.)) | sapply(., mean) > 5)
Which gives:
# A tibble: 150 x 3
Sepal.Length Petal.Length Petal.Width
<dbl> <dbl> <dbl>
1 5.1 1.4 0.2
2 4.9 1.4 0.2
3 4.7 1.3 0.2
4 4.6 1.5 0.2
5 5.0 1.4 0.2
6 5.4 1.7 0.4
7 4.6 1.4 0.3
8 5.0 1.5 0.2
9 4.4 1.4 0.2
10 4.9 1.5 0.1
# ... with 140 more rows
I can't find the answer anywhere.
I would like to calculate new variable of data frame which is based on mean of rows.
For example:
data <- data.frame(id=c(101,102,103), a=c(1,2,3), b=c(2,2,2), c=c(3,3,3))
I want to use mutate to make variable d which is mean of a,b and c. And I would like to be able to make that by selecting columns in way d=mean(a,b,c), and also I need to use range of variables (like in dplyr) d=mean(a:c).
And of course
mutate(data, c=mean(a,b))
or
mutate(data, c=rowMeans(a,b))
doesn't work.
Can you give me some tip?
Regards
You're looking for
data %>%
rowwise() %>%
mutate(c=mean(c(a,b)))
# id a b c
# (dbl) (dbl) (dbl) (dbl)
# 1 101 1 2 1.5
# 2 102 2 2 2.0
# 3 103 3 2 2.5
or
library(purrr)
data %>%
rowwise() %>%
mutate(c=lift_vd(mean)(a,b))
dplyr is badly suited to operate on this kind of data because it assumes tidy data format and — for the problem in question — your data is untidy.
You can of course tidy it first:
tidy_data = tidyr::gather(data, name, value, -id)
Which looks like this:
id name value
1 101 a 1
2 102 a 2
3 103 a 3
4 101 b 2
5 102 b 2
6 103 b 2
…
And then:
tidy_data %>% group_by(id) %>% summarize(mean = mean(value))
name mean
(fctr) (dbl)
1 a 2
2 b 2
3 c 3
Of course this discards the original data. You could use mutate instead of summarize to avoid this. Finally, you can then un-tidy your data again:
tidy_data %>%
group_by(id) %>%
mutate(mean = mean(value)) %>%
tidyr::spread(name, value)
id mean a b c
(dbl) (dbl) (dbl) (dbl) (dbl)
1 101 2.000000 1 2 3
2 102 2.333333 2 2 3
3 103 2.666667 3 2 3
Alternatively, you could summarise and then merge the result with the original table:
tidy_data %>%
group_by(id) %>%
summarize(mean = mean(value)) %>%
inner_join(data, by = 'id')
The result is the same in either case. I conceptually prefer the second variant.
I think the answer suggesting using data.frame or slicing on . is the best, but could be made simpler and more dplyr-ish like so:
data %>% mutate(c = rowMeans(select(., a,b)))
Or if you want to avoid ., with the penalty of having two inputs to your pipeline:
data %>% mutate(c = rowMeans(select(data, a,b)))
And yet another couple of ways, useful if you have the numeric positions or vector names of the columns to be summarised:
data %>% mutate(d = rowMeans(.[, 2:4]))
or
data %>% mutate(d = rowMeans(.[, c("a","b","c")]))
I think this is the dplyr-ish way. First, I'd create a function:
my_rowmeans = function(...) Reduce(`+`, list(...))/length(list(...))
Then, it can be used inside mutate:
data %>% mutate(rms = my_rowmeans(a, b))
# id a b c rms
# 1 101 1 2 3 1.5
# 2 102 2 2 3 2.0
# 3 103 3 2 3 2.5
# or
data %>% mutate(rms = my_rowmeans(a, b, c))
# id a b c rms
# 1 101 1 2 3 2.000000
# 2 102 2 2 3 2.333333
# 3 103 3 2 3 2.666667
To deal with the possibility of NAs, the function must be uglified:
my_rowmeans = function(..., na.rm=TRUE){
x =
if (na.rm) lapply(list(...), function(x) replace(x, is.na(x), as(0, class(x))))
else list(...)
d = Reduce(function(x,y) x+!is.na(y), list(...), init=0)
Reduce(`+`, x)/d
}
# alternately...
my_rowmeans2 = function(..., na.rm=TRUE) rowMeans(cbind(...), na.rm=na.rm)
# new example
data$b[2] <- NA
data %>% mutate(rms = my_rowmeans(a,b,na.rm=FALSE))
id a b c rms
1 101 1 2 3 1.5
2 102 2 NA 3 NA
3 103 3 2 3 2.5
data %>% mutate(rms = my_rowmeans(a,b))
id a b c rms
1 101 1 2 3 1.5
2 102 2 NA 3 2.0
3 103 3 2 3 2.5
The downside to the my_rowmeans2 is that it coerces to a matrix. I'm not certain that this will always be slower than the Reduce approach, though.
Another simple possibility with few code is:
data %>%
mutate(c= rowMeans(data.frame(a,b)))
# id a b c
# 1 101 1 2 1.5
# 2 102 2 2 2.0
# 3 103 3 2 2.5
As rowMeans needs something like a matrix or a data.frame, you can use data.frame(var1, var2, ...) instead of c(var1, var2, ...). If you have NAs in your data you'll need to tell R what to do, for example to remove them: rowMeans(data.frame(a,b), na.rm=TRUE)
If you'd like to use a pivot_longer()-style solution:
data%>%
pivot_longer(cols=-id)%>%
group_by(id)%>%
mutate(mean=mean(value))%>%
pivot_wider(names_from=name, values_from=value)
Note that this requires the tidyr package.
This is my preference for the fact that I only need to type the name of my ID column, and don't have to worry about column indices or names otherwise. Good for a quick copy-and-point-this-at-different-data solution, though the same can be said of other answers here. Also good for cases where you might have more than one column with categorical information and haven't created a single unique identifier column.
For what it's worth, I found that this solution is very easily modified to ignore NA values with simple addition of na.rm=TRUE in the mean calculation.
For example:
data <- data.frame(id=c(101,102,103), a=c(NA,2,3), b=c(2,2,2), c=c(3,3,3))
data%>%
pivot_longer(cols=-id)%>%
group_by(id)%>%
mutate(mean=mean(value,na.rm=TRUE))%>%
pivot_wider(names_from = name, values_from=value)
You can use a wrapper function around rowMeans() to make it easier to work with. The one below lets you specify na.rm, and you can use tidyselect to choose your columns if you want.
# This is the wrapper function
means <- function(..., na.rm = FALSE) {
rowMeans(data.frame(...), na.rm = na.rm)
}
library(dplyr)
# Example data
iris2 <- iris %>%
head() %>%
transmute(Sepal.Length = replace(Sepal.Length,
sample(c(TRUE, FALSE), nrow(.),
replace = TRUE),
NA),
Sepal.Width,
Petal.Length,
Petal.Width) %>%
print()
#> Sepal.Length Sepal.Width Petal.Length Petal.Width
#> 1 NA 3.5 1.4 0.2
#> 2 NA 3.0 1.4 0.2
#> 3 NA 3.2 1.3 0.2
#> 4 4.6 3.1 1.5 0.2
#> 5 NA 3.6 1.4 0.2
#> 6 5.4 3.9 1.7 0.4
# Basic usage
iris2 %>%
mutate(mean_sepal = means(Sepal.Length, Sepal.Width))
#> Sepal.Length Sepal.Width Petal.Length Petal.Width mean_sepal
#> 1 NA 3.5 1.4 0.2 NA
#> 2 NA 3.0 1.4 0.2 NA
#> 3 NA 3.2 1.3 0.2 NA
#> 4 4.6 3.1 1.5 0.2 3.85
#> 5 NA 3.6 1.4 0.2 NA
#> 6 5.4 3.9 1.7 0.4 4.65
# If you want to exclude NAs
iris2 %>%
mutate(mean_sepal = means(Sepal.Length, Sepal.Width, na.rm = TRUE))
#> Sepal.Length Sepal.Width Petal.Length Petal.Width mean_sepal
#> 1 NA 3.5 1.4 0.2 3.50
#> 2 NA 3.0 1.4 0.2 3.00
#> 3 NA 3.2 1.3 0.2 3.20
#> 4 4.6 3.1 1.5 0.2 3.85
#> 5 NA 3.6 1.4 0.2 3.60
#> 6 5.4 3.9 1.7 0.4 4.65
# You can also use select() and choose columns using tidyselect
iris2 %>%
mutate(mean_sepal = means(select(., contains("Sepal")), na.rm = TRUE))
#> Sepal.Length Sepal.Width Petal.Length Petal.Width mean_sepal
#> 1 NA 3.5 1.4 0.2 3.50
#> 2 NA 3.0 1.4 0.2 3.00
#> 3 NA 3.2 1.3 0.2 3.20
#> 4 4.6 3.1 1.5 0.2 3.85
#> 5 NA 3.6 1.4 0.2 3.60
#> 6 5.4 3.9 1.7 0.4 4.65
Created on 2022-01-13 by the reprex package (v2.0.1)