R - apply a function (combinevar) across all rows of a dataset - r

I have a dataset where each row contains the data needed for the combinevar function (package = fishmethods; combinevar combines info from two distributions to come up with the combined variance).
xbar1 = c(2,2,1,4,3)
xbar2 = c(0,0,0,0,0)
var1 = c(0,1,3,2,1)
var2 = c(0,0,0,0,0)
n1 = c(50,10,30,40,50)
n2 = c(3,4,50,32,20)
df <- data.frame(xbar1, xbar2, var1, var2, n1, n2)
xbar1 xbar2 var1 var2 n1 n2
2 0 0 0 50 3
2 0 1 0 10 4
1 0 3 0 30 50
4 0 2 0 40 32
3 0 1 0 50 20
How would I apply the function across the rows. I can do it in a for loop like this:
for (i in 1:nrow(df)) {
combined_var <- combinevar(xbar = c(df$xbar1[i], df$xbar2[i]),
s_squared = c(df$var1[i], df$var2[i]),
n = c(df$n1[i], df$n2[i]))[2]
print(combined_var)
}
[1] 0.2177068
[1] 1.571429
[1] 1.338608
[1] 5.104851
[1] 2.573499
But I'm sure there's a better way. I think I can probably do it with an apply function but I can't figure out how.

You can use apply function to rows and to do specify function properly read rows:
library(fishmethods)
my_function<- function(vec){
combined_var <- combinevar(xbar = c(vec[1], vec[2]), s_squared = c(vec[3], vec[4]), n = c(vec[5], vec[6]))
}
apply(df, 1, my_function) [2, ]

We can nest the data by row and then map the function for each row.
library(tidyverse)
library(fishmethods)
df %>%
rownames_to_column("row") %>%
nest(-row) %>%
mutate(combined_var = map(data, ~combinevar(xbar = c(.x$xbar1, .x$xbar2),
s_squared = c(.x$var1, .x$var2),
n = c(.x$n1, .x$n2))[2])) %>%
unnest()
#> row combined_var xbar1 xbar2 var1 var2 n1 n2
#> 1 1 0.2177068 2 0 0 0 50 3
#> 2 2 1.5714286 2 0 1 0 10 4
#> 3 3 1.3386076 1 0 3 0 30 50
#> 4 4 5.1048513 4 0 2 0 40 32
#> 5 5 2.5734990 3 0 1 0 50 20
Or we can just apply the function rowwise
df %>%
rowwise() %>%
mutate(combined_var = combinevar(xbar = c(xbar1, xbar2),
s_squared = c(var1, var2),
n = c(n1, n2))[2])
#> Source: local data frame [5 x 7]
#> Groups: <by row>
#>
#> # A tibble: 5 x 7
#> xbar1 xbar2 var1 var2 n1 n2 combined_var
#> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 2 0 0 0 50 3 0.218
#> 2 2 0 1 0 10 4 1.57
#> 3 1 0 3 0 30 50 1.34
#> 4 4 0 2 0 40 32 5.10
#> 5 3 0 1 0 50 20 2.57
Created on 2018-08-19 by the reprex
package (v0.2.0).

Related

Retaining values from one row to the next

I have a data frame with one variable, x. I want to create a new variable y which is equal to 1 when x decreases by 2 from its previous value and equal to 0 otherwise. Then I want to create a variable z which holds the value of x when y was last equal to 1. I want the initial value of z to be 0. I haven't been able to figure out how to make z. Any advice?
Here's what I'm trying to obtain (but for about 1000 rows):
x y z
9 0 0
8 0 0
6 1 6
9 0 6
7 1 7
5 1 5
I've tried lags, cum functions in dplyr to no avail.
library(dplyr)
library(tidyr)
df <- data.frame(x = c(9,8,6,10,9,7,5))
df %>%
mutate(y = +(lag(x, default = x[1]) - x == 2),
z = ifelse(cumsum(y) > 0 & y == 0, NA, x * y)) %>%
fill(z, .direction = "down")
#> x y z
#> 1 9 0 0
#> 2 8 0 0
#> 3 6 1 6
#> 4 10 0 6
#> 5 9 0 6
#> 6 7 1 7
#> 7 5 1 5
Created on 2022-11-07 by the reprex package (v2.0.1)
One option:
df$y = 0L
df$y[-1] = (diff(df$x) == -2L)
df$z = data.table::nafill(ifelse(df$y == 1L, df$x, NA), "locf", fill = 0L)
# x y z
# 1 9 0 0
# 2 8 0 0
# 3 6 1 6
# 4 9 0 6
# 5 7 1 7
# 6 5 1 5
Reproducible data (please provide next time)
df = data.frame(x = c(9L,8L,6L,9L,7L,5L))
Here's a simple way to do it using dplyr.
library(dplyr)
tmp = data.frame(x = c(9,8,6,9,7,5))
tmp %>%
mutate(y = ifelse(lag(x) - x == 2, 1, 0)) %>%
mutate(z = ifelse(y == 1, x, lag(x))) %>%
replace(is.na(.), 0)
# output
# x y z
# 1 9 0 0
# 2 8 0 0
# 3 6 1 6
# 4 9 0 6
# 5 7 1 7
# 6 5 1 5

Counting of conditional frequency in R

I have a table with only one column and more than 200 rows. It includes three values, 0, 1 and 3. I´m interested in only these incidents, where an 1 follwos a 0. Can R count all X=1 if X-1 = =, given that X is the value of any row.
It would be great, if someone could help !
Best, Anna
Do you mean something like this?
# Create some sample data
set.seed(2020)
df <- data.frame(incident = sample(c(0, 1, 3), 10, replace = TRUE))
# incident
#1 3
#2 1
#3 0
#4 0
#5 1
#6 1
#7 0
#8 0
#9 1
#10 1
sum(c(df$incident[-1] == 1, FALSE) * (df$incident == 0))
# Or: with(df, sum(c(incident[-1] == 1, FALSE) * (incident == 0)))
#[1] 2
Here, c(incident[-1] == 1, FALSE) * (incident == 0) is the logical AND of x[i-1] = 0 and x[i] = 1. sum then sums the number of occurrences (in this case there are 2: one in rows 4/5 and one in rows 8/9).
library(tidyverse)
set.seed(123)
(df <- tibble(value = sample(c(0, 1, 3),size = 200, replace = TRUE)))
#> # A tibble: 200 x 1
#> value
#> <dbl>
#> 1 3
#> 2 3
#> 3 3
#> 4 1
#> 5 3
#> 6 1
#> 7 1
#> 8 1
#> 9 3
#> 10 0
#> # … with 190 more rows
count <- 0
#use map instead of walk to view the process row by row
walk(2:nrow(df), ~ {
if (df$value[[.x - 1]] == 0 && df$value[[.x]] == 1) count <<- count + 1
})
count
#> [1] 26
#some rows where the pattern is happening
df[86:87, ]
#> # A tibble: 2 x 1
#> value
#> <dbl>
#> 1 0
#> 2 1
df[93:94, ]
#> # A tibble: 2 x 1
#> value
#> <dbl>
#> 1 0
#> 2 1
Created on 2021-06-28 by the reprex package (v2.0.0)
Using dplyr:
transmute(df, dif = c(NA, diff(value))) %>%
count(dif) %>%
filter(dif == 1)
#> # A tibble: 1 x 2
#> dif n
#> <dbl> <int>
#> 1 1 26
Created on 2021-06-28 by the reprex package (v2.0.0)

The Most Efficient Way of Forming Groups using R

I have a tibble dt given as follows:
library(tidyverse)
dt <- tibble(x=as.integer(c(0,0,1,0,0,0,1,1,0,1))) %>%
mutate(grp = as.factor(c(rep("A",3), rep("B",4), rep("C",1), rep("D",2))))
dt
As one can observe the rule for grouping is:
starts 0 and ends with 1 (e.g., groups A, B, D) or
it solely contains 1 (e.g., group C)
Problem: Given a tibble with column integer vector x of zeros and 1 that starts with 0 and ends in 1, what is the most efficient way to obtain a grouping using R? (You can use any grouping symbols/factors.)
We can get the cumulative sum of 'x' (assuming it is binary), take the lag add 1 and use that index to replace it with LETTERS (Note that LETTERS was used only as part of matching with the expected output - it can take go up to certain limit)
library(dplyr)
dt %>%
mutate(grp2 = LETTERS[lag(cumsum(x), default = 0)+ 1])
-output
# A tibble: 10 x 3
x grp grp2
<int> <fct> <chr>
1 0 A A
2 0 A A
3 1 A A
4 0 B B
5 0 B B
6 0 B B
7 1 B B
8 1 C C
9 0 D D
10 1 D D
Though the strategy proposed by Akrun is fantastic, yet to show that it can be managed through accumulate also
library(tidyverse)
dt <- tibble(x=as.integer(c(0,0,1,0,0,0,1,1,0,1))) %>%
mutate(grp = as.factor(c(rep("A",3), rep("B",4), rep("C",1), rep("D",2))))
dt %>%
mutate(GRP = accumulate(lag(x, default = 0),.init =1, ~ if(.y != 1) .x else .x+1)[-1])
#> # A tibble: 10 x 3
#> x grp GRP
#> <int> <fct> <dbl>
#> 1 0 A 1
#> 2 0 A 1
#> 3 1 A 1
#> 4 0 B 2
#> 5 0 B 2
#> 6 0 B 2
#> 7 1 B 2
#> 8 1 C 3
#> 9 0 D 4
#> 10 1 D 4
Created on 2021-06-13 by the reprex package (v2.0.0)

Filter Data completely user defined r - multiple columns and filters

I am attempting to create a function that will allow a user to define an infinite number of columns and apply matching filters to those columns.
df <- data.frame(a=1:10, b=round(runif(10)), c=round(runif(10)))
|a| b|c|
|1| 1|1|
|2| 0|0|
|3| 0|1|
|4| 1|0|
|5| 1|0|
|6| 1|0|
|7| 1|1|
|8| 1|1|
|9| 1|0|
|10|1|1|
I would like the user to be able to filter the data based off either column, and apply different filters to each column. I know the following does not work. But this would be the general idea.
test <- function(df, fCol, fParam){
df %>% filter(fCol[1] %in% fParam[1] | fCol[2] %in% fParam[2])
}
test(df, c("b","c"),c(1,0)
# Which I would want it to return
|a|b|c|
|4|1|0|
|5|1|0|
|6|1|0|
|9|1|0|
The issue that I run into is that I won't know how many columns the user will want to filter, nor will I know the column names.
Any help at all would be greatly appreciated. Please ask questions if you have them. I tried my best to give a reprex.
I believe this should satisfy what you want
library(tidyr)
library(dplyr)
test <- function(df,
fCol,
fParam,
match_type = "any")
{
if(!is.element(match_type, c("any","all"))|length(match_type)!=1){
stop()
}
df <- df %>% ungroup() %>%
mutate(..id..=1:n())
meta <- data.frame(fCol=fCol,fParam=fParam)
logi <- df %>%
select("..id..",fCol) %>%
gather(key = "key", value = "value", -..id..) %>%
left_join(., y = meta, by = c("key"="fCol")) %>%
mutate(match = value==fParam) %>%
select(-key,-value, -fParam) %>%
group_by_at(setdiff(names(.),"match")) %>%
summarise(match = ifelse(match_type%in%"any",any(match), all(match)))
df2 <- left_join(df, logi, by = intersect(colnames(df),colnames(logi))) %>%
filter(match)%>%
select(-match, -..id..)
return(df2)
}
df <- data.frame(a=1:10, b=round(runif(10)), c=round(runif(10)))
df
# a b c
#1 1 0 1
#2 2 1 0
#3 3 0 0
#4 4 0 1
#5 5 0 1
#6 6 0 1
#7 7 1 0
#8 8 1 1
#9 9 1 0
#10 10 1 0
#use "any" to do an | match
test(df, c("b","c"),c(1,0), match_type = "any")
# a b c
#1 2 1 0
#2 3 0 0
#3 7 1 0
#4 8 1 1
#5 9 1 0
#6 10 1 0
#use "all" to do an & match
test(df, c("b","c"),c(1,0), match_type = "all")
# a b c
#1 2 1 0
#2 7 1 0
#3 9 1 0
#4 10 1 0
You can also specify the same colname for fCol multiple times if you want to match multiple values
test(df, c("b","b"),c(1,0)) #matches everything but you get the point
(my original response):
I am not sure this quite gives you the process you
want, but here's my best attempt before running out of
patience!!! :-)
I am sure there is a good way to make this an AND filter not an OR but I
can't quite get there myself. (Maybe a combination of map_dfc and
inner_join?)
Edit: got there in the end! Improved code below (original code deleted).
suppressPackageStartupMessages(library(dplyr))
suppressPackageStartupMessages(library(tibble))
suppressPackageStartupMessages(library(purrr))
my_df <- tibble(
a=1:10,
b=round(runif(10)),
c=round(runif(10))
)
my_df
#> # A tibble: 10 x 3
#> a b c
#> <int> <dbl> <dbl>
#> 1 1 1 0
#> 2 2 1 0
#> 3 3 0 1
#> 4 4 0 0
#> 5 5 1 1
#> 6 6 0 1
#> 7 7 0 0
#> 8 8 0 1
#> 9 9 1 0
#> 10 10 1 0
col_names <- c("b", "c")
tests <- c(1, 0)
# option 1: with a named function:
make_test_frame <- function(col_name, test) {
tibble({{col_name}} := test)
}
my_df1 <- map2_dfc(col_names, tests, make_test_frame) %>%
inner_join(x = my_df)
#> Joining, by = c("b", "c")
my_df1
#> # A tibble: 4 x 3
#> a b c
#> <int> <dbl> <dbl>
#> 1 1 1 0
#> 2 2 1 0
#> 3 9 1 0
#> 4 10 1 0
# 2. or with an anonymous function:
my_df1 <- map2_dfc(
col_names, tests,
function(col_name, test) {
tibble({{col_name}} := test)
}
) %>%
inner_join(x = my_df)
#> Joining, by = c("b", "c")
my_df1
#> # A tibble: 4 x 3
#> a b c
#> <int> <dbl> <dbl>
#> 1 1 1 0
#> 2 2 1 0
#> 3 9 1 0
#> 4 10 1 0
# 3. or as one big, hairy function:
filter_df <- function(df, col_names, tests) {
map2_dfc(
col_names, tests,
function(col_name, test) {
tibble({{col_name}} := test)
}
) %>%
inner_join(x = df)
}
my_df1 <- filter_df(my_df, col_names = c("b", "c"), tests = c(1, 0))
#> Joining, by = c("b", "c")
my_df1
#> # A tibble: 4 x 3
#> a b c
#> <int> <dbl> <dbl>
#> 1 1 1 0
#> 2 2 1 0
#> 3 9 1 0
#> 4 10 1 0
Created on 2020-02-28 by the reprex package (v0.3.0)

right join with dplyr make rows columns

I would like to do right join data1 and data2 by ProductCode and I need to get below desired output table
data1=data.frame(ProductCode=c(1,1,1,2,2,3),region=c("A","A","A","B","B","C"))
data1
ProductCode region
1 A
1 A
1 A
2 B
2 B
3 C
data2=data.frame(ProductCode=c(1,1,1,2,2,3),Period=c("promo1","promo2"
,"promo3","promo2","promo3","promo1"),promosales=c(15,12,7,18,20,2))
data2
ProductCode Period promosales
1 promo1 15
1 promo2 12
1 promo3 7
2 promo2 18
2 promo3 20
3 promo1 2
Desired output table
ProdcutCode region Promo1_sales Promo2_sales Promo3_sales
1 A 15 12 7
2 B 18 20 0
3 C 2 0 0
If I do it with sql, I have to group by after that by maximizing each row
sqldf("select a.*,
case when Period='promo1' then b.promosales else 0 end as
Promo1_sales1,
case when Period='promo2' then b.promosales else 0 end as
Promo1_sales2,
case when Period='promo3' then b.promosales else 0 end as
Promo1_sales3,
case when Period='promo4' then b.promosales else 0 end as
Promo1_sales4
from data1 a
left join data2 b on a.ProductCode=b.ProductCode
")
Can I do it dplyr or anything else?
Thank you.
Not sure this will work in your general case, but you can do:
data1 <- data.frame(ProductCode=c(1,1,1,2,2,3),
region=c(rep('A', 3), rep('B', 2),'C'))
data2 <- data.frame(ProductCode=c(1,1,1,2,2,3),
Period=c("promo1","promo2","promo3","promo2","promo3","promo1"),
promosales=c(15,12,7,18,20,2))
library(dplyr)
library(tidyr)
data1 %>%
distinct() %>%
inner_join(data2, by = 'ProductCode') %>%
group_by(ProductCode) %>%
mutate(rownr = paste0('Promo', row_number(), '_sales')) %>%
select(-Period) %>%
spread(rownr, promosales, fill = 0)
#> # A tibble: 3 x 5
#> # Groups: ProductCode [3]
#> ProductCode region Promo1_sales Promo2_sales Promo3_sales
#> <dbl> <fct> <dbl> <dbl> <dbl>
#> 1 1 A 15 12 7
#> 2 2 B 18 20 0
#> 3 3 C 2 0 0
A better approach would be simpler:
data1 %>%
distinct() %>%
inner_join(data2, by = 'ProductCode') %>%
group_by(ProductCode) %>%
spread(Period, promosales, fill = 0)
#> # A tibble: 3 x 5
#> # Groups: ProductCode [3]
#> ProductCode region promo1 promo2 promo3
#> <dbl> <fct> <dbl> <dbl> <dbl>
#> 1 1 A 15 12 7
#> 2 2 B 0 18 20
#> 3 3 C 2 0 0
Created on 2018-05-23 by the reprex package (v0.2.0).

Resources