I have a df (test) like this
Now if you look at the data, 6 to 10 combination is available in the second period but not in the first period. Hence when I use this code
a_summary <- test %>%
group_by(from, to) %>%
summarize(avg = mean(share, na.rm = T)) %>%
ungroup() %>%
spread(from, avg, fill = 0)
The output comes like this
Now, look at the 10 to 6 cell. it gives a value of 1 because 10 to 6 combination only exist one time. But when I make the average, I would like to consider all combination in each period. hence the expected outcome of that 10 to 6 cell is .5 and overall matrix column and row summation should be 1.
a_summary <- test %>%
group_by(from, to) %>%
summarize(count = sum(n, na.rm = T)) %>%
ungroup() %>%
spread(from, count, fill = 0)
This will give you all count of all combinations. Now you can normalize this matrix with dividing by sum(test$n) or use prop.table()
Related
I have a fairly alrge dataset and I am running a for loop to remove one line per transect and calculate the frequency of the category. I am now trying to make it so that instead of one line per transect it removes a whole transect every iteration. Is it possible to do this?
Here is a sample dataset with the same columns I have
Transect<- c(1,1,1,1,1,2,2,2,2,2,3,3,3,3,3)
Category<- c("S","S","S","C","T","S","SP","T", "C", "T","S","SP","T","S","C")
dat<- data.frame(Transect,Category)
So the current code below removes one line per transect. How could I do it so that it randomly deletes a whole transect category (i.e. in the first iteration all of transect 3 is removed and on the second all of 1 is removed)
for (q in 1:2) {
for ( i in 0:5){
#if (i>0)
df<- dat2 %>%
group_by(Transect) %>%
sample_n(n() - i, replace = TRUE) %>%
ungroup()
c<-df %>%
group_by(Category) %>%
summarise(n = n(), replace=TRUE) %>%
mutate(freq = n / sum(n),
total=55-i)
if (i==0){
tot_1=c
} else {
tot_1=bind_rows(tot_1,c)
}
}
tot_1$rep = q
if (q==1){
dftot = tot_1
} else {
dftot=bind_rows(dftot, tot_1)
}
}
It seems your goals is to iteratively assess increasingly small subsamples of your data to assess loss of representation of the whole. This code will try dropping a random 1 then 2 then 3... and report the distribution of categories. The last few lines normalize count to fraction of total for easy comparison between iterations.
Note I used set.seed() because it will return a different result each time due to random sampling.
To break down this answer a bit:
It's important that Category is a factor so that table() won't drop any Category values that have no count in a particular iteration. It would run to a point but then the rowbinding operation within map_dfr() would fail.
First I just enumerate the numbers of Transect to leave out (should be 0:4 in this example) using 0:length(unique(d$Transect)). I included 0 so that we can see what it looks like with the full dataset.
I used set_names() so that it becomes a named vector. This allows us to use .id inside map_dfr() so that we get an extra column which stores the value of the leaveout.
purrr::map_dfr() will iteratively apply a function over some list. In this case I piped in the list of leaveout values (which we just named) and the function we apply is given as an rlang-style lambda function which begins with ~ and operates on the argument .x.
Working from the inside of the filter operation, this function first randomly samples a number of values of Transect to exclude given by .x and then removes data with said value of Transect. Here we use %in% and negate the whole result with ! at the beginning.
Then we just use dplyr::pull() to take the Category column as a vector and run table() on it to tabulate the occurrence of each value.
The rest just calculates the total count for each iteration and then divides the values by that to get a fraction.
library(tidyverse)
d <- tibble(
Transect = as.character(c(1,1,1,1,1,2,2,2,2,2,3,3,3,3,3)),
Category = factor(c("S","S","S","C","T","S","SP","T", "C", "T","S","SP","T","S","C"))
)
set.seed(1)
0:length(unique(d$Transect)) %>% set_names() %>%
map_dfr( ~ d %>%
filter(!Transect %in% sample(unique(d$Transect), size = .x)) %>%
pull(Category) %>%
table(),
.id = "leaveout_transects") %>%
rowwise() %>%
mutate(total_count = sum(c_across(-1)), .after = 1) %>%
mutate(across(-c(1:2), ~.x/total_count))
#> # A tibble: 4 × 6
#> # Rowwise:
#> leaveout_transects total_count C S SP T
#> <chr> <int> <table> <table> <table> <table>
#> 1 0 15 0.2 0.4 0.1333333 0.2666667
#> 2 1 10 0.2 0.3 0.2000000 0.3000000
#> 3 2 5 0.2 0.2 0.2000000 0.4000000
#> 4 3 0 NaN NaN NaN NaN
It would probably be more rigorous to simulate each leaveout condition multiple times and look at the distribution of performance you get at each value to assess what's likely to happen in the future with a given subsample.
Base r has the built in function replicate which is great for this purpose. Here I'm just using the code above with replicate and then reformatting the data a bit to graph it.
# use replicate to make many simulations
n_reps <- 20
replicate(
n_reps,
0:length(unique(d$Transect)) %>% set_names() %>%
map_dfr(
~ d %>%
filter(!Transect %in% sample(unique(d$Transect), size = .x)) %>%
pull(Category) %>%
table(),
.id = "leaveout_transects"
) %>%
rowwise() %>%
mutate(total_count = sum(c_across(-1)), .after = 1) %>%
mutate(across(-c(1:2), ~ .x / total_count)) %>%
select(3:6) %>%
t() %>%
cor() %>%
.[, 1]) %>%
as_tibble(.name_repair = "unique") %>%
mutate("leavout_transects" = factor(0:length(unique(d$Transect)))) %>%
pivot_longer(-leavout_transects, values_to = "correlation") %>%
select(-name) %>%
ggplot(aes(leavout_transects, correlation)) +
geom_boxplot()
Created on 2022-09-22 by the reprex package (v2.0.1)
I want to make a rolling mean on the last X number of days. rollmean() does that using rows. Since I am using loggers that sometimes fail, and also the data were cleaned, the time series is not continuous (rows do not necessarily represent a constant time difference).
A colleague suggested the solution below, which works great. Except my data need to be grouped (in the example by treatment). For each day, I want the rolling mean of the last X days for each treatment.
Thanks
# making some example data
# vector with days since the beginning of experiment
days <- 0:30
# random values df1 <- tibble::tibble(
days_since_beginning = days,
value_to_used = rnorm(length(days)),
treatment = sample(letters[1],31,replace = TRUE) )
df2 <- tibble::tibble(
days_since_beginning = days,
value_to_used = rnorm(length(days)),
treatment = sample(letters[2],31,replace = TRUE) )
df <- full_join(df1, df2)
# how long should be the period for mean
time_period <- 10 # calculate for last 10 days
df_mean <- df %>% dplyr::mutate(
# calculate rolling mean
roll_mean = purrr::map_dbl(
.x = days_since_beginning,
.f = ~ df %>%
# select only data for the last `time_period`
dplyr::filter(days_since_beginning >= .x - time_period &
days_since_beginning <= .x) %>%
purrr::pluck("value_to_used") %>%
mean() %>%
return()
) )
This takes the mean over the last 10 days by treatment. The width argument includes a computation of how many rows back to use so that it corresponds to 10 days rather than 10 rows. This uses the fact that width can be a vector.
library(dplyr)
library(zoo)
df %>%
group_by(treatment) %>%
mutate(roll = rollapplyr(value_to_used,
seq_along(days_since_beginning) - findInterval(days_since_beginning - 10, days_since_beginning),
mean)) %>%
ungroup
Same colleague came up with his own solution:
df_mean <-
df %>%
dplyr::group_by(treatment) %>%
tidyr::nest() %>%
dplyr::mutate(
data_with_mean = purrr::map(
.x = data,
.f = ~ {
dataset <- .x
dataset %>%
dplyr::mutate(
# calculate rolling mean
roll_mean = purrr::map_dbl(
.x = days_since_beginning,
.f = ~ dataset %>%
# select only data for the last `time_period`
dplyr::filter(days_since_beginning >= .x - time_period &
days_since_beginning <= .x) %>%
purrr::pluck("value_to_used") %>%
mean() %>%
return()
)) %>%
return()
}
)) %>%
dplyr::select(-data) %>%
tidyr::unnest(data_with_mean) %>%
dplyr::ungroup()
I compared the results with G. Grothendieck's idea, and it only matches if I use time_period in my colleague's code and time_period + 1 in G. Grothendieck's code. So there is a difference in how the time_period is used, and I am confused about why it happens.
I am trying to create an index of a set of variables by taken the mean of the selected variables using the following code:
data <- data %>%
group_by(country) %>%
# Standardize each component/measure
mutate(
std_var1 = standardize(var1, Z),
std_var2 = standardize(var2, Z),
std_var3 = standardize(var3, Z),
std_var4 = standardize(var4, Z)
) %>%
ungroup() %>%
dplyr::select(std_var1,
std_var2,
std_var3,
std_var4) %>%
# Average all z scores for an individual
mutate(index = pmap_dbl(., ~ mean(c(...), na.rm = T))) %>%
cbind(data, .) %>% unnest() %>%
I also use the idx_mean package that takes the following syntax:
mutate(data, idx_var = idx_mean(std_var1, std_var2, std_var3, std_var4))
and obtain similar but not exactly the same index values (not just a matter of rounding).
Is there one approach that seems more accurate here?
The 4th and 5th columns display index values created by the idx function (4th column) and the other approach (5th column.)
I can't found any help lf internet.
I have 3 cols in .sav file loaded to R studio.
Is M with values 1,2,3,4,5,6,7 and label: weight, and N with values 1,2,3 and label diet.
I want group by it by these columns, but for N col I want only pick those where value is 1. Also I have last column with age data A.
I wrote this:
library(dplyr)
df%>%
group_by(M, N) %>%
summarize(values = mean(A, na.rm = TRUE))
And I got group by but for all N.
I tried something like this:
library(dplyr)
df%>%
group_by(M, N == 1) %>%
summarize(values = mean(A, na.rm = TRUE))
but I got again group for all categories from N with NA etc.
Expcted: I want only group_by by M - all values, and N where value =1.
How should that group by looks?
We can do a group by 'M' and summarise the filtered 'A'
library(dplyr)
df %>%
group_by(M) %>%
summarise(values = mean(A[N == 1], na.rm = TRUE))
Or another option is to have a filter in between, but this would also remove the groups where there are no 'N' as 1
df %>%
filter(N == 1) %>%
group_by(M) %>%
summarise(values = mean(A, na.rm = TRUE))
In R, when I run this group_by code, I obtain this result.
df <- tibble(y=c('a','a','a', 'b','b','b','b','b'), z=c(1,1,1,1,1,1,2,2))
df %>% group_by(z,y) %>% summarise(n())
z y n()
1 a 3
1 b 3
2 b 2
Is there a way to make it look like this?
z y n()
1 a 3
b 3
2 b 2
My goal is to have the formatting look the way it does in Pandas, where the multilevel index isn't repeated each time ( see below ).
Here's one possibility:
df <- tibble(y=c('a','a','a', 'b','b','b','b','b','a','b'), z=c(1,1,1,1,1,1,2,2,3,3))
df2 <-
df %>%
group_by(z,y) %>%
summarise(n = n()) %>%
group_by(z) %>%
mutate(z2 = if_else(row_number() == 1, as.character(z), " "), y, n) %>%
ungroup() %>%
transmute(z = z2, y, n)
df2 %>%
knitr::kable()
I'm having trouble thinking of ways to do this that don't involve grouping by the z column and finding the first row. Unfortunately that means you need to add a couple steps, because a grouping variable can't be modified in the mutate call.