In my code below, I want to first remove the Variable for which i do not have any value (Ie., F should be removed where all the values are 'NA'). then i am trying to find accumulated values of each Variable. I tried with the following code but i am not getting anything out of it.
library(tidyverse)
set.seed(50)
DF <- data.frame(Days = 1:5, A = runif(5,0,3), S = runif(5,1,6), F = matrix(NA, 5,1), C = runif(5,2,4))
DF_1 <- gather(DF, -Days, key = "Variable", value = "Value")
DF_2 <- DF_1 %>%
filter(Variable == "NA") %>%
mutate(cumulative_Sum = cumsum(Value))
Output
For Variable A I should get something like below- similar for others
> A <- cumsum(DF$A)
> A
[1] 2.126181 3.439161 4.039176 6.340374 7.879859
After grouping by 'Variable', filter out the groups having all NA 'Value', then do the cumulative sum of 'Value' after replacing the NA with 0
library(dplyr)
library(tidyr)
DF_1 %>%
group_by(Variable) %>%
filter(!all(is.na(Value))) %>%
mutate(Value = cumsum(replace_na(Value, 0)))
# A tibble: 15 x 3
# Groups: Variable [3]
# Days Variable Value
# <int> <chr> <dbl>
# 1 1 A 2.13
# 2 2 A 3.44
# 3 3 A 4.04
# 4 4 A 6.34
# 5 5 A 7.88
# 6 1 S 1.22
# 7 2 S 5.72
# 8 3 S 9.95
# 9 4 S 11.2
#10 5 S 12.7
#11 1 C 2.78
#12 2 C 5.32
#13 3 C 8.60
#14 4 C 10.8
#15 5 C 13.3
If we use the 'wide' format 'DF', then use mutate_at
DF %>%
mutate_at(-1, cumsum)
Related
Let's say df present aggregated metric in AB test with groups A and B. x is for example number of page visits, n number of users with this number of visits. (In reality, there are way more users and differences are small). Note that there's different number of users per group.
library(tidyverse)
df <- bind_rows(
tibble(group = "A", x = rpois(100, 1)),
tibble(group = "B", x = rpois(200, 2))
) %>%
count(group, x)
I want to compare tiles of users. By tile, I mean users in group A that have the same x value.
For example, I if 34.17% of users in group A has value 0, I want to compare it to average number of x for the lowest 34.17% of users in group B. Next, for example, users with 1 visits in group A are between 34.17% and 74.8% - I want to compare them with the same percentile (but should be more precise) users in group B. Etc...
Here's my try:
n_fake <- 1000
df_agg_per_imp <- df %>%
group_by(group) %>%
mutate(
p_max = n_fake * cumsum(n) / sum(n),
p_min = lag(p_max, default = 0),
p = map2(p_min + 1, p_max, seq)
) %>%
ungroup()
df_agg_per_imp %>%
unnest(p) %>%
pivot_wider(id_cols = p, names_from = group, values_from = x) %>%
group_by(A) %>%
summarise(
p_min = min(p) / n_fake,
p_max = max(p) / n_fake,
rel_uplift = mean(B) / mean(A)
)
#> # A tibble: 6 × 4
#> A p_min p_max rel_uplift
#> <int> <dbl> <dbl> <dbl>
#> 1 0 0.001 0.34 Inf
#> 2 1 0.341 0.74 1.92
#> 3 2 0.741 0.91 1.57
#> 4 3 0.911 0.96 1.33
#> 5 4 0.961 0.99 1.21
#> 6 5 0.991 1 1.2
What I don't like is that I have to create row for each user (and this could be millions) to get the results I want. Is there simpler/better way to do it?
You may be able to do something like this:
extend the creation of your initial frame to get proportion in A and B, and pivot wider:
set.seed(123)
df <- bind_rows(
tibble(group = "A", x = rpois(100, 1)),
tibble(group = "B", x = rpois(200, 2))
) %>%
count(group, x) %>%
group_by(group) %>%
mutate(prop = n/sum(n)) %>%
pivot_wider(id_cols=x, names_from=group,values_from=prop)
With the seed above, this gives you a frame like this:
# A tibble: 7 x 3
x A B
<int> <dbl> <dbl>
1 0 0.35 0.095
2 1 0.38 0.33
3 2 0.21 0.285
4 3 0.04 0.14
5 4 0.02 0.085
6 5 NA 0.055
7 6 NA 0.01
Create a function estimates the rel_uplift, while also returning an updated set of group B proportions and group B values (i.e. xvalues)
f <- function(a,aval,bvec,bvals) {
cindex = which(cumsum(bvec)>=a)
if(length(cindex) == 0) bindex=seq_along(bvec)
else bindex= 1:min(cindex)
rem = sum(bvec[bindex])-a
bmean = sum(bvals[bindex] * (bvec[bindex] - c(rep(0,length(bindex)-1), rem)))
if(length(bindex)>1) {
if(rem!=0) bindex = bindex[1:(length(bindex)-1)]
bvec = bvec[-bindex]
bvals = bvals[-bindex]
}
bvec[1] = rem
list("rel_uplift" = bmean/(a*aval),"bvec" = bvec, "bvals" = bvals )
}
Initiate a dataframe, and a list called fres which contains the initial bvec and initial bvals
result=data.frame()
fres = list("bvec" = df$B,"bvals" = df$x)
Use a for loop to loop over the values of df$A, each time getting the rel_uplift, and preparing an updated set of bvec and bvals to be used in the function
for(a in df %>% filter(!is.na(A)) %>% pull(A)) {
x = df %>% filter(A==a) %>% pull(x)
fres = f(a, x,fres[["bvec"]],fres[["bvals"]])
result = rbind(result,data.frame(x =x, A=a,rel_uplift=fres[["rel_uplift"]]))
}
result
x A rel_uplift
1 0 0.35 Inf
2 1 0.38 1.855263
3 2 0.21 1.726190
4 3 0.04 1.666667
5 4 0.02 1.375000
If I understand right you want to compare counts by two parameters simultaneously, ie by $group and by $x.
From the example in the initial post I see that not all values $x may be available for each group.
Summarizing by 2 co-variables can be done with base R.
Here a simple function (assuming that you're always looking at $group and $x):
countnByGroup <- function(xx, asPercent=FALSE) {
lev <- unique(xx$x)
grp <- unique(xx$group)
out <- sapply(grp, function(x) {z <- rep(NA, length(lev)); names(z) <- lev
w <- which(xx$group==x); if(length(w) >0) z[match(xx$x[w], lev)] <- xx$n[w]
z })
if(asPercent) out <- 100*apply(out, 2, function(x) x/sum(x, na.rm=TRUE))
out }
Note, in the function above the man variable was called 'xx' to avoid confusion
with $x.
df # produced using the code from your example
## A tibble: 13 x 3
# group x n
# <chr> <int> <int>
# 1 A 0 36
# 2 A 1 38
# 3 A 2 19
# 4 A 3 6
# 5 A 4 1
# 6 B 0 27
# 7 B 1 44
# 8 B 2 55
# 9 B 3 44
#10 B 4 21
#11 B 5 6
#12 B 6 2
#13 B 8 1
One gets :
countnByGroup(df)
# A B
#0 36 27
#1 38 44
#2 19 55
#3 6 44
#4 1 21
#5 NA 6
#6 NA 2
#8 NA 1
## and
countnByGroup(df, asPercent=T)
# A B
#0 36 13.5
#1 38 22.0
#2 19 27.5
#3 6 22.0
#4 1 10.5
#5 NA 3.0
#6 NA 1.0
#8 NA 0.5
As long as you don't apply any rounding you'll have the results as precise as it gets.
By chance the random values from above did't produce more digits when processing and thus by chance the percent values for A are all integers.
Another interesting option may be to consider two-way tables in R using table().
But in this case you need your entries as separate lines and not already transformed to counting data as in your example above.
I have a df with repeated sequence in first column and I want to get the values within the same number (in column 1) and create columns with them.
Obs: my df has 25502100 rows and the sequence is formed by 845 values.
See one simple example of my df below:
df <- data.frame(x = c(1,2,3,4,1,2,3,4), y = c(0.1,-2,-3,1,0,10,6,9))
I would like a function to transform this df in:
df_new
x y z
1 1 0.1 0
2 2 -2.0 10
3 3 -3.0 6
4 4 1.0 9
Does anyone has a solution?
An option with pivot_wider
library(tidyr)
library(data.table)
library(dplyr)
df %>%
mutate(rn = c('y', 'z')[rowid(x)]) %>%
pivot_wider(names_from = rn, values_from = y)
-output
# A tibble: 4 x 3
# x y z
# <dbl> <dbl> <dbl>
#1 1 0.1 0
#2 2 -2 10
#3 3 -3 6
#4 4 1 9
Consider the following data frame obtained after a cbind operation on two lists
> fl
x meanlist
1 1 48.5
2 2 32.5
3 3 28.0
4 4 27.0
5 5 25.5
6 6 20.5
7 7 27.0
8 8 24.0
class_median <- list(0, 15, 25, 35, 45)
class_list <- list(0:10, 10:20, 20:30, 30:40, 40:50)
The values in class_median represent classes -10 to +10, 10 to 20, 20 to 30 etc
Firstly, I am trying to group the values in fl$meanlist as per the classes in class_list. Secondly, I am trying to return one value per class which is closest to the median values as follows
> fl_subset
x meanlist cm
1 1 48.5 45
2 2 32.5 35
3 5 25.5 25
I am trying to use loops to compare but it seems to be long and unmanageable and the result is not correct
Here's an approach with dplyr:
library(dplyr)
# do a little prep--name classes, extract breaks, put medians in a data frame
names(class_list) = letters[seq_along(class_list)]
breaks = c(min(class_list[[1]]), sapply(class_list, max))
med_data = data.frame(median = unlist(class_median), class = names(class_list))
fl %>%
# assign classes
mutate(class = cut(meanlist, breaks = breaks, labels = names(class_list))) %>%
# get medians
left_join(med_data) %>%
# within each class...
group_by(class) %>%
# keep the row with the smallest absolute difference to the median
slice(which.min(abs(meanlist - median))) %>%
# sort in original order
arrange(x)
# Joining, by = "class"
# # A tibble: 3 x 4
# # Groups: class [3]
# x meanlist class median
# <int> <dbl> <fct> <dbl>
# 1 1 48.5 e 45
# 2 2 32.5 d 35
# 3 5 25.5 c 25
One approach utilizing purrr and dplyr could be:
map2(.x = class_list,
.y = class_median,
~ fl %>%
mutate(cm = between(meanlist, min(.x), max(.x))) %>%
filter(any(cm)) %>%
mutate(cm = cm*.y)) %>%
bind_rows(.id = "ID") %>%
group_by(ID) %>%
slice(which.min(abs(meanlist-cm)))
ID x meanlist cm
<chr> <int> <dbl> <dbl>
1 3 5 25.5 25
2 4 2 32.5 35
3 5 1 48.5 45
I am trying to select all rows in a repeated measures dataset that belong to a randomly selected group of people. I am trying to do it entirely in the tidyverse (for my own edification) but find myself having to fall back on base R functions. Here is how I do it with a combination of base R and dplyr commands.
set.seed(145)
df <- data.frame(id = rep(letters[1:10], each = 4),
score = rnorm(40))
ids <- sample(unique(df$id), 3)
smallDF <- df %>% dplyr::filter(id %in% ids)
smallDF
# id score
# 1 a 0.6869129
# 2 a 1.0663631
# 3 a 0.5367006
# 4 a 1.9060287
# 5 c 1.1677516
# 6 c 0.7926794
# 7 c -1.2135038
# 8 c -1.0056141
# 9 d 0.2085696
# 10 d 0.4461776
# 11 d -0.6208060
# 12 d 0.4413429
I can sample randomly from the id identifier using dplyr...
df %>% distinct(id) %>% sample_n(3)
# id
# 1 e
# 2 c
# 3 b
...but the fact that the output is a dataframe/tibble is making it difficult for me to get to that next step where I then filter the original df by the randomly selected id identifiers.
Can anyone help?
You can do a left_join to original df to get all the rows of randomly selected id's
library(dplyr)
set.seed(123)
df %>% distinct(id) %>% sample_n(3) %>% left_join(df)
#Joining, by = "id"
# id score
#1 b 1.063
#2 b 1.370
#3 b 0.528
#4 b 0.403
#5 f 0.343
#6 f -1.286
#7 f -0.534
#8 f 0.597
#9 c 1.168
#10 c 0.793
#11 c -1.214
#12 c -1.006
df %>% filter(id %in% sample(levels(id),3))
I'm sure this question has been asked before, but I can't find the answer.
Here's my data:
df <- data.frame(group=c("a","a","a","b","b","c"), value=c(1,2,3,4,5,7))
df
#> group value
#> 1 a 1
#> 2 a 2
#> 3 a 3
#> 4 b 4
#> 5 b 5
#> 6 c 7
I'd like a 3rd column which has the sum of "value" for each "group", like so:
#> group value group_sum
#> 1 a 1 6
#> 2 a 2 6
#> 3 a 3 6
#> 4 b 4 9
#> 5 b 5 9
#> 6 c 7 7
How can I do this with dplyr?
Using dplyr -
df %>%
group_by(group) %>%
mutate(group_sum = sum(value))
Nobody mentioned data.table yet:
library(data.table)
dat <- data.table(df)
dat[, `:=`(sums = sum(value)), group]
Which transforms dat into:
group value sums
1: a 1 6
2: a 2 6
3: a 3 6
4: b 4 9
5: b 5 9
6: c 7 7
left_join(
df,
df %>% group_by(group) %>% summarise(group_sum = sum(value)),
by = c("group")
)
I don't know how to do it one step, but
df_avg <- df %>% group_by(group) %>% summarize(group_sum=sum(value))
df %>% full_join(df_avg,by="group")
works. (This is basically equivalent to #KeqiangLi's answer.)
ave(), from base R, is useful here too:
df %>% mutate(group_sum=ave(value,group,FUN=sum))