dplyr sample by groups of values

dplyr sample by groups of values - r

I want to make samples based on grouped values with dplyr :
What I tried :
id <- c(1, 1, 1, 2, 3, 3, 4, 5, 5, 5, 6, 6, 7, 8, 8, 8, 8, 8)
id <- as.data.frame(id)
sample <- id %>%
group_by(id) %>%
sample_n(2, replace = FALSE) %>%
ungroup(id)
sample
Expected result ( n sample =2) :
1, 1, 1, 2
or
1, 1, 1, 3, 3
or
5, 5, 5, 6, 6
etc.
I have got an error:
Error: `size` must be less or equal than 1 (size of data), set `replace` = TRUE to use sampling with replacement

Perhaps this helps
id %>%
distinct(id) %>%
sample_n(2, replace = FALSE) %>%
inner_join(id, .)

Related

Calculate the mean after filtering and groupby

I have a large dataframe of message exchanges that looks like this:
structure(list(from = c(1, 8, 3, 3, 8, 1, 4, 5, 8, 3, 1, 8, 4,
1, 4, 8, 1, 4, 5, 8, 3, 1, 8, 1, 4, 8), to = c(8, 3, 8, 54, 3,
4, 1, 6, 7, 1, 4, 3, 8, 8, 1, 3, 4, 1, 6, 7, 1, 4, 3, 8, 1, 3
), time = c(63200, 81282, 81543, 81548, 81844, 82199, 82514,
82711, 82739, 82814, 82936, 83889, 84207, 84427, 85523, 85545,
86883, 87187, 87701, 89004, 89619, 92662, 93384, 93443, 94042,
94203), month = c(2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 4, 4, 4, 4, 4,
4, 4, 4, 4, 4, 6, 6, 6, 6, 6, 6), day = c(1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 3, 15, 15, 15, 15, 15, 15
)), class = "data.frame", row.names = c(NA, -26L))
I'm aiming to calculate the average of the differences in time between the first and the last message someone gets in a day.
So, what I'm aiming is to filter the dataset by an index if it is present either on column "to" of "from", group by day using both the month ("month") and the number of the day in the month ("day"), then I want to calculate the difference between the first and the last message in each day and then I want to average those differences.
At the end I should get a dataframe with the indexes and the daily average for each index. Like this:
index avg
1 1 9429.333
2 3 2590.667
3 4 1982.000
4 8 7338.000
The value for 1 is the average of the differences between the max and min of time for each day: 19164 (for day 1 in month 2), 4251(for day 2 in month 4) and 4423 (for day 15 in month 6).(Note: when the difference is equal to 0 the number should be excluded from the average as in day 3 month 4 for index 8)
Right now I'm trying this, but it does not work
dur<-function(x)max(x)-min(x) #The function to calculate the difference. In other cases I need to use other functions of my own
#index are the Names of the indexes for which I want the calculation
index <- c(1, 3, 4, 8)
names(index) <- index
index %>%
map_dfr(~ df %>% filter(from == .x | to == .x) %>% group_by (month,day) %>%
summarize(result = dur(time)) %>%
summarize(mdur = mean(result)) ,.id = "index")`
The one below works to calculate the time difference for all messages, but I also need the daily average
index %>%
map_dfr(~ df %>%
filter(from == .x | to == .x) %>%
summarize(result = dur(time)),
.id = "index")

library(dplyr)
df = data.frame(from = c(1, 8, 3, 3, 8, 1, 4, 5, 8, 3, 1, 8, 4, 1, 4, 8, 1, 4, 5, 8, 3, 1, 8, 1, 4, 8, 2 ,3),
to = c(8, 3, 8, 54, 3, 4, 1, 6, 7, 1, 4, 3, 8, 8, 1, 3, 4, 1, 6, 7, 1, 4, 3, 8, 1, 3, 5, 8),
time = c(63200, 81282, 81543, 81548, 81844, 82199, 82514, 82711, 82739, 82814, 82936, 83889, 84207, 84427, 85523, 85545, 86883, 87187, 87701, 89004, 89619, 92662, 93384, 93443, 94042, 94203, 12402, 24932),
month = c(2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 6, 6, 6, 6, 6, 6, 9, 9),
day = c(1, 1, 1, 15, 15, 22, 22, 22, 25, 25, 1, 1, 1, 1, 2, 2, 2, 2, 2, 3, 15, 15, 15, 18, 18, 18, 9, 9))
df2 <- df %>% group_by(day, month) %>% summarise(f = first(time), l = last(time)) %>% mutate(diff = l - f) %>% group_by(month) %>% summarise(mt = sum(diff)/length(which(diff!=0)))
This gives:
> df2
# A tibble: 4 × 2
month mt
<dbl> <dbl>
1 2 4806.5
2 4 1834.5
3 6 2262.5
4 9 12530.0
Is this what you are after?
Although you have mentioned something about a person, your data does not include a person column, so I assume this is data from the same person. If you have multiple people, it's just a matter of applying this code to each person separately.

Subset every 5 rows by group?

I have a dataset with multiple groups, and want to subset rows within groups along multiples of 5, with the addition of the first row (so row 1,5,10,15, etc within every group).
Right now my dataset has a column named "Group ID" and a few other columns (e.g. time, date, etc), but nothing indicating row numbers of any kind.
Any help would be appreciated! I was thinking maybe something compatible with dplyr? I was trying things using the function slice but no luck so far.

You need to create the sequence within each group and then just use filter
library(dplyr)
df <- data.frame(id = c(1, 2, 1, 2, 2, 3, 4, 3, 1, 2, 4, 4, 4, 3, 1, 1, 1, 2, 2),
b = c(6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6))
df <- df %>%
group_by(id) %>%
mutate(group_index = row_number()) %>%
filter(group_index == 1 | group_index %% 5 == 0)

R tidyverse warning: The `i` argument of ``[`()` can't be a matrix as of tibble 3.0.0

I get a warning when wanting to select rows dependent on the mean of one of the variables in a tibble. See details below and warning. So I wonder if there is a more tidyverse solution to this.
Example data:
x <- c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10)
y <- c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10)
z <- c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10)
xyz <- tibble(x, y, z)
group1 <- xyz[xyz[2] < stats::median(purrr::as_vector(xyz$y), na.rm = TRUE), ]
Warning message:
The i argument of ``[() can't be a matrix as of tibble 3.0.0.
Convert to a vector.
Thanks in advance

xyz %>%
filter(y < stats::median(y))

Make boxplots of columns in R

I am a beginner in R, and have a question about making boxplots of columns in R. I just made a dataframe:
SUS <- data.frame(RD = c(4, 3, 4, 1, 2, 2, 4, 2, 4, 1), TK = c(4, 2, 4, 2, 2, 2, 4, 4, 3, 1),
WK = c(3, 2, 4, 1, 3, 3, 4, 2, 4, 2), NW = c(2, 2, 4, 2, NA, NA, 5, 1, 4, 2),
BW = c(3, 2, 4, 1, 4, 1, 4, 1, 5, 1), EK = c(2, 4, 3, 1, 2, 4, 2, 2, 4, 2),
AN = c(3, 2, 4, 2, 3, 3, 3, 2, 4, 2))
rownames(SUS) <- c('Pleasant to use', 'Unnecessary complex', 'Easy to use',
'Need help of a technical person', 'Different functions well integrated','Various function incohorent', 'Imagine that it is easy to learn',
'Difficult to use', 'Confident during use', 'Long duration untill I could work with it')
I tried a number of times, but I did not succeed in making boxplots for all rows. Someone who can help me out here?

You can do it as well using tidyverse
library(tidyverse)
SUS %>%
#create new column and save the row.names in it
mutate(variable = row.names(.)) %>%
#convert your data from wide to long
tidyr::gather("var", "value", 1:7) %>%
#plot it using ggplot2
ggplot(., aes(x = variable, y = value)) +
geom_boxplot()+
theme(axis.text.x = element_text(angle=35,hjust=1))

As #blondeclover says in the comment, boxplot() should work fine for doing a boxplot of each column.
If what you want is a boxplot for each row, then actually your current rows need to be your columns. If you need to do this, you can transpose the data frame before plotting:
SUS.new <- as.data.frame(t(SUS))
boxplot(SUS.new)

Returning index of vector

I have a vector that looks like this:
c(1,1,1,1,2,2,2,2,3,3,3,3,3,3,4,4,4,4,4,4,4,4,5,5,5,5,5..)
I want to get the index of when the element changes, i.e. (1,5,9,...)
I know how to do it with a for loop, but I am trying a faster way as my vector is very large.
Thanks,

Try
which(c(TRUE,diff(v1)!=0))
Or
match(unique(v1), v1)
Or if the vector is sorted
head(c(1, findInterval(unique(v1), v1)+1),-1)
data
v1 <- c(1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 4, 4, 4, 4, 4, 4,
4, 4, 5, 5, 5, 5, 5)

Another fun approach:
v1 <- c(1, 1, 2, 3, 4, 4, 5, 6, 7, 7, 7, 8)
head(c(1, cumsum(rle(v1)$lengths) + 1), -1)
Or if you have magrittr then it can become
library(magrittr)
v1 %>%
rle %>%
.$lengths %>%
cumsum %>%
add(1) %>%
c(1, .) %>%
head(-1)
Result: 1 3 4 5 7 8 9 12
Might look weird but it's fun to think that through :)
Explanation: cumsum(rle(v1)$lengths) gets you almost all the way there, but it'll give you the index of where a sequence ends rather than where the next sequence starts, so that's why we add one to each element, append the index 1, and remove the last element.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

dplyr sample by groups of values - r

Perhaps this helps id %>% distinct(id) %>% sample_n(2, replace = FALSE) %>% inner_join(id, .)

Related

Calculate the mean after filtering and groupby

Subset every 5 rows by group?

R tidyverse warning: The `i` argument of ``[`()` can't be a matrix as of tibble 3.0.0

Make boxplots of columns in R

Returning index of vector

Categories

Resources