My data is grouped, but I would like to split each group in two, as illustrated in the example below. It doesn't really matter what the content of group_half will be, it can be anything like 1a/1b or 1.1/1.2. Any recommendations on how to do this using dplyr? Thanks!
col_1 <- c(23,31,98,76,47,65,23,76,3,47,54,56)
group <- c(1,1,1,1,2,2,2,2,3,3,3,3)
group_half <- c(1.1, 1.1, 1.2, 1.2, 2.1, 2.1, 2.2, 2.2, 3.1, 3.1, 3.2, 3.2)
df1 <- data.frame(col_1, group, group_half)
# col_1 group group_half
# 23 1 1.1
# 31 1 1.1
# 98 1 1.2
# 76 1 1.2
# 47 2 2.1
# 65 2 2.1
# 23 2 2.2
# 76 2 2.2
# 3 3 3.1
# 47 3 3.1
# 54 3 3.2
# 56 3 3.2
Here are two options :
If you always have even number of rows in each group.
library(dplyr)
df1 %>%
group_by(group) %>%
mutate(group_half = paste(group, rep(1:2, each = n()/2), sep = '.')) %>%
ungroup
# col_1 group group_half
# <dbl> <dbl> <chr>
# 1 23 1 1.1
# 2 31 1 1.1
# 3 98 1 1.2
# 4 76 1 1.2
# 5 47 2 2.1
# 6 65 2 2.1
# 7 23 2 2.2
# 8 76 2 2.2
# 9 3 3 3.1
#10 47 3 3.1
#11 54 3 3.2
#12 56 3 3.2
This will work irrespective of number of rows in the group.
df1 %>%
group_by(group) %>%
mutate(group_half = paste(group,as.integer(row_number() > n()/2) + 1, sep = '.')) %>%
ungroup
Related
I have data frame below and I want to keep everything unchanged except for the last cell.
I want it to be rounded to the nearest top integer. Is there any way to target one or more specific cells using dplyr?
df <- data.frame(id = c(rep(101, 4), rep(202, 2), "tot"),
status = c("a","b","c","d", "a", "b", "cccc"),
wt = c(100,200,100,105, 20,22,10000),
ht = c(5.3,5.2,5,5.1, 4.3,4.2,4.9))
> df
id status wt ht
1 101 a 100 5.3
2 101 b 200 5.2
3 101 c 100 5.0
4 101 d 105 5.1
5 202 a 20 4.3
6 202 b 22 4.2
7 tot cccc 10000 4.9
my desired out put is :
df[df$id=="tot", 4] <- round(df[df$id=="tot", 4])
> df
id status wt ht
1 101 a 100 5.3
2 101 b 200 5.2
3 101 c 100 5.0
4 101 d 105 5.1
5 202 a 20 4.3
6 202 b 22 4.2
7 tot cccc 10000 5.0
A possible solution, based on dplyr:
library(dplyr)
df %>%
mutate(ht = if_else(row_number() == n(), round(ht,0), ht))
#> id status wt ht
#> 1 101 a 100 5.3
#> 2 101 b 200 5.2
#> 3 101 c 100 5.0
#> 4 101 d 105 5.1
#> 5 202 a 20 4.3
#> 6 202 b 22 4.2
#> 7 tot cccc 10000 5.0
So I have a dataframe (my.df) which I have grouped by the variable "strat". Each row consists of numerous variables. Example of what it looks like is below - I've simplified my.df for this example since it is quite large. What I want to do next is draw a simple random sample from each group. If I wanted to draw 5 observations from each group I would use this code:
new_df <- my.df %>% group_by(strat) %>% sample_n(5)
However, I have a different specified sample size that I want to sample for each group. I have these sample sizes in a vector nj.
nj <- c(3, 4, 2)
So ideally, I would want 3 observations from my first strata, 4 observations from my second strata and 2 observations from my last srata. I'm not sure if I can sample by group using each unique sample size (without having to write out "sample" however many times I need to)? Thanks in advance!
my.df looks like:
var1 var2 strat
15 3 1
13 5 3
8 6 2
12 70 3
11 10 1
14 4 2
You can use stratified from my "splitstackshape" package.
Here's some sample data:
set.seed(1)
my.df <- data.frame(var1 = sample(100, 20, TRUE),
var2 = runif(20),
strat = sample(3, 20, TRUE))
table(my.df$strat)
#
# 1 2 3
# 5 9 6
Here's how you can use stratified:
library(splitstackshape)
# nj needs to be a named vector
nj <- c("1" = 3, "2" = 4, "3" = 2)
stratified(my.df, "strat", nj)
# var1 var2 strat
# 1: 72 0.7942399 1
# 2: 39 0.1862176 1
# 3: 50 0.6684667 1
# 4: 21 0.2672207 2
# 5: 69 0.4935413 2
# 6: 91 0.1255551 2
# 7: 78 0.4112744 2
# 8: 7 0.3403490 3
# 9: 27 0.9347052 3
table(.Last.value$strat)
#
# 1 2 3
# 3 4 2
Since your data is inadequate for sampling, let us consider this example on iris dataset
library(tidyverse)
nj <- c(3, 5, 6)
set.seed(1)
iris %>% group_split(Species) %>% map2_df(nj, ~sample_n(.x, size = .y))
# A tibble: 14 x 5
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
<dbl> <dbl> <dbl> <dbl> <fct>
1 4.6 3.1 1.5 0.2 setosa
2 4.4 3 1.3 0.2 setosa
3 5.1 3.5 1.4 0.2 setosa
4 6 2.7 5.1 1.6 versicolor
5 6.3 2.5 4.9 1.5 versicolor
6 5.8 2.6 4 1.2 versicolor
7 6.1 2.9 4.7 1.4 versicolor
8 5.8 2.7 4.1 1 versicolor
9 6.4 2.8 5.6 2.2 virginica
10 6.9 3.2 5.7 2.3 virginica
11 6.2 3.4 5.4 2.3 virginica
12 6.9 3.1 5.1 2.3 virginica
13 6.7 3 5.2 2.3 virginica
14 7.2 3.6 6.1 2.5 virginica
You can bring nj values to sample in the dataframe and then use sample_n by group.
library(dplyr)
df %>%
mutate(nj = nj[strat]) %>%
group_by(strat) %>%
sample_n(size = min(first(nj), n()))
Note that the above works because strat has value 1, 2, 3. For a general solution when the group does not have such values you could use :
df %>%
mutate(nj = nj[match(strat, unique(strat))]) %>%
group_by(strat) %>%
sample_n(size = min(first(nj), n()))
I have the following dataframe. I want to prefer dplyr to solve this problem.
For each zone I want at minimum two values. Value > 4.0 is preferred.
Therefore, for zone 10 all values (being > 4.0) are kept. For zone 20, top two values are picked. Similarly for zone 30.
zone <- c(rep(10,4), rep(20, 4), rep(30, 4))
set.seed(1)
value <- c(4.5,4.3,4.6, 5,5, rep(3,7)) + round(rnorm(12, sd = 0.1),1)
df <- data.frame(zone, value)
> df
zone value
1 10 4.4
2 10 4.3
3 10 4.5
4 10 5.2
5 20 5.0
6 20 2.9
7 20 3.0
8 20 3.1
9 30 3.1
10 30 3.0
11 30 3.2
12 30 3.0
The desired output is as follows
> df
zone value
1 10 4.4
2 10 4.3
3 10 4.5
4 10 5.2
5 20 5.0
6 20 3.1
7 30 3.1
8 30 3.2
I thought of using top_n but it picks the same number for each zone.
You could dynamically calculate n in top_n
library(dplyr)
df %>% group_by(zone) %>% top_n(max(sum(value > 4), 2), value)
# zone value
# <dbl> <dbl>
#1 10 4.4
#2 10 4.3
#3 10 4.5
#4 10 5.2
#5 20 5
#6 20 3.1
#7 30 3.1
#8 30 3.2
can do so
library(tidyverse)
df %>%
group_by(zone) %>%
filter(row_number(-value) <=2 | head(value > 4))
I am stuck with question - how to sum consecutive duplicate odd rows and remove all but first row. I have got how to sum consecutive duplicate rows and remove all but first row (link: https://stackoverflow.com/a/32588960/11323232). But this project, i would like to sum the consecutive duplicate odd rows but not all of the consecutive duplicate rows.
ia<-c(1,1,2,NA,2,1,1,1,1,2,1,2)
time<-c(4.5,2.4,3.6,1.5,1.2,4.9,6.4,4.4, 4.7, 7.3,2.3, 4.3)
a<-as.data.frame(cbind(ia, time))
a
ia time
1 1 4.5
2 1 2.4
3 2 3.6
5 2 1.2
6 1 4.9
7 1 6.4
8 1 4.4
9 1 4.7
10 2 7.3
11 1 2.3
12 2 4.3
to
a
ia time
1 1 6.9
3 2 3.6
5 2 1.2
6 1 20.4
10 2 7.3
11 1 2.3
12 2 4.3
how to edit the following code for my goal to sum consecutive duplicate odd rows and remove all but first row ?
result <- a %>%
filter(na.locf(ia) == na.locf(ia, fromLast = TRUE)) %>%
mutate(ia = na.locf(ia)) %>%
mutate(change = ia != lag(ia, default = FALSE)) %>%
group_by(group = cumsum(change), ia) %>%
# this part
summarise(time = sum(time))
One dplyr possibility could be:
a %>%
group_by(grp = with(rle(ia), rep(seq_along(lengths), lengths))) %>%
mutate(grp2 = ia %/% 2 == 0,
time = sum(time)) %>%
filter(!grp2 | (grp2 & row_number() == 1)) %>%
ungroup() %>%
select(-grp, -grp2)
ia time
<dbl> <dbl>
1 1 6.9
2 2 3.6
3 2 1.2
4 1 20.4
5 2 7.3
6 1 2.3
7 2 4.3
You could try with use of data.table the following:
library(data.table)
ia <- c(1,1,2,NA,2,1,1,1,1,2,1,2)
time <- c(4.5,2.4,3.6,1.5,1.2,4.9,6.4,4.4, 4.7, 7.3,2.3, 4.3)
a <- data.table(ia, time)
a[, sum(time), by=.(ia, rleid(!ia %% 2 == 0))]
Gives
## ia rleid V1
##1: 1 1 6.9
##2: 2 2 3.6
##3: NA 3 1.5
##4: 2 4 1.2
##5: 1 5 20.4
##6: 2 6 7.3
##7: 1 7 2.3
##8: 2 8 4.3
I have a data frame data with var1, .., var100. They are all numeric and have the same length.
I put them into a list: list1<-list(var1, .., var100).
Now, I'd like to duplicate the variables with extended names (var1_trunc,.., var100_trunc) and also keep the original variables (var1,.., var100) . I do not need rename, as I want to run different statistics later for var1 vs. var1_trunc and so on.
I tried:
lapply(list1, function(x){
paste(substitute(x),"trunc",sep="_")[x<mean(x)]<<-x
paste(substitute(x),"trunc",sep="_")[x>=mean(x)]<<-mean(x)
}
My problem is that the new variables (var1,.., var100) are not created.
May be I'm trying a wrong approach?
I don't understand why you put them in a list when you have a dataframe that will be easy to perform your stats comparisons later.
As #akrun mentioned you can use that command to change names.
# example dataframe
df1 = data.frame(var1 = 1:5,
var2 = 11:15)
df1
# var1 var2
# 1 1 11
# 2 2 12
# 3 3 13
# 4 4 14
# 5 5 15
# your function
ff = function(x){ ifelse(x < mean(x), x, mean(x)) }
# create new dataset by applying function to prevous dataset
df2 = data.frame(sapply(df1, ff))
df2
# var1 var2
# 1 1 11
# 2 2 12
# 3 3 13
# 4 3 13
# 5 3 13
# change names and combine datasets
names(df2) = paste0(names(df1),"_trunc")
df_full = cbind(df1,df2)
df_full
# var1 var2 var1_trunc var2_trunc
# 1 1 11 1 11
# 2 2 12 2 12
# 3 3 13 3 13
# 4 4 14 3 13
# 5 5 15 3 13
Use the approach above as a function that updates your datasets:
# your function to update dataset
UpdateDataset = function(df1){
ff = function(x){ ifelse(x < mean(x), x, mean(x)) } # your function to update columns
df2 = data.frame(sapply(df1, ff))
names(df2) = paste0(names(df1),"_trunc")
df_full = cbind(df1,df2)
return(df_full)
}
# try a new dataset
df = data.frame(var1 = 1:10,
var2 = 41:50)
df
# var1 var2
# 1 1 41
# 2 2 42
# 3 3 43
# 4 4 44
# 5 5 45
# 6 6 46
# 7 7 47
# 8 8 48
# 9 9 49
# 10 10 50
UpdateDataset(df)
# var1 var2 var1_trunc var2_trunc
# 1 1 41 1.0 41.0
# 2 2 42 2.0 42.0
# 3 3 43 3.0 43.0
# 4 4 44 4.0 44.0
# 5 5 45 5.0 45.0
# 6 6 46 5.5 45.5
# 7 7 47 5.5 45.5
# 8 8 48 5.5 45.5
# 9 9 49 5.5 45.5
# 10 10 50 5.5 45.5