I would like to create a column, by group, with a certain week's value from another column.
In this example New_column is created with the Number from the 2nd week for each group.
Group Week Number New_column
A 1 19 8
A 2 8 8
A 3 21 8
A 4 5 8
B 1 4 12
B 2 12 12
B 3 18 12
B 4 15 12
C 1 9 4
C 2 4 4
C 3 10 4
C 4 2 4
I've used this method, which works, but I feel is a really messy way to do it:
library(dplyr)
df <- df %>%
group_by(Group) %>%
mutate(New_column = ifelse(Week == 2, Number, NA))
df <- df %>%
group_by(Group) %>%
mutate(New_column = sum(New_column, na.rm = T))
There are several solution possible, depending on what you need specifically. With your specific sample data, however, all of them give the same result
1) It identifies the week number from column Week, even if the dataframe is not sorted
df %>%
group_by(Group) %>%
mutate(New_column = Number[Week == 2])
However, if the weeks do not start from 1, this solution will still try to find the case only where Week == 2
2) If df is already sorted by Week inside each group, you could use
df %>%
group_by(Group) %>%
mutate(New_column = Number[2])
This solution does not take the week Number in which Week == 2, but rather the second week within each group, regardless of its actual Week value.
3) If df is not sorted by week, you could do it with
df %>%
group_by(Group) %>%
arrange(Week, .by_group = TRUE) %>%
mutate(New_column = Number[2])
and uses the same rationale as solution 2)
Related
I'm trying to get the first n parts of an object, but with different n per group, according values I have in other object.
I have the next replicable example:
a<- tibble(id = c(1,2,3,4,5,6,7,8,9,10),
group = c(1,1,1,1,1,2,2,2,2,2))
b<- tibble(group=c(1,2),
n = c(3,4))
where what I want is to get the first 3 rows of a when the group is 1, and the first 4 rows of a when the group is 2.
I've trying doing this:
cob<- a %>% group_by(group) %>% arrange(id, .by_group = TRUE) %>%
group_map(~head(.x, b$n))
But I just get the first 3 rows in both groups, and not different size for each group.
We can do a join and then filter
library(dplyr)
a %>%
left_join(b) %>%
group_by(group) %>%
filter(row_number() <= first(n)) %>%
ungroup %>%
select(-n)
or another option is
a %>%
group_by(group) %>%
slice(seq_len(b$n[match(cur_group(), b$group)]))
Here is a data.table solution.
library(data.table)
setDT(a) # only needed because you started with a tibble
setDT(b) # same
a[b, on=.(group)][, .(id=id[1:n]), by=.(group, n)]
group n V1
1: 1 3 1
2: 1 3 2
3: 1 3 3
4: 2 4 6
5: 2 4 7
6: 2 4 8
7: 2 4 9
The first clause: a[b, on=.(group)] joins b to a creating a data.table with columns group, id, and n. The second clause: [, .(id=id[1:n]), by=.(group, n)] groups by group, taking the first n elements of id in each group.
This question already has answers here:
Select the row with the maximum value in each group
(19 answers)
Closed last year.
I have a dataset with multiple rows of the same individual
set.seed(420)
df <- data.frame(ind = c(rep("A",3), rep("B",5), rep("C",4)), value = seq(1:12), location = sample(c("first", "second", "third"), 12, replace = TRUE))
df
ind value location
1 A 1 first
2 A 2 first
3 A 3 second
4 B 4 second
5 B 5 first
6 B 6 first
7 B 7 first
8 B 8 first
9 C 9 first
10 C 10 first
11 C 11 first
12 C 12 third
I would like to find the location for each individual for which the value column is highest.
So the final dataset would like:
ind value location
A 3 second
B 8 first
C 12 third
Is this possible to do with group_by and summarize or mutate in dplyr?
There are a few ways to do it using tidyverse.
library(tidyverse)
df %>% group_by(ind) %>% slice_max(value)
Or
df %>% group_by(ind) %>% filter(value == max(value))
Addressing your question
mutate()
For mutate(), it would require extra steps to filter the data (i.e. made the data unique), since it would not shrink the data by group.
First group_by as usual,
Make sure we arrange by value, since we'll use the position to extract the location associated with maximum value
Set value column to the max(value)
Set location column to the last(location), since we've sorted the value, last(location) should be the location of the max(value)
Only keep the distinct rows
df %>% group_by(ind) %>%
arrange(value) %>%
mutate(value = max(value),
location = last(location)) %>%
distinct(value, .keep_all = T)
summarise()
Similar logic from mutate() can be applied to summarise(), but we do not need the distinct() step, since the summarise() would naturally shrink the data by group, but remember we need arrange(value) to make sure the values are sorted properly.
df %>% group_by(ind) %>%
arrange(value) %>%
summarize(value = max(value), location = last(location))
Output
# A tibble: 3 x 3
# Groups: ind [3]
ind value location
<chr> <int> <chr>
1 A 3 second
2 B 8 first
3 C 12 third
I am looking to filter and retrieve all rows from all groups where a specific row meets a condition, in my example when the value is more than 3 at the highest day per group. This is obviously simplified but breaks it down to the essential.
# Dummy data
id = rep(letters[1:3], each = 3)
day = rep(1:3, 3)
value = c(2,3,4,2,3,3,1,2,4)
my_data = data.frame(id, day, value, stringsAsFactors = FALSE)
My approach works, but it seems somewhat unsmart:
require(dplyr)
foo <- my_data %>%
group_by(id) %>%
slice(which.max(day)) %>% # gets the highest day
filter(value>3) # filters the rows with value >3
## semi_join with the original data frame gives the required result:
semi_join(my_data, foo, by = 'id')
id day value
1 a 1 2
2 a 2 3
3 a 3 4
4 c 1 1
5 c 2 2
6 c 3 4
Is there a more succint way to do this?
my_data %>% group_by(id) %>% filter(value[which.max(day)] > 3)
I've tried searching a number of posts on SO but I'm not sure what I'm doing wrong here, and I imagine the solution is quite simple. I'm trying to group a dataframe by one variable and figure the mean of several variables within that group.
Here is what I am trying:
head(airquality)
target_vars = c("Ozone","Temp","Solar.R")
airquality %>% group_by(Month) %>% select(target_vars) %>% summarise(rowSums(.))
But I get the error that my lenghts don't match. I've tried variations using mutate to create the column or summarise_all, but neither of these seem to work. I need the row sums within group, and then to compute the mean within group (yes, it's nonsensical here).
Also, I want to use select because I'm trying to do this over just certain variables.
I'm sure this could be a duplicate, but I can't find the right one.
EDIT FOR CLARITY
Sorry, my original question was not clear. Imagine the grouping variable is the calendar month, and we have v1, v2, and v3. I'd like to know, within month, what was the average of the sums of v1, v2, and v3. So if we have 12 months, the result would be a 12x1 dataframe. Here is an example if we just had 1 month:
Month v1 v2 v3 Sum
1 1 1 0 2
1 1 1 1 3
1 1 0 0 3
Then the result would be:
Month Average
1 8/3
You can try:
library(tidyverse)
airquality %>%
select(Month, target_vars) %>%
gather(key, value, -Month) %>%
group_by(Month) %>%
summarise(n=length(unique(key)),
Sum=sum(value, na.rm = T)) %>%
mutate(Average=Sum/n)
# A tibble: 5 x 4
Month n Sum Average
<int> <int> <int> <dbl>
1 5 3 7541 2513.667
2 6 3 8343 2781.000
3 7 3 10849 3616.333
4 8 3 8974 2991.333
5 9 3 8242 2747.333
The idea is to convert the data from wide to long using tidyr::gather(), then group by Month and calculate the sum and the average.
This seems to deliver what you want. It's regular R. The sapply function keeps the months separated by "name". The sum function applied to each dataframe will not keep the column sums separate. (Correction # 2: used only target_vars):
sapply( split( airquality[target_vars], airquality$Month), sum, na.rm=TRUE)
5 6 7 8 9
7541 8343 10849 8974 8242
If you wanted the per number of variable results, then you would divide by the number of variables:
sapply( split( airquality[target_vars], airquality$Month), sum, na.rm=TRUE)/
(length(target_vars))
5 6 7 8 9
2513.667 2781.000 3616.333 2991.333 2747.333
Perhaps this is what you're looking for
library(dplyr)
library(purrr)
library(tidyr) # forgot this in original post
airquality %>%
group_by(Month) %>%
nest(Ozone, Temp, Solar.R, .key=newcol) %>%
mutate(newcol = map_dbl(newcol, ~mean(rowSums(.x, na.rm=TRUE))))
# A tibble: 5 x 2
# Month newcol
# <int> <dbl>
# 1 5 243.2581
# 2 6 278.1000
# 3 7 349.9677
# 4 8 289.4839
# 5 9 274.7333
I've never encountered a situation where all the answers disagreed. Here's some validation (at least I think) for the 5th month
airquality %>%
filter(Month == 5) %>%
select(Ozone, Temp, Solar.R) %>%
mutate(newcol = rowSums(., na.rm=TRUE)) %>%
summarise(sum5 = sum(newcol), mean5 = mean(newcol))
# sum5 mean5
# 1 7541 243.2581
I am trying to add a sum column to a large file that has dates in it. I want to sum every month and add a column to the right of the last column of that month.
Below is a reproducible example:
df <- data.frame("6Jun06" = c(4, 5, 9),
"13Jun06" = c(4, 5, 9),
"20Jun06" = c(4, 5, 9),
"03Jul16" = c(1, 2, 3),
"09Jul16" = c(1, 2, 3),
"01Aug16" = c(1, 2, 5))
So in this case I would need to have three columns (after Jun, Jul, and Aug).
X6.Jun.06 X13.Jun.06 X20.Jun.06 Jun.Sum X03.Jul.16 X09.Jul.16 Jul.Sum X01.Aug.16 Aug.Sum
1 4 4 4 Sum 1 1 Sum 1 Sum
2 5 5 5 Sum 2 2 Sum 2 Sum
3 9 9 9 Sum 3 3 Sum 5 Sum
I am not sure how to sum every month individually. I know there are build-in sum functions but the functions that I tried do not fit to my problem because they just do a general sum.
If you are new to R, a good start is taking a look at the dplyr ecosystem (as well as other packages by Hadley Wickham).
library(dplyr)
library(tidyr)
df %>%
mutate(id = 1:nrow(df)) %>%
gather(date, value, -id) %>%
mutate(Month = month.abb[apply(sapply(month.abb, function(mon) {grepl(mon, .$date)}), 1, which)]) %>%
group_by(id, Month) %>%
summarize(sum = sum(value)) %>%
spread(Month, sum) %>%
left_join(mutate(df, id = 1:nrow(df)), .) %>%
select(-id)
You're making life slightly hard for yourself using variables names that start with a numeral, as R will insert an X in front of them. However, here's one way you could get the sums you want.
#1. Use the package `reshape2`:
library(reshape2)
dfm <- melt(df)
#2. Get rid of the X in the dates, then convert to a date using the package `lubridate` and extract the month:
library(lubridate)
dfm$Date <- dmy(substring(dfm$variable, 2))
dfm$Month <- month(dfm$Date)
#3. Then calculate the sum for each month using the `dplyr` package:
library(dplyr)
dfm %>% group_by(Month) %>% summarise(sum(value))
Here is one way which adds the new columns at the end of the data frame,
cbind(df, sapply(unique(gsub('\\d+', '', names(df))), function(i)
rowSums(df[grepl(i, sub('\\d+', '', names(df)))])))
# 6Jun06 13Jun06 20Jun06 03Jul16 09Jul16 01Aug16 Jun Jul Aug
#1 4 4 4 1 1 1 12 2 1
#2 5 5 5 2 2 2 15 4 2
#3 9 9 9 3 3 5 27 6 5