Sum column in a DataFrame in R - r

I am trying to add a sum column to a large file that has dates in it. I want to sum every month and add a column to the right of the last column of that month.
Below is a reproducible example:
df <- data.frame("6Jun06" = c(4, 5, 9),
"13Jun06" = c(4, 5, 9),
"20Jun06" = c(4, 5, 9),
"03Jul16" = c(1, 2, 3),
"09Jul16" = c(1, 2, 3),
"01Aug16" = c(1, 2, 5))
So in this case I would need to have three columns (after Jun, Jul, and Aug).
X6.Jun.06 X13.Jun.06 X20.Jun.06 Jun.Sum X03.Jul.16 X09.Jul.16 Jul.Sum X01.Aug.16 Aug.Sum
1 4 4 4 Sum 1 1 Sum 1 Sum
2 5 5 5 Sum 2 2 Sum 2 Sum
3 9 9 9 Sum 3 3 Sum 5 Sum
I am not sure how to sum every month individually. I know there are build-in sum functions but the functions that I tried do not fit to my problem because they just do a general sum.

If you are new to R, a good start is taking a look at the dplyr ecosystem (as well as other packages by Hadley Wickham).
library(dplyr)
library(tidyr)
df %>%
mutate(id = 1:nrow(df)) %>%
gather(date, value, -id) %>%
mutate(Month = month.abb[apply(sapply(month.abb, function(mon) {grepl(mon, .$date)}), 1, which)]) %>%
group_by(id, Month) %>%
summarize(sum = sum(value)) %>%
spread(Month, sum) %>%
left_join(mutate(df, id = 1:nrow(df)), .) %>%
select(-id)

You're making life slightly hard for yourself using variables names that start with a numeral, as R will insert an X in front of them. However, here's one way you could get the sums you want.
#1. Use the package `reshape2`:
library(reshape2)
dfm <- melt(df)
#2. Get rid of the X in the dates, then convert to a date using the package `lubridate` and extract the month:
library(lubridate)
dfm$Date <- dmy(substring(dfm$variable, 2))
dfm$Month <- month(dfm$Date)
#3. Then calculate the sum for each month using the `dplyr` package:
library(dplyr)
dfm %>% group_by(Month) %>% summarise(sum(value))

Here is one way which adds the new columns at the end of the data frame,
cbind(df, sapply(unique(gsub('\\d+', '', names(df))), function(i)
rowSums(df[grepl(i, sub('\\d+', '', names(df)))])))
# 6Jun06 13Jun06 20Jun06 03Jul16 09Jul16 01Aug16 Jun Jul Aug
#1 4 4 4 1 1 1 12 2 1
#2 5 5 5 2 2 2 15 4 2
#3 9 9 9 3 3 5 27 6 5

Related

Interpolation of values from list

I have a dataframe containing the results of a competition. In this example competitors b and c have tied for second place. The actual dataframe is very large and could contain multiple ties.
df <- data.frame(name = letters[1:4],
place = c(1, 2, 2, 4))
I also have point values for the respective places, where first place gets 4 points, 2nd gets 3, 3rd gets 1 and 4th gets 0.
points <- c(4, 3, 1, 0)
names(points) <- 1:4
I can match points to place to get each competitor's score
df %>%
mutate(score = points[place])
name place score
1 a 1 4
2 b 2 3
3 c 2 3
4 d 4 0
What I would like to do though is award points to b and c that are the mean of the point values for 2nd and 3rd, such that each receives 2 points like this:
name place score
1 a 1 4
2 b 2 2
3 c 2 2
4 d 4 0
How can I accomplish this programmatically?
A solution using nested data frames and purrr.
library(dplyr)
library(tidyr)
library(purrr)
df <- data.frame(name = letters[1:4],
place = c(1, 2, 2, 4))
points <- c(4, 3, 1, 0)
names(points) <- 1:4
# a function to help expand the dataframe based on the number of ties
expand_all <- function(x,n){
x:(x+n-1)
}
df %>%
group_by(place) %>%
tally() %>%
mutate(new_place = purrr::map2(place,n, expand_all)) %>%
unnest(new_place) %>%
mutate(score = points[new_place]) %>%
group_by(place) %>%
summarize(score = mean(score)) %>%
inner_join(df)
Robert Wilson's answer gave me an idea. Rather than mapping over nested dataframes the rank function from base can get to the same result
df %>%
mutate(new_place = rank(place, ties.method = "first")) %>%
mutate(score = points[new_place]) %>%
group_by(place) %>%
summarize(score = mean(score)) %>%
inner_join(df)
place score name
<dbl> <dbl> <chr>
1 1 4 a
2 2 2 b
3 2 2 c
4 4 0 d
This can be accomplished in few lines with an ifelse() statement inside of a mutate():
df %>%
group_by(place) %>%
mutate(n_ties = n()) %>%
ungroup %>%
mutate(score = (points[place] + ifelse(n_ties > 1, 1, 0))/ n_ties)
# A tibble: 4 x 4
name place n_ties score
<chr> <dbl> <int> <dbl>
1 a 1 1 4
2 b 2 2 2
3 c 2 2 2
4 d 4 1 0

Create column with a certain week value by group

I would like to create a column, by group, with a certain week's value from another column.
In this example New_column is created with the Number from the 2nd week for each group.
Group Week Number New_column
A 1 19 8
A 2 8 8
A 3 21 8
A 4 5 8
B 1 4 12
B 2 12 12
B 3 18 12
B 4 15 12
C 1 9 4
C 2 4 4
C 3 10 4
C 4 2 4
I've used this method, which works, but I feel is a really messy way to do it:
library(dplyr)
df <- df %>%
group_by(Group) %>%
mutate(New_column = ifelse(Week == 2, Number, NA))
df <- df %>%
group_by(Group) %>%
mutate(New_column = sum(New_column, na.rm = T))
There are several solution possible, depending on what you need specifically. With your specific sample data, however, all of them give the same result
1) It identifies the week number from column Week, even if the dataframe is not sorted
df %>%
group_by(Group) %>%
mutate(New_column = Number[Week == 2])
However, if the weeks do not start from 1, this solution will still try to find the case only where Week == 2
2) If df is already sorted by Week inside each group, you could use
df %>%
group_by(Group) %>%
mutate(New_column = Number[2])
This solution does not take the week Number in which Week == 2, but rather the second week within each group, regardless of its actual Week value.
3) If df is not sorted by week, you could do it with
df %>%
group_by(Group) %>%
arrange(Week, .by_group = TRUE) %>%
mutate(New_column = Number[2])
and uses the same rationale as solution 2)

how to count repetitions of first occuring value with dplyr

I have a dataframe with groups that essentially looks like this
DF <- data.frame(state = c(rep("A", 3), rep("B",2), rep("A",2)))
DF
state
1 A
2 A
3 A
4 B
5 B
6 A
7 A
My question is how to count the number of consecutive rows where the first value is repeated in its first "block". So for DF above, the result should be 3. The first value can appear any number of times, with other values in between, or it may be the only value appearing.
The following naive attempt fails in general, as it counts all occurrences of the first value.
DF %>% mutate(is_first = as.integer(state == first(state))) %>%
summarize(count = sum(is_first))
The result in this case is 5. So, hints on a (preferably) dplyr solution to this would be appreciated.
You can try:
rle(as.character(DF$state))$lengths[1]
[1] 3
In your dplyr chain that would just be:
DF %>% summarize(count_first = rle(as.character(state))$lengths[1])
# count_first
# 1 3
Or to be overzealous with piping, using dplyr and magrittr:
library(dplyr)
library(magrittr)
DF %>% summarize(count_first = state %>%
as.character %>%
rle %$%
lengths %>%
first)
# count_first
# 1 3
Works also for grouped data:
DF <- data.frame(group = c(rep(1,4),rep(2,3)),state = c(rep("A", 3), rep("B",2), rep("A",2)))
# group state
# 1 1 A
# 2 1 A
# 3 1 A
# 4 1 B
# 5 2 B
# 6 2 A
# 7 2 A
DF %>% group_by(group) %>% summarize(count_first = rle(as.character(state))$lengths[1])
# # A tibble: 2 x 2
# group count_first
# <dbl> <int>
# 1 1 3
# 2 2 1
No need of dplyrhere but you can modify this example to use it with dplyr. The key is the function rle
state = c(rep("A", 3), rep("B",2), rep("A",2))
x = rle(state)
DF = data.frame(len = x$lengths, state = x$values)
DF
# get the longest run of consecutive "A"
max(DF[DF$state == "A",]$len)

r: Summarise for rowSums after group_by

I've tried searching a number of posts on SO but I'm not sure what I'm doing wrong here, and I imagine the solution is quite simple. I'm trying to group a dataframe by one variable and figure the mean of several variables within that group.
Here is what I am trying:
head(airquality)
target_vars = c("Ozone","Temp","Solar.R")
airquality %>% group_by(Month) %>% select(target_vars) %>% summarise(rowSums(.))
But I get the error that my lenghts don't match. I've tried variations using mutate to create the column or summarise_all, but neither of these seem to work. I need the row sums within group, and then to compute the mean within group (yes, it's nonsensical here).
Also, I want to use select because I'm trying to do this over just certain variables.
I'm sure this could be a duplicate, but I can't find the right one.
EDIT FOR CLARITY
Sorry, my original question was not clear. Imagine the grouping variable is the calendar month, and we have v1, v2, and v3. I'd like to know, within month, what was the average of the sums of v1, v2, and v3. So if we have 12 months, the result would be a 12x1 dataframe. Here is an example if we just had 1 month:
Month v1 v2 v3 Sum
1 1 1 0 2
1 1 1 1 3
1 1 0 0 3
Then the result would be:
Month Average
1 8/3
You can try:
library(tidyverse)
airquality %>%
select(Month, target_vars) %>%
gather(key, value, -Month) %>%
group_by(Month) %>%
summarise(n=length(unique(key)),
Sum=sum(value, na.rm = T)) %>%
mutate(Average=Sum/n)
# A tibble: 5 x 4
Month n Sum Average
<int> <int> <int> <dbl>
1 5 3 7541 2513.667
2 6 3 8343 2781.000
3 7 3 10849 3616.333
4 8 3 8974 2991.333
5 9 3 8242 2747.333
The idea is to convert the data from wide to long using tidyr::gather(), then group by Month and calculate the sum and the average.
This seems to deliver what you want. It's regular R. The sapply function keeps the months separated by "name". The sum function applied to each dataframe will not keep the column sums separate. (Correction # 2: used only target_vars):
sapply( split( airquality[target_vars], airquality$Month), sum, na.rm=TRUE)
5 6 7 8 9
7541 8343 10849 8974 8242
If you wanted the per number of variable results, then you would divide by the number of variables:
sapply( split( airquality[target_vars], airquality$Month), sum, na.rm=TRUE)/
(length(target_vars))
5 6 7 8 9
2513.667 2781.000 3616.333 2991.333 2747.333
Perhaps this is what you're looking for
library(dplyr)
library(purrr)
library(tidyr) # forgot this in original post
airquality %>%
group_by(Month) %>%
nest(Ozone, Temp, Solar.R, .key=newcol) %>%
mutate(newcol = map_dbl(newcol, ~mean(rowSums(.x, na.rm=TRUE))))
# A tibble: 5 x 2
# Month newcol
# <int> <dbl>
# 1 5 243.2581
# 2 6 278.1000
# 3 7 349.9677
# 4 8 289.4839
# 5 9 274.7333
I've never encountered a situation where all the answers disagreed. Here's some validation (at least I think) for the 5th month
airquality %>%
filter(Month == 5) %>%
select(Ozone, Temp, Solar.R) %>%
mutate(newcol = rowSums(., na.rm=TRUE)) %>%
summarise(sum5 = sum(newcol), mean5 = mean(newcol))
# sum5 mean5
# 1 7541 243.2581

Remove duplicated rows using dplyr

I have a data.frame like this -
set.seed(123)
df = data.frame(x=sample(0:1,10,replace=T),y=sample(0:1,10,replace=T),z=1:10)
> df
x y z
1 0 1 1
2 1 0 2
3 0 1 3
4 1 1 4
5 1 0 5
6 0 1 6
7 1 0 7
8 1 0 8
9 1 0 9
10 0 1 10
I would like to remove duplicate rows based on first two columns. Expected output -
df[!duplicated(df[,1:2]),]
x y z
1 0 1 1
2 1 0 2
4 1 1 4
I am specifically looking for a solution using dplyr package.
Here is a solution using dplyr >= 0.5.
library(dplyr)
set.seed(123)
df <- data.frame(
x = sample(0:1, 10, replace = T),
y = sample(0:1, 10, replace = T),
z = 1:10
)
> df %>% distinct(x, y, .keep_all = TRUE)
x y z
1 0 1 1
2 1 0 2
3 1 1 4
Note: dplyr now contains the distinct function for this purpose.
Original answer below:
library(dplyr)
set.seed(123)
df <- data.frame(
x = sample(0:1, 10, replace = T),
y = sample(0:1, 10, replace = T),
z = 1:10
)
One approach would be to group, and then only keep the first row:
df %>% group_by(x, y) %>% filter(row_number(z) == 1)
## Source: local data frame [3 x 3]
## Groups: x, y
##
## x y z
## 1 0 1 1
## 2 1 0 2
## 3 1 1 4
(In dplyr 0.2 you won't need the dummy z variable and will just be
able to write row_number() == 1)
I've also been thinking about adding a slice() function that would
work like:
df %>% group_by(x, y) %>% slice(from = 1, to = 1)
Or maybe a variation of unique() that would let you select which
variables to use:
df %>% unique(x, y)
For completeness’ sake, the following also works:
df %>% group_by(x) %>% filter (! duplicated(y))
However, I prefer the solution using distinct, and I suspect it’s faster, too.
Most of the time, the best solution is using distinct() from dplyr, as has already been suggested.
However, here's another approach that uses the slice() function from dplyr.
# Generate fake data for the example
library(dplyr)
set.seed(123)
df <- data.frame(
x = sample(0:1, 10, replace = T),
y = sample(0:1, 10, replace = T),
z = 1:10
)
# In each group of rows formed by combinations of x and y
# retain only the first row
df %>%
group_by(x, y) %>%
slice(1)
Difference from using the distinct() function
The advantage of this solution is that it makes it explicit which rows are retained from the original dataframe, and it can pair nicely with the arrange() function.
Let's say you had customer sales data and you wanted to retain one record per customer, and you want that record to be the one from their latest purchase. Then you could write:
customer_purchase_data %>%
arrange(desc(Purchase_Date)) %>%
group_by(Customer_ID) %>%
slice(1)
When selecting columns in R for a reduced data-set you can often end up with duplicates.
These two lines give the same result. Each outputs a unique data-set with two selected columns only:
distinct(mtcars, cyl, hp);
summarise(group_by(mtcars, cyl, hp));
If you want to find the rows that are duplicated you can use find_duplicates from hablar:
library(dplyr)
library(hablar)
df <- tibble(a = c(1, 2, 2, 4),
b = c(5, 2, 2, 8))
df %>% find_duplicates()

Resources