How to sum a substring reference - r

I'm attempting to select the correct column to sum the total of a from within a data frame column using ddply:
df2 <- ddply(df1,'col1', summarise, total = sum(substr(variable,1,3)))
It appears not to be working because you can't sum a character, but I am trying to pass the reference to the column, not sum the literal result of the substring. Is there a way to get around this?
Example Data & Desired output:
variable = "Aug 2017"
col1 Jun Jul Aug
1 A 1 2 3
2 A 1 2 3
3 A 1 2 3
4 A 1 2 3
5 A 1 2 3
6 B 2 3 4
7 B 2 3 4
8 B 2 3 4
9 C 3 4 5
10 C 3 4 5
Desired Output:
1 A 15
2 B 12
3 C 10

This works with dplyr instead of plyr.
# create data
df1 <- data.frame(
col1 = c(rep('A', 5), rep('B', 3), rep('C', 2)),
Jun = c(1, 1, 1, 1, 1, 2, 2, 2, 3, 3),
Jul = c(2, 2, 2, 2, 2, 3, 3, 3, 4, 4),
Aug = c(3, 3, 3, 3, 3, 4, 4, 4, 5, 5))
variable = 'Aug 2017'
# load dplyr library
library(dplyr)
# summarize each column that matches some string
df1 %>%
select(col1, matches(substr(variable, 1, 3))) %>%
group_by(col1) %>%
summarize_each(funs = 'sum')
# A tibble: 3 × 2
col1 Aug
<fctr> <dbl>
1 A 15
2 B 12
3 C 10
I also highly recommend reading about nonstandard and standard evaluation, here:
http://adv-r.had.co.nz/Computing-on-the-language.html

Related

Select rows up to certain value in R

I have the following dataframe:
df1 <- data.frame(ID = c(1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2),
var1 = c(0, 2, 3, 4, 2, 5, 6, 10, 11, 0, 1, 2, 1, 5, 7, 10))
I want to select only the rows containing values up to 5, once 5 is reached I want it to go to the next ID and select only values up to 5 for that group so that the final result would look like this:
ID var1
1 0
1 2
1 3
1 4
1 2
1 5
2 0
2 1
2 2
2 1
2 5
I would like to try something with dplyr as it is what I am most familiar with.
You could use which.max() to find the first occurrence of var1 >= 5, and then extract those rows whose row numbers are before it.
library(dplyr)
df1 %>%
group_by(ID) %>%
filter(row_number() <= which.max(var1 >= 5)) %>%
ungroup()
or
df1 %>%
group_by(ID) %>%
slice(1:which.max(var1 >= 5)) %>%
ungroup()
# # A tibble: 11 × 2
# ID var1
# <dbl> <dbl>
# 1 1 0
# 2 1 2
# 3 1 3
# 4 1 4
# 5 1 2
# 6 1 5
# 7 2 0
# 8 2 1
# 9 2 2
# 10 2 1
# 11 2 5

Calculate mean by group among observation without NA

Updated:
Hi! I have a data like this.
structure(list(V1QB10 = c(1, 1, 1, 2, 1, 3, 3, 1, 4, 2), V1QB12A = c(2,
1, 2, 3, NA, 2, 2, 3, 2, 2), V1QB12B = c(NA, 2, 2, 2, 2, 1, 2,
2, 2, 2), V1QB12C = c(NA, 1, 2, 2, 2, 1, 2, 2, 2, 2), sum = c(NA,
4, 6, 7, NA, 4, 6, 7, 6, 6)), row.names = c(NA, 10L), class = "data.frame")
This is how the data looks like:
V1QB10 V1QB12A V1QB12B V1QB12C sum
1 1 2 NA NA NA
2 1 1 2 1 4
3 1 2 2 2 6
4 2 3 2 2 7
5 1 NA 2 2 NA
6 3 2 1 1 4
7 3 2 2 2 6
8 1 3 2 2 7
9 4 2 2 2 6
10 2 2 2 2 6
Variable "sum" is the sum of "V1QB12*".
Now I'm trying to calculate the mean of the "sum" by "V1QB10":
dt %>%
group_by(V1QB10) %>%
dplyr::summarise(n=n(), mean=mean(sum), sd=sd(sum)) %>%
as.data.frame()
I'm expect the calculation like:
for V1QB10==1, the n is 3 (remove 2 observations with NA in "V1QB12*"), and sum up the "sum": 4+6+7=17, then calculate the mean: 17/3, and the sd.
But I found that I keep getting mean of 17/5. Trying to replace the code with n=n(V1QB12A) also didn't work.
Maybe I'm overthinking this problem. How I'm gonna do to fix it?
Thank you!
I'm not completely sure I follow what you're looking for, but the dplyr package has a nifty drop_na() function that will remove the NA's if you use it like this:
dt <- dt %>%
drop_na() %>%
dplyr::mutate(sum=rowSums(dplyr::select(., contains("V1QB12")), na.rm=T))
dt %>%
group_by(V1QB10) %>%
dplyr::summarise(n=n(), mean=mean(sum), sd=sd(sum)) %>%
as.data.frame()
Result:
V1QB10 n mean sd
1 1 3 5.666667 1.5275252
2 2 2 6.500000 0.7071068
3 3 2 5.000000 1.4142136
4 4 1 6.000000 NA

Grouped non-dense rank without omitted values

I have the following data.frame:
df <- data.frame(date = c(1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3),
id = c(4, 4, 2, 4, 1, 2, 3, 1, 2, 2, 1, 1))
And I want to add a new column grp which, for each date, ranks the IDs. Ties should have the same value, but there should be no omitted values. That is, if there are two values which are equally minimum, they should both get rank 1, and the next lowest values should get rank 2.
The expected result would therefore look like this. Note that, as mentioned, the groups are for each date, so the operation must be grouped by date.
data.frame(date = c(1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3),
id = c(4, 4, 2, 4, 1, 2, 3, 1, 2, 2, 1, 1),
grp = c(2, 2, 1, 2, 1, 2, 3, 1, 2, 2, 1, 1))
I'm sure there's a trivial way to do this but I haven't found it: none of the options for tie.method behave in this way (data.table::frank also doesn't help, since it only adds a dense rank).
I thought of doing a normal rank and then using data.table::rleid, but that doesn't work if there are duplicate values separated by other values during the same day.
I also thought of grouping by date and id and then using a group-ID, but the lowest values each day must start at rank 1, so that won't work either.
The only functional solution I've found is to create another table with the unique ids per day and then join that table to this one:
suppressPackageStartupMessages(library(dplyr))
df <- data.frame(date = c(1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3),
id = c(4, 4, 2, 4, 1, 2, 3, 1, 2, 2, 1, 1))
uniques <- df %>%
group_by(
date
) %>%
distinct(
id
) %>%
mutate(
grp = rank(id)
)
df <- df %>% left_join(
unique
) %>% print()
#> Joining, by = c("date", "id")
#> date id grp
#> 1 1 4 2
#> 2 1 4 2
#> 3 1 2 1
#> 4 1 4 2
#> 5 2 1 1
#> 6 2 2 2
#> 7 2 3 3
#> 8 2 1 1
#> 9 3 2 2
#> 10 3 2 2
#> 11 3 1 1
#> 12 3 1 1
Created on 2020-05-08 by the reprex package (v0.3.0)
However, this seems quite inelegant and convoluted for what seems like a simple operation, so I'd rather see if other solutions are available.
Curious to see data.table solutions if available, but unfortunately the solution must be in dplyr.
We can use dense_rank
library(dplyr)
df %>%
group_by(date) %>%
mutate(grp = dense_rank(id))
# A tibble: 12 x 3
# Groups: date [3]
# date id grp
# <dbl> <dbl> <int>
# 1 1 4 2
# 2 1 4 2
# 3 1 2 1
# 4 1 4 2
# 5 2 1 1
# 6 2 2 2
# 7 2 3 3
# 8 2 1 1
# 9 3 2 2
#10 3 2 2
#11 3 1 1
#12 3 1 1
Or with frank
library(data.table)
setDT(df)[, grp := frank(id, ties.method = 'dense'), date]

get the maximum and minimum values of a sub group of columns in a dataframe in ddply in R

I am trying to select the maximum and minimum values of a group of variables from within a data frame using the ddply function from the plyr package. However, it does not seem to work.
a1 = c(1, 2, 3, 4, 5)
a2 = c(6, 7, 8, 9, 10)
a3 = c(11, 12, 13, 14, 15)
f=letters[1:5]
d= data.frame(f,a1, a2, a3)
t=ddply(d,.(f), summarize,
minima=apply(f[,c(1:3)], 1, min),
maxima=apply(f[,c(1:3)], 1, min))
Thanks!
This dplyr approach produces mins and maxes. You may need to reshape the resulting data frame, depending on what you are using it for.
library(dplyr)
# Create dataframe
a1 = c(1, 2, 3, 4, 5)
a2 = c(6, 7, 8, 9, 10)
a3 = c(11, 12, 13, 14, 15)
f=letters[1:5]
d= data.frame(f,a1, a2, a3)
# Get min and max value for a1,a2,a3
d %>% group_by(f) %>% summarise_at(vars(a1,a2,a3),funs(min = min(.),max = max(.)) )
#> # A tibble: 5 × 7
#> f a1_min a2_min a3_min a1_max a2_max a3_max
#> <fctr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 a 1 6 11 1 6 11
#> 2 b 2 7 12 2 7 12
#> 3 c 3 8 13 3 8 13
#> 4 d 4 9 14 4 9 14
#> 5 e 5 10 15 5 10 15

dplyr how to lag by group

I have a data frame of orders and receivables with lead times.
Can I use dplyr to fill in the receive column according to the groups lead time?
df <- data.frame(team = c("a","a","a","a", "a", "b", "b", "b", "b", "b"),
order = c(2, 4, 3, 5, 6, 7, 8, 5, 4, 5),
lead_time = c(3, 3, 3, 3, 3, 2, 2, 2, 2, 2))
>df
team order lead_time
a 2 3
a 4 3
a 3 3
a 5 3
a 6 3
b 7 2
b 8 2
b 5 2
b 4 2
b 5 2
And adding a receive column like so:
dfb <- data.frame(team = c("a","a","a","a", "a", "b", "b", "b", "b", "b"),
order = c(2, 4, 3, 5, 6, 7, 8, 5, 4, 5),
lead_time = c(3, 3, 3, 3, 3, 2, 2, 2, 2, 2),
receive = c(0, 0, 0, 2, 4, 0, 0, 7, 8, 5))
>dfb
team order lead_time receive
a 2 3 0
a 4 3 0
a 3 3 0
a 5 3 2
a 6 3 4
b 7 2 0
b 8 2 0
b 5 2 7
b 4 2 8
b 5 2 5
I was thinking along these lines but run into an error
dfc <- df %>%
group_by(team) %>%
mutate(receive = if_else( row_number() < lead_time, 0, lag(order, n = lead_time)))
Error in mutate_impl(.data, dots) :
could not convert second argument to an integer. type=SYMSXP, length = 1
Thanks for the help!
This looks like a bug; There might be some unintended mask of the lag function between dplyr and stats package, try this work around:
df %>%
group_by(team) %>%
# explicitly specify the source of the lag function here
mutate(receive = dplyr::lag(order, n=unique(lead_time), default=0))
#Source: local data frame [10 x 4]
#Groups: team [2]
# team order lead_time receive
# <fctr> <dbl> <dbl> <dbl>
#1 a 2 3 0
#2 a 4 3 0
#3 a 3 3 0
#4 a 5 3 2
#5 a 6 3 4
#6 b 7 2 0
#7 b 8 2 0
#8 b 5 2 7
#9 b 4 2 8
#10 b 5 2 5
We can also use shift from data.table
library(data.table)
setDT(df)[, receive := shift(order, n = lead_time[1], fill=0), by = team]
df
# team order lead_time receive
# 1: a 2 3 0
# 2: a 4 3 0
# 3: a 3 3 0
# 4: a 5 3 2
# 5: a 6 3 4
# 6: b 7 2 0
# 7: b 8 2 0
# 8: b 5 2 7
# 9: b 4 2 8
#10: b 5 2 5

Resources