I have data for multiple products, their launch date, and sales; there are other variables as well. But, these are the ones that I am using for manipulation. I want to add a column and a row in my dataset; the new variable/ column indicates months after the product's launch (launch). So, launch 1 will indicate the first month for each product and 2 will be the second month, and so on. I also want to add an observation (row) for each product with launch as 0 and sales as 0.
months<- as.Date(c("2011-04-01", "2011-05-01" , "2011-06-01",
"2012-10-01", "2012-11-01", "2012-12-01",
"2011-04-01", "2011-05-01" , "2011-06-01",
"2013-06-01", "2013-07-01", "2013-08-01"))
product <- c("A", "A" , "A",
"B", "B", "B",
"C", "C" , "C",
"D", "D", "D")
sales<- c(75, 78,80,
67, 65, 75,
86, 87, 87,
90, 92, 94)
#This is how data looks right now..
input_data<- data.frame(months, product, sales)
Right now, I can add the launch column and assign the launch value the same as row_number after group_ by product, and it populates the launch as 1,2,3, etc. based on the months. However, I don't know how to add additional observations for each product.
Right now, I am identifying the entry date of each product and creating a data frame with the 0 launch and sales, and binding the dataset. But, it is tedious and I am sure it could be done more efficiently.
#Expected outcome:
#I don't care about the additional dates row too much it can remain as NA, here I added it for making data frame
months1 <- as.Date(c ("2011-03-01", "2011-04-01", "2011-05-01" , "2011-06-01",
"2012-9-01", "2012-10-01", "2012-11-01", "2012-12-01" ,
" 2011-03-01", "2011-04-01", "2011-05-01" , "2011-06-01" ,
"2013-06-01", "2013-06-01", "2013-07-01", "2013-08-01"))
launch<- c(0, 1, 2, 3,
0, 1, 2, 3,
0, 1, 2, 3,
0, 1, 2, 3)
product1 <- c("A", "A" , "A", "A",
"B", "B", "B", "B",
"C", "C" , "C", "C",
"D", "D", "D", "D")
sales1<- c(0, 75, 78,80,
0, 67, 65, 75,
0, 86, 87, 87,
0, 90, 92, 94)
output_data <- data.frame (months1, launch, product1, sales1)
We may use complete to expand the data after grouping by 'product'
library(lubridate)
library(dplyr)
library(tidyr)
input_data %>%
group_by(product) %>%
complete(months = first(months) %m+% months(-1:2),
fill = list(sales = 0)) %>%
mutate(launch = row_number() - 1) %>%
ungroup %>%
select(months, launch, product, sales)
-output
# A tibble: 16 × 4
months launch product sales
<date> <dbl> <chr> <dbl>
1 2011-03-01 0 A 0
2 2011-04-01 1 A 75
3 2011-05-01 2 A 78
4 2011-06-01 3 A 80
5 2012-09-01 0 B 0
6 2012-10-01 1 B 67
7 2012-11-01 2 B 65
8 2012-12-01 3 B 75
9 2011-03-01 0 C 0
10 2011-04-01 1 C 86
11 2011-05-01 2 C 87
12 2011-06-01 3 C 87
13 2013-05-01 0 D 0
14 2013-06-01 1 D 90
15 2013-07-01 2 D 92
16 2013-08-01 3 D 94
Related
This question already has an answer here:
Using Reshape from wide to long in R [closed]
(1 answer)
Closed 2 years ago.
Suppose I have the following data in that wide format:
data = tibble::tribble(
~ID, ~Time, ~Value, ~ValueX,
"A", 1, 11, 41,
"A", 2, 12, 42,
"A", 3, 13, 43,
"B", 1, 21, 41,
"B", 2, 22, 42,
"B", 3, 23, 43,
"C", 1, 31, 41,
"C", 2, 32, 42,
"C", 3, 33, 43
)
Since ValueX is a repeated variable that does not vary within ID group variable, I just want to add it as new rows identified by ID. This will be the desired output:
data.desired = tibble::tribble(
~ID, ~Time, ~Value,
"A", 1, 11,
"A", 2, 12,
"A", 3, 13,
"B", 1, 21,
"B", 2, 22,
"B", 3, 23,
"C", 1, 31,
"C", 2, 32,
"C", 3, 33,
"ValueX", 1, 41,
"ValueX", 2, 42,
"ValueX", 3, 41
)
Here is a way via base R. You can aggregate ValueX per Time and get the first observation each. Then create a data frame with same names as your original data and simply rbind, i.e.
rbind(data[-ncol(data)],
setNames(data.frame('ValueX', aggregate(ValueX ~ Time, data, head, 1)),
names(data[-ncol(data)])))
# A tibble: 12 x 3
# ID Time Value
# <chr> <dbl> <dbl>
# 1 A 1 11
# 2 A 2 12
# 3 A 3 13
# 4 B 1 21
# 5 B 2 22
# 6 B 3 23
# 7 C 1 31
# 8 C 2 32
# 9 C 3 33
#10 ValueX 1 41
#11 ValueX 2 42
#12 ValueX 3 43
use tidyverse
addCase <- distinct(data, Time, ValueX) %>%
pivot_longer(-Time, names_to = "ID", values_to = "Value")
data %>%
select(-ValueX) %>%
add_case(addCase)
# A tibble: 12 x 3
ID Time Value
<chr> <dbl> <dbl>
1 A 1 11
2 A 2 12
3 A 3 13
4 B 1 21
5 B 2 22
6 B 3 23
7 C 1 31
8 C 2 32
9 C 3 33
10 ValueX 1 41
11 ValueX 2 42
12 ValueX 3 43
I have a data table of labelled coordinates that are aligned between two groups (A and B). For example:
dt_long <- data.table(LABEL_A = c(rep("A", 20), rep("A", 15), rep ("A", 10), rep ("A", 15), rep ("A", 10)),
SEQ_A = c(11:30, 61:75, 76:85, 86:100, 110:119),
LABEL_B= c(rep("C", 20), rep("D", 15), rep("F", 10), rep("G",15), rep("D", 10)),
SEQ_B = c(1:20, 25:11, 16:25, 15:1, 1:5, 8:12))
How can I reduce this information into a short format, where the start and end coordinates for each aligned sequence are given. For example:
dt_short <- data.table(LABEL_A = c("A", "A", "A", "A", "A", "A"),
Start_A = c(11, 61, 76, 86, 110, 115),
End_A = c(30, 75, 85, 100, 114, 119),
LABEL_B= c("C", "D", "F", "G", "D", "D"),
Start_B = c(1, 25, 16, 15, 1, 8),
End_B = c(20, 11, 25, 1, 5, 12))
The length of each aligned sequence should be identical. For example:
identical(abs(dt_short$End_A - dt_short$Start_A), abs(dt_short$End_B - dt_short$Start_B))
You can make use of rleid and incorporating Frank's comment to remove grouping column
dt_long[, .(
LABEL_A=LABEL_A[1L], Start_A=SEQ_A[1L], End_A=SEQ_A[.N],
LABEL_B=LABEL_B[1L], Start_B=SEQ_B[1L], End_B=SEQ_B[.N]),
by=rleid(LABEL_A, LABEL_B,
c(0L, cumsum(diff(SEQ_A) > 1L)),
c(0L, cumsum(diff(SEQ_B) > 1L)))][, (1) := NULL]
output:
LABEL_A Start_A End_A LABEL_B Start_B End_B
1: A 11 30 C 1 20
2: A 61 75 D 25 11
3: A 76 85 F 16 25
4: A 86 100 G 15 1
5: A 110 114 D 1 5
6: A 115 119 D 8 12
A straight forward way is to group by the two labels and get the first and last of each group, i.e.
library(data.table)
dt_long[, .(Start_A = first(SEQ_A), End_A = last(SEQ_A), Start_B = first(SEQ_B), End_B = last(SEQ_B)), by = .(LABEL_A, LABEL_B)][]
# LABEL_A LABEL_B Start_A End_A Start_B End_B
#1: 1 3 11 30 1 20
#2: 1 4 61 75 25 11
#3: 1 6 76 85 16 25
#4: 1 7 86 100 15 1
We can just subset and dcast. Would also work seamlessly when there are many different groups of columns
dcast(dt_long[, .SD[c(1, .N)], .(LABEL_A, LABEL_B)],
LABEL_A + LABEL_B ~ c("Start", "End")[rowid(LABEL_A, LABEL_B)],
value.var = c("SEQ_A", "SEQ_B"))
# LABEL_A LABEL_B SEQ_A_End SEQ_A_Start SEQ_B_End SEQ_B_Start
#1: 1 3 30 11 20 1
#2: 1 4 75 61 11 25
#3: 1 6 85 76 25 16
#4: 1 7 100 86 1 15
This question already has answers here:
Remove group from data.frame if at least one group member meets condition
(4 answers)
Closed 4 years ago.
I would like to return all the observations within a group if at least one of the group's observations meet a filtering criteria.
For example below, I would like only the groups "shoe" and "ship" and all of the values returned since both of those groups have at least one value under 50.
I tried using the group_by but it seems to only return observations where the filter criteria are met and not the whole group.
library(dplyr)
test <- data.frame('prod_id'= c("shoe", "shoe", "shoe", "shoe", "shoe",
"shoe", "boat", "boat","boat","boat","boat","boat", "ship", "ship",
"ship",
"ship", "ship", "ship"),
'seller_id'= c("a", "b", "c", "d", "e", "f", "a","g", "h", "r",
"q", "b", "qe", "dj", "d3", "kk", "dn", "de"),
'Dich'= c(1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0),
'price' = c(12, 200, 10, 4, 3, 4, 99, 55, 86, 88, 75, 64, 82,
21, 44, 34, 22, 33)
)
Here is what I tried
test2 <- test%>%
group_by(prod_id) %>%
(filter = price < 50)
You need filter with any
library(dplyr)
test%>%
group_by(prod_id) %>%
filter(any(price < 50))
# prod_id seller_id Dich price
# <fct> <fct> <dbl> <dbl>
# 1 shoe a 1 12
# 2 shoe b 0 200
# 3 shoe c 0 10
# 4 shoe d 0 4
# 5 shoe e 0 3
# 6 shoe f 0 4
# 7 ship qe 0 82
# 8 ship dj 0 21
# 9 ship d3 0 44
#10 ship kk 0 34
#11 ship dn 0 22
#12 ship de 0 33
Or the base R approach using ave
test[with(test, ave(price < 50, prod_id, FUN = any)), ]
For completeness sake, one with data.table
library(data.table)
setDT(test)[, if(any(price < 50)) .SD, prod_id]
The R script below creates a data frame a123 with three columns. Column a1 has three variables occurring at different places with corresponding a2 and a3 values.
a1 = c("A", "B", "C", "A", "B", "B", "A", "C", "A", "C", "B")
a2 = c( 10, 8, 11 , 6 , 4 , 7 , 9 , 1 , 3 , 2, 7)
a3 = c( 55, 34, 33, 23, 78, 33, 123, 34, 85, 76, 74)
a123 = data.frame(a1, a2, a3)
My need is that I want a3 column values corresponding to a1 column values to be arranged in ascending order based on the order of a2 values. Also, if common a2 values are encountered, the corresponding a3 column values should be arranged in ascending order. For example, say value "A" in column a1 has following values in a2 and a3,
a2 = c(10, 6, 9, 3)
a3 = c(55, 23, 123, 85)
The values can be like:
a3 = c(123, 23, 85, 55)
Expected Outcome:
a1 = c("A", "B", "C", "A", "B", "B", "A", "C", "A", "C", "B")
a2 = c( 10, 8, 11, 6, 4, 7, 9, 1, 3, 2, 7)
a3 = c( 123, 78, 76, 23, 33, 34, 85, 33, 55, 34, 74)
a123 = data.frame(a1, a2, a3)
Thanks and please help. Note: Please try to avoid loops and conditions as they might slow the computation based on large data.
A solution using dplyr, sort, and rank. I do not fully understand your logic, but this is probably something you are looking for. Notice that I assume the elements in a3 of group A is 123, 55, 85, 23.
library(dplyr)
a123_r <- a123 %>%
group_by(a1) %>%
mutate(a3 = sort(a3, decreasing = TRUE)[rank(-a2, ties.method = "last")]) %>%
ungroup() %>%
as.data.frame()
a123_r
# a1 a2 a3
# 1 A 10 123
# 2 B 8 78
# 3 C 11 76
# 4 A 6 55
# 5 B 4 33
# 6 B 7 34
# 7 A 9 85
# 8 C 1 33
# 9 A 3 23
# 10 C 2 34
# 11 B 7 74
Essentially, I need to calculate means of values in rows under certain conditions.
Name = c("A", "A", "A", "A", "B", "B", "B", "B")
temp = c(22, 22, 26, 23, 18, 20, 18, 17)
peak = c(0, 0, 1, 0, 0, 1, 0, 0)
new = NA
d<- data.frame(Name, temp, peak, new)
When peak = 1, calculate the average of temp i-1 and i+1, place that value in 'new' column. Otherwise, the value in new should be the same as temp. I would like to do this only within "Name" groups so that group A temp values are not mixed with group B.
Then, the output will look like this:
Name temp peak new
1 A 22 0 22.0
2 A 22 0 22.0
3 A 26 1 22.5
4 A 23 0 23.0
5 B 18 0 18.0
6 B 20 1 18.0
7 B 18 0 18.0
8 B 17 0 17.0
I started writing an ifelse statement, which might look something like this:
d$new<-ifelse(d$peak==1, mean(peak[i-1, i+1]), d$temp)
I also thought about lapply, but I think this needs a loop. Any suggestions?
This should do the trick. No loops
Name = c("A", "A", "A", "A", "B", "B", "B", "B")
temp = c(22, 22, 26, 23, 18, 20, 18, 17)
peak = c(0, 0, 1, 0, 0, 1, 0, 0)
d<- data.frame(Name, temp, peak)
d$new = temp
ind = which(d$peak==1)
d$new[ind] = (d$temp[ind-1]+d$temp[ind+1])/2
Try rollapply from the zoo package:
library(zoo)
rollfun <- function(i) with(d[i, ], if (peak[2]) mean(temp[-2]) else temp[2])
transform(d, temp.new = rollapply(seq(0, nrow(d)+1), 3, rollfun))
Note that this assumes that there are no peaks at boundaries (which is the case in the question).
REVISED Some simplifications.
Here is the output:
> Name = c("A", "A", "A", "A", "B", "B", "B", "B")
> temp = c(22, 22, 26, 23, 18, 20, 18, 17)
> peak = c(0, 0, 1, 0, 0, 1, 0, 0)
> new = NA
> d<- data.frame(Name, temp, peak, new)
> library(zoo)
>
> rollfun <- function(i) with(d[i, ], if (peak[2]) mean(temp[-2]) else temp[2])
> transform(d, temp.new = rollapply(seq(0, nrow(d)+1), 3, rollfun))
Name temp peak new temp.new
1 A 22 0 NA 22.0
2 A 22 0 NA 22.0
3 A 26 1 NA 22.5
4 A 23 0 NA 23.0
5 B 18 0 NA 18.0
6 B 20 1 NA 18.0
7 B 18 0 NA 18.0
8 B 17 0 NA 17.0