This question already has answers here:
Recode dates to study day within subject
(2 answers)
Closed 3 years ago.
I have data structured as below:
ID Day Desired Output
1 1 1
1 1 1
1 1 1
1 2 2
1 2 2
1 3 3
2 4 1
2 4 1
2 5 2
3 6 1
3 6 1
Is it possible to create a sequence for the desired output without using a loop? The dataset is quite large so a loop won't work, is it possible to do this with the dplyr package or maybe a combination of cumsum/diff?
An option is to group by 'ID', and then do a match on the 'Day' with the unique values of 'Day' column
library(dplyr)
df1 %>%
group_by(ID) %>%
mutate(desired = match(Day, unique(Day)))
data
df1 <- structure(list(ID = c(1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 3L,
3L), Day = c(1L, 1L, 1L, 2L, 2L, 3L, 4L, 4L, 5L, 6L, 6L)), row.names = c(NA,
-11L), class = "data.frame")
Related
so I´m having a dataframe of this form:
ID Var1 Var2
1 1 1
1 2 2
1 3 3
1 4 2
1 5 2
2 1 4
2 2 8
2 3 10
2 4 10
2 5 7
and I would like to filter the Var1 values by group for their maximum, on the condition, that the maximum value of Var2 is not met. This will be part of a new dataframe only containing one row per ID, so the outcome should be something like this:
ID Var1
1 2
2 2
so the function should filter the dataframe for the maximum, but only consider the values in the rows before Var2 reaches it´s maximum. The rows containing the maximum itself should not be included and so shouldn´t the rows after the maximum.
I tried building something with the while loop, but it didn´t work out. Also I´d be thankful if the solution doesn´t employ data.table
Thanks in advance
Maybe you could do something like this:
DF <- structure(list(
ID = c(1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L),
Var1 = c(1L, 2L, 3L, 4L, 5L, 1L, 2L, 3L, 4L, 5L),
Var2 = c(1L, 2L, 3L, 2L, 2L, 4L, 8L, 10L, 10L, 7L)),
class = "data.frame", row.names = c(NA, -10L))
library(dplyr)
DF %>% group_by(ID) %>%
slice(1:(which.max(Var2)-1)) %>%
slice_max(Var1) %>%
select(ID, Var1)
#> # A tibble: 2 x 2
#> # Groups: ID [2]
#> ID Var1
#> <int> <int>
#> 1 1 2
#> 2 2 2
Created on 2020-08-04 by the reprex package (v0.3.0)
I'm trying to delete some repeating information in my data set and replace it with NA. Here's an example of the data:
DataTable1
ID Day x y
1 1 1 3
1 2 1 3
2 1 2 5
2 2 2 5
3 1 3 4
3 2 3 4
4 1 4 6
4 2 4 6
I'm trying to replace "x" and "y" values with "NA" when Day=1. This is what I want:
ID Day x y
1 1 NA NA
1 2 1 3
2 1 NA NA
2 2 2 5
3 1 NA NA
3 2 3 4
4 1 NA NA
4 2 4 6
I'm not really sure where to start or how to go about this. I tried using the replace_with_na_if function from the naniar library. Otherwise, I am unsure what to try.
replace_with_na_if(data.frame=DataTable1$x,
condition=DataTable1$Day== 2)
I received an error message that reads:
Error in replace_with_na_if(data.frame = DataTable1$x, condition = DataTable1$Day == :
unused argument (data.frame = DataTable1$x)
An option in base R would be to create a logical vector based on the elements of 'Day'. Use that index to subset the 'x', 'y' columns and assign them to NA
i1 <- df1$Day == 1
df1[i1, c('x', 'y')] <- NA
Here's a data.table solution. Since you may be new to R, you need to install the data.table package first. If you have a large data set, data.table may work faster than using data frame. Also, I find the syntax to be easy to read and understand.
#Create the data frame:
df <- structure(list(ID = c(1L, 1L, 2L, 2L, 3L, 3L, 4L, 4L), Day = c(1L, 2L, 1L,
2L, 1L, 2L, 1L, 2L), x = c(1L, 1L, 2L, 2L, 3L, 3L, 4L, 4L), y = c(3L, 3L, 5L, 5L,
4L, 4L, 6L, 6L)), class = "data.frame", row.names = c(NA, -8L))
library(data.table)
dt <- setDT(df) # convert the data frame to a data.table
dt[Day == 1, c("x","y") := NA] # where Day equals 1, make the columns x and y equal NA
Good luck and welcome to stackoverflow!
Using dplyr, we can use mutate_at and replace like
library(dplyr)
df %>% mutate_at(vars(x, y), ~replace(., Day == 1, NA))
# ID Day x y
#1 1 1 NA NA
#2 1 2 1 3
#3 2 1 NA NA
#4 2 2 2 5
#5 3 1 NA NA
#6 3 2 3 4
#7 4 1 NA NA
#8 4 2 4 6
data
df <- structure(list(ID = c(1L, 1L, 2L, 2L, 3L, 3L, 4L, 4L), Day = c(1L, 2L, 1L,
2L, 1L, 2L, 1L, 2L), x = c(1L, 1L, 2L, 2L, 3L, 3L, 4L, 4L), y = c(3L, 3L, 5L, 5L,
4L, 4L, 6L, 6L)), class = "data.frame", row.names = c(NA, -8L))
This question already has answers here:
Select the row with the maximum value in each group
(19 answers)
Extract row corresponding to minimum value of a variable by group
(9 answers)
Closed 4 years ago.
in my data
data=structure(list(v1 = c(1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L),
v2 = c(1L, 2L, 3L, 4L, 5L, 1L, 2L, 3L, 4L, 5L), x = c(10L,
1L, 2L, 3L, 4L, 3L, 2L, 30L, 3L, 5L)), .Names = c("v1", "v2",
"x"), class = "data.frame", row.names = c(NA, -10L))
There are 3 variables.
I need to get only those lines in relation to which X, has the max value.
For example. Take First category of v1 and look in relation to which category v2 x has max value
It is
v1=1 and v2=1 x=10
Take second category of v1 and look in relation to which category v2 x has max value
It is v1=2 ,v2=3 x=30
so desired output
v1 v2 x
1 1 10
2 3 30
How to do it?
Here is a solution using data.table:
library(data.table)
setDT(data)
data[, .SD[which.max(x)], keyby = v1]
v1 v2 x
1: 1 1 10
2: 2 3 30
And for completeness an ugly base-R solution:
t(sapply(split(data, data[["v1"]]), function(s) s[which.max(s[["x"]]),]))
v1 v2 x
1 1 1 10
2 2 3 30
Using dplyr:
data %>%
group_by(v1) %>%
filter(x == max(x))
# A tibble: 2 x 3
# Groups: v1 [2]
v1 v2 x
<int> <int> <int>
1 1 1 10
2 2 3 30
I'm trying to determine the number of unique customers per week per store.
I have a piece of code that accomplishes this task but the tabulation is not what I am looking for.
I have the following table:
store week customer_ID
1 1 1
1 1 1
1 1 2
1 2 1
1 2 2
1 2 3
2 1 1
2 1 1
2 1 2
2 2 2
2 2 3
2 2 3
So every week I need to count how many unique customer there were.
Say for example if customer 1 had visited on week 1, then revisited on week 2 that would not count as a unique visit.
If that same customer visited store 2 on week 1 or any other week. Then that would count as a unique visit for store two.
The outcome would look like the following:
store week unique Customers
1 1 2
1 2 1
2 1 2
2 2 1
I used the following but its not correct
agg <- aggregate(data=df, customer_ID~ week+store, function(x) length(unique(x)))
structure(list(store = c(1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L,
2L, 2L, 2L), week = c(1L, 1L, 1L, 2L, 2L, 2L, 1L, 1L, 1L, 2L,
2L, 2L), customer_ID = c(1L, 1L, 2L, 1L, 2L, 3L, 1L, 1L, 2L,
2L, 3L, 3L)), .Names = c("store", "week", "customer_ID"), class = "data.frame", row.names = c(NA,
-12L))
Here is a base R method. The idea is to split the data into a list of data.frames, one for each store. Assuming observations are ordered by week, then drop duplicated observations of customer ID. The subset data.frame is aggregated using your function. Then do.call and rbind put the results into a single data.frame:
do.call(rbind, lapply(split(df, df$store),
function(i) aggregate(data=i[!duplicated(i$customer_ID),],
customer_ID ~ week+store, length)))
week store customer_ID
1.1 1 1 2
1.2 2 1 1
2.1 1 2 2
2.2 2 2 1
to make sure that your data.frame is ordered properly prior to attempting this, you could use order:
df <- df[order(df$store, df$week), ]
In case it is of interest, I put together a data.table solution as well.
library(data.table)
setDT(df)
df[df[, !duplicated(customer_ID), by=store]$V1,
.(newCust=length(customer_ID)), by=.(store, week)]
store week newCust
1: 1 1 2
2: 1 2 1
3: 2 1 2
4: 2 2 1
This method uses a logical vector df[, !duplicated(customer_ID), by=store]$V1 to subset the data to unique IDs by store, and then calculates the unique number of new customers by store-week.
This question already has answers here:
How to sum a variable by group
(18 answers)
Closed 6 years ago.
I have the following data frame:
Event Scenario Year Cost
1 1 1 10
2 1 1 5
3 1 2 6
4 1 2 6
5 2 1 15
6 2 1 12
7 2 2 10
8 2 2 5
9 3 1 4
10 3 1 5
11 3 2 6
12 3 2 5
I need to produce a pivot table/ frame that will sum the total cost per year for each scenario. So the result will be.
Scenario Year Cost
1 1 15
1 2 12
2 1 27
2 2 15
3 1 9
3 2 11
I need to produce a ggplot line graph that plot the cost of each scenario per year. I know how to do that, I just can't get the right data frame.
Try
library(dplyr)
df %>% group_by(Scenario, Year) %>% summarise(Cost=sum(Cost))
Or
library(data.table)
setDT(df)[, list(Cost=sum(Cost)), by=list(Scenario, Year)]
Or
aggregate(Cost~Scenario+Year, df,sum)
data
df <- structure(list(Event = 1:12, Scenario = c(1L, 1L, 1L, 1L, 2L,
2L, 2L, 2L, 3L, 3L, 3L, 3L), Year = c(1L, 1L, 2L, 2L, 1L, 1L,
2L, 2L, 1L, 1L, 2L, 2L), Cost = c(10L, 5L, 6L, 6L, 15L, 12L,
10L, 5L, 4L, 5L, 6L, 5L)), .Names = c("Event", "Scenario", "Year",
"Cost"), class = "data.frame", row.names = c(NA, -12L))
The following does it:
library(plyr)
ddply(df, .(Scenario, Year), summarize, Cost = sum(Cost))
#Scenario Year Cost
#1 1 1 15
#2 1 2 12
#3 2 1 27
#4 2 2 15
#5 3 1 9
#6 3 2 11