Here is my data:
df <- tibble::tribble(
~Group, ~Year, ~given, ~required,
"A", 2017L, 3L, 1L,
"A", 2017L, 4L, 2L,
"A", 2017L, 8L, 6L,
"A", 2018L, 1L, 7L,
"A", 2018L, 4L, 10L,
"B", 2018L, 8L, 1L,
"B", 2019L, 3L, 4L,
"B", 2019L, 4L, 5L)
I want to calculate "required" such that, for any "Group":
The first entry of the 'required' gets the value of 1.
The delta between the 'required' and 'given' variables has to be the same.
For any Year, the minimum values for "given" variable could be 1 and the maximum is 8.
How should I calculate 'required' variable using the 'Group', 'Year', and 'given' variables?
library(tidyverse)
df2 <- df %>%
group_by(Group) %>%
mutate(required = c(1, diff(given)),
required = ifelse(required < 0, max(given) - abs(required), required),
required = cumsum(required)) %>%
ungroup()
df2
# # A tibble: 8 x 4
# Group Year given required
# <chr> <int> <int> <dbl>
# 1 A 2017 3 1
# 2 A 2017 4 2
# 3 A 2017 8 6
# 4 A 2018 1 7
# 5 A 2018 4 10
# 6 B 2018 8 1
# 7 B 2019 3 4
# 8 B 2019 4 5
Related
I have a longitudinal data where respondents recruited as cohort. Right now, I have year in which they took the survey. But I want to create a new column simply counting if it is the first, second, or third time a person took the survey.
Original Table
PersonID
SurveyYear
SurveyQ1Rating
SurveyQ2Rating
Gender
12
2013
5
4
f
12
2012
4
4
f
12
2010
3
3
f
2
2007
4
4
m
2
2008
3
3
m
2
2009
3
5
m
2
2010
5
5
m
2
2013
2
2
m
5
2013
4
4
f
5
2014
5
5
f
Target Table (Where I created a new col SurveytTime to mark the ith time one took the survey)
PersonID
SurveyYear
SurveyTime
SurveyQ1Rating
SurveyQ2Rating
Gender
12
2013
3
5
4
f
12
2012
2
4
4
f
12
2010
1
3
3
f
2
2007
1
4
4
m
2
2008
2
3
3
m
2
2009
3
3
5
m
2
2010
4
5
5
m
2
2013
5
2
2
m
5
2013
1
4
4
f
5
2014
2
5
5
f
A base solution:
df |>
transform(SurveyTime = ave(SurveyYear, PersonID, FUN = rank))
Its dplyr equivalent:
library(dplyr)
df %>%
group_by(PersonID) %>%
mutate(SurveyTime = dense_rank(SurveyYear)) %>%
ungroup()
Data
df <- structure(list(PersonID = c(12L, 12L, 12L, 2L, 2L, 2L, 2L, 2L,
5L, 5L), SurveyYear = c(2013L, 2012L, 2010L, 2007L, 2008L, 2009L,
2010L, 2013L, 2013L, 2014L), SurveyQ1Rating = c(5L, 4L, 3L, 4L,
3L, 3L, 5L, 2L, 4L, 5L), SurveyQ2Rating = c(4L, 4L, 3L, 4L, 3L,
5L, 5L, 2L, 4L, 5L), Gender = c("f", "f", "f", "m", "m", "m",
"m", "m", "f", "f")), class = "data.frame", row.names = c(NA, -10L))
Using data.table
library(data.table)
setDT(df1)[, SurveyTime := frank(SurveyYear), PersonID]
My dataframe looks like this:
Index Year Renovation
1 2012 1
1 2018 1
2 2012 1
2 2018 1
3 2012 0
3 2018 0
I would like to change the Renovation variable for 2012 to '0', IF the renovation variable for 2018 was "1". So I am facing a double condition here. How can I do this in R?
You can use ifelse to check for condition.
library(dplyr)
df %>%
group_by(Index) %>%
mutate(Renovation = ifelse(Year == 2012 &
Renovation[match(2018, Year)] == 1, 0, Renovation))
# Index Year Renovation
# <int> <int> <dbl>
#1 1 2012 0
#2 1 2018 1
#3 2 2012 0
#4 2 2018 1
#5 3 2012 0
#6 3 2018 0
data
df <- structure(list(Index = c(1L, 1L, 2L, 2L, 3L, 3L), Year = c(2012L,
2018L, 2012L, 2018L, 2012L, 2018L), Renovation = c(1L, 1L, 1L,
1L, 0L, 0L)), class = "data.frame", row.names = c(NA, -6L))
I have two lists of data frames. Each list has 6 data frames.
The dataframes has the same columns, but in list1 the dataframes has info from 2015 to 2017 and list2 has info of 2018. Like below
List1$A
Name Value Year
AAA 123 2015
BBB 456 2016
CCC 789 2017
AAA 543 2018
List2$A
Name Value Year
AAA 543 2018
BBB 248 2018
I want to merge the dataframes from both lists. So I want in the end just one list of dataframes with all the info for all years.
Some dataframes from list1 has already info of 2018, so when I merge them with the others I want those 2018 values to be replaced.
Newlist$A
Name Value Year
AAA 123 2015
BBB 456 2016
CCC 789 2017
AAA 543 2018
BBB 248 2018
I tried this but didn't work
data<- lapply(list1,list2, function (x,y) merge(x,y))
How can I do this?
It's always helpful to include a sample of data with dput, but here's an attempt without the data's confirmation:
library(tidyverse)
map2(list1, list2, ~bind_rows(.y, .x) %>% group_by(Name, Year) %>% slice(1))
We bind the rows (with list2 first), then grouping by Name and Year and taking the first occurrence with slice, which should take the first value for any Name/Year repeated measures from the 2nd data frame.
We could first bind everything into a long data frame and remove the entries for "2018" that first occur if there's an entry in list 2.
To do this we could list the lists and rbind them after adding an ID column that later helps to remove the duplicates of year "2018" that stem from list 1 with by/ave, but keep those which don't occur in list 2.
The trick of the latter is to us a rev(seq_along(x)).
To demonstrate I have created sample data that probably resembles your data.
# list the lists
L <- list(L1=L1, L2=L2)
# add id column to sublists
L <- lapply(seq(L), function(x)
Map(`[<-`, L[[x]], "list", value=substr(names(L)[x], 2, 2)))
# rbind lists to long data frame
d <- do.call(rbind, unlist(L, recursive=FALSE))
# remove 2018 duplicates of list L1, keep if no 2018 in list L2
do.call(rbind, by(d, d$name, function(y) {
i <- cbind(y, id=ave(y$year, y$year, FUN=function(z) rev(seq_along(z))))
i[!i$id == 2, ]
}))
Result
# name value year list id
# A.A.1 A 998 2015 1 1
# A.A.4 A 456 2016 1 1
# A.A.7 A 312 2017 1 1
# A.A.13 A 478 2018 2 1
# B.A.2 B 1592 2015 1 1
# B.A.5 B 1072 2016 1 1
# B.A.8 B 673 2017 1 1
# B.A.21 B 445 2018 2 1
# C.A.3 C 957 2015 1 1
# C.A.6 C 199 2016 1 1
# C.A.9 C 2165 2017 1 1
# C.A.31 C 342 2018 2 1
# D.B.1 D 877 2015 1 1
# D.B.4 D 876 2016 1 1
# D.B.7 D 482 2017 1 1
# D.B.13 D 1077 2018 2 1
# E.B.2 E 370 2015 1 1
# E.B.5 E 1475 2016 1 1
# E.B.8 E 768 2017 1 1
# E.B.11 E 385 2018 1 1 <- this stems from list 1!
# F.B.3 F 421 2015 1 1
# F.B.6 F 930 2016 1 1
# F.B.9 F 1105 2017 1 1
# F.B.31 F 1836 2018 2 1
Data
l1 <- list(A = structure(list(name = structure(c(1L, 2L, 3L, 1L, 2L,
3L, 1L, 2L, 3L, 1L, 2L, 3L), .Label = c("A", "B", "C"), class = "factor"),
value = c(1371, 565, 363, 633, 404, 106, 1512, 95, 2018,
63, 1305, 2287), year = c(2015L, 2015L, 2015L, 2016L, 2016L,
2016L, 2017L, 2017L, 2017L, 2018L, 2018L, 2018L)), class = "data.frame", row.names = c(NA,
-12L)), B = structure(list(name = structure(c(1L, 2L, 3L, 1L,
2L, 3L, 1L, 2L, 3L, 1L, 2L, 3L), .Label = c("D", "E", "F"), class = "factor"),
value = c(1389, 279, 133, 636, 284, 2656, 2440, 1320, 307,
1781, 172, 1215), year = c(2015L, 2015L, 2015L, 2016L, 2016L,
2016L, 2017L, 2017L, 2017L, 2018L, 2018L, 2018L)), class = "data.frame", row.names = c(NA,
-12L)))
L2 <- list(A = structure(list(name = structure(1:3, .Label = c("A",
"B", "C"), class = "factor"), value = c(1895, 430, 257), year = c(2018,
2018, 2018)), class = "data.frame", row.names = c(NA, -3L)),
B = structure(list(name = structure(c(1L, 3L), .Label = c("D",
"E", "F"), class = "factor"), value = c(1763, 640), year = c(2018,
2018)), row.names = c(1L, 3L), class = "data.frame"))
L2$B <- L2$B[-2, ] # remove intentionally value
I've got monthly year over year data in a long format that I'm trying to spread with two columns. The only examples I've seen include a single key.
> dput(df)
structure(list(ID = c("a", "a", "a", "a", "a", "a", "a", "a",
"a", "b", "b", "b", "b", "b", "b", "b", "b", "b"), Year = c(2015L,
2015L, 2015L, 2016L, 2016L, 2016L, 2017L, 2017L, 2017L, 2015L,
2015L, 2015L, 2016L, 2016L, 2016L, 2017L, 2017L, 2017L), Month = c(1L,
2L, 3L, 1L, 2L, 3L, 1L, 2L, 3L, 1L, 2L, 3L, 1L, 2L, 3L, 1L, 2L,
3L), Value = c(1L, 1L, 1L, 2L, 2L, 2L, 3L, 3L, 3L, 6L, 7L, 8L,
9L, 10L, 11L, 12L, 13L, 14L)), .Names = c("ID", "Year", "Month",
"Value"), class = "data.frame", row.names = c(NA, -18L))
I'm trying to get it into a data format with years as columns 2:5, and one row per month per ID
ID Month 2015 2016 2017
a 1 1 2 3
a 2 1 2 3
a 3 1 2 3
a 1 6 9 12
a 2 7 10 13
a 3 8 11 14
I've tried the following with the following error:
by_month_over_years = spread(df,key = c(Year,Month), Value)
Error: `var` must evaluate to a single number or a column name, not an integer vector
library(tidyr)
library(dplyr)
df %>% group_by(ID) %>% spread(Year, Value)
# A tibble: 6 x 5
# Groups: ID [2]
ID Month `2015` `2016` `2017`
<chr> <int> <int> <int> <int>
1 a 1 1 2 3
2 a 2 1 2 3
3 a 3 1 2 3
4 b 1 6 9 12
5 b 2 7 10 13
6 b 3 8 11 14
library(reshape2) # or data.table, for dcast
dcast(df, ID + Month ~ Year)
# ID Month 2015 2016 2017
# 1 a 1 1 2 3
# 2 a 2 1 2 3
# 3 a 3 1 2 3
# 4 b 1 6 9 12
# 5 b 2 7 10 13
# 6 b 3 8 11 14
Here is a base R option with reshape
reshape(df, idvar = c('ID', 'Month'), direction = 'wide', timevar = 'Year')
# ID Month Value.2015 Value.2016 Value.2017
#1 a 1 1 2 3
#2 a 2 1 2 3
#3 a 3 1 2 3
#10 b 1 6 9 12
#11 b 2 7 10 13
#12 b 3 8 11 14
I am trying to migrate this activity from excel/SQL to R and I am stuck - any help is very much appreciated. Thanks !
Format of Data:
There are unique customer ids. Each customer has purchases in different groups in different years.
Objective:
For each customer id - get one row of output. Use variable names stored in column and create columns - for each column assign sum of amount. Create a similar column and assign as 1 or 0 depending on presence or absence of revenue.
SOURCE:
Cust_ID Group Year Variable_Name Amount
1 1 A 2009 A_2009 2000
2 1 B 2009 B_2009 100
3 2 B 2009 B_2009 300
4 2 C 2009 C_2009 20
5 3 D 2009 D_2009 299090
6 3 A 2011 A_2011 89778456
7 1 B 2011 B_2011 884
8 1 C 2010 C_2010 34894
9 3 D 2010 D_2010 389849
10 2 A 2013 A_2013 742
11 1 B 2013 B_2013 25661
12 2 C 2007 C_2007 393
13 3 D 2007 D_2007 23
OUTPUT:
Cust_ID A_2009 B_2009 C_2009 D_2009 A_2011 …. A_2009_P B_2009_P
1 sum of amount .. 1 0 ….
2
3
dput of original data:
structure(list(Cust_ID = c(1L, 1L, 2L, 2L, 3L, 3L, 1L, 1L, 3L,
2L, 1L, 2L, 3L), Group = c("A", "B", "B", "C", "D", "A", "B",
"C", "D", "A", "B", "C", "D"), Year = c(2009L, 2009L, 2009L,
2009L, 2009L, 2011L, 2011L, 2010L, 2010L, 2013L, 2013L, 2007L,
2007L), Variable_Name = c("A_2009", "B_2009", "B_2009", "C_2009",
"D_2009", "A_2011", "B_2011", "C_2010", "D_2010", "A_2013", "B_2013",
"C_2007", "D_2007"), Amount = c(2000L, 100L, 300L, 20L, 299090L,
89778456L, 884L, 34894L, 389849L, 742L, 25661L, 393L, 23L)), .Names = c("Cust_ID",
"Group", "Year", "Variable_Name", "Amount"), class = "data.frame", row.names = c(NA,
-13L))
One option:
intm <- as.data.frame.matrix(xtabs(Amount ~ Cust_ID + Variable_Name,data=dat))
result <- data.frame(aggregate(Amount~Cust_ID, data=dat,sum),intm,(intm > 0)+0 )
Result (abridged):
Cust_ID Amount A_2009 A_2011 ... A_2009.1 A_2011.1
1 1 65539 4000 0 ... 1 0
2 2 1455 0 0 ... 0 0
3 3 90467418 0 89778456 ... 0 1
If the names are a concern, they can easily be fixed via:
names(res) <- gsub("\\.1","_P",names(res))