I have a set of time series, and I want to scale each of them relative to their value in a specific interval. That way, each series will be at 1.0 at that time and change proportionally.
I can't figure out how to do that with dplyr.
Here's a working example using a for loop:
library(dplyr)
data = expand.grid(
category = LETTERS[1:3],
year = 2000:2005)
data$value = runif(nrow(data))
# the first time point in the series
baseYear = 2002
# for each category, divide all the values by the category's value in the base year
for(category in as.character(levels(factor(data$category)))) {
data[data$category == category,]$value = data[data$category == category,]$value / data[data$category == category & data$year == baseYear,]$value[[1]]
}
Edit: Modified the question such that the base time point is not indexable. Sometimes the "time" column is actually a factor, which isn't necessarily ordinal.
This solution is very similar to #thelatemail, but I think it's sufficiently different enough to merit its own answer because it chooses the index based on a condition:
data %>%
group_by(category) %>%
mutate(value = value/value[year == baseYear])
# category year value
#... ... ... ...
#7 A 2002 1.00000000
#8 B 2002 1.00000000
#9 C 2002 1.00000000
#10 A 2003 0.86462789
#11 B 2003 1.07217943
#12 C 2003 0.82209897
(Data output has been truncated. To replicate these results, set.seed(123) when creating data.)
Use first in dplyr, ensuring you use order_by
data %>%
group_by(category) %>%
mutate(value = value / first(value, order_by = year))
Something like this:
data %>%
group_by(category) %>%
mutate(value=value/value[1]) %>%
arrange(category,year)
Result:
# category year value
#1 A 2000 1.0000000
#2 A 2001 0.2882984
#3 A 2002 1.5224308
#4 A 2003 0.8369343
#5 A 2004 2.0868684
#6 A 2005 0.2196814
#7 B 2000 1.0000000
#8 B 2001 0.5952027
Related
I want to delete duplicates with multiple grouping conditions but always get way less results than expected.
The dataframe compares two companies per year. Like this:
year
c1
c2
2000
a
b
2000
a
c
2000
a
d
2001
a
b
2001
b
d
2001
a
c
For every c1 I want to look at c2 and delete rows which are in the previous year.
I found a similar problem but with just one c. Here are some of my tries so far:
df<- df%>%
group_by(c1,c2) %>%
mutate(dup = n() > 1) %>%
group_split() %>%
map_dfr(~ if(unique(.x$dup) & (.x$year[2] - .x$year[1]) == 1) {
.x %>% slice_head(n = 1)
} else {
.x
}) %>%
select(-dup) %>%
arrange(year)
df<- sqldf("select a.*
from df a
left join df b on b.c1=a.c1 and b.c2 = a.c2 and b.year = a.year - 1
where b.year is null")
The desired output for the example would be:
year
c1
c2
2000
a
b
2000
a
c
2000
a
d
2001
b
d
Assuming you want to check duplicate in the previous year only. So showing it to you on a modified sample
library(tidyverse)
df <- read.table(header = T, text = 'year c1 c2
2000 a b
2000 a c
2000 a d
2001 a b
2001 b d
2001 a c
2002 a d')
df %>%
filter(map2_lgl(df$year, paste(df$c1, df$c2), ~ !paste(.x -1, .y) %in% paste(df$year, df$c1, df$c2)))
#> year c1 c2
#> 1 2000 a b
#> 2 2000 a c
#> 3 2000 a d
#> 4 2001 b d
#> 5 2002 a d
Created on 2021-07-08 by the reprex package (v2.0.0)
Some of the other solutions won't work because I think they ignore the fact that you will probably have many years and want to eliminate duplicates from only the prior.
Here is something fairly simple. You could do this in some map function or whatnot, but sometimes a simple loop does just fine. For each year of data, use anti_join() to return only those values from the current year which are not in the prior year. Then just restack the data.
df_split <- df %>%
group_split(year)
for (this_year in 2:length(df_split)) {
df_split[[this_year]] <- df_split[[this_year]] %>%
anti_join(df_split[[this_year - 1]], by = c("c1", "c2"))
}
bind_rows(df_split)
# # A tibble: 4 x 3
# year c1 c2
# <int> <chr> <chr>
# 1 2000 a b
# 2 2000 a c
# 3 2000 a d
# 4 2001 b d
Edit
Another approach is to add a dummy column for the prior year and just use an anti_join() with that. This is probably what I would do.
df %>%
mutate(prior_year = year - 1) %>%
anti_join(df, by = c(prior_year = "year", "c1", "c2")) %>%
select(-prior_year)
You can also use the following solution.
library(dplyr)
library(purrr)
df %>%
filter(pmap_int(list(df$c1, df$c2, df$year), ~ df %>%
filter(year %in% c(..3, ..3 - 1)) %>%
rowwise() %>%
mutate(output = all(c(..1, ..2) %in% c_across(c1:c2))) %>%
pull(output) %>% sum) < 2)
# AnilGoyal's modified data set
year c1 c2
1 2000 a b
2 2000 a c
3 2000 a d
4 2001 b d
5 2002 a d
this will only keep the data u want.
The datais your data frame.
data[!duplicated(data[,2:3]),]
I think this is pretty simple with base duplicated using the fromLast option to get the last rather than the first entry. (It does assum the ordering by year.
dat[!duplicated(dat[2:3], fromLast=TRUE), ] # negate logical vector in i-position
year c1 c2
3 2000 a d
4 2001 a b
5 2001 b d
6 2001 a c
I do get a different result than you said was expected so maybe I misunderstood the specifications?
Assuming, that you indeed wanted to keep your last year, as stated in the question, but contrary to your example table, you could simply use slice:
library(dplyr)
df = data.frame(year=c("2000","2000","2000","2001","2001","2001"),
c1 = c("a","a","a","a","b","a"),c2=c("b","c","d","b","d","c"))
df %>% group_by(c1,c2) %>%
slice_tail() %>%arrange(year,c1,c2)
Use slice_head(), if you wanted the first year.
Here is the documentation: slice
I am trying to find the first occurrence of a FALSE in a dataframe for each row value. My rows are specific occurrences and the columns are dates. I would like to be able to find the date of first FALSE so that I can use that value to find a return date.
An example structure of my dataframe:
df <- data.frame(ID = c(1,2,3), '2001' = c(TRUE, TRUE, TRUE),
'2002' = c(FALSE, TRUE, FALSE), '2003' = c(TRUE, FALSE, TRUE))
I want to end up with a second dataframe or list that contains the ID and the column name that identifies the first instance of a FALSE.
For example :
ID | Date
1 | 2002
2 | 2003
3 | 2002
I do not know the mechanism to find such a result.
The actual dataframe contains a couple thousand rows so I unfortunately can't do it by hand.
I am a new R user so please don't refrain from suggesting things you might expect a more experienced R user to have already thought about.
Thanks in advance
Try this using tidyverse functions. You can reshape data to long and then filter for F values. If there are some duplicated rows the second filter can avoid them. Here the code:
library(dplyr)
library(tidyr)
#Code
newdf <- df %>% pivot_longer(-ID) %>%
group_by(ID) %>%
filter(value==F) %>%
filter(!duplicated(value)) %>% select(-value) %>%
rename(Myname=name)
Output:
# A tibble: 3 x 2
# Groups: ID [3]
ID Myname
<dbl> <chr>
1 1 2002
2 2 2003
3 3 2002
Another option without duplicated values can be using the row_number() to extract the first value (row_number()==1):
library(dplyr)
library(tidyr)
#Code 2
newdf <- df %>% pivot_longer(-ID) %>%
group_by(ID) %>%
filter(value==F) %>%
mutate(V=ifelse(row_number()==1,1,0)) %>%
filter(V==1) %>%
select(-c(value,V)) %>% rename(Myname=name)
Output:
# A tibble: 3 x 2
# Groups: ID [3]
ID Myname
<dbl> <chr>
1 1 2002
2 2 2003
3 3 2002
Or using base R with apply() and a generic function:
#Code 3
out <- data.frame(df[,1,drop=F],Res=apply(df[,-1],1,function(x) names(x)[min(which(x==F))]))
Output:
ID Res
1 1 2002
2 2 2003
3 3 2002
We can use max.col with ties.method = 'first' after inverting the logical values.
cbind(df[1], Date = names(df[-1])[max.col(!df[-1], ties.method = 'first')])
# ID Date
#1 1 2002
#2 2 2003
#3 3 2002
I am trying to do something very similar to Scale relative to a value in each group (via dplyr) (however this solution seems to crash R for me). I would like to replicate a single value for each group and add a new column with this value repeated. As an example I have
library(dplyr)
data = expand.grid(
category = LETTERS[1:2],
year = 2000:2003)
data$value = runif(nrow(data))
data
category year value
1 A 2000 0.6278798
2 B 2000 0.6112281
3 A 2001 0.2170495
4 B 2001 0.6454874
5 A 2002 0.9234604
6 B 2002 0.9311204
7 A 2003 0.5387899
8 B 2003 0.5573527
And I would like a dataframe like
data
category year value value2
1 A 2000 0.6278798 0.6278798
2 B 2000 0.6112281 0.6112281
3 A 2001 0.2170495 0.6278798
4 B 2001 0.6454874 0.6112281
5 A 2002 0.9234604 0.6278798
6 B 2002 0.9311204 0.6112281
7 A 2003 0.5387899 0.6278798
8 B 2003 0.5573527 0.6112281
i.e. the value for each category is the value from year 2000. I was trying to think of a general solution extensible to a given filtering criteria, i.e. something like
data %>% group_by(category) %>% mutate(value = filter(data, year==2002))
however this does not work because of incorrect length in the assignment.
Do this:
data %>% group_by(category) %>%
mutate(value2 = value[year == 2000])
You could also do it this way:
data %>% group_by(category) %>%
arrange(year) %>%
mutate(value2 = value[1])
or
data %>% group_by(category) %>%
arrange(year) %>%
mutate(value2 = first(value))
or
data %>% group_by(category) %>%
mutate(value2 = nth(value, n = 1, order_by = "year"))
or probably several other ways.
Your attempt with mutate(value = filter(data, year==2002)) doesn't make sense for a few reasons.
When you explicitly pass in data again, it's not part of the chain that got grouped earlier, so it doesn't know about the grouping.
All dplyr verbs take a data frame as first argument and return a data frame, including filter. When you do value = filter(...) you're trying to assign a full data frame to the single column value.
I wanted to convert one of my Proc SQL/SAS code in Rev R/Microsoft-r
here is my sample code
proc sql;
create table GENDER_YEAR as
select YEAR,GENDER,count(distinct CARD_NO) as CM_COUNT,sum(SPEND) as TOTAL_SPEND, sum(case when SPEND GT 0 then 1 else 0 end) as NO_OF_TRANS
from ABC group by YEAR,GENDER;
quit;
I'm trying below code in Rev R
library("RevoPemaR")
byGroupPemaObj <- PemaByGroup()
GENDER_cv_grouped <- pemaCompute(pemaObj = byGroupPemaObj, data = Merchant_Trans,groupByVar = "GENDER",computeVars = c("LOCAL_SPEND"),fnList = list(sum = list(FUN = sum, x = NULL)))
it Calculate only on thing at a time, but i need Distinct Count of CARD_NO, SUM of SPEND, and No of no zero Rows for Spend as Trans for each segment of YEAR & Gender.
Output Should look like below
YEAR GENDER CM_COUNT TOTAL_SPEND NO_OF_TRANS
YEAR1 M 23 120 119
YEAR1 F 21 110 110
YEAR2 M 20 121 121
YEAR2 F 35 111 109
Looking forward help on this.
The easiest way to go about this it to concatenate the columns into a single column and use that. It seems that most both dplyrXdf and RevoPemaR do not support group by with 2 variables yet.
The way to do this would be by adding a rxDataStep on top which creates this variable first and then groups by it. Some approximate code for this is:
library("RevoPemaR")
byGroupPemaObj <- PemaByGroup()
rxDataStep(inData = Merchant_Trans, outFile = Merchant_Trans_Groups,
transform = list(year_gender = paste(YEAR, GENDER,))
GENDER_cv_grouped <- pemaCompute(pemaObj = byGroupPemaObj,
data = Merchant_Trans_Groups, groupByVar = "GENDER",
computeVars = c("LOCAL_SPEND"),
fnList = list(sum = list(FUN = sum, x = NULL)))
Note that overall there are 3 methods of doing a groupBy in RevR as far as I know. Each has it's pros and cons.
rxSplit - This actually creates different XDF files for each group that you want. This can be used with the splitByFactor arg where the factor specifies which groups should be created.
RevoPemaR's PemaByGroup - This assumes that each group's data can be stored in RAM. Which is a fair assumption. It also needs the original Xdf file to be sorted by the GroupBy column. And it only supports grouping by 1 column.
dplyrXdf's group_by - This is a spin on the popular dplyr package. It has many variable manipulation methods - so a different way to write rxSplit and rxDataStep) using dplyr like syntax. It also only supports 1 column to group with.
All three methods currently only support a single variable group operation. Hence they all require some pre processing of the data to work with.
Here's a simple solution using dplyrXdf. Unlike with data frames, the n_distinct() summary function provided by dplyr doesn't work with xdf files, so this does a two-step summarisation: first including card_no as a grouping variable, and then count the number of card_no's.
First, generate some example data:
library(dplyrXdf) # also loads dplyr
set.seed(12345)
df <- expand.grid(year=2000:2005, gender=c("F", "M")) %>%
group_by(year, gender) %>%
do(data.frame(card_no=sample(20, size=10, replace=TRUE),
spend=rbinom(10, 1, 0.5) * runif(10) * 100))
xdf <- rxDataStep(df, "ndistinct.xdf", overwrite=TRUE)
Now call summarise twice, taking advantage of the fact that the first summarise will remove card_no from the list of grouping variables:
smry <- xdf %>%
mutate(trans=spend > 0) %>%
group_by(year, gender, card_no) %>%
summarise(n=n(), total_spend=sum(spend), no_of_trans=sum(trans)) %>%
summarise(cm_count=n(), total_spend=sum(total_spend), no_of_trans=sum(no_of_trans))
as.data.frame(smry)
#year gender cm_count total_spend no_of_trans
#1 2000 F 10 359.30313 6
#2 2001 F 8 225.89571 3
#3 2002 F 7 332.58365 6
#4 2003 F 5 333.72169 5
#5 2004 F 7 280.90448 5
#6 2005 F 9 254.37680 5
#7 2000 M 8 309.77727 6
#8 2001 M 8 143.70835 2
#9 2002 M 8 269.64968 5
#10 2003 M 8 265.27049 4
#11 2004 M 9 99.73945 3
#12 2005 M 8 178.12686 6
Verify that this is the same result (modulo row ordering) as you'd get by running a dplyr chain on the original data frame:
df %>%
group_by(year, gender) %>%
summarise(cm_count=n_distinct(card_no), total_spend=sum(spend), no_of_trans=sum(spend > 0)) %>%
arrange(gender, year)
#year gender cm_count total_spend no_of_trans
#<int> <fctr> <int> <dbl> <int>
#1 2000 F 10 359.30313 6
#2 2001 F 8 225.89571 3
#3 2002 F 7 332.58365 6
#4 2003 F 5 333.72169 5
#5 2004 F 7 280.90448 5
#6 2005 F 9 254.37680 5
#7 2000 M 8 309.77727 6
#8 2001 M 8 143.70835 2
#9 2002 M 8 269.64968 5
#10 2003 M 8 265.27049 4
#11 2004 M 9 99.73945 3
#12 2005 M 8 178.12686 6
I am trying to do something very similar to Scale relative to a value in each group (via dplyr) (however this solution seems to crash R for me). I would like to replicate a single value for each group and add a new column with this value repeated. As an example I have
library(dplyr)
data = expand.grid(
category = LETTERS[1:2],
year = 2000:2003)
data$value = runif(nrow(data))
data
category year value
1 A 2000 0.6278798
2 B 2000 0.6112281
3 A 2001 0.2170495
4 B 2001 0.6454874
5 A 2002 0.9234604
6 B 2002 0.9311204
7 A 2003 0.5387899
8 B 2003 0.5573527
And I would like a dataframe like
data
category year value value2
1 A 2000 0.6278798 0.6278798
2 B 2000 0.6112281 0.6112281
3 A 2001 0.2170495 0.6278798
4 B 2001 0.6454874 0.6112281
5 A 2002 0.9234604 0.6278798
6 B 2002 0.9311204 0.6112281
7 A 2003 0.5387899 0.6278798
8 B 2003 0.5573527 0.6112281
i.e. the value for each category is the value from year 2000. I was trying to think of a general solution extensible to a given filtering criteria, i.e. something like
data %>% group_by(category) %>% mutate(value = filter(data, year==2002))
however this does not work because of incorrect length in the assignment.
Do this:
data %>% group_by(category) %>%
mutate(value2 = value[year == 2000])
You could also do it this way:
data %>% group_by(category) %>%
arrange(year) %>%
mutate(value2 = value[1])
or
data %>% group_by(category) %>%
arrange(year) %>%
mutate(value2 = first(value))
or
data %>% group_by(category) %>%
mutate(value2 = nth(value, n = 1, order_by = "year"))
or probably several other ways.
Your attempt with mutate(value = filter(data, year==2002)) doesn't make sense for a few reasons.
When you explicitly pass in data again, it's not part of the chain that got grouped earlier, so it doesn't know about the grouping.
All dplyr verbs take a data frame as first argument and return a data frame, including filter. When you do value = filter(...) you're trying to assign a full data frame to the single column value.