I have a data frame with Dates and Values:
library(dplyr)
library(lubridate)
df<-tibble(DateTime=ymd(c("2018-01-01","2018-01-01","2018-01-02","2018-01-02","2018-01-03","2018-01-03")),
Value=c(5,10,12,3,9,11),Rank=rep(0,6))
I would like to Rank the values of the two last rows, each compared with the rest four Value rows (the ones of previous dates).
I have managed to do this:
dfReference<-df%>%filter(DateTime!=max(DateTime))
dfTarget<-df%>%filter(DateTime==max(DateTime))
for (i in 1:nrow(dfTarget)){
tempDf<-rbind(dfReference,dfTarget[i,])%>%
mutate(Rank=rank(Value,ties.method = "first"))
dfTarget$Rank[i]=filter(tempDf,DateTime==max(df$DateTime))$Rank
}
Desired output:
> dfTarget
# A tibble: 2 x 3
DateTime Value Rank
<date> <dbl> <dbl>
1 2018-01-03 9 3
2 2018-01-03 11 4
But I am looking for a more delicate way.
Thanks
This is basically the same idea as your for loop, but instead of a loop it uses map_int, and instead of creating a new data frame using rbind it creates a new vector with c().
library(tidyverse)
is.max <- with(df, DateTime == max(DateTime))
df[is.max,] %>%
mutate(Rank = map_int(Value, ~
c(df$Value[!is.max], .x) %>%
rank(ties.method = 'first') %>%
tail(1)))
# # A tibble: 2 x 3
# DateTime Value Rank
# <date> <dbl> <int>
# 1 2018-01-03 9 3
# 2 2018-01-03 11 4
Related
I would like to identify all rows of a tibble that have been altered after mutate .
My real data has multiple columns and the mutate function changes more than one column at once.
# library
library(tidyverse)
# get df
df <- tibble(name=c("A","B","C","D"),value=c(1,2,3,4))
# mutate df
dfnew <- df %>%
mutate(value=case_when(name=="A" ~ value+1, TRUE ~value)) %>%
mutate(name=case_when(name=="B" ~ "K", TRUE ~name))
Created on 2020-04-26 by the reprex package (v0.3.0)
Now I look for a way how to compare all rows of df with dfnew and identify all rows with at least one change.
The desired output would be:
# desired output:
#
# # A tibble: 4 x 2
# name value
# <chr> <dbl>
# 1 A 2
# 2 K 2
You can do:
anti_join(dfnew, df)
name value
<chr> <dbl>
1 A 2
2 K 2
#tmfmnk's response does the trick, but if you'd like to use a loop (e.g. for some flexibility using different kinds of messages or warnings depending on what you're checking) you could do:
output <- list()
for (i in 1:nrow(dfnew)) {
if (all(df[i, ] == dfnew[i, ])) {
next
}
output[[i]] <- dfnew[i, ]
}
bind_rows(output)
# A tibble: 2 x 2
name value
<chr> <dbl>
1 A 2
2 K 2
We can also use setdiff from dplyr
library(dplyr)
setdiff(dfnew, df)
# A tibble: 2 x 2
# name value
# <chr> <dbl>
#1 A 2
#2 K 2
Or using fsetdiff from data.table
library(data.table)
fsetdiff(setDT(dfnew), setDT(df))
Essentially, I have a dataset with variables indicating group, date and value of variable. I need to take the difference between the value and the end-of-previous year value per group. Since the data is balanced, I was trying to do that with dplyr::lag, inserting the lag given the month of the observation:
x <- x %>% group_by(g) %>% mutate(y = v - lag(v, n=month(d))
This, however, does not work.
The results should be:
Mock dataset:
x <- data.frame('g'=c('B','B','B','C','A','A','A','A','A','A'),'d'=c('2018-11-30', '2018-12-31','2019-01-31','2019-12-31','2016-12-31','2017-11-30','2017-12-31','2018-12-31','2019-01-31','2019-02-28'),'v'=c(300,200,250,100,400,150,200,500,400,500))
Desired variable:
y <- c(NA,NA,-50,NA,NA,-250,-200,300,-100,0)
New dataset:
cbind(x,y)
An idea via dplyr can be to look for the last day, get the index and use that to subtract and then convert to NAs, i.e.
library(dplyr)
x %>%
group_by(g) %>%
mutate(new = which(sub('^[0-9]+-([0-9]+-[0-9]+)$', '\\1', d) == '12-31'),
y = v - v[new],
y = replace(y, row_number() <= new, NA)) %>%
select(-new)
which gives,
# A tibble: 7 x 4
# Groups: g [3]
g d v y
<fct> <fct> <dbl> <dbl>
1 B 2018-11-30 300 NA
2 B 2018-12-31 200 NA
3 B 2019-01-31 250 50
4 C 2017-12-31 400 NA
5 A 2018-12-31 500 NA
6 A 2019-01-31 400 -100
7 A 2019-02-28 500 0
In the end I decided to create an auxiliary variable ('eoy') to indicate the row of the corresponding end-of-year per group for each row. It requires a loop and is inefficient but facilitates the remaining computations that will depend on this. The desired computation would become:
mutate('y'= x - x[eoy])
I have a longitudinal data set and would like to extract the latest, non-missing complete set of observations for each variable in the data set where id is a unique identifier, yr is year, and x1 and x2 are variables with missing values. The actual data set has 100s of variables over the course of 60 years.
data <- data.frame(id=rep(1:3,3)
yr=rep(1:3,times=1, each=3)
x1=c(1,3,7,NA,NA,NA,9,4,10)
x2=c(NA,NA,NA,3,9,6,NA,NA,NA))
Below are my expected results. For x1, the latest complete set of observations is year 3. For x2, the latest complete set of observations is year 2.
Using base R
subset(data, yr %in% names(tail(which(sapply(split(data[c('x1', 'x2')],
data$yr), function(x) any(colSums(!is.na(x)) == nrow(x)))), 2)))
Here's a tidyverse solution. First, I create the data frame.
# Create data frame
df <- data.frame(id=rep(1:3,3),
yr=rep(1:3,times=1, each=3),
x1=c(1,3,7,NA,NA,NA,9,4,10),
x2=c(NA,NA,NA,3,9,6,NA,NA,NA))
Next, I load the required libraries.
# Load library
library(dplyr)
library(tidyr)
I then go from wide to long format, group by yr and key (i.e., variable name), remove those that have NA values (i.e., keep those that are all not NA), group by key, keep those data that are in the maximum year, switch back to wide format, and arrange to make the printed result look pretty.
df %>%
gather("key", "val", x1, x2) %>%
group_by(yr, key) %>%
filter(all(!is.na(val))) %>%
group_by(key) %>%
filter(yr == max(yr)) %>%
spread(key, val) %>%
arrange(yr)
#> # A tibble: 6 x 4
#> id yr x1 x2
#> <int> <int> <dbl> <dbl>
#> 1 1 2 NA 3
#> 2 2 2 NA 9
#> 3 3 2 NA 6
#> 4 1 3 9 NA
#> 5 2 3 4 NA
#> 6 3 3 10 NA
Created on 2019-05-29 by the reprex package (v0.3.0)
I'm building a dataset and am looking to be able to add a week count to a dataset starting from the first date, ending on the last. I'm using it to summarize a much larger dataset, which I'd like summarized by week eventually.
Using this sample:
library(dplyr)
df <- tibble(Date = seq(as.Date("1944/06/1"), as.Date("1944/09/1"), "days"),
Week = nrow/7)
# A tibble: 93 x 2
Date Week
<date> <dbl>
1 1944-06-01 0.143
2 1944-06-02 0.286
3 1944-06-03 0.429
4 1944-06-04 0.571
5 1944-06-05 0.714
6 1944-06-06 0.857
7 1944-06-07 1
8 1944-06-08 1.14
9 1944-06-09 1.29
10 1944-06-10 1.43
# … with 83 more rows
Which definitely isn't right. Also, my real dataset isn't structured sequentially, there are many days missing between weeks so a straight sequential count won't work.
An ideal end result is an additional "week" column, based upon the actual dates (rather than hard-coded with a seq_along() type of result)
Similar solution to Ronak's but with lubridate:
library(lubridate)
(df <- tibble(Date = seq(as.Date("1944/06/1"), as.Date("1944/09/1"), "days"),
week = interval(min(Date), Date) %>%
as.duration() %>%
as.numeric("weeks") %>%
floor() + 1))
You could subtract all the Date values with the first Date and calculate the difference using difftime in "weeks", floor all the values and add 1 to start the counter from 1.
df$week <- floor(as.numeric(difftime(df$Date, df$Date[1], units = "weeks"))) + 1
df
# A tibble: 93 x 2
# Date week
# <date> <dbl>
# 1 1944-06-01 1
# 2 1944-06-02 1
# 3 1944-06-03 1
# 4 1944-06-04 1
# 5 1944-06-05 1
# 6 1944-06-06 1
# 7 1944-06-07 1
# 8 1944-06-08 2
# 9 1944-06-09 2
#10 1944-06-10 2
# … with 83 more rows
To use this in your dplyr pipe you could do
library(dplyr)
df %>%
mutate(week = floor(as.numeric(difftime(Date, first(Date), units = "weeks"))) + 1)
data
df <- tibble::tibble(Date = seq(as.Date("1944/06/1"), as.Date("1944/09/1"), "days"))
I have two data sets with one common variable - ID (there are duplicate ID numbers in both data sets). I need to link dates to one data set, but I can't use left-join because the first or left file so to say needs to stay as it is (I don't want it to return all combinations and add rows). But I also don't want it to link data like vlookup in Excel which finds the first match and returns it so when I have duplicate ID numbers it only returns the first match. I need it to return the first match, then the second, then third (because the dates are sorted so that the newest date is always first for every ID number) and so on BUT I can't have added rows. Is there any way to do this? Since I don't know how else to show you I have included an example picture of what I need. data joining. Not sure if I made myself clear but thank you in advance!
You can add a second column to create subid's that follow the order of the rownumbers. Then you can use an inner_join to join everything together.
Since you don't have example data sets I created two to show the principle.
df1 <- df1 %>%
group_by(ID) %>%
mutate(follow_id = row_number())
df2 <- df2 %>% group_by(ID) %>%
mutate(follow_id = row_number())
outcome <- df1 %>% inner_join(df2)
# A tibble: 7 x 3
# Groups: ID [?]
ID sub_id var1
<dbl> <int> <fct>
1 1 1 a
2 1 2 b
3 2 1 e
4 3 1 f
5 4 1 h
6 4 2 i
7 4 3 j
data:
df1 <- data.frame(ID = c(1, 1, 2,3,4,4,4))
df2 <- data.frame(ID = c(1,1,1,1,2,3,3,4,4,4,4),
var1 = letters[1:11])
You need a secondary id column. Since you need the first n matches, just group by the id, create an autoincrement id for each group, then join as usual
df1<-data.frame(id=c(1,1,2,3,4,4,4))
d1=sample(seq(as.Date('1999/01/01'), as.Date('2012/01/01'), by="day"),11)
df2<-data.frame(id=c(1,1,1,1,2,3,3,4,4,4,4),d1,d2=d1+sample.int(50,11))
library(dplyr)
df11 <- df1 %>%
group_by(id) %>%
mutate(id2=1:n())%>%
ungroup()
df21 <- df2 %>%
group_by(id) %>%
mutate(id2=1:n())%>%
ungroup()
left_join(df11,df21,by = c("id", "id2"))
# A tibble: 7 x 4
id id2 d1 d2
<dbl> <int> <date> <date>
1 1 1 2009-06-10 2009-06-13
2 1 2 2004-05-28 2004-07-11
3 2 1 2001-08-13 2001-09-06
4 3 1 2005-12-30 2006-01-19
5 4 1 2000-08-06 2000-08-17
6 4 2 2010-09-02 2010-09-10
7 4 3 2007-07-27 2007-09-05