I am having two dataframes as follows,
data1
Type date
1 A 2011-10-21
2 A 2011-11-18
3 A 2011-12-16
4 B 2011-10-20
5 B 2011-11-17
6 B 2011-12-15
and
data2
Date Type value
1 2011-10-25 A 1
2 2011-10-15 A 3
3 2011-11-10 A 4
4 2011-10-23 B 12
5 2011-10-27 B 1
6 2011-11-18 B 1
I want to loop through the type(A,B) of data1 and check for each date and check all the entries for type(A,B) in data2 and check for the dates in data2 which is within two weeks gap, and then sum the values and bring it as an output.
My ideal output would be
Type date Value
1 A 2011-10-21 4 (3+1)
2 A 2011-11-18 4
3 A 2011-12-16 NA ( No values for A within two weeks)
4 B 2011-10-20 13 ( 12+1)
5 B 2011-11-17 1
6 B 2011-12-15 NA ( No values for A within two weeks)
I can think of writing a loop in R and running through. But it is running for a long time. I guess there should be a better way in dplyr to do this. I am trying and not able to complete it. Can anybody help me in doing this?
Thanks
How does this look? Assuming data1 as df1 and data2 as df2
library(dplyr)
library(lubridate)
df3 <- full_join(df1, df2, by = "Type")
df3 <- df3 %>% mutate(date1 = week(date), Date1 = week(Date))
df4 <- df3 %>% mutate(Key = ifelse(((date1 - Date1) %in% c(-2:2)), T, F))
df5 <- df4 %>% filter(Key == T) %>% group_by(Type, date) %>%
summarise(Value = sum(value))
full_join(df1, df5, by = c("Type", "date"))
Related
This question already has answers here:
How to pivot_longer two groups of columns using regular expression [duplicate]
(1 answer)
Pivot longer using Tidyr - Multiple variables stored in column names
(2 answers)
Closed 12 months ago.
I have a dataframe like this :
data.frame(id = c(1,2,3),
first_value=c("A","B","NA"),second_value=c("A","NA","D"),
first_date=c("2001",2010,2003),second_date=c("2003",2014,"2007"))
id first_value second_values first_date second_date
1 1 A A 2001 2003
2 2 B NA 2010 2014
3 3 NA D 2003 2007
I'am looking to transform it to longer dataframe like this with the samplist way:
id timing value date
1 1 first A 2001
2 1 second A 2003
3 2 first B 2010
4 2 second NA 2014
5 3 first NA 2003
6 3 second D 2007
I wasn't successful with the tidyr pivot_longer
You can do:
library(tidyverse)
df %>%
pivot_longer(cols = -id,
names_pattern = '(.*)_(.*)',
names_to = c('timing', '.value'))
Which gives:
# A tibble: 6 x 4
id timing value date
<dbl> <chr> <chr> <chr>
1 1 first A 2001
2 1 second A 2003
3 2 first B 2010
4 2 second NA 2014
5 3 first NA 2003
6 3 second D 2007
NOTE: this only works if you rename your second_values column to second_value. I assume the „values“ was just a typo?
Alternative and a bit simpler, based on #Maël‘s suggestion:
df %>%
pivot_longer(cols = -id,
names_sep = '_',
names_to = c('timing', '.value'))
Could try something like this?
library(tidyverse)
df <- data.frame(id = c(1,2,3),
first_value=c("A","B","NA"),
second_values=c("A","NA","D"),
first_date=c("2001",2010,2003),
second_date=c("2003",2014,"2007"))
bind_rows(
df %>%
select(id, first_value, date = first_date) %>%
pivot_longer(cols = "first_value", names_to = "timing"),
df %>%
select(id, second_values, date = second_date) %>%
pivot_longer(cols = "second_values", names_to = "timing")
) %>%
relocate(id, timing, value, date) %>%
arrange(id)
The last two lines are just there to get the same formatting / order as you posted, so probably could be ommitted.
I have a method for replacing values in a dataframe by matching id values. This works well for small data sets but not well on large datasets. Does anyone have a suggestion on how I might make this process more computationally effective?
Below is an example of my R code. I am using the tidyverse package.
# Delta Array small test
test_df <- data.frame(ID = c(1,2,3,4,5,6,7,8,8,9),
val = c(1,NA,3,4,5,6,7,8,NA,9))
delta_test <- data.frame(ID = c(2,8,9),
val = c(2,100,50))
test_df$val <- ifelse(is.na(delta_test$val[match(test_df$ID, delta_test$ID)]),
test_df$val,
delta_test$val[match(test_df$ID, delta_test$ID)])
test_df
You can try to join test_df with delta_test and select the first non-NA value using coalesce.
library(dplyr)
test_df <- test_df %>%
left_join(delta_test, by = 'ID') %>%
mutate(val = coalesce(val.y, val.x)) %>%
select(ID, val)
test_df
# ID val
#1 1 1
#2 2 2
#3 3 3
#4 4 4
#5 5 5
#6 6 6
#7 7 7
#8 8 100
#9 8 100
#10 9 50
In base R this can be implemented as :
test_df <- transform(merge(test_df, delta_test, by = 'ID', all.x = TRUE),
val = ifelse(is.na(val.y), val.x, val.y))
data1=data.frame("Grade"=c(1,2,3,1,2,3),
"Group"=c(A,A,A,B,B,B),
"Score"=c(5,7,10,7,7,8))
data2=data.frame("Grade"=c(1,2,3),
"Combine"=c(12,14,18),
"A"=c(5,7,10),
"B"=c(7,7,8))
I have 'data1' and wish for 'data2' where you transpose Group from 'data1' into 'A' and 'B' and then finally add 'Combine' which sums 'A' and 'B'
You tagged this with data.table, so here's a data.table approach.
library(data.table)
data1 <- as.data.table(data1)
data2 <- dcast(data1,Grade ~ Group)
data2[,Combine := A + B]
data2
Grade A B Combine
1: 1 5 7 12
2: 2 7 7 14
3: 3 10 8 18
We can use pivot_wider from tidyr
library(dplyr)
library(tidyr)
data1 %>%
pivot_wider(names_from = Group, values_from = Score) %>%
mutate(Combine = A + B)
You can do
library(tidyverse)
data1 %>%
spread(Group, Score) %>%
mutate(Combine = A+B)
Grade A B Combine
1 1 5 7 12
2 2 7 7 14
3 3 10 8 18
in Base R
data2 <- data.frame("Grade" = 1:3)
grade.locations <- lapply(1:3,grep,data1$Grade)
for(i in 1:3){
data2$Combine[i] <- sum(data1[grade.locations[[i]],3])
data2$A[i] <- data1[grade.locations[[i]][1],3]
data2$B[i] <- data1[grade.locations[[i]][2],3]
}
I have a dataframe like this
id start end
1 20/06/88 24/07/89
1 27/07/89 13/04/93
1 14/04/93 6/09/95
2 3/01/92 11/02/94
2 30/03/94 16/04/96
2 17/04/96 18/08/97
that I would like to merge with this other dataframe
id date
1 26/08/88
2 10/05/96
The resulting merged dataframe should look like this
id start end date
1 20/06/88 24/07/89 26/06/88
1 27/07/89 13/04/93 NA
1 14/04/93 6/09/95 NA
2 3/01/92 11/02/94 NA
2 30/03/94 16/04/96 NA
2 17/04/96 18/08/97 10/05/96
In practice I want to merge the two dataframes based on id and on the fact that date must lie within the interval spanned by the start and end vars of the first dataframe.
Do you have any suggestion on how to do this? I tried to use the fuzzyjoin package, but I have some memory issue..
Many thanks to everyone
Might be a dupe, will remove when I found a good target. In the meantime, we could use fuzzyjoin
library(tidyverse)
library(fuzzyjoin)
df1 %>%
mutate_at(2:3, as.Date, "%d/%m/%y") %>%
fuzzy_left_join(
df2 %>% mutate(date = as.Date(date, "%d/%m/%y")),
by = c("id" = "id", "start" = "date", "end" = "date"),
match_fun = list(`==`, `<`, `>`))
# id.x start end id.y date
#1 1 1988-06-20 1989-07-24 1 1988-08-26
#2 1 1989-07-27 1993-04-13 NA <NA>
#3 1 1993-04-14 1995-09-06 NA <NA>
#4 2 1992-01-03 1994-02-11 NA <NA>
#5 2 1994-03-30 1996-04-16 NA <NA>
#6 2 1996-04-17 1997-08-18 2 1996-05-10
All that remains is tidying up the id columns.
Sample data
df1 <- read.table(text = "
id start end
1 20/06/88 24/07/89
1 27/07/89 13/04/93
1 14/04/93 6/09/95
2 3/01/92 11/02/94
2 30/03/94 16/04/96
2 17/04/96 18/08/97", header = T)
df2 <- read.table(text = "
id date
1 26/08/88
2 10/05/96 ", header = T)
You can use sqldf for complex joins:
require(sqldf)
sqldf("SELECT df1.*,df2.date,df2.id as id2
FROM df1
LEFT JOIN df2
ON df1.id = df2.id AND
df1.start < df2.date AND
df1.end > df2.date")
I have a large data frame with approximately 500,000 observations (identified by "ID") and 150+ variables. Some observations only appear once; others appear multiple times (upwards of 10 or so). I would like to "collapse" these multiple observations so that there is only one row per unique ID, and that all information in columns 2:150 are concatenated. I do not need any calculations run on these observations, just a quick munging.
I've tried:
df.new <- group_by(df,"ID")
and also:
library(data.table)
dt = data.table(df)
dt.new <- dt[, lapply(.SD, na.omit), by = "ID"]
and unfortunately neither have worked. Any help is appreciated!
Using basic R:
df = data.frame(ID = c("a","a","b","b","b","c","d","d"),
day = c("1","2","3","4","5","6","7","8"),
year = c(2016,2017,2017,2016,2017,2016,2017,2016),
stringsAsFactors = F)
> df
ID day year
1 a 1 2016
2 a 2 2017
3 b 3 2017
4 b 4 2016
5 b 5 2017
6 c 6 2016
7 d 7 2017
8 d 8 2016
Do:
z = aggregate(df[,2:3],
by = list(id = df$ID),
function(x){ paste0(x, collapse = "/") }
)
Result:
> z
id day year
1 a 1/2 2016/2017
2 b 3/4/5 2017/2016/2017
3 c 6 2016
4 d 7/8 2017/2016
EDIT
If you want to avoid "collapsing" NA do:
z = aggregate(df[,2:3],
by = list(id = df$ID),
function(x){ paste0(x[!is.na(x)],collapse = "/") })
For a data frame like:
> df
ID day year
1 a 1 2016
2 a 2 NA
3 b 3 2017
4 b 4 2016
5 b <NA> 2017
6 c 6 2016
7 d 7 2017
8 d 8 2016
The result is:
> z
id day year
1 a 1/2 2016
2 b 3/4 2017/2016/2017
3 c 6 2016
4 d 7/8 2017/2016
I have had a similar problem in the past, but I wasn't dealing with several copies of the same data. It was in many cases just 2 instances and in some cases 3 instances. Below was my approach. Hopefully, it will help.
idx <- duplicated(df$key) | duplicated(df$key, fromLast=TRUE) # get the index of the duplicate entries. Or will help get the original value too.
dupes <- df[idx,] # get duplicated values
non_dupes <- df[!idx,] # get all non duplicated values
temp <- dupes %>% group_by(key) %>% # roll up the duplicated ones.
fill_(colnames(dupes), .direction = "down") %>%
fill_(colnames(dupes), .direction = "up") %>%
slice(1)
Then it is easy to merge back the temp and the non_dupes.
EDIT
I would highly recommend to filter the df to the only the population as much as possible and relevant for your end goal as this process could take some time.
What about?
df %>%
group_by(ID) %>%
summarise_each(funs(paste0(., collapse = "/")))
Or reproducible...
iris %>%
group_by(Species) %>%
summarise_each(funs(paste0(., collapse = "/")))