Joins in R while also spreading out information from one data frame - r

I am attempting to join together two data frames. One contains records of when certain events happened. The other contains daily information on values that occurred for a given organization.
My current challenge is how to join together the information in the "when certain events happened" data frame fully into the records data frame. Most of dplyr's joins appear to simply join one line together. I need to fully spread out the record information based on start and end dates.
In other words, I need to spread out information from one line into many lines, while simultaneously joining to the daily data table. It is important that I do this in R because the alternative is quite a bit of filtering and dragging in Excel (the information covers thousands of rows).
Below is a representation of the daily data table
value year month day org link
12 1 1 1 AA AA-1-1
45 1 1 2 AA AA-1-2
31 1 1 3 AA AA-1-3
10 1 1 4 AA AA-1-4
Below is a representation of the records table
year month day org link end_link event event_info
1 1 2 AA AA-1-1-2 AA-1-1-3 Buy Yes
1 2 7 BB BB-1-2-7 BB-1-2-10 Sell Yes
And finally, here is what I am aiming for in the end:
value month day org link event event_info
12 1 1 AA AA-1-1-1
45 1 2 AA AA-1-1-2 Buy Yes
31 1 3 AA AA-1-1-3 Buy Yes
10 1 4 AA AA-1-1-4
Is there any way to accomplish this in R? I have tried using dplyr joins but usually am only able to join together the initial link.
Edit: The second "end" link refers to an end date. In the records table this is all in one line, while the second data frame has daily information.
Edit: Below I have put together a cleaner look at my real data. The first image is of DAILY DATA while the second is of RECORDS OF EVENTS. The third is what I would like to see (ideally).
Daily data, which will have multiple orgs present
Records data, note org id AA and the audience
Ideal combined data

We have first to build some dates in order to build date sequences that we'll unnest to get a long version of df2, which we right join on df1:
library(tidyverse)
df2 %>%
separate(link,c("org1","year1","month1","day1")) %>%
separate(end_link,c("org2","year2","month2","day2")) %>%
rowwise %>%
transmute(org,event,event_info, date = list(
as.Date(paste0(year1,"-",month1,"-",day1)):as.Date(paste0(year2,"-",month2,"-",day2)))) %>%
unnest %>%
right_join(df1 %>% mutate(date=as.numeric(as.Date(paste0(year,"-",month,"-",day))))) %>%
select(value, month, day, org, link, event,event_info)
# # A tibble: 4 x 7
# value month day org link event event_info
# <int> <int> <int> <chr> <chr> <chr> <chr>
# 1 12 1 1 AA AA-1-1 <NA> <NA>
# 2 45 1 2 AA AA-1-2 Buy Yes
# 3 31 1 3 AA AA-1-3 Buy Yes
# 4 10 1 4 AA AA-1-4 <NA> <NA>
data
df1 <- read.table(text="value year month day org link
12 1 1 1 AA AA-1-1
45 1 1 2 AA AA-1-2
31 1 1 3 AA AA-1-3
10 1 1 4 AA AA-1-4",h=T,strin=F)
df2 <- read.table(text="year month day org link end_link event event_info
1 1 2 AA AA-1-1-2 AA-1-1-3 Buy Yes
1 2 7 BB BB-1-2-7 BB-1-2-10 Sell Yes",h=T,strin=F)

I would use the Data table package, it is for me the best R package to do data analysis. Hope to have properly understood the problem, let me know if it does not work.
The first part creates the data-set (I created the two data.table objects in two different ways just to show both alternatives, you could read your data directly from excel, .txt, .csv or similar, let me know if you want to know how to do this).
library(data.table)
value<-c(12,45,31,10)
year<-c(1,1,1,1)
month<-c(1,1,1,1)
day<-c(1,2,3,4)
org<-c("AA","AA","AA","AA")
link<-c("AA-1-1","AA-1-2","AA-1-3","AA-1-4")
Daily_dt<-data.table(value, year,month,day,org,link)
Records_dt<-data.table(year=c(1,1),month=c(1,1),day=c(2,3),org=c("AA","BB"),link=c("AA-1-1-2","BB-1-2-7"),end_link=c("AA-1-1-3","BB-1-2-10"),
event=c("Buy","Buy"),event_info=c("Yes","Yes"))
Daily_dt[,Date:=as.Date(paste(year,"-",month,"-",day,sep=""))]
To achieve what you want you need these lines
Records_dt=rbind(Records_dt[,c("org","link","event","event_info")],
Records_dt[,list(org,link=end_link,event,event_info)])
Record_Dates<-as.data.table(tstrsplit(Records_dt$link,"-")[-1])
Record_Dates[,Dates:=as.Date(paste(V1,"-",V2,"-",V3,sep=""))]
Records_dt[,Date:=Record_Dates$Dates]
setkey(Records_dt,Date)
setkey(Daily_dt,Date)
Records_dt<-Records_dt[,c("Date","event","event_info")][Daily_dt,]
Records_dt<-Records_dt[,c("value","month","day","org","link","event","event_info")]
and this is the result
> Records_dt
value month day org link event event_info
1: 12 1 1 AA AA-1-1 NA NA
2: 45 1 2 AA AA-1-2 Buy Yes
3: 31 1 3 AA AA-1-3 Buy Yes
4: 10 1 4 AA AA-1-4 NA NA
If your input data had more than one event in the same day (with or without the same org) something like:
> Records_dt
year month day org link end_link event event_info
1: 1 1 2 AA AA-1-1-2 AA-1-1-3 Buy Yes
2: 1 1 3 BB BB-1-2-7 BB-1-2-10 Buy Yes
3: 1 1 2 AA AA-1-1-2 AA-1-1-3 Buy Yes
4: 1 1 3 AA AA-1-2-7 AA-1-2-10 Buy Yes
some tweaks may be required, but am not sure if you required this, so did not add it.

Related

How to iterate one dataframe based on a mapping file in R?

Serial No.
Company 1
Company 2
Company 3
01
NA
2
NA
02
2
NA
5
03
NA
NA
4
04
1
NA
NA
05
NA
4
NA
I have a data structure like this where the column headings represent some companies and the row headings represents consumers who buy the products. 'NA' representing no purchase for that company's products by the consumer.
I have a second mapping file where the companies are represented as row headings as follows -
Company
Country
Category
Company 1
UK
FMCG
Company 2
UK
FMCG
Company 3
India
FMCG
Company 4
US
Nicotine
The data set is for over 10000 consumers and 1000 companies. I'm getting the market share for different countries and categories using the aggregate function and mapping file.
I want to make a look to iterate values in the first data-frame to change the share for different countries and categories. The idea is to make a loop where I can choose which country's (or category) share needs to be changed along with the share and then to use the mapping file to iterate values for companies in those countries (or category). The values need to be changes for only those consumers who buy the products from companies belonging to that country (or category).
Can someone suggest how can this be done in R (preferably) or Python?
Edit:
Before iteration I will use the aggregate function in R to get the shares for a country (or category) like this -
Country
Share
UK
0.33
US
0.02
IN
0.41
IR
0.11
PK
0.13
In the loop I want to be able to specify the share for some country (say UK) to whatever is required (say 0.5). The mapping file will be used to iterate values to the first data structure where people have bought products from companies in UK.
The final output will be something like this.
Country
Share
UK
0.50
US
0.00
IN
0.38
IR
0.11
PK
0.01
Here's a guess: ultimately, this is a combination of reshape from wide to long, then merge/join, and finally aggregation/summarizing by group. If you need more information for either operation, using those key-words (on SO) will provide very useful information.
base R (and reshape2)
## reshape
dat1melted <- reshape2::melt(dat1, "Serial No.", variable.name = "Company")
dat1melted$Company <- as.character(dat1melted$Company)
dat1melted <- dat1melted[!is.na(dat1melted$value),]
dat1melted
# Serial No. Company value
# 2 02 Company 1 2
# 4 04 Company 1 1
# 6 01 Company 2 2
# 10 05 Company 2 4
# 12 02 Company 3 5
# 13 03 Company 3 4
## merge
dat1merged <- merge(dat1melted, dat2, by = "Company", all.x = TRUE)
dat1merged
# Company Serial No. value Country Category
# 1 Company 1 02 2 UK FMCG
# 2 Company 1 04 1 UK FMCG
# 3 Company 2 01 2 UK FMCG
# 4 Company 2 05 4 UK FMCG
# 5 Company 3 02 5 India FMCG
# 6 Company 3 03 4 India FMCG
## aggregate by group
aggregate(value ~ Country, data = dat1merged, FUN = sum)
# Country value
# 1 India 9
# 2 UK 9
dplyr
library(dplyr)
# library(tidyr) # pivot_longer
dat1 %>%
## reshape
tidyr::pivot_longer(-`Serial No.`, names_to = "Company") %>%
filter(!is.na(value)) %>%
## merge
left_join(., dat2, by = "Company") %>%
## aggregate by group
group_by(Country) %>%
summarize(value = sum(value))
# # A tibble: 2 x 2
# Country value
# <chr> <int>
# 1 India 9
# 2 UK 9

How to view all rows of an output thats not in table form

Problem
I started with an ungrouped data set which I proceeded to group, the output of the grouping however, does not return all 427 rows. The output is needed to input that data into a table.
So initially the data was ungrouped and appears as follows:
Occupation Education Age Died
1 household Secondary 39 no
2 farming primary 83 yes
3 farming primary 60 yes
4 farming primary 73 yes
5 farming Secondary 51 no
6 farming iliterate 62 yes
The data is then grouped as follows:
occu %>% group_by(Occupation, Died, Age) %>% count()##use this to group on the occupation of the suicide victimrs
which gives the following output:
Occupation Died Age n
<fct> <fct> <int> <int>
1 business/service no 20 2
2 business/service no 30 1
3 business/service no 31 2
4 business/service no 34 1
5 business/service no 36 2
6 business/service no 41 1
7 business/service no 44 1
8 business/service no 46 1
9 business/service no 84 1
10 business/service yes 21 1
# ... with 417 more rows
problem is i need all the rows in order to input the grouped data into a table using:
dt <- read.table(text="full output from above")
If any more code would be useful to solving this let me know.
It is not really clear what you want but try the following code :
occu %>% group_by(Occupation, Died, Age) %>% count()
dt <- as.data.frame(occu)
It seems you simply want to convert the tibble to a data frame. There is no need to print all the output and then copy-paste it into read.table().
Also if you need you can save your output with write.table(dt,"filename.txt"), it will create a .txt file with your data.
If what you want is really print all the tibble output in the console, then you can do the following code, as suggested by Akrun's link :
options(dplyr.print_max = 1e9)
It will allow R to print the full tibble into the console, which I think is not efficient to do what you are asking.

Count no-NA values per row

family_id<-c(1,2,3)
age_mother<-c(30,27,29)
dob_child1<-c("1998-11-12","1999-12-12","1996-04-12")##child one birth day
dob_child2<-c(NA,"1997-09-09",NA)##if no child,NA
dob_child3<-c(NA,"1999-09-01","1996-09-09")
DT<-data.table(family_id,age_mother,dob_child1,dob_child2,dob_child3)
Now I have DT, how can I use this table to know how many children each family have using syntax like this:
DT[,apply..,keyby=family_id]##this code is wrong
This may also work:
> DT$total_child <- as.vector(rowSums(!is.na(DT[, c("dob_child1",
"dob_child2", "dob_child3")])))
> DT
family_id age_mother dob_child1 dob_child2 dob_child3 total_child
1 1 30 1998-11-12 <NA> <NA> 1
2 2 27 1999-12-12 1997-09-09 1999-09-01 3
3 3 29 1996-04-12 <NA> 1996-09-09 2
You can use sqldf package, to use a SQL query in R.
I duplicated your DT.
family_id<-c(1,2,3)
age_mother<-c(30,27,29)
dob_child1<-c("1998-11-12","1999-12-12","1996-04-12")##child one birth day
dob_child2<-c(NA,"1997-09-09",NA)##if no child,NA
dob_child3<-c(NA,"1999-09-01","1996-09-09")
DT<-data.table(family_id,age_mother,dob_child1,dob_child2,dob_child3)
library(sqldf)
sqldf('select distinct (count(dob_child3)+count(dob_child2)+count(dob_child1)) as total_child,
family_id from DT group by family_id')
The result is the following:
total_child family_id
1 1 1
2 3 2
3 2 3
It is correct for you?

New column from non-standard date factor in R

I have a dataframe with an oddly formatted dates column. I'd like to create a column just showing the year from the original date column and I am having trouble coming up with a way to do this because the current date column is being treated as a factor. Any advice on how to do this efficiently would be appreciated.
Example
starting with:
org <- c("a","b","c","d")
country <- c("1","2","3","4")
date <- c("01-09-14","01-10-07","11-31-99","10-31-12")
toy <- data.frame(cbind(org,country,date))
toy
org country date
1 a 1 01-09-14
2 b 2 01-10-07
3 c 3 11-31-99
4 d 4 10-31-12
str(toy$date)
Factor w/ 4 levels "01-09-14","01-10-07",..: 1 2 4 3
Desired result:
org country Year
1 a 1 2014
2 b 2 2007
3 c 3 1999
4 d 4 2012
This should work:
transform(toy,Year=format(strptime(date,"%m-%d-%y"),"%Y"))
This produces
## org country date Year
## 1 a 1 01-09-14 2014
## 2 b 2 01-10-07 2007
## 3 c 3 11-31-99 <NA>
## 4 d 4 10-31-12 2012
I initially thought that the NA value was because the %y format indicator wasn't smart enough to handle previous-century dates, but ?strptime says:
‘%y’ Year without century (00-99). On input, values 00 to 68 are
prefixed by 20 and 69 to 99 by 19 - that is the behaviour
specified by the 2004 and 2008 POSIX standards, but they do
also say ‘it is expected that in a future version the default
century inferred from a 2-digit year will change’.
implying that it should be able to handle it.
The problem is actually that 31 November doesn't exist ...
(You can drop the date column at your leisure ...)

R finding date intervals by ID

Having the following table which comprises some key columns which are: customer ID | order ID | product ID | Quantity | Amount | Order Date.
All this data is in LONG Format, in that you will get multi line items for the 1 Customer ID.
I can get the first date last date using R DateDiff but converting the file to WIDE format using Plyr, still end up with the same problem of getting multiple orders by customer, just less rows and more columns.
Is there an R function that extends R DateDiff to work out how to get the time interval between purchases by Customer ID? That is, time between order 1 and 2, order 2 and 3, and so on assuming these orders exists.
CID Order.Date Order.DateMY Order.No_ Amount Quantity Category.Name Locality
1 26/02/13 Feb-13 zzzzz 1 r MOSMAN
1 26/05/13 May-13 qqqqq 1 x CHULLORA
1 28/05/13 May-13 wwwww 1 r MOSMAN
1 28/05/13 May-13 wwwww 1 x MOSMAN
2 19/08/13 Aug-13 wwwwww 1 o OAKLEIGH SOUTH
3 3/01/13 Jan-13 wwwwww 1 x CURRENCY CREEK
4 28/08/13 Aug-13 eeeeeee 1 t BRISBANE
4 10/09/13 Sep-13 rrrrrrrrr 1 y BRISBANE
4 25/09/13 Sep-13 tttttttt 2 e BRISBANE
It is not clear what do you want to do since you don't give the expected result. But I guess you want to the the intervals between 2 orders.
library(data.table)
DT <- as.data.table(DF)
DT[, list(Order.Date,
diff = c(0,diff(sort(as.Date(Order.Date,'%d/%m/%y')))) ),CID]
CID Order.Date diff
1: 1 26/02/13 0
2: 1 26/05/13 89
3: 1 28/05/13 2
4: 1 28/05/13 0
5: 2 19/08/13 0
6: 3 3/01/13 0
7: 4 28/08/13 0
8: 4 10/09/13 13
9: 4 25/09/13 15
Split the data frame and find the intervals for each Customer ID.
df <- data.frame(customerID=as.factor(c(rep("A",3),rep("B",4))),
OrderDate=as.Date(c("2013-07-01","2013-07-02","2013-07-03","2013-06-01","2013-06-02",
"2013-06-03","2013-07-01")))
dfs <- split(df,df$customerID)
lapply(dfs,function(x){
tmp <-diff(x$OrderDate)
tmp
})
Or use plyr
library(plyr)
dfs <- dlply(df,.(customerID),function(x)return(diff(x$OrderDate)))
I know this question is very old, but I just figured out another way to do it and wanted to record it:
> library(dplyr)
> library(lubridate)
> df %>% group_by(customerID) %>%
mutate(SinceLast=(interval(ymd(lag(OrderDate)),ymd(OrderDate)))/86400)
# A tibble: 7 x 3
# Groups: customerID [2]
customerID OrderDate SinceLast
<fct> <date> <dbl>
1 A 2013-07-01 NA
2 A 2013-07-02 1.
3 A 2013-07-03 1.
4 B 2013-06-01 NA
5 B 2013-06-02 1.
6 B 2013-06-03 1.
7 B 2013-07-01 28.

Resources