I have two data tables: claims and SC. They have one column in common - subCoverageKey. In claims I want to create a new columns subCoverageeyClaim. For every row in claims I want to take the corresponding subCoverageeyClaim value from SC - aka. matching by subCoverageKey. In case there are multiple subCoverageeyClaim for a subCoverageKey a random choice should be taken.
How can this be done?
I tried using sample() but couldn't get it to work.
The resulting data.table should look something like this:
claims
clientID claimID claimYear amount clDate subCoverageKey subCoverageKeyClaims
1: 1 OP_a19517b1-5c66-47ca-92de-40c1b1a0b16b 2019 50.01 2019-04-26 IP_accommodation b83f2a41-64c3-4571-97e7-6534f9629104
2: 1 OP_a19517b1-5c66-47ca-92de-40c1b1a0b16b 2019 50.01 2019-04-26 IP_bundle f0a9ee55e-31b1-46f8-a0d4-91154e6c0998
3: 1 OP_a19517b1-5c66-47ca-92de-40c1b1a0b16b 2019 50.01 2019-04-26 IP_accommodation f0a9ee55e-31b1-46f8-a0d4-91154e6c0998
4: 1 OP_064c03aa-f2d5-4768-9c4e-51b54a725e56 2019 78.25 2019-06-09 IP_upgrade 74390be79-dc1e-4f7a-a0c0-f548c0b9ffcb
5: 1 OP_064c03aa-f2d5-4768-9c4e-51b54a725e56 2019 78.25 2019-06-09 Daily_cash 7a61bcf3-9e6d-4c4b-be2b-1381527dedd6
---
2637586: 130999 OP_b165c233-cd77-461b-b37d-704ac647d878 2019 8.66 2019-09-13 IP_upgrade ffdef3f3-2996-4d1a-bf51-a78b43029079
2637587: 130999 OP_0a11b09d-fd4c-427e-ad7b-8c67c2fa70e5 2019 61.16 2019-09-17 Daily_cash 0a9ee55e-31b1-46f8-a0d4-91154e6c0998
2637588: 131000 OP_3fb03980-8642-48bf-8967-55e410243868 2019 12.64 2019-05-10 IP_upgrade 4390be79-dc1e-4f7a-a0c0-f548c0b9ffcb
2637589: 131000 OP_64d85cc6-db73-408a-a02a-6b0c811ee06d 2019 8.44 2019-05-02 IP_bundle ffdef3f3-2996-4d1a-bf51-a78b43029079
2637590: 131000 OP_8b5585d8-d8e0-47ed-9005-3584062d4103 2019 3.57 2019-03-10 IP_accommodation ffdef3f3-2996-4d1a-bf51-a78b43029079
The data.table i am planning to join are quite large with ~300000 observations, so i am looking for something that wont take that much time.
i found the answer i was looking for :
convert SC table into a list
listSC <- split(SC, SC$subCoverageKeyClaims)
use sapply randomly pick the corresponding subCoverageKeyClaims
claims$subCoverageKeyClaims <- sapply(claims$subCoverageKey, function(x){
sample(listSC[[x]]$subCoverageKey, 1)
})
Here is one tidyverse way :
library(tidyverse)
SC %>%
select(claimID, subCoverageKey) %>%
group_by(subCoverageKey) %>%
nest %>%
ungroup %>%
right_join(claims, by = 'subCoverageKey') %>%
mutate(subCoverageKeyClaims = map_chr(data, ~sample(.x$claimID, 1)))
Since there are duplicates in claims dataset we create a list column for same subCoverageKey in SC and join it with claims dataset. We can then select one random claimID value from their respective data.
Related
I'm running the following code below to retrieve a data set, which unfortunately uses "." instead of NA to represent missing data. After much wrangling and searching SO and other fora, I still cannot make the code replace all instances of "." with NA so I can convert the columns to numeric and go on with my life. I'm pretty sure the problem is between the screen and the chair, so I don't see a need to post sessionInfo, but please let me know otherwise. Help in solving this would be greatly appreciated. The first four columns are integers setting out the date and the unique ID, so I would only need to correct the other columns. Thanks in advance you all!
library(data.table)
google_mobility_data <- data.table(read.csv("https://github.com/OpportunityInsights/EconomicTracker/raw/main/data/Google Mobility - State - Daily.csv",stringsAsFactors = FALSE))
# The following line is the one where I can't make it work properly.
google_mobility_data[, .SD := as.numeric(sub("^\\.$", NA, .SD)), .SDcols = -c(1:4)]
I downloaded your data and changed the last entry on the first row to "." to test NA in the final column.
Use readLines to read a character vector.
Use gsub to change . to NA.
Use fread to read as a data.table.
library(data.table)
gmd <- readLines("Google Mobility - State - Daily.csv")
gmd[c(2,3120)]
# [1] "2020,2,24,1,.00286,-.00714,.0557,.06,.0129,.00857,."
# [2] "2020,4,25,10,-.384,-.191,.,-.479,-.441,.179,-.213"
gmd <- gsub(",\\.,",",NA,",gmd)
gmd <- gsub(",\\.$",",NA",gmd)
gmd[c(2,3120)]
# [1] "2020,2,24,1,.00286,-.00714,.0557,.06,.0129,.00857,NA"
# [2] "2020,4,25,10,-.384,-.191,NA,-.479,-.441,.179,-.213"
google_mobility_data <- fread(text=gmd)
google_mobility_data[c(1,3119)]
# year month day statefips gps_retail_and_recreation gps_grocery_and_pharmacy gps_parks gps_transit_stations gps_workplaces gps_residential gps_away_from_home
#1: 2020 2 24 1 0.00286 -0.00714 0.0557 0.060 0.0129 0.00857 NA
#2: 2020 4 25 10 -0.38400 -0.19100 NA -0.479 -0.4410 0.17900 -0.213
summary(google_mobility_data)
EDIT: You mentioned using na.strings with fread didn't work for you, so I suggested the above approach.
However, at least with the data file downloaded as I did, this worked in one line - as suggested by #MichaelChirico:
google_mobility_data <- fread("Google Mobility - State - Daily.csv",na.strings=".")
google_mobility_data[c(1,3119)]
year month day statefips gps_retail_and_recreation gps_grocery_and_pharmacy gps_parks gps_transit_stations gps_workplaces gps_residential gps_away_from_home
#1: 2020 2 24 1 0.00286 -0.00714 0.0557 0.060 0.0129 0.00857 NA
#2: 2020 4 25 10 -0.38400 -0.19100 NA -0.479 -0.4410 0.17900 -0.213
I have a dataframe called Deals and one of the variables is $DealYear. This contains 3 factors (2013, 2014, 2015). I have another column in Deals called $GDPDeflator which is currently unpopulated. For example:
#Deals table
DealID DealAmt DealYear Name GDPDeflator Website
100101 200 2013 ABC 0 www.abc.com
120022 3000 2014 EFG 0 www.efg.com
300012 650 2013 HIJ 0 www.hij.com
I have a small second table called Deflator which contains the values I need for Deals$GDPDeflator:
#Deflator table
Year Defl
2012 1.10
2013 1.08
2014 1.055
2015 1.046
2016 1.03
How can I look up the values for Deals$GDPDeflator from Deflator$Defl based on Deals$DealYear and Deflator$Year?
This is a typical use of merge function as suggested in comments.
Since you did not provide a minimal reproducible example I have to prepare a toy example.
deals.df <- data.frame(DealID = abs(rnorm(3)),
DealYear = c(2013,2014,2015),
DealAmt = abs(rnorm(3)))
deflator.df <- data.frame(Year=c(2012:2016),
Defl=c(1.1,1.08,1.055,1.046,1.03))
At this point you could rename DealYear to Year in deals.df (or vice versa) or, as I show below, use by.x and by.y to tell merge the name of the column in each data frame
NEW.deals.df <- merge(deals.df, deflator.df, by.x = "DealYear", by.y = "Year")
NEW.deals.df
output
DealYear DealID DealAmt Defl
1 2013 2.4428505 0.8423267 1.080
2 2014 0.7864217 1.7308812 1.055
3 2015 1.2319621 0.7857849 1.046
Now you can rearrange columns if needed.
Please note that you can match rows between data frames with more than one column as identifier using by=c("colname1","colname2",...).
I'm having trouble applying a simple data.table join example to a larger (10GB) data set. merge() works just fine on data.frames with the larger dataset, although I'd love to take advantage of the speed in data.table. Could anyone point out what I'm misunderstanding about data.table (and the error message in particular)?
Here is the simple example (derived from this thread: Join of two data.tables fails).
# The data of interest.
(DT <- data.table(id = c(rep(1154:1155, 2), 1160),
price = c(1.99, 2.50, 15.63, 15.00, 0.75),
key = "id"))
id price
1: 1154 1.99
2: 1154 15.63
3: 1155 2.50
4: 1155 15.00
5: 1160 0.75
# Lookup table.
(lookup <- data.table(id = 1153:1160,
version = c(1,1,3,4,2,1,1,2),
yr = rep(2006, 4),
key = "id"))
id version yr
1: 1153 1 2006
2: 1154 1 2006
3: 1155 3 2006
4: 1156 4 2006
5: 1157 2 2006
6: 1158 1 2006
7: 1159 1 2006
8: 1160 2 2006
# The desired table. Note: lookup[DT] works as well.
DT[lookup, allow.cartesian = T, nomatch=0]
id price version yr
1: 1154 1.99 1 2006
2: 1154 15.63 1 2006
3: 1155 2.50 3 2006
4: 1155 15.00 3 2006
5: 1160 0.75 2 2006
The larger data set consists of two data.frames: temp.3561 (the dataset of interest) and temp.versions (the lookup dataset). They have the same structure as DT and lookup (above), respectively. Using merge() works well, however my application of data.table is clearly flawed:
# Merge data.frames: works just fine
long.merged <- merge(temp.versions, temp.3561, by = "id")
# Convert the data.frames to data.tables
DTtemp.3561 <- as.data.table(temp.3561)
DTtemp.versions <- as.data.table(temp.versions)
# Merge the data.tables: doesn't work
setkey(DTtemp.3561, id)
setkey(DTtemp.versions, id)
DTlong.merged <- merge(DTtemp.versions, DTtemp.3561, by = "id")
Error in vecseq(f__, len__, if (allow.cartesian) NULL else as.integer(max(nrow(x), :
Join results in 11277332 rows; more than 7946667 = max(nrow(x),nrow(i)). Check for duplicate
key values in i, each of which join to the same group in x over and over again. If that's ok,
try including `j` and dropping `by` (by-without-by) so that j runs for each group to avoid the
large allocation. If you are sure you wish to proceed, rerun with allow.cartesian=TRUE.
Otherwise, please search for this error message in the FAQ, Wiki, Stack Overflow and datatable-
help for advice.
DTtemp.versions has the same structure as lookup (in the simple example), and the key "id" consists of 779,473 unique values (no duplicates).
DTtemp3561 has the same structure as DT (in the simple example) plus a few other variables, but its key "id" only has 829 unique values despite the 7,946,667 observations (lots of duplicates).
Since I'm just trying to add version numbers and years from DTtemp.versions to each observation in DTtemp.3561, the merged data.table should have the same number of observations as DTtemp.3561 (7,946,667). Specifically, I don't understand why merge() generates "excess" observations when using data.table but not when using data.frame.
Likewise
# Same error message, but with 12,055,777 observations
altDTlong.merged <- DTtemp.3561[DTtemp.versions]
# Same error message, but with 11,277,332 observations
alt2DTlong.merged <- DTtemp.versions[DTtemp.3561]
Including allow.cartesian=T and nomatch=0 doesn't drop the "excess" observations.
Oddly, if I truncate the dataset of interest to have 10 observatons, merge() works fine on both data.frames and data.tables.
# Merge short DF: works just fine
short.3561 <- temp.3561[-(11:7946667),]
short.merged <- merge(temp.versions, short.3561, by = "id")
# Merge short DT
DTshort.3561 <- data.table(short.3561, key = "id")
DTshort.merged <- merge(DTtemp.versions, DTshort.3561, by = "id")
I've been through the FAQ (http://datatable.r-forge.r-project.org/datatable-faq.pdf, and 1.12 in particular). How would you suggest thinking about this?
Could anyone point out what I'm misunderstanding about data.table (and the error message in particular)?
Taking you answer directly. The error message
Join results in 11277332 rows; more than 7946667 = max(nrow(x),nrow(i)). Check for duplicate key values in i...
states the result of your join has more values than usual cases expects. This means the lookup table key has duplicates which results multiple matches on join.
If it doesn't answer your question you should restate it.
I have several .csv files, each one corresponding to a monthly list of customers and some information about them. Each file consists of the same information about customers such as:
names(data.jan)
ID AGE CITY GENDER
names(data.feb)
ID AGE CITY GENDER
To simplify, I will consider only two months, january and february, but my real set of csv files go from january to november:
Considering a "customer X",I have three possible scenarios:
1- Customer X is listed in the january database, but he left and now is not listed in february
2- Customer X is listed in both january and february databases
3- Customer X entered the database in february, so he is not listed in january
I am stuck on the following problem: I need to create a single database with all customers and their respective information that are listed in both dataframes. However, considering a customer that is listed in both dataframes, I want to pick his information from his first entry, that is, january.
When I use merge, I have four options, acording to http://www.dummies.com/how-to/content/how-to-use-the-merge-function-with-data-sets-in-r.html
data <- merge(data.jan,data.feb, by="ID", all=TRUE)
Regardless of which all, all.x or all.y I choose, I get the same undesired output called data:
data[1,]
ID AGE.x CITY.x GENDER.x AGE.y CITY.y GENDER.y
123 25 NY M 25 NY M
I think that what would work here is to merge both databases with this type of join:
Then, merge the resulting dataframe with data.jan with the full outer join. But I don't know how to code this in R.
Thanks,
Bernardo
d1 <- data.frame(x=1:9,y=1:9,z=1:9)
d2 <- data.frame(x=1:10,y=11:20,z=21:30) # example data
d3 <- merge(d1,d2, by="x", all=TRUE) #merge
# keep the original columns from janary (i.e. y.x, z.x)
# but replace the NAs in those columns with the data from february (i.e. y.y,z.y )
d3[is.na(d3[,2]) ,][,2:3] <- d3[is.na(d3[,2]) ,][, 4:5]
#> d3[, 1:3]
# x y.x z.x
#1 1 1 1
#2 2 2 2
#3 3 3 3
#4 4 4 4
#5 5 5 5
#6 6 6 6
#7 7 7 7
#8 8 8 8
#9 9 9 9
#10 10 20 30
This may be tiresome for more than 2 months though, perhaps you should consider #flodel's comments, also note there are demons when your original Jan data has NAs (and you still want the first months data, NA or not, retained) although you never mentioned them in your question.
Try:
data <- merge(data.jan,data.frame(ID=data.feb$ID), by="ID")
although I haven't tested it since no data, but if you just join the ID col from Feb, it should only filter out anything that isn't in both frames
#user1317221_G's solution is excellent. If your tables are large (lots of customers), data tables might be faster:
library(data.table)
# some sample data
jan <- data.table(id=1:10, age=round(runif(10,25,55)), city=c("NY","LA","BOS","CHI","DC"), gender=rep(c("M","F"),each=5))
new <- data.table(id=11:16, age=round(runif(6,25,55)), city=c("NY","LA","BOS","CHI","DC","SF"), gender=c("M","F"))
feb <- rbind(jan[6:10,],new)
new <- data.table(id=17:22, age=round(runif(6,25,55)), city=c("NY","LA","BOS","CHI","DC","SF"), gender=c("M","F"))
mar <- rbind(jan[1:5,],new)
setkey(jan,id)
setkey(feb,id)
join <- data.table(merge(jan, feb, by="id", all=T))
join[is.na(age.x) , names(join)[2:4]:= join[is.na(age.x),5:7,with=F]]
Edit: This adds processing for multiple months.
f <- function(x,y) {
setkey(x,id)
setkey(y,id)
join <- data.table(merge(x,y,by="id",all=T))
join[is.na(age.x) , names(join)[2:4]:= join[is.na(age.x),5:7,with=F]]
join[,names(join)[5:7]:=NULL] # get rid of extra columns
setnames(join,2:4,c("age","city","gender")) # rename columns that remain
return(join)
}
Reduce("f",list(jan,feb,mar))
Reduce(...) applies the function f(...) to the elements of the list in turn, so first to jan and feb, and then to the result and mar, etc.
DF2
Date EMMI ACT NO2
2011/02/12 12345 21 11
2011/02/14 43211 22 12
2011/02/19 12345 21 13
2011/02/23 43211 13 12
2011/02/23 56341 13 12
2011/03/03 56431 18 20
I need to find difference between two dates in a column. For example difference between ACT column values.For example, the EMMI 12345, Difference between dates 2011/02/19 - 2011/02/12 = 21-21 = 0. like that i want to do for entire column of ACT. Add a new column diff and add values to that. Can anybody let me know please how to do it.
This is the output i want
DF3
Date EMMI ACT NO2 DifACT
2011/02/12 12345 21 11 NA
2011/02/14 43211 22 12 NA
2011/02/19 12345 21 13 0
2011/02/23 43211 13 12 -9
2011/02/23 56341 13 12 5
Try this:
DF3 <- DF2
DF3$difACT <- ave( DF3$ACT, DF3$EMMI, FUN= function(x) c(NA, diff(x)) )
As long as the dates are sorted (within EMMI) this will work, if they are not sorted then we would need to modify the above to sort within EMMI first. I would probably sort the entire data frame on date first (and save the results of order), then run the above. Then if you need it back in the original order you can run order on the results of the original order results to "unorder" the data frame.
This is based on plyr package (not tested):
library(plyr)
DF3<-ddply(DF2,.(EMMI),mutate,difACT=diff(ACT))