I have a list of marked individuals (column Mark) which have been captured various years (column Year) within a range of the river (LocStart and LocEnd). Location on the river is in meters.
I would like to know if a marked individual has used overlapping range between years i.e. if the individual has gone to the same segment of the river from year to year.
Here is an example of the original data set:
IDMark YearLocStartLocEnd
11081199221,72922,229
21081199221,20321,703
31081200521,50822,008
41126199419,22219,522
51126199418,81119,311
61283200521,75422,254
71283200722,02522,525
Here is what I would like the final answer to look like:
MarkYear1Year2IDs
1081199220051, 3
1081199220052, 3
1283200520076, 7
In this case, individual 1126 would not be in the final output as the only two ranges available were the same year. I realize it would be easy to remove all the records where Year1 = Year2.
I would like to do this in R and have looked into the >IRanges package but have not been able to consider the group = Mark and been able to extract the Year1 and Year2 information.
Using foverlaps() function from data.table package:
require(data.table)
setkey(setDT(dt), Mark, LocStart, LocEnd) ## (1)
olaps = foverlaps(dt, dt, type="any", which=TRUE) ## (2)
olaps = olaps[dt$Year[xid] != dt$Year[yid]] ## (3)
olaps[, `:=`(Mark = dt$Mark[xid],
Year1 = dt$Year[xid],
Year2 = dt$Year[yid],
xid = dt$ID[xid],
yid = dt$ID[yid])] ## (4)
olaps = olaps[xid < yid] ## (5)
# xid yid Mark Year1 Year2
# 1: 2 3 1081 1992 2005
# 2: 1 3 1081 1992 2005
# 3: 6 7 1283 2005 2007
We first convert the data.frame to data.table by reference using setDT. Then, we key the data.table on columns Mark, LocStart and LocEnd, which will allow us to perform overlapping range joins.
We calculate self overlaps (dt with itself) with any type of overlap. But we return matching indices here using which = TRUE.
Remove all indices where Year corresponding to xid and yid are identical.
Add all the other columns and replace xid and yid with corresponding ID values, by reference.
Remove all indices where xid >= yid. If row 1 overlaps with row 3, then row 3 also overlaps with row 1. We don't need both. foverlaps() doesn't have a way to remove this by default yet.
Related
I have a tibble with a list of stocks, each has a sector ids, each sector is a string with 8 characters (it is a level 4 GICS sector https://en.wikipedia.org/wiki/Global_Industry_Classification_Standard):
tabl <- tibble(Stock=c("A","B","C","D"), SectorId=c("30101010", "30101010", "20103015", "55102010"))
I also have a tibble that map a SectorId to a ClusterId:
map_tabl <- tibble(ClusterId=c("C1","C1", "C2","C3"), SectorId=c("3010", "3020", "201030", "551020"))
Note that in the cluster mapping we have a mix of sectors defined on the 4 different levels (see https://en.wikipedia.org/wiki/Global_Industry_Classification_Standard), i.e. Sector "3010" contains sector "30101010". The first 2 characters correspond to Level 1, the first 4 to Level 2, the first 6 to Level 3, and the 8 characters to Level 4. So for example in this case "30101010" belongs to the higher level sector "3010", which is in ClusterId="C1". Note that "30101010" is not specified at all in map_tabl, so probably I should use a function that look at substrings, like grepl.
The resulting tibble should be:
tibble(Stock=c("A","B","C","D"), SectorId=c("30101010", "30101010", "20103015", "55102010"), ClusterId=c("C1", "C1", "C2", "C3")
I think we can use a regex (fuzzy) join for this:
library(dplyr)
library(fuzzyjoin) # regex_left_join
map_tabl %>%
mutate(SectorId = paste0("^", SectorId)) %>%
regex_left_join(tabl, ., by = "SectorId")
# # A tibble: 4 x 4
# Stock SectorId.x ClusterId SectorId.y
# <chr> <chr> <chr> <chr>
# 1 A 30101010 C1 ^3010
# 2 B 30101010 C1 ^3010
# 3 C 20103015 C2 ^201030
# 4 D 55102010 C3 ^551020
fuzzyjoin always keeps both versions of the join variables around, it's easy enough to mutate(SectorId = SectorId.x, SectorId.x = NULL, SectorId.y = NULL) or similar (choosing select(-SectorId.x), etc, also works).
The precondition of SectorId to add the ^ is so that the matches only occur at the beginning of the string.
This does not attempt to limit the number of matches, so if there are multiple rows in map_tabl that might match an entry (e.g., SectorId=c("3010", "301010")), then you will need to define a clear way to choose which of these to retain. For this, I assume either Stock is a unique ID of sorts, or if not then you can add one yourself to make sure you end the operation with the same rows (no dupes) as before the join.
I have a Data Frame made up of several columns, each corresponding to a different industry per country. I have 56 industries and 43 countries and I'd select only industries from 5 to 22 per country (18 industries). The big issue is that each industry per country is named as: AUS1, AUS2 ..., AUS56. What I shall select is AUS5 to AUS22, AUT5 to AUT22 ....
A viable solution could be to select columns according to the following algorithm: the first column of interest, i.e., AUS5 corresponds to column 10 and then I select up to AUS22 (corresponding to column 27). Then, I should skip all the remaining column for AUS (i.e. AUS23 to AUS56), and the first 4 columns for the next country (from AUT1 to AUT4). Then, I select, as before, industries from 5 to 22 for AUT. Basically, the algorithm, starting from column 10 should be able to select 18 columns(including column 10) and then skip the next 38 columns, and then select the next 18 columns. This process should be repeated for all the 43 countries.
How can I code that?
UPDATE, Example:
df=data.frame(industry = c("C10","C11","C12","C13"),
country = c("USA"),
AUS3 = runif(4),
AUS4 = runif(4),
AUS5 = runif(4),
AUS6 = runif(4),
DEU5 = runif(4),
DEU6 = runif(4),
DEU7 = runif(4),
DEU8 = runif(4))
#I'm interested only in C10-c11:
df_a=df %>% filter(grepl('C10|C11',industry))
df_a
#Thus, how can I select columns AUS10,AUS11, DEU10,DEU11 efficiently, considering that I have a huge dataset?
Demonstrating the paste0 approach.
ctr <- unique(gsub('\\d', '', names(df[-(1:2)])))
# ctr <- c("AUS", "DEU") ## alternatively hard-coded
ind <- c(10, 11)
subset(df, industry == paste0('C', 10:11),
select=c('industry', 'country', paste0(rep(ctr, each=length(ind)), ind)))
# industry country AUS10 AUS11 DEU10 DEU11
# 1 C10 USA 0.3376674 0.1568496 0.5033433 0.7327734
# 2 C11 USA 0.7421840 0.6808892 0.9050158 0.3689741
Or, since you appear to like grep you could do.
df[grep('10|11', df$industry), grep('industry|country|[A-Z]{3}1[01]', names(df))]
# industry country AUS10 AUS11 DEU10 DEU11
# 1 C10 USA 0.3376674 0.1568496 0.5033433 0.7327734
# 2 C11 USA 0.7421840 0.6808892 0.9050158 0.3689741
If you have a big data set in memory, data.table could be ideal and much faster than alternatives. Something like the following could work, though you will need to play with select_ind and select_ctr as desired on the real dataset.
It might be worth giving us a slightly larger toy example, if possible.
library(data.table)
setDT(df)
select_ind <- paste0(c("C"), c("11","10"))
select_ctr <- paste0(rep(c("AUS", "DEU"), each = 2), c("10","11"))
df[grepl(paste0(select_ind, collapse = "|"), industry), # select rows
..select_ctr] # select columns
AUS10 AUS11 DEU10 DEU11
1: 0.9040223 0.2638725 0.9779399 0.1672789
2: 0.6162678 0.3095942 0.1527307 0.6270880
For more information, see Introduction to data.table.
I have two data frames, df1 has information about a publication's year, outlet name, total articles in this publication in a year, and a cumulative sum of articles over the period of time I'm studying. df2 has a random sample of article IDs, with potential values ranging from 1 to the total number of articles given by df1$cumsum.
What I need to do is to grab each article ID in df2 and identify in which publication and year it falls under, using the information contained in df1.
Here's a minimally reproducible example:
set.seed(890)
df1 <- NULL
df1$year <- c(2000:2009, 2000:2009)
df1$outlet <- c(1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2,2,2,2,2,2,2,2,2,2)
df1$article_total <- sample(1:200, 20, replace = T)
df1$cumsum <- cumsum(df1$article_total)
df1 <- as.data.frame(df1)
df2 <- NULL
df2$art_num <- sample(1:2102, 100, replace = T) # get random sample of article IDs for the total number of articles I have in this db
df2 <- as.data.frame(df2)
Ideally, I would also like to calculate an article's ID in each year. For example, in the data above, outlet 1 has 14 articles in the year 2000 and 168 in 2001 (cumsum = 183). If I have an article ID of 156, I would like to know that it is the 142th article in the year 2001 of publication 1. And so on and so forth for every article ID I have in this database.
I was thinking I should do this with a for loop, but I'm 100% lost in writing it. Here's what I began writing, but I have a feeling I'm not on the right track with it:
for i in 1:nrow(df2$art_num){
article_number <- df2$art_num[i]
if (article_number %in% df1$cumsum){ # note: cumsum should be an interval before doing this?
# get article number, year, publication in new df
# also calculate article ID in each year/publication
}
}
Thanks in advance for any help! I'm still lost with writing loops in R...
#######################
EDITED EXAMPLE as per Frank's suggestion
set.seed(890)
df1 <- NULL
df1$year <- c(2000:2002, 2000:2002)
df1$outlet <- c(1, 1, 1, 2,2,2)
df1$article_total <- sample(1:50, 6, replace = T)
df1$cumsum <- cumsum(df1$article_total)
df1 <- as.data.frame(df1)
df2 <- NULL
df2$art_id <- c(66, 120, 77, 156, 24)
df2 <- as.data.frame(df2)
Here's the output I'm looking for:
art_id outlet year article_number
1 66 1 2002 19
2 120 2 2000 35
3 77 1 2002 30
4 156 2 2001 35
5 24 1 2000 20
This example shows my ideal output in df3, which I calculated/built by hand. It has one column with the article's ID, the appropriate outlet, the year, and a new variable art_number. This is different than the article ID in that I calculated it from df1$cumsum and df3$art_id. In this example, the first row shows that the first article in my database has an ID of 66. I obtain a art_number value of 19 because this article (id = 66) is the 19th article published in the year 2002 by outlet 1. I calculated this value by looking at the article ID, locating the year and outlet based on the df1$cumsum, and then substracting the art_id value from the df1$cumsum value for the previous year. So for this specific article, I calculated df3$art_number = df3$art_id[1,1] - df1$cumsum[2,4]
I need to do this calculation for every article in my data base so I don't do this process by hand forever.
I think your data structure makes sense, though it would be easier with one additional column, for the first article in a year and outlet:
library(data.table)
setDT(df1); setDT(df2)
df1[, art_cstart := shift(cumsum(article_total), fill=0L) + 1L]
year outlet article_total cumsum art_cstart
1: 2000 1 4 4 1
2: 2001 1 43 47 5
3: 2002 1 38 85 48
4: 2000 2 36 121 86
5: 2001 2 39 160 122
6: 2002 2 8 168 161
Now, we can do a rolling update join, "rolling" each art_id to the previous cumsum and computing each desired column:
df2[, c("outlet", "year", "art_num") := df1[df2, on=.(cumsum = art_id), roll=-Inf, .(
x.year,
x.outlet,
i.art_id - x.art_cstart + 1L
)]]
art_id outlet year art_num
1: 66 2002 1 19
2: 120 2000 2 35
3: 77 2002 1 30
4: 156 2001 2 35
5: 24 2001 1 20
How it works
x[i, on=, roll=, j] is the syntax for a join, looking up each row of i in x.
In this join j evaluates to a list of columns, .(...) shorthand for list(...).
Column assignment is done with (colnames) := .(...).
The assignment is to the existing table df2 instead of unnecessarily creating a new table.
For details on how data.table syntax works, see the startup messages...
> library(data.table)
data.table 1.10.4
The fastest way to learn (by data.table authors): https://www.datacamp.com/courses/data-analysis-the-data-table-way
Documentation: ?data.table, example(data.table) and browseVignettes("data.table")
Release notes, videos and slides: http://r-datatable.com
This is the code you need I think:
df3 <- data.frame(matrix(ncol = 3, nrow = 0))
colnames(df3) <- c("articleNumber", "year", "publication")
for(i in 1:nrow(df2$art_num)){
for(j in 1:nrow(df1$cumsum)) {
if ((df2$art_num[i] >= df1$cumsum[j]) && (df2$art_num[i] <= df1$cumsum[j + 1])){
# note: cumsum should be an interval before doing this? NOT REALLY SURE
# WHAT YOU NEED HERE
# get article number, year, publication in new df
df3[i, 1] <- df2$art_num[i]
df3[i, 2] <- df1$year[j]
df3[i, 3] <- df1$outlet[j]
# also calculate article ID in each year/publication ISN'T THIS
# art_num?
}
}
I'm having trouble applying a simple data.table join example to a larger (10GB) data set. merge() works just fine on data.frames with the larger dataset, although I'd love to take advantage of the speed in data.table. Could anyone point out what I'm misunderstanding about data.table (and the error message in particular)?
Here is the simple example (derived from this thread: Join of two data.tables fails).
# The data of interest.
(DT <- data.table(id = c(rep(1154:1155, 2), 1160),
price = c(1.99, 2.50, 15.63, 15.00, 0.75),
key = "id"))
id price
1: 1154 1.99
2: 1154 15.63
3: 1155 2.50
4: 1155 15.00
5: 1160 0.75
# Lookup table.
(lookup <- data.table(id = 1153:1160,
version = c(1,1,3,4,2,1,1,2),
yr = rep(2006, 4),
key = "id"))
id version yr
1: 1153 1 2006
2: 1154 1 2006
3: 1155 3 2006
4: 1156 4 2006
5: 1157 2 2006
6: 1158 1 2006
7: 1159 1 2006
8: 1160 2 2006
# The desired table. Note: lookup[DT] works as well.
DT[lookup, allow.cartesian = T, nomatch=0]
id price version yr
1: 1154 1.99 1 2006
2: 1154 15.63 1 2006
3: 1155 2.50 3 2006
4: 1155 15.00 3 2006
5: 1160 0.75 2 2006
The larger data set consists of two data.frames: temp.3561 (the dataset of interest) and temp.versions (the lookup dataset). They have the same structure as DT and lookup (above), respectively. Using merge() works well, however my application of data.table is clearly flawed:
# Merge data.frames: works just fine
long.merged <- merge(temp.versions, temp.3561, by = "id")
# Convert the data.frames to data.tables
DTtemp.3561 <- as.data.table(temp.3561)
DTtemp.versions <- as.data.table(temp.versions)
# Merge the data.tables: doesn't work
setkey(DTtemp.3561, id)
setkey(DTtemp.versions, id)
DTlong.merged <- merge(DTtemp.versions, DTtemp.3561, by = "id")
Error in vecseq(f__, len__, if (allow.cartesian) NULL else as.integer(max(nrow(x), :
Join results in 11277332 rows; more than 7946667 = max(nrow(x),nrow(i)). Check for duplicate
key values in i, each of which join to the same group in x over and over again. If that's ok,
try including `j` and dropping `by` (by-without-by) so that j runs for each group to avoid the
large allocation. If you are sure you wish to proceed, rerun with allow.cartesian=TRUE.
Otherwise, please search for this error message in the FAQ, Wiki, Stack Overflow and datatable-
help for advice.
DTtemp.versions has the same structure as lookup (in the simple example), and the key "id" consists of 779,473 unique values (no duplicates).
DTtemp3561 has the same structure as DT (in the simple example) plus a few other variables, but its key "id" only has 829 unique values despite the 7,946,667 observations (lots of duplicates).
Since I'm just trying to add version numbers and years from DTtemp.versions to each observation in DTtemp.3561, the merged data.table should have the same number of observations as DTtemp.3561 (7,946,667). Specifically, I don't understand why merge() generates "excess" observations when using data.table but not when using data.frame.
Likewise
# Same error message, but with 12,055,777 observations
altDTlong.merged <- DTtemp.3561[DTtemp.versions]
# Same error message, but with 11,277,332 observations
alt2DTlong.merged <- DTtemp.versions[DTtemp.3561]
Including allow.cartesian=T and nomatch=0 doesn't drop the "excess" observations.
Oddly, if I truncate the dataset of interest to have 10 observatons, merge() works fine on both data.frames and data.tables.
# Merge short DF: works just fine
short.3561 <- temp.3561[-(11:7946667),]
short.merged <- merge(temp.versions, short.3561, by = "id")
# Merge short DT
DTshort.3561 <- data.table(short.3561, key = "id")
DTshort.merged <- merge(DTtemp.versions, DTshort.3561, by = "id")
I've been through the FAQ (http://datatable.r-forge.r-project.org/datatable-faq.pdf, and 1.12 in particular). How would you suggest thinking about this?
Could anyone point out what I'm misunderstanding about data.table (and the error message in particular)?
Taking you answer directly. The error message
Join results in 11277332 rows; more than 7946667 = max(nrow(x),nrow(i)). Check for duplicate key values in i...
states the result of your join has more values than usual cases expects. This means the lookup table key has duplicates which results multiple matches on join.
If it doesn't answer your question you should restate it.
I have several .csv files, each one corresponding to a monthly list of customers and some information about them. Each file consists of the same information about customers such as:
names(data.jan)
ID AGE CITY GENDER
names(data.feb)
ID AGE CITY GENDER
To simplify, I will consider only two months, january and february, but my real set of csv files go from january to november:
Considering a "customer X",I have three possible scenarios:
1- Customer X is listed in the january database, but he left and now is not listed in february
2- Customer X is listed in both january and february databases
3- Customer X entered the database in february, so he is not listed in january
I am stuck on the following problem: I need to create a single database with all customers and their respective information that are listed in both dataframes. However, considering a customer that is listed in both dataframes, I want to pick his information from his first entry, that is, january.
When I use merge, I have four options, acording to http://www.dummies.com/how-to/content/how-to-use-the-merge-function-with-data-sets-in-r.html
data <- merge(data.jan,data.feb, by="ID", all=TRUE)
Regardless of which all, all.x or all.y I choose, I get the same undesired output called data:
data[1,]
ID AGE.x CITY.x GENDER.x AGE.y CITY.y GENDER.y
123 25 NY M 25 NY M
I think that what would work here is to merge both databases with this type of join:
Then, merge the resulting dataframe with data.jan with the full outer join. But I don't know how to code this in R.
Thanks,
Bernardo
d1 <- data.frame(x=1:9,y=1:9,z=1:9)
d2 <- data.frame(x=1:10,y=11:20,z=21:30) # example data
d3 <- merge(d1,d2, by="x", all=TRUE) #merge
# keep the original columns from janary (i.e. y.x, z.x)
# but replace the NAs in those columns with the data from february (i.e. y.y,z.y )
d3[is.na(d3[,2]) ,][,2:3] <- d3[is.na(d3[,2]) ,][, 4:5]
#> d3[, 1:3]
# x y.x z.x
#1 1 1 1
#2 2 2 2
#3 3 3 3
#4 4 4 4
#5 5 5 5
#6 6 6 6
#7 7 7 7
#8 8 8 8
#9 9 9 9
#10 10 20 30
This may be tiresome for more than 2 months though, perhaps you should consider #flodel's comments, also note there are demons when your original Jan data has NAs (and you still want the first months data, NA or not, retained) although you never mentioned them in your question.
Try:
data <- merge(data.jan,data.frame(ID=data.feb$ID), by="ID")
although I haven't tested it since no data, but if you just join the ID col from Feb, it should only filter out anything that isn't in both frames
#user1317221_G's solution is excellent. If your tables are large (lots of customers), data tables might be faster:
library(data.table)
# some sample data
jan <- data.table(id=1:10, age=round(runif(10,25,55)), city=c("NY","LA","BOS","CHI","DC"), gender=rep(c("M","F"),each=5))
new <- data.table(id=11:16, age=round(runif(6,25,55)), city=c("NY","LA","BOS","CHI","DC","SF"), gender=c("M","F"))
feb <- rbind(jan[6:10,],new)
new <- data.table(id=17:22, age=round(runif(6,25,55)), city=c("NY","LA","BOS","CHI","DC","SF"), gender=c("M","F"))
mar <- rbind(jan[1:5,],new)
setkey(jan,id)
setkey(feb,id)
join <- data.table(merge(jan, feb, by="id", all=T))
join[is.na(age.x) , names(join)[2:4]:= join[is.na(age.x),5:7,with=F]]
Edit: This adds processing for multiple months.
f <- function(x,y) {
setkey(x,id)
setkey(y,id)
join <- data.table(merge(x,y,by="id",all=T))
join[is.na(age.x) , names(join)[2:4]:= join[is.na(age.x),5:7,with=F]]
join[,names(join)[5:7]:=NULL] # get rid of extra columns
setnames(join,2:4,c("age","city","gender")) # rename columns that remain
return(join)
}
Reduce("f",list(jan,feb,mar))
Reduce(...) applies the function f(...) to the elements of the list in turn, so first to jan and feb, and then to the result and mar, etc.