How to calculate co-occurrence matrices based on large dataframes?

How to calculate co-occurrence matrices based on large dataframes? - r

I want to create a co-occurrence matrix based on the recommended code here (also see below). It works fine for most of the dataframes I work with. However, I get the following error messages for larger dataframes either if I use data.table::melt ...
negative length vectors are not allowed
... or later on using base::crossprod
error in crossprod: attempt to make a table with >=2^31 elements
Both are related to the size of the dataframe. In the first case, it relates to the number of rows, while in the latter case the size of the matrix exceeds the limit.
I'm aware about the solutions for the first issue (data.table::melt) proposed by [2], [3] and [4] as well as for the second issue (base::crossprod) by [5] and [6], and I've seen [7] but I'm not sure how to adapt them properly to my situation. I have tried to split the dataframe by ID into several dataframes, merge them and calculate the co-occurrence matrix but I've just produced additional error messages (e.g., cannot allocate vector of size 17.8 GB).
Reproducible Example
I have an assembled dataframe created by plyr::join that looks like this (but, of course, a lot larger):
df <- data.frame(ID = c(1,2,3,20000),
C1 = c("England", "England", "England", "China"),
C2 = c("England", "China", "China", "England"),
C5850 = c("England", "China", "China", "England"),
SC1 = c("FOO", "BAR", "EAT", "FOO"),
SC2 = c("MERCI", "EAT", "EAT", "EAT"),
SC5850 = c("FOO", "MERCI", "FOO", "FOO"))
ID C1 C2 ... C5850 SC1 SC2 ... SC5850
1 England England England FOO MERCI FOO
2 England China China BAR EAT MERCI
3 England China China EAT EAT EAT
200000 China England England FOO EAT FOO
Original Code
colnames(df) <- c(paste0("SCCOUNTRY", 2:7))
library(data.table)
melt(setDT(df), id.vars = "ID", measure = patterns("^SCCOUNTRY"))[nchar(value) > 0 & complete.cases(value)] -> foo
unique(foo, by = c("ID", "value")) -> foo2
crossprod(table(foo2[, c(1,3)])) -> mymat
diag(mymat) <- ifelse(diag(mymat) <= 1, 0, mymat)
Conditions (for the calculation of the co-occurrence matrix)
Single observations without additional observations by ID/row are not considered, i.e. a row with only a single country once is counted as 0.
A combination/co-occurrence should be counted as 1.
Being in a combination results in counting as a self-combination as well (USA-USA), i.e. a value of 1 is assigned.
There is no value over 1 assigned to a combination by row/ID.

Related

Cleaning Origin and Destination data with duplicates but different factor level

I have some GIS data with origins and destinations (OD) and an information about the time of the day of the OD. I intending to make a map of this, and to color the ODs by the time of day information.
One thing is that some ODs are in the data set with both day and night and maybe with a different order. I would like to mark those differntly, e.g. "Day/Night"
Is there an easy way to do this? MY MWE is just one OD but I would need to identify it among several others. I can manage to find the duplicates regardless of the order, but I dont know how to find out wether or not there are both time cases there and how to replace them with "Day/Night"
library(data.table)
Origin<-c("London", "Paris", "Lisbon", "Madrid", "Berlin", "London")
Destination<-c("Paris", "London", "Berlin","Lisbon", "Lisbon", "Paris")
Time=factor(c("Day", "Night", "Day", "Day/Night","Day", "Day/Night"))
dt<-data.table(Origin=Origin, Destination=Destination, Time=Time)
#duplicates regardless of order
dat.sort = t(apply(dt[,.(Origin,Destination)], 1, sort))
dt[duplicated(dat.sort) | duplicated(dat.sort, fromLast=TRUE),]

You can do that using dplyr package as follows;
Feel free to change the conditions to what fits your need.
library(data.table)
library(dplyr)
# Creating data
dt <-
data.table(
Origin = c("London", "Paris", "Italy", "Spain", "Portugal", "Poland"),
Destination = c("Paris", "London", "Norway", "Portugal", "Spain", "Spain"),
Time = c("Day", "Night", "Day", NA_character_, NA_character_, NA_character_)
)
dt
# Origin Destination Time
# London Paris Day
# Paris London Night
# Italy Norway Day
# Spain Portugal <NA>
# Portugal Spain <NA>
# Poland Spain <NA>
dt %>%
# pmin and pmax are used to sort the 2 columns
# in order to group by them regardless to their order
group_by(Origin2 = pmin(Origin, Destination),
Destination2 = pmax(Origin, Destination)) %>%
mutate(count = n(), # To check if Origin/destination are repeated or not
row = row_number(), # Place holder to know if it was first to repeat or second
# If not repeated then make Time = Day
# If repeated and first occurance then Time = Day
# If repeated and second occurance then Time = Night
Time = case_when(count == 1 ~ "Day",
count == 2 & row == 1 ~ "Day",
count == 2 & row == 2 ~ "Night")) %>%
ungroup() %>%
select(Origin, Destination, Time)
# Origin Destination Time
# <chr> <chr> <chr>
# 1 London Paris Day
# 2 Paris London Night
# 3 Italy Norway Day
# 4 Spain Portugal Day
# 5 Portugal Spain Night
# 6 Poland Spain Day

Thanks for the dplyr solution by #Nareman Darwisch that gave me the inspiration for my solution with data.table
I am creating a new variable as a unique ID for each Origin Destination
dat.sort = t(apply(dt[,.(Origin,Destination)], 1, sort))
dt.temp<-data.table(dat.sort)
dt.temp[,unique.name:=paste(V1,V2)]
dt$unique.name<-factor(dt.temp$unique.name)
Then I can either calculate the length of the unique occurences of the factor by group or if they match more than once with any of the 3 levels. Based on this I can recode the labels with the "Day/Night" level whenever the length is > 1 or the other condition is TRUE
dt[,No.levels:=length(unique(c(Time))), by=unique.name]
dt[,No.levels.logi:=sum(c(Time) %in% c(1:3))>1 , by=unique.name]
The thing I would like to understand how I could use a logical condition in the spirit of looking at the levels by group and compares those with the cases I want.
dt[,No.levels.logi:=sum(levels(Time) %in% c("Day", "Night"))>1 , by=unique.name]
But I guess the levels command always gives me all three levels.

If I understand correctly, the OP wants to
identify city pairs regardless of the order of origin and destination, e.g. London-Paris belongs to the same city pair as Paris-London
collapse separate rows if a city pair is operated Day and Night or Day/Night
or update the original dataset
This is what I would do:
library(data.table)
dt <- data.table(Origin, Destination, Time)
# add city pair as unique grouping variable
dt[, Pair := paste(pmin(Origin, Destination), pmax(Origin, Destination), sep = "-")][]
# identify city pairs which are operated day and night
pairs_DN <- dt[, all(c("Day", "Night") %in% Time) | "Day/Night" %in% Time, by = Pair][(V1), .(Pair)]
# update original dataset by an update join
dt[pairs_DN, on = "Pair", Time := "Day/Night"][]
Origin Destination Time Pair
1: London Paris Day/Night London-Paris
2: Paris London Day/Night London-Paris
3: Lisbon Berlin Day Berlin-Lisbon
4: Madrid Lisbon Day/Night Lisbon-Madrid
5: Berlin Lisbon Day Berlin-Lisbon
6: London Paris Day/Night London-Paris
The key point is to identify the city pairs which fullfil the second requirement:
dt[, all(c("Day", "Night") %in% Time) | "Day/Night" %in% Time, by = Pair]
Pair V1
1: London-Paris TRUE
2: Berlin-Lisbon FALSE
3: Lisbon-Madrid TRUE
So, there is no need to deal with factor levels. BTW, factor levels are an attribute of the whole column and do not change when subsetting or grouping. What does change is which of the levels are used in a subset or group.
pairs_DN contains the unique key of those city pairs
Pair
1: London-Paris
2: Lisbon-Madrid

individuating rows based on conditions in nested data

I am new to r, and I am having some trouble manipulating the data in the way I need it for my analysis. I would be grateful if anyone could help, because this is essential for my research.
I already asked a similar question but the answer I got did not fully address my problem, I will try to be more clear this time to see if anyone can help.
my data looks something like this:
df<- data.frame(
"Reporter" = c("USA", "USA", "USA", "USA", "USA","USA"),
"Partner" = c( "EU", "EU","EU","EU", "EU","EU"),
"Product.cat" = c("1", "11", "111", "112", "12", "2"),
"Product Description" = c("Food", "Fruit", "Apple",
"Banana", "Meat", "Manifactured"),
"Year" = c(1970, 1970, 1970, 1970, 1970, 1970),
"trade value" = c( 100, 50, 30, 20, 50, 220),
stringsAsFactors = FALSE)
I have country-year observations about trade.
The vector 'product.cat' indicates what kind of commodity is exported. The more digits the product.cat has, the more the trade information is disaggregated.
For example product.cat. 111 (eg. apple) and 112 (e.g. bananas) are sub-product categories of product category 11 (e.g. fruit).
The same holds for the higher levels of aggregation. Product category 11 (fruit) is a subcategory of product.cat 1(food) together with product.cat 12 (meat).
To note that data in lower categories is nested in higher level of aggregation. Hence the value of product.cat 11 (50) is equal to the value of product.cat 111 (30) + product.cat 112 (20).
To do my analysis I need to identify those values that are not reported at the most disaggregated possible level - i.e. I need to identify the data not reported at the 3 digit level.
My problem is that for some country-year observation I have data reported accurately at all levels of aggregation (e.g. 1,11,111,112) while for others i only have data at the higher level of aggregation (e.g. 12 and 2). For instance, in my example, I only have product.cat 12 (meat), but not data on what kind of meat product.cat 121(pork), product.cat 122 (veal).
Similarly, in the example, data on product.cat 2 (manufacturing), is not reported at lower levels.
we do not know whether is product.cat 21 (clothing) or product.cat 22 (wood products).
In other words, I have data reported at the 2 digit (12) or first digit level (2) that could be reported at the 3 digit level. To note that every category should be disaggregated at the 3 level digit
What I would like to do is to find a way to individuate all the data exclusively reported at a higher level of aggregation and change their product.cat name adding an "m" to the end.
After manipulation the product.cat 12 should become* 12m to indicate that data was reported only at the 2nd digit.
Similarly I would like to identify exports that are reported only at the first digit. product.cat 2 should become 2mm to reflect that the data was reported only at the first digit.
To be sure, only the data for which I have information exclusively at a higher level of aggregation - i.e. in the example 12 and 2 - should include "m"s.
For instance, in the example, I do not want to have 1mm, since I have data at a lower level of aggregation (11,12). Similarly, I do not want to have 11m, because I have data at lower levels of aggregation (111,112). What I would like to have is 12m and 2mm because the data is reported only at a higher level of aggregation (12 and 2).
I know that this is a very specific question but I would really appreciate if anyone could help.
Note: in the real dataset, due to for measurement errors, the sum of the disaggregated values do not always perfectly add up to the higher level of aggregation. (for instance, 111+112 can be > 11). Hence, ideally to solve the issue the, I am looking to a function that is able to specify when to add the m based on the number of digits divided by country, partner, year, rather than the sum of the traded value.
I really thank everyone that could give me a help with this, it would be a huge step forward for my research.
---- attempts
I have been working on this function, but it does not seem to do what I am looking for. Maybe someone can find out what is going wrong
fillLevel <- function(x, width = 3, fill = "m"){
sp <- split(x, substr(x, 1, 1))
sp <- lapply(seq_along(sp), function(i){
n <- nchar(sp[[i]])
if(all(n < 3)){
j <- which(n == max(n))
sp[[i]][j] <- gsub(" ", "m", formatC(sp[[i]][j], width = -3))
}
sp[[i]]
})
unname(unlist(sp))
}
df <- df%>% mutate(prdcat2 = fillLevel(df$Product.cat.))
As you can see it only individuates 2mm but not 12m. Moreover when I run it on more complex codes it mess up the order of my data. I think this relates to sp <- lapply(seq_along(sp) but i am not sure how to go about it.
Best

Here's one way to do it:
library(data.table)
setDT(df)
# tag levels
df[, lvl := nchar(Product.cat)]
df[lvl < 3L, has_subcat := FALSE]
# use level-3 observations to flag level-2s as okay
df[
df[lvl == 3, .(Reporter, Partner, Year, Product.cat = substr(Product.cat, 1, 2))],
on=.(Reporter, Partner, Year, Product.cat),
has_subcat := TRUE
]
# use level-2 observations to flag level-1s as okay
df[
df[lvl == 2, .(Reporter, Partner, Year, Product.cat = substr(Product.cat, 1, 1))],
on=.(Reporter, Partner, Year, Product.cat),
has_subcat := TRUE
]
# create new cat, flagging observations with no subcategories
df[, newcat := Product.cat]
df[has_subcat == FALSE, newcat := paste0(Product.cat, strrep("m", 3-lvl))]
Reporter Partner Product.cat Product.Description Year trade.value lvl has_subcat newcat
1: USA EU 1 Food 1970 100 1 TRUE 1
2: USA EU 11 Fruit 1970 50 2 TRUE 11
3: USA EU 111 Apple 1970 30 3 NA 111
4: USA EU 112 Banana 1970 20 3 NA 112
5: USA EU 12 Meat 1970 50 2 FALSE 12m
6: USA EU 2 Manifactured 1970 220 1 FALSE 2mm
I'm assuming that this should be done separately per Reporter-Partner-Year.

Merge dataframes based on regex condition

This problem involves R. I have two dataframes, represented by this minimal reproducible example:
a <- data.frame(geocode_selector = c("36005", "36047", "36061", "36081", "36085"), county_name = c("Bronx", "Kings", "New York", "Queens", "Richmond"))
b <- data.frame(geocode = c("360050002001002", "360850323001019"), jobs = c("4", "204"))
An example to help communicate the very specific operation I am trying to perform: the geocode_selector column in dataframe a contains the FIPS county codes of the five boroughs of NY. The geocode column in dataframe b is the 15-digit ID of a specific Census block. The first five digits of a geocode match a more general geocode_selector, indicating which county the Census block is located in. I want to add a column to b specifying which county each census block falls under, based on which geocode_selector each geocode in b matches with.
Generally, I'm trying to merge dataframes based on a regex condition. Ideally, I'd like to perform a full merge carrying all of the columns of a over to b and not just the county_name.
I tried something along the lines of:
b[, "county_name"] <- NA
for (i in 1:nrow(b)) {
for (j in 1:nrow(a)) {.
if (grepl(data.a$geocode_selector[j], b$geocode[i]) == TRUE) {
b$county_name[i] <- a$county_name[j]
}
}
}
but it took an extremely long time for the large datasets I am actually processing and the finished product was not what I wanted.
Any insight on how to merge dataframes conditionally based on a regex condition would be much appreciated.

You could do this...
b$geocode_selector <- substr(b$geocode,1,5)
b2 <- merge(b, a, all.x=TRUE) #by default it will merge on common column names
b2
geocode_selector geocode jobs county_name
1 36005 360050002001002 4 Bronx
2 36085 360850323001019 204 Richmond
If you wish, you can delete the geocode_selector column from b2 with b2[,1] <- NULL

We can use sub to create the 'geocode_selector' and then do the join
library(data.table)
setDT(a)[as.data.table(b)[, geocode_selector := sub('^(.{5}).*', '\\1', geocode)],
on = .(geocode_selector)]
# geocode_selector county_name geocode jobs
#1: 36005 Bronx 360050002001002 4
#2: 36085 Richmond 360850323001019 204

This is a great opportunity to use dplyr. I also tend to like the string handling functions in stringr, such as str_sub.
library(dplyr)
library(stringr)
a <- data_frame(geocode_selector = c("36005", "36047", "36061", "36081", "36085"),
county_name = c("Bronx", "Kings", "New York", "Queens", "Richmond"))
b <- data_frame(geocode = c("360050002001002", "360850323001019"),
jobs = c("4", "204"))
b %>%
mutate(geocode_selector = str_sub(geocode, end = 5)) %>%
inner_join(a, by = "geocode_selector")
#> # A tibble: 2 x 4
#> geocode jobs geocode_selector county_name
#> <chr> <chr> <chr> <chr>
#> 1 360050002001002 4 36005 Bronx
#> 2 360850323001019 204 36085 Richmond

matching if string values are equal, creating a new string value in new column in R

I am trying to do a kind of 'if' statement in R where I want to find if two values (string) are the same in two different columns. For example, if my Origin and my Destination country are the same, I want to create a new column with Domestic as a result. If false, then eventually I would code the NA as International.
I try several functions in R but still can't have it properly!
I think the recode function from car library could fit. Here is an example of data and two examples of lines of code I have tried.
Thanks for the help.
#Data
Origin.Country <- c("Canada","Vietnam","Maldives", "Indonesia", "Spain", "Canada","Vietnam")
Passengers <- c(100, 5000, 200, 10000, 200, 20, 4000)
Destination.Country <- c("France","Vietnam","Portugal", "Thailand", "Spain", "Canada","Thailand")
data2<-data.frame(Origin.Country, Destination.Country, Passengers)
#Creating new column
data2$Domestic<-NA
#If Origin and Destination is the same = Domestic
data2$Domestic[data2$Origin.Country==data2$Destination.Country <- Domestic
data2$Domestic <- recode(data2$Origin.Country, c(data2$Destination.Country)='Domestic', else='International')

You can use ifelse:
data2$Domestic <- ifelse(as.character(data2$Origin.Country) ==
as.character(data2$Destination.Country),
'Domestic', 'International')
I used as.character to coerce the country name variables to be characters for comparison. ifelse takes a logical as the first argument, and returns the second argument if TRUE, and the third argument if FALSE. In this instance, it performs a comparison of the variables by row.

This might be a bit slow because it's not vectorised, but it worked based on your example:
data2$domestic <- apply(data2, 1, function(x) {
( x["Origin.Country"] == x["Destination.Country"] )
} )

You can use recode in this way:
library(dplyr); library(car)
data2 %>% mutate(Domestic = recode(as.character(Origin.Country) == as.character(Destination.Country),
"TRUE='domestic'; else='international'"))
Origin.Country Destination.Country Passengers Domestic
1 Canada France 100 international
2 Vietnam Vietnam 5000 domestic
3 Maldives Portugal 200 international
4 Indonesia Thailand 10000 international
5 Spain Spain 200 domestic
6 Canada Canada 20 domestic
7 Vietnam Thailand 4000 international

A fast way to merge named vectors of different length into a data frame (preserving name information as column name) in R

I have a list L of named vectors. For example, 1st element:
> L[[1]]
$event
[1] "EventA"
$time
[1] "1416355303"
$city
[1] "Los Angeles"
$region
[1] "California"
$Locale
[1] "en-GB"
when I unlist each element of the list the resulting vectors looks like this (for the 1st 3 elements):
> unlist(L[[1]])
event time city region Locale
"EventA" "1416355303" "Los Angeles" "California" "en-GB"
> unlist(L[[2]])
event time Locale
"EventB" "1416417567" "en-GB"
> unlist(L[[3]])
event properties.time
"EventM" "1416417569"
I have over 0.5 million elements in the list and each one has up to 42 of these feaures/names. I have to merge them into a dataframe taken into account their names and that not all of them have the same number of feaures or names (in the example above, V2 has no information for region and city). At the moment, what I do is a loop through the whole list:
df1 <- merge(stack(unlist(L[[1]])), stack(unlist(L[[2]])),
by = "ind", all = TRUE)
suppressWarnings(for (i in 3:length(L)){
df1 <- merge(df1, stack(unlist(L[[i]])), by = "ind", all = TRUE)
})
df1 <- as.data.frame(t(df1))
For the example above this returns:
V1 V2 V3 V4 V5
ind city event Locale region time
values.x Los Angeles EventA en-GB California 1416355303
values.y <NA> EventB en-GB <NA> 1416417567
values <NA> EventM <NA> <NA> 1416417569
which is what I want. However, bearing in mind the length of the list and the fact that every time that the command:
df1 <- merge(df1, stack(unlist(L[[i]])), by = "ind", all = TRUE)
runs, loads the entire data frame (df1), the loop takes a very long time. Therefore, I was wondering if anyone knows a better/faster way to code this. In other words. Given a long list of named vectors with different lengths, is there a fast way to merge them into a data frame as the one described above.
For example, is there a way of doing this using foreach and %dopar%? In any case, any faster approach is welcome.

I've heard the data.table package is pretty fast. And rbindlist is perfect for this list.
library(data.table)
rbindlist(L, fill=TRUE)
# event time city region Locale
# 1: EventA 1416355303 Los Angeles California en-GB
# 2: EventB 1416417567 NA NA en-GB
# 3: EventM 1416417569 NA NA NA

I'm not sure why you use merge. It seems to me like you should simply rbind.
L <- list(list(event = "EventA", time = 1416355303,
city = "Los Angeles", region = "California",
Locale = "en-GB"),
list(event = "EventB", time = 1416417567,
Locale = "en-GB"),
list(event = "EventM", time = 1416417569))
library(plyr)
do.call(rbind.fill, lapply(L, as.data.frame))
# event time city region Locale
#1 EventA 1416355303 Los Angeles California en-GB
#2 EventB 1416417567 <NA> <NA> en-GB
#3 EventM 1416417569 <NA> <NA> <NA>

Here's a compact solution to consider:
library(reshape2)
dcast(melt(L), L1 ~ L2, value.var = "value")
# L1 city event Locale region time
# 1 1 Los Angeles EventA en-GB California 1416355303
# 2 2 <NA> EventB en-GB <NA> 1416417567
# 3 3 <NA> EventM <NA> <NA> 1416417569

The original post is about merging named vectors. Define the first two given in the example above as vectors:
>C1 <- c(event = "EventA", time = 1416355303,
city = "Los Angeles", region = "California",
Locale = "en-GB")
>C2 <- c(event = "EventB", time = 1416417567,
Locale = "en-GB")
If you want to merge them and are OK to give up the extra data in the longer vector vector, then you can index the longer vector by names in the shorter vector
>C1 <- C1[names(C2)]
Then just use rbind or cbind. Example with rbind
>C1_C2 <- rbind(C1,C2)
>C1_C2
event time Locale
C1 "EventA" "1416355303" "en-GB"
C2 "EventB" "1416417567" "en-GB"
You can combine the final two steps but will lose the name of the first vector if you do that

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

How to calculate co-occurrence matrices based on large dataframes? - r

Related

Cleaning Origin and Destination data with duplicates but different factor level

individuating rows based on conditions in nested data

Merge dataframes based on regex condition

matching if string values are equal, creating a new string value in new column in R

A fast way to merge named vectors of different length into a data frame (preserving name information as column name) in R

Categories

Resources