Counting unique subsets of data efficiently - r

I have a relatively large dataset that I wouldn't qualify as 'big data'. It's around 3 to 5 million rows; because of the size I'm using the data.table library to do analysis.
The dataset (named df, which is a data.table structure) composition can essentially be broken into:
n identify fields, hereafter ID_1, ID_2, ..., ID_n, some of which are numeric and some of which are character vector.
m categorical variables, hereafter C_1, ..., C_m, all of which are character vector and have very few values apiece (2 in one, 3 in another, etc...)
2 measurement variables, M_1, and M_2, both numeric.
A subset of data is identified by ID_1 through ID_n, has a full set of all values of C_1 through C_m, and has a range of values of M_1 and M_2. A subset of data consists of 126 records.
I need to accurately count the unique sets of data and, because of the size of the data, I would like to know if there already exists a much more efficient way to do this. (Why roll my own if other, much smarter, people have done it already?)
I've already done a fair amount of Google work to arrive at the method below.
What I've done is to use the ht package (https://github.com/nfultz/ht) so that I can use a data frame as a hash value (using digest in the background).
I paste together the ID fields to create a new, single column, hereafter referred to as ID, which resembles...
ID = "ID1:ID2:...:IDn"
Then I loop through each unique set of identifiers and then, using just the subset data frame of C_1 through C_m, M_1, and M_2 (126 rows of data), hash the value / increment the hash.
Afterwards I'm taking that information and putting it back into the data frame.
# Create the hash structure
datasets <- ht()
# Declare the fields which will denote a subset of data
uniqueFields <- c("C_1",..."C_m","M_1","M_2")
# Create the REPETITIONS field in the original data.table structure
df[,REPETITIONS := 0]
# Create the KEY field in the original data.table structure
df[,KEY := ""]
# Use the updateHash function to fill datasets
updateHash <- function(val){
key <- df[ID==val, uniqueFields, with=FALSE]
if (isnull(datasets[key])) {
# If this unique set of data doesn't already exist in datasets...
datasets[key] <- list(val)
} else {
# If this unique set of data does already exist in datasets...
datasets[key] <- append(datasets[key],val)
}
}
# Loop through the ID fields. I've explored using apply;
# this vector is around 10-15K long. This version works.
for (id in unique(df$ID)) {
updateHash(id)
}
# Now update the original data.table structure so the analysis can
# be done. Again, I could use the R apply family, this version works.
for(dataset in ls(datasets)){
IDS <- unlist(datasets[[dataset]]$val)
# For this set of information, how many times was it repeated?
df[ID%in%IDS, REPETITIONS := length(datasets[[dataset]]$val)]
# For this set, what is a unique identifier?
df[ID%in%IDS, KEY := dataset]
}
This does what I want to, though not blindingly fast. I now have the capability to present some neat analysis revolving around variability in datasets to people who care about it. I don't like that it's hackey and, one way or another, I'm going to clean this up and make it better. Before I do that I want to do my final due diligence and see if it's simply my Google Fu failing me.

Related

How to analyse row's with similar ID's in PySpark?

I have a very large Dataset (160k rows).
I want to analyse each subset of rows with the same ID.
I only care about subsets with the same ID that are at least 30rows long.
What approach should I use?
I did the same task in R and did the following (from what it seems that can't be translated to pyspark):
Order by ascending order.
check whether next row is same as current, if yes n=n+1, if no i do my analysis and save the results. Rinse and Repeat for the whole lenght of the Data frame.
One easy method is to group by 'ID' and collect the columns that are needed for your analysis.
If just one column:
grouped_df = original_df.groupby('ID').agg(F.collect_list("column_m")).alias("for_analysis"))
If you need multiple columns, you can use struct:
grouped_df = original_df.groupby('ID').agg(F.collect_list(F.struct("column_m", "column_n", "column_o")).alias("for_analysis"))
Then, once you have your data per ID, you can use a UDF to perform your elaborate analysis
grouped_df = grouped_df.withColumn('analysis_result', analyse_udf('for_analysis', ...))

How to include conditional := reference or set() in data.table

New to R and working my way through DataCamp to better understand Data Tables. Trying to apply := and set on a large data set to aggregate.
Wondering if someone can give me with a pointer on including conditionals into a := or set() for data tables. I have a large 10 million row 20+ column datatable where I am trying to group by an ID and period (using setkey), testing row i-1 against i for column name "period" to provide a categorical output 0 or 1 in the "flag" column.
I've tried:
for(i in 1:200)
set(DT, i, .(period[i]-period[i-1]<=1, period[i]-period[i-1]>1), flag = .(0,1))
# error is unused argument flag=.(0,1)
I'm probably mixing up := and set and base R. I haven't seen where examples compare one row to another and include the index in the call to provide two different outputs.

R data frames joined by matching value inequality to a range defined by 2 columns

In R, I know there are many different ways of joining/merging data frames based on an equals-condition between two or several columns.
However, I need to join two data frames based on matching a value to a value-range, defined by 2 columns, using greater-than-or-equal-to in one case and less-than-or-equal-to in the other. If I was using SQL, the query could be:
SELECT * FROM Table1,
LEFT JOIN Table2
ON Table1.Value >= Table2.LowLimit AND Table1.Value <= Table2.HighLimit
I know about the sqldf package, but I would like to avoid using that if possible.
The data I am working with is one data frame with ip-addresses, like so:
ipaddresses <- data.frame(IPAddress=c("1.1.1.1","2.2.2.2","3.3.3.3","4.4.4.4"))
The other data frame is the MaxMind geolite2 database, containing an ip-address range start, and ip-address range end, and a geographic location ID:
ip_range_start <- c("1.1.1.0","3.3.3.0")
ip_range_end <- c("1.1.1.255","3.3.3.100")
geolocationid <- c("12345","67890")
ipranges <- data.frame(ip_range_start,ip_range_end,geolocationid)
So, what I need to achieve is a join of ipranges$geolocationid onto ipaddresses, in each case where
ipaddresses$IPAddress >= ipranges$ip_range_start
AND
ipaddresses$IPAddress <= ipranges$ip_range_end
With the example data above, that means I need to correctly find that 1.1.1.1 is in the range of 1.1.1.0-1.1.1.255, and 3.3.3.3 is in the range of 3.3.3.0-3.3.3.100.
This approach may not scale well, because it involves initially doing an outer join via broom::inflate(), but it should work if you don't have a ton of ipaddresses:
library(dplyr)
library(broom)
ipranges %>%
inflate(ipaddresses) %>%
ungroup %>%
filter(
numeric_version(IPAddress) >= numeric_version(ip_range_start),
numeric_version(IPAddress) <= numeric_version(ip_range_end)
)
Results
Source: local data frame [2 x 4]
IPAddress ip_range_start ip_range_end geolocationid
(fctr) (fctr) (fctr) (fctr)
1 1.1.1.1 1.1.1.0 1.1.1.255 12345
2 3.3.3.3 3.3.3.0 3.3.3.100 67890
Having done some additional research, I have actually found a solution for my particular use case. Still, it is NOT a solution to the general problem: How to join two data frames where the join condition is that key >= value1 AND key <= value2. However, it does solve the actual problem I had.
What I ended up finding as a great way to solve my need for geographic location of ip-addresses, is the package rgeolocate in combination with the downloadable binary version of the MaxMind GeoLite2 database.
The solution is lightning-fast; the matching of 500+ ip-addresses to 3+ million ip-ranges is done in a second. My previous attempt involved loaded the CSV-version of the MaxMind database into a data frame and work from there. Don't do that. Thanks to the rgeolocate package and the binary MaxMind database, it is SO much faster.
My code ended up being this (dataunion is the name of my data frame where I have my collected ip-addresses)
library(rgeolocate)
ipaddresslist <- as.character(dataunion$IPAddress)
geoloc <- maxmind(ipaddresslist, "GeoLite2-City.mmdb", c("latitude","longitude", "continent_name","country_name","region_name","city_name"))
colnames(geoloc) <- c("Lat","Long","Continent","Country","Region","City")
dataunion <- cbind(dataunion, geoloc)
Finally, I have found the solution for the general problem, in addition to the above solution to the specific problem of geolocating IP-addresses using the MaxMind database.
This is the general solution for joining two data frames of equal or unequal length, where a value must be compared with an inequality condition (less-than or greater-than) to one or more columns.
The solution is using sapply, which is base R.
With the two data frames defined in the question, iprangesand ipaddresses, we have:
ipaddresses$geolocationid <- sapply(ipaddresses$IPAddress,
function(x)
ipranges$geolocationid[ipranges$ip_range_start <= x & ipranges$ip_range_end >= x])
What sapply does is it takes each element, one at a time, from the vector ipaddresses$IPAddressand applies it to the function expression provided as an argument to sapply. The result element of applying the function to each element is appended to a vector, which is the output result of sapply. And that is what we insert as a new column into ipaddresses$geolocationid.
In this case, if the IP-addresses are converted to integers first, the sapply operation probably gets faster. Here are a few lines that will extend the ipaddresses data frame with a column containing the integer version of each ip-address:
#calculating the integer version of each IP-address
octet <- data.frame(read.table(text=as.character(ipaddresses$IPAddress), sep="."))
octet$IPint <- 256^3*octet[,1] + 256^2*octet[,2] + 256*octet[,3] + octet[,4]
ipaddresses$IPint <- octet$IPint
# cleaning "octet" from memory
octet <- NULL
You would obviously have to do the same kind of conversion to the IP-addresses in your ipranges dataframe.

R: Warning when creating a (long) list of dummies

A dummy column for a column c and a given value x equals 1 if c==x and 0 else. Usually, by creating dummies for a column c, one excludes one value x at choice, as the last dummy column doesn't add any information w.r.t. the already existing dummy columns.
Here's how I'm trying to create a long list of dummies for a column firm, in a data.table:
values <- unique(myDataTable$firm)
cols <- paste('d',as.character(inds[-1]), sep='_') # gives us nice d_value names for columns
# the [-1]: I arbitrarily do not create a dummy for the first unique value
myDataTable[, (cols):=lapply(values[-1],function(x)firm==x)]
This code reliably worked for previous columns, which had smaller unique values. firm however is larger:
tr(values)
num [1:3082] 51560090 51570615 51603870 51604677 51606085 ...
I get a warning when trying to add the columns:
Warning message:
truelength (6198) is greater than 1000 items over-allocated (length = 36). See ?truelength. If you didn't set the datatable.alloccol option very large, please report this to datatable-help including the result of sessionInfo().
As far as I can tell, there is still all columns that I need. Can I just ignore this issue? Will it slow down future computations? I'm not sure what to make of this and the relevant of truelength.
Taking Arun's comment as an answer.
You should use alloc.col function to pre-allocate required amount of columns in your data.table to the number which will be bigger than expected ncol.
alloc.col(myDataTable, 3200)
Additionally depending on the way how you consume the data I would recommend to consider reshaping your wide table to long table, see EAV. Then you need to have only one column per data type.

Evaluating dataframe and storing the result

My dataframe(m*n) has few hundreds of columns, i need to compare each column with all other columns (contingency table) and perform chisq test and save the results for each column in different variable.
Its working for one column at a time like,
s <- function(x) {
a <- table(x,data[,1])
b <- chisq.test(a)
}
c1 <- apply(data,2,s)
The results are stored in c1 for column 1, but how will I loop this over all columns and save result for each column for further analysis?
If you're sure you want to do this (I wouldn't, thinking about the multitesting problem), work with lists :
Data <- data.frame(
x=sample(letters[1:3],20,TRUE),
y=sample(letters[1:3],20,TRUE),
z=sample(letters[1:3],20,TRUE)
)
# Make a nice list of indices
ids <- combn(names(Data),2,simplify=FALSE)
# use the appropriate apply
my.results <- lapply(ids,
function(z) chisq.test(table(Data[,z]))
)
# use some paste voodoo to give the results the names of the column indices
names(my.results) <- sapply(ids,paste,collapse="-")
# select all values for y :
my.results[grep("y",names(my.results))]
Not harder than that. As I show you in the last line, you can easily get all tests for a specific column, so there is no need to make a list for each column. That just takes longer and takes more space, but gives the same information. You can write a small convenience function to extract the data you need :
extract <- function(col,l){
l[grep(col,names(l))]
}
extract("^y$",my.results)
Which makes you can even loop over different column names of your dataframe and get a list of lists returned :
lapply(names(Data),extract,my.results)
I strongly suggest you get yourself acquainted with working with lists, they're one of the most powerful and clean ways of doing things in R.
PS : Be aware that you save the whole chisq.test object in your list. If you only need the value for Chi square or the p-value, select them first.
Fundamentally, you have a few problems here:
You're relying heavily on global arguments rather than local ones.
This makes the double usage of "data" confusing.
Similarly, you rely on a hard-coded value (column 1) instead of
passing it as an argument to the function.
You're not extracting the one value you need from the chisq.test().
This means your result gets returned as a list.
You didn't provide some example data. So here's some:
m <- 10
n <- 4
mytable <- matrix(runif(m*n),nrow=m,ncol=n)
Once you fix the above problems, simply run a loop over various columns (since you've now avoided hard-coding the column) and store the result.

Resources