I'm trying to generate network graph data from raw occurrence data. In the raw data, I have the occurrence rate of features in a variety of contexts. Let's say it's actors in different movies. Each row is [context, feature, weight], where weight might be amount of screen time. Here's a toy data set:
df <- data.frame(context = sample(LETTERS[1:10], 500, replace=TRUE),
feature = sample(LETTERS, 500, replace=TRUE),
weight = sample(1:100, 500, replace=TRUE)
)
So for Movie A, we might have 20 rows, where each row is an actor's name and their screen time in that movie.
What I'd like to generate is the pairwise combination of all actors for each movie, with the sum of their respective weights. So for example, if we start with:
[A, A, 5]
[A, B, 2]
I'd like output in the format of [context, feature1, feature2, sum.weight]. So:
[A, A, B, 7]
I know how to run through this with a combination of for loops, but I'd like to know if there is a more "classic R" way of approaching this, particularly with something like data.table.
Here's a possible solution using the data.table package:
library(data.table)
# keep a record of feature's levels
feature.levels <- levels(df$feature)
# for each context, create a data table for all pair combinations of features,
# & sum of said pair's weights
df <- df[,
as.data.table(
cbind(t(combn(feature, 2)),
rowSums(t(combn(weight, 2))))
),
by = context]
# map features (converted into integers in the previous step) back to factors
df[,
c('V1', 'V2') := lapply(.SD,
function(x){factor(x, labels = feature.levels)}),
.SDcols = c('V1', 'V2')]
# rename features / sum weights
setnames(df,
old = c("V1", "V2", "V3"),
new = c("feature1", "feature2", "sum.weights"))
> head(df)
context feature1 feature2 sum.weights
1: C j l 373
2: C j z 282
3: C j v 382
4: C j h 488
5: C j c 280
6: C j u 360
Data (I used lower case for "feature" so that it's visually distinct from upper case "context"):
set.seed(123)
df <- data.frame(context = sample(LETTERS[1:10], 500, replace=TRUE),
feature = sample(letters, 500, replace=TRUE),
weight = sample(1:100, 500, replace=TRUE))
# convert to data table & summarize to unique combinations by context + feature
setDT(df)
df <- df[,
list(weight = sum(weight)),
by = list(context, feature)]
Related
I have four large vectors of unequal length. Below I am providing a toy dataset similar to my original dataset:
a <- c(1021.923, 3491.31, 102.3, 12019.11, 879.2, 583.1)
b <- c(21,32,523,123.1,123.4545,12345,95.434, 879.25, 1021.9,11,12,662)
c <- c(52,21,1021.9288,12019.12, 879.1)
d <- c(432.432,23466.3,45435,3456,123,6688,1021.95)
Is there a way to compare all of these vectors one by one with an allowed threshold of ±0.5 for the match? In other words, I want to report the numbers that are common among all four vectors while allowing a drift of 0.5.
In the case of the toy dataset above, the final answer is:
Match1
a 1021.923
b 1021.900
c 1021.929
d 1021.950
I understand that this is possible for two vectors, but how can I do it for 4 vectors?
RELATED
All-to-all setdiff on two numeric vectors with a numeric threshold for accepting matches
Compare two vectors of numbers based on threshold of tolerance (±) of 0.5
Here is a data.table solution.
It is scalable to n vectors, so try feeding it as much as you like.. It also performs well when multiple values have 'hits' in all vectors.
sample data
a <- c(1021.923, 3491.31, 102.3, 12019.11, 879.2, 583.1)
b <- c(21,32,523,123.1,123.4545,12345,95.434, 879.25, 1021.9,11,12,662)
c <- c(52,21,1021.9288,12019.12, 879.1)
d <- c(432.432,23466.3,45435,3456,123,6688,1021.95)
code
library(data.table)
#create list with vectors
l <- list( a,b,c,d )
names(l) <- letters[1:4]
#create data.table to work with
DT <- rbindlist( lapply(l, function(x) {data.table( value = x)} ), idcol = "group")
#add margins to each value
DT[, `:=`( id = 1:.N, start = value - 0.5, end = value + 0.5 ) ]
#set keys for joining
setkey(DT, start, end)
#perform overlap-join
result <- foverlaps(DT,DT)
#cast, to check how the 'hits' each id has in each group (a,b,c,d)
answer <- dcast( result,
group + value ~ i.group,
fun.aggregate = function(x){ x * 1 },
value.var = "i.value",
fill = NA )
#get your final answer
#set columns to look at (i.e. the names from the earlier created list)
cols = names(l)
#keep the rows without NA (use rowSums, because TRUE = 1, FALSE = 0 )
#so if rowSums == 0, then columns in the vactor 'cols' do not contain a 'NA'
answer[ rowSums( is.na( answer[ , ..cols ] ) ) == 0, ]
output
# group value a b c d
# 1: a 1021.923 1021.923 1021.9 1021.929 1021.95
# 2: b 1021.900 1021.923 1021.9 1021.929 1021.95
# 3: c 1021.929 1021.923 1021.9 1021.929 1021.95
# 4: d 1021.950 1021.923 1021.9 1021.929 1021.95
I have these two data frames:
set.seed(42)
A <- data.table(station = sample(1:10, 1000, replace=TRUE),
hash = sample(letters[1:5], 1000, replace=TRUE),
point = sample(1:24, 1000, replace=TRUE))
B <- data.table(station = sample(1:10, 100, replace=TRUE),
card = sample(letters[6:10], 100, replace=TRUE),
point = sample(1:24, 100, replace=TRUE))
Dataframe A contains more than 1M rows.
I try to find hash (from A) for each card (from B). I have some conditions there: stations and points in A lays in a range(for station +- 1 and for points just + 2).
I use grouping B by card and execute for each group function for binding rows after implementing such conditions and get max by freq.
detect <- function(x){
am0 <- data.frame(station = 0,
hash = 0,
point = 0)
for (i in 1:nrow(x)) {
am1 <- A %>%
filter(station %in% (B$station[i] - 1) : (B$station[i] + 1) &
point > B$point[i] & point < B$point[i] + 2)
am0 <- rbind(am0, am1)
}
t <- as.data.frame(table(am0$hash))
t <- t %>%
arrange(-Freq) %>%
filter(row_number() == 1)
return(t)
}
And then just:
library(dplyr)
B %>%
group_by(card) %>%
do(detect(.)) %>%
ungroup
But I don't know how to implement function by each group with indices [i] so I actually get a wrong result.
# A tibble: 5 x 3
card Var1 Freq
<chr> <fctr> <int>
1 f c 46
2 g c 75
3 h c 41
4 i c 64
5 j c 62
I`m a beginner but I know best solution for big datasets - using data.table library for join 2 datasets like these. Can you help me to find decision for it?
I think what you want to do is:
#### Prepare join limits
B[, point_limit := as.integer(point + 2)]
B[, station_lower := as.integer(station - 1)]
B[, station_upper := as.integer(station + 1)]
## Join A on B, creates All combinations of points in A and B fulfilling the conditions
joined_table <- B[A,
, on = .( point_limit >= point, point <= point,
station_lower <= station, station_upper >= station),
nomatch = 0,
allow.cartesian=TRUE]
## Count the occurrences of the combinations
counted_table <- joined_table[,.N, by=.(card,hash)][order(card, -N)]
## Select the top for each group.
counted_table[, head(.SD, 1 ),by = .(card)][order(card)]
This will create a full table with all the information in and then do the counting on that. It relies purely on data.tables since to fully take advantage of the speed gains from that package. The data.table vignette is good if you are unfamiliar with the syntax. The nomatch condition ensures that we are doing an inner join.
This will probably be fine if A is only 1M rows and B is kept the same size, depending on your datas distribution. We can however split B also in a similar way to your do statement using the package purrr. I'm not sure how this interacts with R:s garabage collection however.
frame_list <- purrr::map(unique(B$card),
~ B[card == .x][A,
, on = .(point_limit >= point,
point <= point,
station_lower <= station,
station_upper >= station),
nomatch = 0,
allow.cartesian = TRUE][, .N, by = .(card, hash)])
counted_table_mem <- rbindlist(frame_list )
Something to note in this is that I use, rbindlist instead of multiple rbind. Repeatedly calling rbind will be very slow, since you will need to allocate new memory each time.
I'm interested in building a function making use of apply/sapply or Map that would iterate over available columns in dta and replace values in each column with matched values from data frame available in a nameless list of data frames with list item index corresponding to the column number of the dta data frame.
Example
Given objects:
set.seed(1)
size <- 20
# Data set
dta <-
data.frame(
unitA = sample(LETTERS[1:4], size = size, replace = TRUE),
unitB = sample(letters[16:20], size = size, replace = TRUE),
unitC = sample(month.abb[1:4], size = size, replace = TRUE),
someValue = sample(1:1e6, size = size, replace = TRUE)
)
# Meta data
lstMeta <- list(
# Unit A definitions
data.frame(
V1 = c("A", "B", "D"),
V2 = c("Letter A", "Letter B", "Letter D")
),
# Unit B definitions
data.frame(
V1 = c("t", "q"),
V2 = c("small t", "small q")
),
# Unit C definitions
data.frame(
V1 = c("Mar", "Jan"),
V2 = c("March", "January")
)
)
Desired results
When applied on dta, the function should return a data.frame corresponding to the extract below:
unitA unitB unitC someValue
Letter B small t Apr 912876
Letter B small q March 293604
C s Apr 459066
Letter D p March 332395
Letter A small q March 650871
Letter D small q Apr 258017
Letter D p January 478546
C small q Feb 766311
C small t March 84247
Letter A small q March 875322
Letter A r Feb 339073
Letter A r Ap 839441
C r Feb 346684
Letter B p January 333775
Letter D small t January 476352
(...)
Existing approach
replaceLbls <- function(dataSet, lstDict) {
sapply(seq_along(dataSet), function(i) {
# Take corresponding metadata data frame
dtaDict <- lstDict[[i]]
# Replace values in selected column
# Where matches on V1 push corrsponding values from V2
dataSet[,i][match(dataSet[,i], dtaDict[,1])] <- dtaDict[,2][match(dtaDict[,1], dataSet[,i])]
})
}
# Testing -----------------------------------------------------------------
replaceLbls(dataSet = dta, lstDict = lstMeta)
Of course the approach proposed above does not work as it will try to use NA in assignments; but it summarises what I want to achieve:
Error in x[...] <- m : NAs are not allowed in subscripted assignments
In addition: Warning message: In [<-.factor(*tmp*, match(dataSet[,
i], dtaDict[, 1]), value = c(NA, : invalid factor level, NA
generated
Additional remarks
Source data set
The key characteristics of the data are:
The list is nameless so subsetting has to be done by item numbers not by names
Item number correspond to column numbers
There is no full match between metadata data frames available in the list of data frames and unit columns available in the data
The someValue column also should be iterated over as it may contain labels that should be replaced
Solution
I'm not interested in dplyr/data.table/sqldf-based solutions.
I'm not interested in nested for-loops
I have a hacky solution that doesn't use for loops or other packages. I needed to convert the factors to characters for it to work but you might be able to improve my solution.
The solution works by only matching values that are found in your lstMeta by creating a vector of indices where matches are found. I also used the <<- operator. If you're better at R than me, you can probably improve this.
set.seed(1)
size <- 20
# Data set
dta <-
data.frame(
unitA = sample(LETTERS[1:4], size = size, replace = TRUE),
unitB = sample(letters[16:20], size = size, replace = TRUE),
unitC = sample(month.abb[1:4], size = size, replace = TRUE),
someValue = sample(1:1e6, size = size, replace = TRUE),
stringsAsFactors = F
)
# Meta data
lstMeta <- list(
# Unit A definitions
data.frame(
V1 = c("A", "B", "D"),
V2 = c("Letter A", "Letter B", "Letter D"),
stringsAsFactors = F
),
# Unit B definitions
data.frame(
V1 = c("t", "q"),
V2 = c("small t", "small q"),
stringsAsFactors = F
),
# Unit C definitions
data.frame(
V1 = c("Mar", "Jan"),
V2 = c("March", "January"),
stringsAsFactors = F
)
)
replaceLbls <- function(dataSet, lstDict) {
sapply(1:3, function(i) {
# Take corresponding metadata data frame
dtaDict <- lstDict[[i]]
# Replace values in selected column
# Where matches on V1 push corrsponding values from V2
myUniques <- which(dataSet[,i] %in% dtaDict[,1])
dataSet[myUniques,i]<<- dtaDict[,2][match(dataSet[myUniques,i],dtaDict[,1])]
})
return(dataSet)
}
# Testing -----------------------------------------------------------------
replaceLbls(dataSet = dta, lstDict = lstMeta)
The following approach works well for the example data:
replaceLbls <- function(dataSet, lstDict) {
dataSet[seq_along(lstDict)] <- Map(function(x, lst) {
x <- as.character(x)
idx <- match(x, as.character(lst$V1))
replace(x, !is.na(idx), as.character(lst$V2)[na.omit(idx)])
}, dataSet[seq_along(lstDict)], lstDict)
dataSet
}
head(replaceLbls(dta, lstMeta))
# unitA unitB unitC someValue
# 1 Letter B small t Apr 912876
# 2 Letter B small q March 293604
# 3 C s Apr 459066
# 4 Letter D p March 332395
# 5 Letter A small q March 650871
# 6 Letter D small q Apr 258017
This assumes that you want to apply the changes to the first X column of the data that are as long as the meta-list. You might want to include an extra step to convert back to factor since this approach converts the adjusted columns to character class.
Another remark on factors: you could potentially speed up the performance by working only on the levels of any factor variables instead the whole column. The general process would be similar but requires a few more steps to check classes etc.
You can also try this:
mapr<-function(t,meta){
ind<-match(t,meta$V1)
if(!is.na(ind)){return(meta$V2[ind])}
else{return(t)}}
Then using sapply:
dta<-as.data.frame(cbind(sapply(1:3,function(t,df,meta){sapply(df[,t],mapr,lstMeta[[t]])},dta,lstMeta,simplify = T),dta[,4]))
A couple of mapplys can do the job
f1 <- function(df, lst){
d1 <- setNames(data.frame(mapply(function(x, y) x$V2[match(y, x$V1)], lst, df[1:3]),
df$someValue, stringsAsFactors = FALSE),
names(df))
as.data.frame(mapply(function(x, y) replace(x, is.na(x), y[is.na(x)]), d1, df))
}
I want to apply aggregate functions and percentage function to column. I found threads that discuss aggregation (Calculating multiple aggregations with lapply(.SD, ...) in data.table R package) and threads that discuss percentage (How to obtain percentages per value for the keys in R using data.table? and Use data.table to calculate the percentage of occurrence depending on the category in another column), but not both.
Please note that I am looking for data.table based methods. dplyr wouldn't work on actual data set.
Here's the code to generate sample data:
set.seed(10)
IData <- data.frame(let = sample( x = LETTERS, size = 10000, replace=TRUE), numbers1 = sample(x = c(1:20000),size = 10000), numbers2 = sample(x = c(1:20000),size = 10000))
IData$let<-as.character(IData$let)
data.table::setDT(IData)
Here's the code to generate output using dplyr
Output <- IData %>%
dplyr::group_by(let) %>%
dplyr::summarise(numbers1.mean = as.double(mean(numbers1)),numbers1.median = as.double(median(numbers1)),numbers2.mean=as.double(mean(numbers2)),sum.numbers1.n = sum(numbers1)) %>%
dplyr::ungroup() %>%
dplyr::mutate(perc.numbers1 = sum.numbers1.n/sum(sum.numbers1.n)) %>%
dplyr::select(numbers1.mean,numbers1.median,numbers2.mean,perc.numbers1)
Sample Output (header)
If I run head(output), I would get:
let numbers1.mean numbers1.median numbers2.mean perc.numbers1
<chr> <dbl> <dbl> <dbl> <dbl>
N 10320.951 10473.0 9374.435 0.03567927
H 9683.590 9256.5 9328.035 0.03648391
L 10223.322 10226.0 9806.210 0.04005400
S 9922.486 9618.0 10233.849 0.03678742
C 9592.620 9226.0 9791.221 0.03517997
F 10323.867 10382.0 10036.561 0.03962035
Here's what I tried using data.table (unsuccessfully)
IData[, as.list(unlist(lapply(.SD, function(x) list(mean=mean(x),median=median(x),sum=sum(x))))), by=let, .SDcols=c("numbers1","numbers2")] [,.(Perc = numbers1.sum/sum(numbers1.sum)),by=let]
I have 2 Questions:
a) How can I solve this using data.table?
b) I have seen above threads have used prop.table. Can someone please guide me how to use this function?
I would sincerely appreciate any guidance.
We can use the similar approach with data.table
res <- IData[, .(numbers1.mean = mean(numbers1),
numbers1.median = median(numbers1),
numbers2.mean=mean(numbers2),
sum.numbers1.n = sum(numbers1)), let
][, perc.numbers1 := sum.numbers1.n/sum(sum.numbers1.n)
][, c("let", "numbers1.mean", "numbers1.median",
"numbers2.mean", "perc.numbers1"), with = FALSE]
head(res)
# let numbers1.mean numbers1.median numbers2.mean perc.numbers1
#1: N 10320.951 10473.0 9374.435 0.03567927
#2: H 9683.590 9256.5 9328.035 0.03648391
#3: L 10223.322 10226.0 9806.210 0.04005400
#4: S 9922.486 9618.0 10233.849 0.03678742
#5: C 9592.620 9226.0 9791.221 0.03517997
#6: F 10323.867 10382.0 10036.561 0.03962035
I have a large data table in R:
library(data.table)
set.seed(1234)
n <- 1e+07*2
DT <- data.table(
ID=sample(1:200000, n, replace=TRUE),
Month=sample(1:12, n, replace=TRUE),
Category=sample(1:1000, n, replace=TRUE),
Qty=runif(n)*500,
key=c('ID', 'Month')
)
dim(DT)
I'd like to pivot this data.table, such that Category becomes a column. Unfortunately, since the number of categories isn't constant within groups, I can't use this answer.
Any ideas how I might do this?
/edit: Based on joran's comments and flodel's answer, we're really reshaping the following data.table:
agg <- DT[, list(Qty = sum(Qty)), by = c("ID", "Month", "Category")]
This reshape can be accomplished a number of ways (I've gotten some good answers so far), but what I'm really looking for is something that will scale well to a data.table with millions of rows and hundreds to thousands of categories.
data.table implements faster versions of melt/dcast data.table specific methods (in C). It also adds additional features for melting and casting multiple columns. Please see the Efficient reshaping using data.tables vignette.
Note that we don't need to load reshape2 package.
library(data.table)
set.seed(1234)
n <- 1e+07*2
DT <- data.table(
ID=sample(1:200000, n, replace=TRUE),
Month=sample(1:12, n, replace=TRUE),
Category=sample(1:800, n, replace=TRUE), ## to get to <= 2 billion limit
Qty=runif(n),
key=c('ID', 'Month')
)
dim(DT)
> system.time(ans <- dcast(DT, ID + Month ~ Category, fun=sum))
# user system elapsed
# 65.924 20.577 86.987
> dim(ans)
# [1] 2399401 802
Like that?
agg <- DT[, list(Qty = sum(Qty)), by = c("ID", "Month", "Category")]
reshape(agg, v.names = "Qty", idvar = c("ID", "Month"),
timevar = "Category", direction = "wide")
There is no data.table specific wide reshaping method.
Here is an approach that will work, but it is rather convaluted.
There is a feature request #2619 Scoping for LHS in :=to help with making this more straightforward.
Here is a simple example
# a data.table
DD <- data.table(a= letters[4:6], b= rep(letters[1:2],c(4,2)), cc = as.double(1:6))
# with not all categories represented
DDD <- DD[1:5]
# trying to make `a` columns containing `cc`. retaining `b` as a column
# the unique values of `a` (you may want to sort this...)
nn <- unique(DDD[,a])
# create the correct wide data.table
# with NA of the correct class in each created column
rows <- max(DDD[, .N, by = list(a,b)][,N])
DDw <- DDD[, setattr(replicate(length(nn), {
# safe version of correct NA
z <- cc[1]
is.na(z) <-1
# using rows value calculated previously
# to ensure correct size
rep(z,rows)},
simplify = FALSE), 'names', nn),
keyby = list(b)]
# set key for binary search
setkey(DDD, b, a)
# The possible values of the b column
ub <- unique(DDw[,b])
# nested loop doing things by reference, so should be
# quick (the feature request would make this possible to
# speed up using binary search joins.
for(ii in ub){
for(jj in nn){
DDw[list(ii), {jj} := DDD[list(ii,jj)][['cc']]]
}
}
DDw
# b d e f
# 1: a 1 2 3
# 2: a 4 2 3
# 3: b NA 5 NA
# 4: b NA 5 NA
EDIT
I found this SO post, which includes a better way to insert the
missing rows into a data.table. Function fun_DT adjusted
accordingly. Code is cleaner now; I don't see any speed improvements
though.
See my update at the other post. Arun's solution works as well, but you have to manually insert the missing combinations. Since you have more identifier columns here (ID, Month), I only came up with a dirty solution here (creating an ID2 first, then creating all ID2-Category combination, then filling up the data.table, then doing the reshaping).
I'm pretty sure this isn't the best solution, but if this FR is built in, those steps might be done automatically.
The solutions are roughly the same speed wise, although it would be interesting to see how that scales (my machine is too slow, so I don't want to increase the n any further...computer crashed to often already ;-)
library(data.table)
library(rbenchmark)
fun_reshape <- function(n) {
DT <- data.table(
ID=sample(1:100, n, replace=TRUE),
Month=sample(1:12, n, replace=TRUE),
Category=sample(1:10, n, replace=TRUE),
Qty=runif(n)*500,
key=c('ID', 'Month')
)
agg <- DT[, list(Qty = sum(Qty)), by = c("ID", "Month", "Category")]
reshape(agg, v.names = "Qty", idvar = c("ID", "Month"),
timevar = "Category", direction = "wide")
}
#UPDATED!
fun_DT <- function(n) {
DT <- data.table(
ID=sample(1:100, n, replace=TRUE),
Month=sample(1:12, n, replace=TRUE),
Category=sample(1:10, n, replace=TRUE),
Qty=runif(n)*500,
key=c('ID', 'Month')
)
agg <- DT[, list(Qty = sum(Qty)), by = c("ID", "Month", "Category")]
agg[, ID2 := paste(ID, Month, sep="_")]
setkey(agg, ID2, Category)
agg <- agg[CJ(unique(ID2), unique(Category))]
agg[, as.list(setattr(Qty, 'names', Category)), by=list(ID2)]
}
library(rbenchmark)
n <- 1e+07
benchmark(replications=10,
fun_reshape(n),
fun_DT(n))
test replications elapsed relative user.self sys.self user.child sys.child
2 fun_DT(n) 10 45.868 1 43.154 2.524 0 0
1 fun_reshape(n) 10 45.874 1 42.783 2.896 0 0