I have ~15 data.frames with 100K-300K rows. I want to condense on the variable v the other columns into a character json format for condensed storage reasons. Note that each group in v will have multiple rows (1 or more; likely more). I have code below that inefficiently uses the jsonlite package to convert but because of how I set up the splits it's slow and not memory efficient. How could I do this faster and more memory efficient. I don't need to use the jsonlite package just did because it's the only way I knew how to do this. I'm thinking there's a way to make the character json directly using data.table in a fast way but can't think of how to do this.
PS if it helps to know the motivation...I'm doing this to have a hash table that I can look up v in and then convert the json back to an R data.frame on the fly. Maybe there's a way to use jsonlite more directly than I am but toJSON(dat) is not what I'm after.
MWE
set.seed(10)
dat <- data.frame(
v = rep(c('red', 'blue'), each =3),
w = sample(LETTERS, 6),
x = sample(1:3, 6, T),
y = sample(1:3, 6, T),
z = sample(1:3, 6, T),
stringsAsFactors = FALSE
)
dat
Data View
v w x y z
1 red N 1 1 2
2 red H 1 2 3
3 red K 2 2 3
4 blue P 2 2 2
5 blue B 2 1 3
6 blue E 2 1 2
Coverting
library(jsonlite)
jsonlist <- lapply(split(dat[-1], dat$v), function(x) as.character(toJSON(x)))
data.frame(
v = names(jsonlist),
json = unlist(jsonlist, use.names=FALSE),
stringsAsFactors = FALSE
)
Desired Result
v json
1 blue [{"w":"P","x":2,"y":2,"z":2},{"w":"B","x":2,"y":1,"z":3},{"w":"E","x":2,"y":1,"z":2}]
2 red [{"w":"N","x":1,"y":1,"z":2},{"w":"H","x":1,"y":2,"z":3},{"w":"K","x":2,"y":2,"z":3}]
Using a data.table, you can group by v and pass .SD to toJSON:
library(data.table)
setDT(dat)
dat[, toJSON(.SD), by = v]
# v V1
#1: red [{"w":"N","x":1,"y":1,"z":2},{"w":"H","x":1,"y":2,"z":3},{"w":"K","x":2,"y":2,"z":3}]
#2: blue [{"w":"P","x":2,"y":2,"z":2},{"w":"B","x":2,"y":1,"z":3},{"w":"E","x":2,"y":1,"z":2}]
I'm still not convinced what you're doing makes sense, but:
dat %>%
group_by(v) %>%
do(json = select(., -v) %>% toJSON ) %>%
mutate(json = unlist(json))
Related
I have a data frame with two string variables, and would like to convert them to numeric values using a separate "key" data frame. The below example is simplified, but I need to be able to apply it to replace the contents of the V1 and V2 variables based on an arbitrary key that will not always be a=1, b=2 etc...
Example:
set.seed(1)
x <- data.frame(
V1 = sample((letters), 10, replace=TRUE),
V2 = sample((letters), 10, replace=TRUE)
)
key <- data.frame(letters, 1:26)
I need to reference the first element of V1 against the key, replace with the according value (e.g. a = 1, b = 2, etc.), do the same for the second element, and then when done with V1 move on and do the same for V2.
I've been struggling to work out a solution using lapply() and sub() but keep getting stuck because I can't see a way to pass the sub() function more than a 1:1 comparison. Is there a different function I should be using?
Forgive me- I'm sure the solution must be simple but I'm quite new to R still.
Here are two approaches with base R to make it:
using sapply()
x[] <- with(key, sapply(x, function(v) values[match(v,letters)]))
or
x <- data.frame(with(key, sapply(x, function(v) values[match(v,letters)])))
using as.matrix (similar to the unlist() approach by #Ronak Shah)
x[] <- with(key, values[match(as.matrix(x),letters)])
You can create a lookup table with data.table and then apply the mapping along the columns of your data frame with apply:
library(data.table)
key <- data.table(letters = letters, value = 1:26, key = "letters")
apply(x, 2, function(x) key[x]$value)
>
V1 V2
1 y a
2 d u
3 g u
4 a j
5 b v
6 w n
7 k j
8 n g
9 r i
10 s o
You could unlist and match in base R
x[] <- key$values[match(unlist(x), key$letters)]
x
# V1 V2
#1 25 1
#2 4 21
#3 7 21
#4 1 10
#5 2 22
#6 23 14
#7 11 10
#8 14 7
#9 18 9
#10 19 15
Or using dplyr
library(dplyr)
x %>% mutate_all(~key$values[match(., key$letters)])
data
set.seed(1)
x <- data.frame(
V1 = sample((letters), 10, replace=TRUE),
V2 = sample((letters), 10, replace=TRUE)
)
key <- data.frame(letters = letters, values = 1:26)
You could use apply with both row and column margins, e.g, as.data.frame(apply(x, c(1,2), function(l) key[key$letters == l,c(2)])).
I would like to join the two data frames :
a <- data.frame(x=c(1,3,5))
b <- data.frame(start=c(0,4),end=c(2,6),y=c("a","b"))
with a condition like (x>start)&(x<end) in order to get such a result:
# x y
#1 1 a
#2 2 <NA>
#3 3 b
I don't want to make a potentially large cartesian product and then select only the few rows matching the condition and I'd like a solution using the tidyverse (I am not interested in a solution using SQL which would be a confession of failure). I thought of the 'fuzzyjoin' package but I cannot find examples fitting my need : the function to apply for the condition has only two arguments. I also tried to put 'start' and 'end' into a single argument with data.frame(z=I(purrr::map2(b$start,b$end,list)),y=b$y)
# z y
#1 0, 2 a
#2 4, 6 b
but although the data looks fine fuzzy_left_join doesn't accept it.
I search for solutions working in more general cases (n variables on the LHS, m on the RHS, not necessarily numeric with arbitrary conditions).
UPDATE
I also want to be able to express conditions like (x=start+1)|(x=end+1) giving here:
# x y
#1 1 a
#2 3 a
#3 5 b
For this case you don't need multi_by or multy_match_fun, this works :
library(fuzzyjoin)
fuzzy_left_join(a, b, by = c(x = "start", x = "end"), match_fun = list(`>`, `<`))
# x start end y
# 1 1 0 2 a
# 2 3 NA NA <NA>
# 3 5 4 6 b
I eventually went to the code of fuzzy_join and found a way to make what I want even without proper documentation. fuzzy_let_join doesn't work but there is the following way (not really pretty and it actually does a cartesian product):
g <- function(x,y) (x>y[,"start"])&(x<y[,"end"])
fuzzy_join(a,b, multi_by = list(x="x",y=c("start","end"))
, multi_match_fun = g, mode = "left") %>% select(x,y)
data.table approach could be
library(data.table)
name1 <- setdiff(names(setDT(b)), names(setDT(a)))
#perform left outer join and then select required columns
a[b, (name1) := mget(name1), on = .(x > start, x < end)][, .(x, y)]
which gives
x y
1: 1 a
2: 3 <NA>
3: 5 b
Sample data:
a <- data.frame(x = c(1, 3, 5))
b <- data.frame(start = c(0, 4), end = c(2, 6), y = c("a", "b"))
Update: In case you want to join both dataframes on (x=start+1)|(x=end+1) condition then you can try
library(data.table)
DT1 <- as.data.table(a)
DT2 <- as.data.table(b)
#Perform 1st join on "x = start+1" and then another on "x = end+1". Finally row-bind both results.
DT <- rbindlist(list(DT1[DT2[, start_temp := start+1], on = c(x = "start_temp"), .(x, y), nomatch = 0],
DT1[DT2[, end_temp := end+1], on = c(x = "end_temp"), .(x, y), nomatch = 0]))
DT
# x y
#1: 1 a
#2: 5 b
#3: 3 a
A possible answer to explain what I am trying to do : extending dplyr in some way. And I will be happy to know if there are ways to improve this solution or some problems I didn't see.
The solution avoids the cartesian product, but duplicates into lists of data frames both one of the input data frame and the result. I didn't include the final column selection of x and y that is easy to code.
my_left_join <- function(.DATA1,.DATA2,.WHERE)
{
call = as.list(match.call())
df1 <- .DATA1
df1$._row_ <- 1:nrow(df1)
dfl1 <- replyr::replyr_split(df1,"._row_")
eval(substitute(
dfl2 <- mapply(function(.x)
{filter(.DATA2,with(.x,WHERE)) %>%
mutate(._row_=.x$._row_)}
, dfl1, SIMPLIFY=FALSE)
,list(WHERE=call$.WHERE)))
df2 <- replyr::replyr_bind_rows(dfl2)
left_join(df1,df2,by="._row_") %>% select(-._row_)
}
my_left_join(a,b,(x>start)&(x<end))
# x start end y
#1 1 0 2 a
#2 3 NA NA <NA>
#3 5 4 6 b
my_left_join(a,b,(x==(start+1))|(x==(end+1)))
# x start end y
#1 1 0 2 a
#2 3 0 2 a
#3 5 4 6 b
You can try a GenomicRanges solution
library(GenomicRanges)
# setup GRanges objects
a_gr <- GRanges(1, IRanges(a$x,a$x))
b_gr <- GRanges(1, IRanges(b$start, b$end))
# find overlaps between the two data sets
res <- as.data.frame(findOverlaps(a_gr,b_gr))
# create the expected output
a$y <- NA
a$y[res$queryHits] <- as.character(b$y)[res$subjectHits]
a
x y
1 1 a
2 3 <NA>
3 5 b
I'm looking at a problem where we are trying to create time series differences for a large numbers of pairs (+40 000) where the time series has (+150 daily points)
Each row is representing a pair of two individuals we want to compare
pairs = data.frame("number" = c(1,2,3,4),
"name1" = c("A","B","C","D"),
"name2" = c("B","D","D","A")
)
pairs$name1 <- as.character(pairs$name1)
pairs$name2 <- as.character(pairs$name2)
Each row is representing time series data for a particular individual
ts = data.frame("name" = c("A","B","C","D"),
"day1" = c(10,12,54,13),
"day2" = c(2,8,47,29),
"day3" = c(1,5,14,36)
)
ts$name <- as.character(ts$name)
I have the following R Code where the goal is to create for each pair of individuals (so 4 in my example) a new data frame that has the difference of their daily time series from the ts data frame. This works but is extremely slow when I try to run it on my real data set where pairs has 40 000 rows and ts about 150 columns. Anyone has an idea of how I could speed this up? I tried to use lapply but cant figure how to create the difference and store it in new time series. Thanks!!
diffs<-data.frame(matrix(ncol=ncol(ts)))
colnames(diffs)<-colnames(ts)
for (row in 1:nrow(pairs)){
row1<-ts[(ts$name==pairs[row,"name1"]),]
row2<-ts[(ts$name==pairs[row,"name2"]),]
difference<-rbind(row1,row2)
difference[3,1]<-pairs[row,"number"]
difference[3,2:ncol(difference)]<-difference[1,2:ncol(difference)]-difference[2,2:ncol(difference)]
diffs<-rbind(diffs,difference[3,])
}
A few remarks up front:
i) data.frame() has an argument stringsAsFactors, which you can set to FALSE, i.e.:
pairs = data.frame(
"number" = c(1,2,3,4),
"name1" = c("A","B","C","D"),
"name2" = c("B","D","D","A"),
stringsAsFactors = FALSE
)
ii) speeding up your code is not really a matter of replacing the for-loop with an apply, but rather a matter of data structures and processing efficiency. Relying on a package / function that internally loops in C++ rather than R or writing C++ code yourself will give you the biggest boost.
iii) I'll also provide a larger dummy example here, such that you and others can test and compare times a little easier:
# all combination of LETTERS, including identity pairs like A~A
pairs = cbind.data.frame(
"number" = seq(1, 676),
setNames(expand.grid(LETTERS, LETTERS), nm = c("name1", "name2"))
)
# expand.grid produces factor columns
pairs$name1 <- as.character(pairs$name1)
pairs$name2 <- as.character(pairs$name2)
ts = cbind.data.frame(
"name" = LETTERS,
matrix(sample.int(100, 150*26, replace = TRUE), ncol = 150),
stringsAsFactors = FALSE
)
names(ts)[-1] <- paste0("day", names(ts)[-1])
iv) an improved version of your loop could then look like:
# initialize full matrix (since the ID is a number too), allocating necessary memory
diffs2 <- matrix(0, ncol = ncol(ts), nrow = nrow(pairs))
colnames(diffs2) <- colnames(ts)
# first column is given
diffs2[, 1] <- pairs$number
for (row in 1:nrow(pairs)) {
row1 <- as.vector(as.matrix(ts[ts$name==pairs[row,"name1"], -1]))
row2 <- as.vector(as.matrix(ts[ts$name==pairs[row,"name2"], -1]))
diffs2[row, -1] <- row1 - row2
}
this is already several times faster than what you had, but illustrates the awkwardness of having a data.frame object for a time-series, which should instead be an object of a class that allows to work with the numeric data more directly / efficiently (there are several packages that offer time-series classes).
Now for an answer that is still fairly simple yet rather fast, using dplyr and tidyr:
# simple way of measuring time
start <- Sys.time()
xx <- tidyr::gather(ts, key = "day", value = "value", 2:151)
yy <- dplyr::left_join(pairs, xx, by = c("name1" = "name"))
zz <- dplyr::left_join(yy, xx, by = c("name2" = "name", "day" = "day"))
res <- dplyr::mutate(zz, diff = value.x - value.y)
end <- Sys.time()
duration <- end - start
duration
Time difference of 0.06700397 secs
You can also try the approaches from the previous two answers, it's clear that the mapply solution will be slow and the data.table one isn't working fully yet and already looks slower and more complicated.
I have a data.table solution to help.
The idea is to switch to long format to be able to use grouping operation (equivalent to apply) and create permutated column to make the pairs:
name1idx <- unlist(lapply(pairs$name1,function(x){grep(x,ts$name)}))
name2idx <- unlist(lapply(pairs$name2,function(x){grep(x,ts$name)}))
plouf <-melt(setDT(ts),measure.vars = patterns("^day"),variable.name = "day")
plouf[,name1 := name[name1idx],by = day]
plouf[,value1 := value[name1idx],by = day]
plouf[,name2 := name[name2idx],by = day]
plouf[,value2 := value[name2idx],by = day]
plouf[,diff := value1 - value2]
plouf[,.(day,diff),by = .(name1,name2)]
name1 name2 day diff
1: A B day1 -2
2: A B day2 -6
3: A B day3 -4
4: B D day1 -1
5: B D day2 -21
6: B D day3 -31
7: C D day1 41
8: C D day2 18
9: C D day3 -22
10: D A day1 3
11: D A day2 27
12: D A day3 35
name1idx and name1idx are the index of ts$name corresponding to pairs$name1 and pairs$name2. You can have this way all the pairs.
I was looking for a solution to in which use of column names is dynamic and no column name to be used other than name. mapply, dplyr and reshape2 has been used for this solution.
# library(reshape2)
# A function which will filter value based on pairs
matchPair <- function(x, y){
matchedRow <- ts %>%
filter(name == x | name == y) %>%
select(-name)
data.frame(diff(as.matrix(matchedRow))) %>%
mutate(name = paste0(x, '~',y))
}
df.r <-do.call(rbind,mapply(matchPair, pairs$name1, pairs$name2,
SIMPLIFY = FALSE))
# Row names are not meaningful. Hence remove those.
row.names(df.r) <- NULL
#Result
#> df.r
# day1 day2 day3 name
#1 2 6 4 A~B
#2 1 21 31 B~D
#3 -41 -18 22 C~D
#4 3 27 35 D~A
** edited because I'm a doofus - with replacement, not without **
I have a large-ish (>500k rows) dataset with 421 groups, defined by two grouping variables. Sample data as follows:
df<-data.frame(group_one=rep((0:9),26), group_two=rep((letters),10))
head(df)
group_one group_two
1 0 a
2 1 b
3 2 c
4 3 d
5 4 e
6 5 f
...and so on.
What I want is some number (k = 12 at the moment, but that number may vary) of stratified samples, by membership in (group_one x group_two). Membership in each group should be indicated by a new column, sample_membership, which has a value of 1 through k (again, 12 at the moment). I should be able to subset by sample_membership and get up to 12 distinct samples, each of which is representative when considering group_one and group_two.
Final data set would thus look something like this:
group_one group_two sample_membership
1 0 a 1
2 0 a 12
3 0 a 5
4 1 a 5
5 1 a 7
6 1 a 9
Thoughts? Thanks very much in advance!
Maybe something like this?:
library(dplyr)
df %>%
group_by(group_one, group_two) %>%
mutate(sample_membership = sample(1:12, n(), replace = FALSE))
Here's a one-line data.table approach, which you should definitely consider if you have a long data.frame.
library(data.table)
setDT(df)
df[, sample_membership := sample.int(12, .N, replace=TRUE), keyby = .(group_one, group_two)]
df
# group_one group_two sample_membership
# 1: 0 a 9
# 2: 0 a 8
# 3: 0 c 10
# 4: 0 c 4
# 5: 0 e 9
# ---
# 256: 9 v 4
# 257: 9 x 7
# 258: 9 x 11
# 259: 9 z 3
# 260: 9 z 8
For sampling without replacement, use replace=FALSE, but as noted elsewhere, make sure you have fewer than k members per group. OR:
If you want to use "sampling without unnecessary replacement" (making this up -- not sure what the right terminology is here) because you have more than k members per group but still want to keep the groups as evenly sized as possible, you could do something like:
# example with bigger groups
k <- 12L
big_df <- data.frame(group_one=rep((0:9),260), group_two=rep((letters),100))
setDT(big_df)
big_df[, sample_round := rep(1:.N, each=k, length.out=.N), keyby = .(group_one, group_two)]
big_df[, sample_membership := sample.int(k, .N, replace=FALSE), keyby = .(group_one, group_two, sample_round)]
head(big_df, 15) # you can see first repeat does not occur until row k+1
Within each "sampling round" (first k observations in the group, second k observations in the group, etc.) there is sampling without replacement. Then, if necessary, the next sampling round makes all k assignments available again.
This approach would really evenly stratify the sample (but perfectly even is only possible if you have a multiple of k members in each group).
Here is a base R method, that assumes that your data.frame is sorted by groups:
# get number of observations for each group
groupCnt <- with(df, aggregate(group_one, list(group_one, group_two), FUN=length))$x
# for reproducibility, set the seed
set.seed(1234)
# get sample by group
df$sample <- c(sapply(groupCnt, function(i) sample(12, i, replace=TRUE)))
Untested example using dplyr, if it doesn't work it might point you in the right direction.
library( dplyr )
set.seed(123)
df <- data.frame(
group_one = as.integer( runif( 1000, 1, 6) ),
group_two = sample( LETTERS[1:6], 1000, TRUE)
) %>%
group_by( group_one, group_two ) %>%
mutate(
sample_membership = sample( seq(1, length(group_one) ), length(group_one), FALSE)
)
Good luck!
I have searched the internet, but I haven't been able to find a solution to my problem.
I have a data frame of numbers and characters:
mydf <- data.frame(col1=c(1, 2, 3, 4),
col2 = c(5, 6, 7, 8),
col3 = c("a", "b", "c", "d"), stringsAsFactors = FALSE)
mydf:
col1 col2 col3
1 5 a
2 6 b
3 7 c
4 8 d
I would like to repeat this into
col1 col2 col3
1 5 a
1 5 a
1 5 a
2 6 b
2 6 b
2 6 b
3 7 c
3 7 c
3 7 c
4 8 d
4 8 d
4 8 d
Using apply(mydf, 2, function(x) rep(x, each = 3)) will give the right repetition, but will not conserve the classes of col1, col2, and col3, as numeric, numeric and character, respectively, as I would like. This is a constructed example, and setting the classes of each column in my data frame is a bit tedious.
Is there a way to make the repetition while conserving the classes?
It's even easier than you think.
index <- rep(seq_len(nrow(mydf)), each = 3)
mydf[index, ]
This also avoids the implicit looping from apply.
This is an unfortunate and an unexpected class conversion (too me, anyway). Here's an easy workaround that uses the fact that a data.frame is just a special list.
data.frame(lapply(mydf, function(x) rep(x, each = 3)))
(anyone know why the behaviour the questioner observed shouldn't be reported as a bug?)
Just another solution:
mydf3 <- do.call(rbind, rep(list(mydf), 3))
Take a look at aggregate and disaggregate in the raster package. Or, use my modified version zexpand below:
# zexpand: analogous to disaggregate
zexpand<-function(inarray, fact=2, interp=FALSE, ...) {
# do same analysis of fact to allow one or two values, fact >=1 required, etc.
fact<-as.integer(round(fact))
switch(as.character(length(fact)),
'1' = xfact<-yfact<-fact,
'2'= {xfact<-fact[1]; yfact<-fact[2]},
{xfact<-fact[1]; yfact<-fact[2];warning(' fact is too long. First two values used.')})
if (xfact < 1) { stop('fact[1] must be > 0') }
if (yfact < 1) { stop('fact[2] must be > 0') }
bigtmp <- matrix(rep(t(inarray), each=xfact), nrow(inarray), ncol(inarray)*xfact, byr=T) #does column expansion
bigx <- t(matrix(rep((bigtmp),each=yfact),ncol(bigtmp),nrow(bigtmp)*yfact,byr=T))
# the interpolation would go here. Or use interp.loess on output (won't
# handle complex data). Also, look at fields::Tps which probably does
# a much better job anyway. Just do separately on Re and Im data
return(invisible(bigx))
}
I really like Richie Cotton's answer.
But you could also simply use rbind and reorder it.
res <-rbind(mydf,mydf,mydf)
res[order(res[,1],res[,2],res[,3]),]
The package mefa comes with a nice wrapper for rep applied to data.frame. This will match your example in one line:
mefa:::rep.data.frame(mydf, each=3)