I would like to turn the first table into the second by selecting the last observation of a group for a and b, the first observation for c, sum each observation for the group for d and e, and for f, check if a valid date exists and use that date.
Table 1:
ID a b c d e f
1 10 100 1000 10000 100000 ?
1 10 100 1001 10010 100100 5/07/1977
1 11 111 1002 10020 100200 5/07/1977
2 22 222 2000 20000 200000 6/02/1980
3 33 333 3000 30000 300000 20/12/1978
3 33 333 3001 30010 300100 ?
4 40 400 4000 40000 400000 ?
4 40 400 4001 40010 400100 ?
4 40 400 4002 40020 400200 7/06/1944
4 44 444 4003 40030 400300 ?
4 44 444 4004 40040 400400 ?
4 44 444 4005 40050 400500 ?
5 55 555 5000 50000 500000 31/05/1976
5 55 555 5001 50010 500100 31/05/1976
Table 2:
ID a b c d e f
1 11 111 1000 30030 300300 5/07/1977
2 22 222 2000 20000 200000 6/02/1980
3 33 333 3000 60010 600100 20/12/1978
4 44 444 4000 240150 2401500 7/06/1944
5 55 555 5000 100010 1000100 31/05/1976
I have looked up StackOverflow questions and I have only seen elements of this. I can do a through to e in the following steps.
library(data.table)
setwd('D:/Work/BRB/StackOverflow')
DT = data.table(fread('datatable.csv', header=TRUE))
AB = DT[ , .SD[.N], ID ]
AB = AB[ , c('a', 'b') ]
C = DT[ , .SD[1], ID ]
C = C[ , 'c' ]
DE = DT[ , .(d = sum(d), e = sum(e)) , by = ID ]
Final = cbind(AB, C, DE)
Final
My question is, can I do the operations on variables a, b, c, d, e in one transformation without having to split it into 3?
Also, I have no idea how to do f. Any suggestions?
Finally, I am new to R. Anything else I can improve about my code?
There are several things you can improve:
fread will return a data.table, so no need to wrap it in data.table. You can check with class(DT).
Use the na.strings parameter when reading in the data. See below for an example.
Summarise with:
DT[, .(a = a[.N],
b = b[.N],
c = c[1],
d = sum(d),
e = sum(e),
f = unique(na.omit(f)))
, by = ID]
you will then get:
ID a b c d e f
1: 1 11 111 1000 30030 300300 5/07/1977
2: 2 22 222 2000 20000 200000 6/02/1980
3: 3 33 333 3000 60010 600100 20/12/1978
4: 4 44 444 4000 240150 2401500 7/06/1944
5: 5 55 555 5000 100010 1000100 31/05/1976
Some explanations & other notes:
Subsetting with [1] will give you the first value of a group. You could also use the first-function which is optimized in data.table, and thus faster.
Subsetting with [.N] will give you the last value of a group. You could also use the last-function which is optimized in data.table, and thus faster.
Don't use variable names that are also functions in R (in this case, don't use c as a variable name). See also ?c for an explanation of what the c-function does.
For summarising the f-variable, I used unique in combination with na.omit. If there is more than one unique date by ID, you could also use for example na.omit(f)[1].
If speed is an issue, you could optimize the above to (thx to #Frank):
DT[order(f)
, .(a = last(a),
b = last(b),
c = first(c),
d = sum(d),
e = sum(e),
f = first(f))
, by = ID]
Ordering by f will put NA-values last. As a result now the internal GForce-optimization is used for all calculations.
Used data:
DT <- fread("ID a b c d e f
1 10 100 1000 10000 100000 ?
1 10 100 1001 10010 100100 5/07/1977
1 11 111 1002 10020 100200 5/07/1977
2 22 222 2000 20000 200000 6/02/1980
3 33 333 3000 30000 300000 20/12/1978
3 33 333 3001 30010 300100 ?
4 40 400 4000 40000 400000 ?
4 40 400 4001 40010 400100 ?
4 40 400 4002 40020 400200 7/06/1944
4 44 444 4003 40030 400300 ?
4 44 444 4004 40040 400400 ?
4 44 444 4005 40050 400500 ?
5 55 555 5000 50000 500000 31/05/1976
5 55 555 5001 50010 500100 31/05/1976", na.strings='?')
We can use tidyverse. After grouping by 'ID', we summarise the columns based on the first or last observation
library(dplyr)
DT %>%
group_by(ID) %>%
summarise(a = last(a),
b = last(b),
c = first(c),
d = sum(d),
e = sum(e),
f = f[f!="?"][1])
# A tibble: 5 × 7
# ID a b c d e f
# <int> <int> <int> <int> <int> <int> <chr>
#1 1 11 111 1000 30030 300300 5/07/1977
#2 2 22 222 2000 20000 200000 6/02/1980
#3 3 33 333 3000 60010 600100 20/12/1978
#4 4 44 444 4000 240150 2401500 7/06/1944
#5 5 55 555 5000 100010 1000100 31/05/1976
Related
I've got a very large dataset (millions of rows that I need to loop through thousands of times), and during the loop I have to do a conditional sum that appears to be taking a very long time. Is there a way of making this more efficient?
Datatable format as follows:
DT <- data.table('A' = c(1,1,1,2,2,3,3,3,3,4),
'B' = c(500,510,540,500,540,500,510,519,540,500),
'C' = c(10,20,10,20,10,50,20,50,20,10))
A
B
C
1
500
10
1
510
20
1
540
10
2
500
20
2
540
10
3
500
50
3
510
20
3
519
50
3
540
20
4
500
10
I need the sum of column C (in a new column, D) subject to A == A, and B >= B & B < B + 20 (by row). So the output table would look like the following:
A
B
C
D
1
500
10
30
1
510
20
30
1
540
10
10
2
500
20
20
2
540
10
10
3
500
50
120
3
510
20
120
3
519
50
120
3
540
20
20
4
500
10
10
The code I'm currently using:
DT[,D:= sum(DT$C[A == DT$A & ((B >= DT$B) & (B < DT$B + 20))]), by=c('A', 'B')]
This takes a very long time to actually run, as well as giving me the wrong answer. The output I get looks like this:
A
B
C
D
1
500
10
10
1
510
20
30
1
540
10
10
2
500
20
20
2
540
10
10
3
500
50
50
3
510
20
70
3
519
50
120
3
540
20
20
4
500
10
10
(i.e. D only appears to increase cumulatively).
I'm less concerned with the cumulative thing, more about speed. Ultimately what I'm trying to get to is the largest sum of C, by A, subject to B being within 20 of eachother. I would really appreciate any help on this! Thanks in advance.
If I understand correctly, this can be solved by a non-equi self join:
DT[, Bp20 := B + 20][
DT, on = .(A, B >= B, B < Bp20), mult = "last"][
, .(B, C = i.C, D = sum(i.C)), by = .(A, Bp20)][
, Bp20 := NULL][]
A B C D
1: 1 500 10 30
2: 1 510 20 30
3: 1 540 10 10
4: 2 500 20 20
5: 2 540 10 10
6: 3 500 50 120
7: 3 510 20 120
8: 3 519 50 120
9: 3 540 20 20
10: 4 500 10 10
# logic for B
DT[, g := B >= shift(B) & B < shift(B, 1) + 20, by = A]
# creating index column
DT[, gi := !g]
DT[is.na(gi), gi := T]
DT[, gi := cumsum(gi)]
DT[, D := sum(C), by = gi] # summing by new groups
DT
# A B C g gi D
# 1: 1 500 10 NA 1 30
# 2: 1 510 20 TRUE 1 30
# 3: 1 540 10 FALSE 2 10
# 4: 2 500 20 NA 3 20
# 5: 2 540 10 FALSE 4 10
# 6: 3 500 50 NA 5 120
# 7: 3 510 20 TRUE 5 120
# 8: 3 519 50 TRUE 5 120
# 9: 3 540 20 FALSE 6 20
# 10: 4 500 10 NA 7 10
You might need to adjust logic for B, as all edge cases isn't clear from the question... if for one A value we have c(30, 40, 50, 60), all of those rows are in one group?
I have two data frames one with quantities and one with prices
Quantities <- data.frame(region=c("US","US","EU","China","EU"),a = 1:5, b = 5:9, c=8:12)
prices_frame <- data.frame(region=c("US","EU","China"),a = c(10,20,30), b = c(10,20,30), c=c(1000,2000,100))
Quantities
region a b c
1 US 1 5 8
2 US 2 6 9
3 EU 3 7 10
4 China 4 8 11
5 EU 5 9 12
Prices
region a b c
1 US 10 10 1000
2 EU 20 20 2000
3 China 30 30 100
Is there a way that I can quickly multiply the quantities with the prices auf the matching region without having to loop through the entire quantity frame?
Best Alex
We can use match
Assuming your columns a, b, c are in same order in both the dataframes
Quantities[-1] * prices_frame[match(Quantities$region, prices_frame$region), -1]
# a b c
#1 10 50 8000
#2 20 60 9000
#3 60 140 20000
#4 120 240 1100
#5 100 180 24000
To get the dataframe with the same number of columns,
new_df <- cbind(Quantities[1],
Quantities[-1] * prices_frame[match(Quantities$region, prices_frame$region), -1])
# region a b c
#1 US 10 50 8000
#2 US 20 60 9000
#3 EU 60 140 20000
#4 China 120 240 1100
#5 EU 100 180 24000
We can use a join in data.table
library(data.table)
nm1 <- names(Quantities)[-1]
setDT(Quantities)[prices_frame, (nm1) := Map(`*`, mget(nm1),
mget(paste0("i.", nm1))) , on = "region"]
Quantities
# region a b c
#1: US 10 50 8000
#2: US 20 60 9000
#3: EU 60 140 20000
#4: China 120 240 1100
#5: EU 100 180 24000
I have a data frame df1:
chr = c( 1,1,1,1,2,2,2,2)
point = c (257,752,135,1650,252,756,1230,1710)
df1 = data.frame(chr, point)
chr point
1 1 257
2 1 752
3 1 135
4 1 1650
5 2 252
6 2 756
7 2 1230
8 2 1710
I would like to add a new column to this called name. The name to be allocated comes from a reference data frame df2:
chrB = c( 1,1,1,1,2,2,2,2)
txstart = c(0,501,1001,1501,0,501,1001,1501)
txstop = c(500,1000,1500,2000,500,1000,1500,2000)
name2 = c("F","W","Q","G","V","S","L","Y")
chrB txstart txstop name2
1 2 0 500 F
2 2 501 1000 W
3 2 1001 1500 Q
4 2 1501 2000 G
5 1 0 500 V
6 1 501 1000 S
7 1 1001 1500 L
8 1 1501 2000 Y
Where chr in df1 is the same as chrB in df2 AND point in df1 lies between values txstart and txstop the name2 in df2 should be added to df1. result I would like is below:
chr point name
1 1 257 V
2 1 752 S
3 1 135 L
4 1 1650 Y
5 2 252 F
6 2 756 W
7 2 1230 Q
8 2 1710 G
Any help much appreciated!!!
With the updated dataset only the foverlaps method works:
dt1 <- data.table(chr, mp1 = point, mp2 = point,
key = c("chr","mp1", "mp2"))
dt2 <- data.table(chrB, txstart, txstop, name2,
key = c("chrB","txstart", "txstop"))
foverlaps(dt1, dt2, type="within")[, .(chr, midpoint=mp1, name=name2)][]
which gives:
chr midpoint name
1: 1 135 F
2: 1 257 F
3: 1 752 W
4: 1 1650 G
5: 2 252 V
6: 2 756 S
7: 2 1230 L
8: 2 1710 Y
Old answer:
When you want to look whether the midpoint is between the start and stop point of df2, you could use:
df1$name <- df2$name2[match(df1$chr,df2$chrB) &
df1$midpoint > df2$txstart &
df1$midpoint < df2$txstop]
which gives:
> df1
chr midpoint name
1 1 250 F
2 1 750 W
3 1 1250 Q
4 1 1750 G
5 2 250 V
6 2 750 S
7 2 1250 L
8 2 1750 Y
As an alternative approach, you could use the foverlaps function from the data.table package:
library(data.table)
dt1 <- data.table(chr, mp1 = midpoint, mp2 = midpoint, key = c("chr","mp1", "mp2"))
dt2 <- data.table(chrB, txstart, txstop, name2, key = c("chrB","txstart", "txstop"))
foverlaps(dt1, dt2, type="within", nomatch=0L)[, .(chr, midpoint=mp1, name=name2)][]
which gives the same result:
chr midpoint name
1: 1 250 F
2: 1 750 W
3: 1 1250 Q
4: 1 1750 G
5: 2 250 V
6: 2 750 S
7: 2 1250 L
8: 2 1750 Y
I have a data.table as below:
order products value
1000 A|B 10
2000 B|C 20
3000 A|C 30
4000 B|C|D 5
5000 C|D 15
And I need to break the column products and transform/normalize to be used like this:
order prod.seq prod.name value
1000 1 A 10
1000 2 B 10
2000 1 B 20
2000 2 C 20
3000 1 A 30
3000 2 C 30
4000 1 B 5
4000 2 C 5
4000 3 D 5
5000 1 C 15
5000 2 D 15
I guess I can do it using a custom FOR/LOOP but I'd like to know a more advanced way to do that using apply,ddply methods. Any suggestions?
First, convert to a character/string:
DT[,products:=as.character(products)]
Then you can split the string:
DT[,{
x = strsplit(products,"\\|")[[1]]
list( prod.seq = seq_along(x), prod_name = x )
}, by=.(order,value)]
which gives
order value prod.seq prod_name
1: 1000 10 1 A
2: 1000 10 2 B
3: 2000 20 1 B
4: 2000 20 2 C
5: 3000 30 1 A
6: 3000 30 2 C
7: 4000 5 1 B
8: 4000 5 2 C
9: 4000 5 3 D
10: 5000 15 1 C
11: 5000 15 2 D
Here is the another option
library(splitstackshape)
out = cSplit(dat, "products", "|", direction = "long")
out[, prod.seq := seq_len(.N), by = value]
#> out
# order products value prod.seq
# 1: 1000 A 10 1
# 2: 1000 B 10 2
# 3: 2000 B 20 1
# 4: 2000 C 20 2
# 5: 3000 A 30 1
# 6: 3000 C 30 2
# 7: 4000 B 5 1
# 8: 4000 C 5 2
# 9: 4000 D 5 3
#10: 5000 C 15 1
#11: 5000 D 15 2
After cSplit step, using ddply
library(plyr)
ddply(out, .(value), mutate, prod.seq = seq_len(length(order)))
using dplyr
library(dplyr)
out %>% group_by(value) %>% mutate(prod.seq = row_number(order))
using lapply
rbindlist(lapply(split(out, out$value),
function(x){x$prod.seq = seq_len(length(x$order));x}))
I have a dataframe which contains information about several categories, and some associated variables. It is of the form:
ID category sales score
227 A 109 21
131 A 410 24
131 A 509 1
123 B 2 61
545 B 19 5
234 C 439 328
654 C 765 41
What I would like to do is be able to introduce two new columns, salesRank and scoreRank, where I find the item index per category, had they been ordered by sales and score, respectively. I can solve the general case like this:
dF <- dF[order(-dF$sales),]
dF$salesRank<-seq.int(nrow(dF))
but this doesn't account for the categories and so far I've only solved this by breaking up the dataframe. What I want would result in the following:
ID category sales score salesRank scoreRank
227 A 109 21 3 2
131 A 410 24 2 1
131 A 509 1 1 3
123 B 2 61 2 1
545 B 19 5 1 2
234 C 439 328 2 1
654 C 765 41 1 2
Many thanks!
Try:
library(dplyr)
df %>%
group_by(category) %>%
mutate(salesRank = row_number(desc(sales)),
scoreRank = row_number(desc(score)))
Which gives:
#Source: local data frame [7 x 6]
#Groups: category
#
# ID category sales score salesRank scoreRank
#1 227 A 109 21 3 2
#2 131 A 410 24 2 1
#3 131 A 509 1 1 3
#4 123 B 2 61 2 1
#5 545 B 19 5 1 2
#6 234 C 439 328 2 1
#7 654 C 765 41 1 2
From the help:
row_number(): equivalent to rank(ties.method = "first")
min_rank(): equivalent to rank(ties.method = "min")
desc(): transform a vector into a format that will be sorted in descending
order.
As #thelatemail pointed out, for this particular dataset you might want to use min_rank() instead of row_number() which will account for ties in sales/score more appropriately:
> row_number(c(1,2,2,4))
#[1] 1 2 3 4
> min_rank(c(1,2,2,4))
#[1] 1 2 2 4
Use ave in base R with rank (the - is to reverse the rankings from low-to-high to high-to-low):
dF$salesRank <- with(dF, ave(-sales, category, FUN=rank) )
#[1] 3 2 1 2 1 2 1
dF$scoreRank <- with(dF, ave(-score, category, FUN=rank) )
#[1] 2 1 3 1 2 1 2
I have just a base R solution with tapply.
salesRank <- tapply(dat$sales, dat$category, order, decreasing = T)
scoreRank <- tapply(dat$score, dat$category, order, decreasing = T)
cbind(dat, salesRank = unlist(salesRank), scoreRank= unlist(scoreRank))
ID category sales score salesRank scoreRank
A1 227 A 109 21 3 2
A2 131 A 410 24 2 1
A3 131 A 509 1 1 3
B1 123 B 2 61 2 1
B2 545 B 19 5 1 2
C1 234 C 439 328 2 1
C2 654 C 765 41 1 2