Demean R data.table: list of columns - r

I want to demean a whole data.table object (or just a list of many columns of it) by groups.
Here's my approach so far:
setkey(myDt, groupid)
for (col in colnames(wagesOfFired)){
myDt[, paste(col, 'demeaned', sep='.') := col - mean(col), with=FALSE]
}
which gives
Error in col - mean(col) : non-numeric argument to binary operator
Here's some sample data. In this simple case, there's only two columns, but I typically have so many columns such that I want to iterate over a list
y groupid x
1: 3.46000 51557094 97
2: 111.60000 51557133 25
3: 29.36000 51557133 23
4: 96.38000 51557133 9
5: 65.22000 51557193 32
6: 66.05891 51557328 10
7: 9.74000 51557328 180
8: 61.59000 51557328 18
9: 9.99000 51557328 18
10: 89.68000 51557420 447
11: 129.24436 51557429 15
12: 3.46000 51557638 3943
13: 117.36000 51557642 11
14: 9.51000 51557653 83
15: 68.16000 51557653 518
16: 96.38000 51557653 14
17: 9.53000 51557678 18
18: 7.96000 51557801 266
19: 51.88000 51557801 49
20: 10.70000 51558040 1034

The problem is that col is a string, so col-mean(col) cannot be computed.
myNames <- names(myDt)
myDt[,paste(myNames,"demeaned",sep="."):=
lapply(.SD,function(x)x-mean(x)),
by=groupid,.SDcols=myNames]
Comments:
You don't need to set a key.
It's in one operation because using [ repeatedly can be slow.
You can change myNames to some subset of the column names.

Related

Why does subsetting the dataframe not work in this case? [duplicate]

This question already has answers here:
Select rows from a data frame based on values in a vector
(3 answers)
Closed 1 year ago.
I want to subset a dataframe with .id values specified but it gives me this error:
Warning in .id == c(3, 5:12, 14, 20:64, 66:72, 75, 78:79, 81:111, 113:136, :
longer object length is not a multiple of shorter object length
when using this code:
newdatarev = subset(newdata, .id == c(3,5:12,14,20:64,66:72,75,78:79,81:111,113:136,138:149,151:160,
162:183,185:225,227:233,235:247,249,251:264,266:328,330:364,366:383,
385:411,413:471,473:490,492:580,582:598,600:603,605:606,608:619,621:646,
648:686,688:718,720:746,748,750:753,755:762,764:861,863:875,877:894,
897:911,913:914,916:926,928:941))
For reference, here is a small bit of newdata:
> newdata
.id V1 V2
1: 1 -2.870109 8273.632
2: 1 4.829891 8273.632
3: 1 21.329891 8279.132
4: 1 25.729891 8281.332
5: 1 32.329891 8285.732
---
17937: 941 1834.113417 1411.605
17938: 941 1818.713417 1392.905
17939: 941 1814.313417 1386.305
17940: 941 1814.313417 1364.305
17941: 941 1828.613417 1224.605
I have a feeling it has to do with how .id is structured and me using the code interferes with how it interprets the rows vs. .id values that overlap. It does get me a result of a very strange recollection of data here:
> newdatarev
.id V1 V2
1: 55 158.8030 2045.753
2: 100 227.7387 8250.454
3: 153 356.8675 1383.835
4: 205 483.6464 3946.844
5: 299 635.8744 8387.862
6: 347 722.9303 5147.715
7: 393 850.1742 2115.559
8: 439 857.9288 8243.071
9: 482 926.5706 1608.928
10: 532 1107.8380 2616.635
11: 632 1234.6482 4957.055
12: 633 1201.8700 3252.570
13: 683 1315.2215 2068.050
14: 684 1325.5905 6253.692
15: 734 1414.3443 2267.337
16: 784 1551.0153 5184.641
17: 831 1634.2056 7159.362
18: 880 1724.5570 5726.908
19: 933 1879.6398 3465.536
Thank you in advance!
The == operator tests one condition against one other condition. What you want is to test several conditions all at once. This can be done with the %in% infix operator:
newdatarev <- subset(newdata, .id %in% c(3,5:12,14,20:64,66:72,75,78:79,81:111,113:136,138:149,151:160,
162:183,185:225,227:233,235:247,249,251:264,266:328,330:364,366:383,
385:411,413:471,473:490,492:580,582:598,600:603,605:606,608:619,621:646,
648:686,688:718,720:746,748,750:753,755:762,764:861,863:875,877:894,
897:911,913:914,916:926,928:941))

How to effectively determine the maximum difference between the variable value in each row and same variable subsequent row values in data.table in R

What is the most efficient way to determine the maximum positive difference between the value (X) for each row and the subsequent values of the same variable (X) within group (Y) in data.table in R.
Example:
set.seed(1)
dt <- data.table(X = sample(100:200, 500455, replace = TRUE),
Y = unlist(sapply(10:1000, function(x) rep(x, x))))
Here's my solution which I consider ineffective and slow:
dt[, max_diff := vapply(1:.N, function(x) max(X[x:.N] - X[x]), numeric(1)), by = Y]
head(dt, 21)
X Y max_diff
1: 126 10 69
2: 137 10 58
3: 157 10 38
4: 191 10 4
5: 120 10 75
6: 190 10 5
7: 195 10 0
8: 166 10 0
9: 163 10 0
10: 106 10 0
11: 120 11 80
12: 117 11 83
13: 169 11 31
14: 138 11 62
15: 177 11 23
16: 150 11 50
17: 172 11 28
18: 200 11 0
19: 138 11 56
20: 178 11 16
21: 194 11 0
If you can advise the efficient (faster) solution?
Here's a dplyr solution that is about 20x faster and gets the same results. I presume the data.table equivalent would be yet faster. (EDIT: see bottom - it is!)
The speedup comes from reducing how many comparisons need to be performed. The largest difference will always be found against the largest remaining number in the group, so it's faster to identify that number first and do only the one subtraction per row.
First, the original solution takes about 4 sec on my machine:
tictoc::tic("OP data.table")
dt[, max_diff := vapply(1:.N, function(x) max(X[x:.N] - X[x]), numeric(1)), by = Y]
tictoc::toc()
# OP data.table: 4.594 sec elapsed
But in only 0.2 sec we can take that data.table, convert to a data frame, add the orig_row row number, group by Y, reverse sort by orig_row, take the difference between X and the cumulative max of X, ungroup, and rearrange in original order:
library(dplyr)
tictoc::tic("dplyr")
dt2 <- dt %>%
as_data_frame() %>%
mutate(orig_row = row_number()) %>%
group_by(Y) %>%
arrange(-orig_row) %>%
mutate(max_diff2 = cummax(X) - X) %>%
ungroup() %>%
arrange(orig_row)
tictoc::toc()
# dplyr: 0.166 sec elapsed
all.equal(dt2$max_diff, dt2$max_diff2)
#[1] TRUE
EDIT: as #david-arenburg suggests in the comments, this can be done lightning-fast in data.table with an elegant line:
dt[.N:1, max_diff2 := cummax(X) - X, by = Y]
On my computer, that's about 2-4x faster than the dplyr solution above.

Adding a second column as a function of first with data.table's (`:=`) [duplicate]

I want to create a new data.table or maybe just add some columns to a data.table. It is easy to specify multiple new columns but what happens if I want a third column to calculate a value based on one of the columns I am creating. I think plyr package can do something such as that. Can we perform such iterative (sequential) column creation in data.table?
I want to do as follows
dt <- data.table(shop = 1:10, income = 10:19*70)
dt[ , list(hope = income * 1.05, hopemore = income * 1.20, hopemorerealistic = hopemore - 100)]
or maybe
dt[ , `:=`(hope = income*1.05, hopemore = income*1.20, hopemorerealistic = hopemore-100)]
You can also use <- within the call to list eg
DT <- data.table(a=1:5)
DT[, c('b','d') := list(b1 <- a*2, b1*3)]
DT
a b d
1: 1 2 6
2: 2 4 12
3: 3 6 18
4: 4 8 24
5: 5 10 30
Or
DT[, `:=`(hope = hope <- a+1, z = hope-1)]
DT
a b d hope z
1: 1 2 6 2 1
2: 2 4 12 3 2
3: 3 6 18 4 3
4: 4 8 24 5 4
5: 5 10 30 6 5
It is possible by using curly braces and semicolons in j
There are multiple ways to go about it, here are two examples:
# If you simply want to output:
dt[ ,
{hope=income*1.05;
hopemore=income*1.20;
list(hope=hope, hopemore=hopemore, hopemorerealistic=hopemore-100)}
]
# if you want to save the values
dt[ , c("hope", "hopemore", "hopemorerealistic") :=
{hope=income*1.05;
hopemore=income*1.20;
list(hope, hopemore, hopemore-100)}
]
dt
# shop income hope hopemore hopemorerealistic
# 1: 1 700 735.0 840 740
# 2: 2 770 808.5 924 824
# 3: 3 840 882.0 1008 908
# 4: 4 910 955.5 1092 992
# 5: 5 980 1029.0 1176 1076
# 6: 6 1050 1102.5 1260 1160
# 7: 7 1120 1176.0 1344 1244
# 8: 8 1190 1249.5 1428 1328
# 9: 9 1260 1323.0 1512 1412
# 10: 10 1330 1396.5 1596 1496

Multiple different conditions and if statments within a loop

I want to assign different letters from A:U to a new column vector according to some conditions that depend on a different column that takes the numbers 1:99.
I came up with the following solution, but I want to write it more efficiently.
for (i in 1:99){
if (i %in% 1:3 == T ){
id<-which(H07_NACE$NACE2.Code==i)
H07_NACE$NACE2.Sectors[id]<-"A"
}
.............
if (i %in% 45:60 == T ){
id<-which(H07_NACE$NACE2.Code==i)
H07_NACE$NACE2.Sectors[id]<-"D"
}
.....................
if (i == 99 ){
id<-which(H07_NACE$NACE2.Code==i)
H07_NACE$NACE2.Sectors[id]<-"U"
}
}
In the previous code I skipped multiple other line which essentially do the same thing. Notice that conditions changing all the time within this loop that I created and are of two types. One is for example of the type i %in% 45:60 == T and the other of the type 'i == 99 '
My original code has multiple such ifs within this loop so any help on how I can write it more efficiently or compactly will be appreciated.
The user has requested to map the numbers given in H07_NACE$NACE2.Code to the letters "A" to "U" according to given rules he has hardcoded in a number of if clauses.
A more flexible approach (and less tedious to code) is to use a lookup table (or constraint vector as Joseph Wood called it in his answer).
With data.table, we can use either a rolling join or a non-equi update join to do the mapping.
Sample data to be mapped
set.seed(1)
H07_NACE <- data.frame(NACE2.Code = sample(99, 10, replace = TRUE))
Rolling join
For the rolling join, we specify the mapping rules by tiling the number range 1:99 contiguously and giving the start number of each tile.
library(data.table)
# set up lookup table
lookup <- data.table(Code = c(1, 4, 21, 45, 61:75, 98, 99),
Sector = LETTERS[1:21])
lookup
Code Sector
1: 1 A
2: 4 B
3: 21 C
4: 45 D
5: 61 E
6: 62 F
7: 63 G
8: 64 H
9: 65 I
10: 66 J
11: 67 K
12: 68 L
13: 69 M
14: 70 N
15: 71 O
16: 72 P
17: 73 Q
18: 74 R
19: 75 S
20: 98 T
21: 99 U
Code Sector
# map Code to Sector
lookup[setDT(H07_NACE), on = .(Code = NACE2.Code), roll = TRUE]
Code Sector
1: 27 C
2: 37 C
3: 57 D
4: 90 S
5: 20 B
6: 89 S
7: 94 S
8: 66 J
9: 63 G
10: 7 B
If the H07_NACE is to be updated we can append a new column by
setDT(H07_NACE)[, NACE2.Sector := lookup[H07_NACE, on = .(Code = NACE2.Code),
roll = TRUE, Sector]][]
NACE2.Code NACE2.Sector
1: 27 C
2: 37 C
3: 57 D
4: 90 S
5: 20 B
6: 89 S
7: 94 S
8: 66 J
9: 63 G
10: 7 B
Non-equi update join
For the non-equi update join, we specify the mapping rules by giving the lower and upper bounds. This can be derived from lookup by
lookup2 <- lookup[, .(Sector, lower = Code,
upper = shift(Code - 1L, type = "lead", fill = max(Code)))]
lookup2
Sector lower upper
1: A 1 3
2: B 4 20
3: C 21 44
4: D 45 60
5: E 61 61
6: F 62 62
7: G 63 63
8: H 64 64
9: I 65 65
10: J 66 66
11: K 67 67
12: L 68 68
13: M 69 69
14: N 70 70
15: O 71 71
16: P 72 72
17: Q 73 73
18: R 74 74
19: S 75 97
20: T 98 98
21: U 99 99
Sector lower upper
The new column is created by
setDT(H07_NACE)[lookup2, on = .(NACE2.Code >= lower, NACE2.Code <= upper),
NACE2.Sector := Sector][]
NACE2.Code NACE2.Sector
1: 27 C
2: 37 C
3: 57 D
4: 90 S
5: 20 B
6: 89 S
7: 94 S
8: 66 J
9: 63 G
10: 7 B
Here is a quick and dirty solution that should do the job (I'm sure there is more efficient/elegant way to do this). We can setup a constraint vector and use indexing from there to produce the desired results.
## Here is some random data that resembles the OP's
set.seed(3)
H07_NACE <- data.frame(NACE2.Code = sample(99, replace = TRUE))
## "T" is the 20th element... we need to gurantee
## that the number corresponding to "U"
## corresponds to max(NACE2.Code)
maxCode <- max(H07_NACE$NACE2.Code)
constraintVec <- sort(sample(maxCode - 1, 20))
constraintVec <- c(constraintVec, maxCode)
H07_NACE$NACE2.Sector <- LETTERS[vapply(H07_NACE$NACE2.Code, function(x) {
which(constraintVec >= x)[1]
}, 1L)]
## Add optional check column to ensure we are mapping the
## Code to the correct Sector
H07_NACE$NACE2.Check <- constraintVec[vapply(H07_NACE$NACE2.Code, function(x) {
which(constraintVec >= x)[1]
}, 1L)]
head(H07_NACE)
NACE2.Code NACE2.Sector NACE2.Check
1 17 E 18
2 80 R 85
3 39 K 54
4 33 J 37
5 60 N 66
6 60 N 66
Update courtesy of #Frank
As suspected, there is a much simpler solution assuming the above logic is correct. We use findInterval and set the arguments rightmost.closed and left.open to TRUE (we also have to add 1L to the resulting vector):
H07_NACE$NACE2.Sector2 <- LETTERS[findInterval(H07_NACE$NACE2.Code, constraintVec,
rightmost.closed = TRUE, , left.open = TRUE) + 1L]
head(H07_NACE)
NACE2.Code NACE2.Sector NACE2.Check NACE2.Sector2
1 17 E 18 E
2 80 R 85 R
3 39 K 54 K
4 33 J 37 J
5 60 N 66 N
6 60 N 66 N
identical(H07_NACE$NACE2.Sector, H07_NACE$NACE2.Sector2)
[1] TRUE
Here's two tidyverse examples, though I'm not completely certain what the original poster is really asking for.
library(tidyverse)
data.frame(NACE2.Code = sample(99, replace = TRUE)) %>%
mutate(Sectors = ifelse(NACE2.Code %in% 1:3, "A",
ifelse(NACE2.Code %in% 45:60, "D",
ifelse(NACE2.Code ==99, "U", NA))))
data.frame(NACE2.Code = sample(99, replace = TRUE)) %>%
mutate(Sectors = case_when(NACE2.Code %in% 1:3 ~ "A",
NACE2.Code %in% 45:60 ~ "D",
NACE2.Code ==99 ~ "U")) %>%
drop_na

Issue with structure of data.frame for sunburstR plot

Hello,
I created the dataframe below, based on the example in the sunburstR documentation.
Column Count
1: ACTIVE 68764
2: INACTIVE 73599
3: ACTIVE-RESIDENT 68279
4: ACTIVE-NONRESIDENT 485
5: INACTIVE-RESIDENT 63378
6: INACTIVE-NONRESIDENT 10221
7: ACTIVE-RESIDENT-LATIN 55
8: ACTIVE-RESIDENT-CYRLIC 68224
9: ACTIVE-NONRESIDENT-LATIN 465
10: ACTIVE-NONRESIDENT-CYRLIC 20
11: INACTIVE-RESIDENT-LATIN 114
12: INACTIVE-RESIDENT-CYRLIC 63264
13: INACTIVE-NONRESIDENT-LATIN 7915
14: INACTIVE-NONRESIDENT-CYRLIC 2306
The first column is character, the second is integer.
However when I try to plot it, I get nothing.
sunburst(sunburst_data)
Any hints whats wrong with the structure of my dataframe?
Include only the leaf nodes in your data frame...
df <- read.table(text = '
Column Count
ACTIVE-RESIDENT-LATIN 55
ACTIVE-RESIDENT-CYRLIC 68224
ACTIVE-NONRESIDENT-LATIN 465
ACTIVE-NONRESIDENT-CYRLIC 20
INACTIVE-RESIDENT-LATIN 114
INACTIVE-RESIDENT-CYRLIC 63264
INACTIVE-NONRESIDENT-LATIN 7915
INACTIVE-NONRESIDENT-CYRLIC 2306
')
library(sunburstR)
sunburst(df)

Resources