How to refer to a column of a data.table by its position, in a sum() statement - r

I've googled the issue as many ways as my brain is capable of and I still can't find the answer. I'm new to R so there are some things that confuse me a little bit.
Let's say I have a data table like this:
x y z 100 200 300
1: 1 1 a 1 1 1
2: 1 1 b 2 3 4
3: 1 2 c 3 5 7
4: 1 2 d 4 7 0
5: 2 1 e 5 9 3
6: 2 1 f 6 1 6
7: 2 2 g 7 3 9
8: 2 2 h 8 5 2
This can be created with this piece of code:
DT = setDT(structure(list(c(1, 1, 1, 1, 2, 2, 2, 2),
c(1, 1, 2, 2, 1, 1, 2, 2),
c("a","b","c","d","e","f","g","h"),
c(1,2,3,4,5,6,7,8),
c(1,3,5,7,9,1,3,5),
c(1,4,7,0,3,6,9,2)),
.Names = c("x", "y", "z", 100, 200, 300), row.names = c(NA, -8L), class = "data.frame"))
However, in my actual code, the last three columns were auto-generated using another function (dcast), so the total number of columns of the data.table is not static. Also, you may notice that the names of those three last columns are numeric, which might be a problem at some point.
What I need is to create one aditional column for each "extra" column (the ones right after column "z"). I need the code to work such as this example: first, it creates column "100s", then for each row, it calculates the sum of column "100", considering only the rows with the same combination of x,y that the row in question. And so on for "200s" and "300s". Like this:
x y z 100 200 300 100s 200s 300s
1: 1 1 a 1 1 1 3 4 5
2: 1 1 b 2 3 4 3 4 5
3: 1 2 c 3 5 7 7 12 7
4: 1 2 d 4 7 0 7 12 7
5: 2 1 e 5 9 3 11 10 9
6: 2 1 f 6 1 6 11 10 9
7: 2 2 g 7 3 9 15 8 11
8: 2 2 h 8 5 2 15 8 11
I've tried with several modifications of this idea of a code:
for (i in 3:(dim(DT)[2])) {
DT <- DT[,paste(colnames(DT)[i], "s", sep=""):=sum(i),
by=c("x","y")]
}
This gives me the following result:
x y z 100 200 300 100s 200s 300s
1: 1 1 a 1 1 1 4 5 6
2: 1 1 b 2 3 4 4 5 6
3: 1 2 c 3 5 7 4 5 6
4: 1 2 d 4 7 0 4 5 6
5: 2 1 e 5 9 3 4 5 6
6: 2 1 f 6 1 6 4 5 6
7: 2 2 g 7 3 9 4 5 6
8: 2 2 h 8 5 2 4 5 6
Of course, R is not recognizing the numeric value of i as the number of column it should consider for the sum, as instead it's taking it as a raw number. I can't figure out how to adress a specific column by its position, because when it comes to sum(), that "with=FALSE" thing fails to save the day.
Any help will be appreciated.

There is no need for using a for loop in this case to get the desired result. You can update DT by reference with:
DT[, paste0(colnames(DT)[3:5],'s') := lapply(.SD, sum), by = .(x,y)]
which will give you the desired result:
> DT
x y 100 200 300 100s 200s 300s
1: 1 1 1 1 1 3 4 5
2: 1 1 2 3 4 3 4 5
3: 1 2 3 5 7 7 12 7
4: 1 2 4 7 0 7 12 7
5: 2 1 5 9 3 11 10 9
6: 2 1 6 1 6 11 10 9
7: 2 2 7 3 9 15 8 11
8: 2 2 8 5 2 15 8 11
When you don't know exacly which columns to sum, you could use one of the following methods:
# method 1:
DT[, paste0(colnames(DT)[3:ncol(DT)],'s') := lapply(.SD, sum), by = .(x,y)]
# method 2:
DT[, paste0(setdiff(colnames(DT), c('x','y')),'s') := lapply(.SD, sum), by = .(x,y)]
With the updated example, probably the best way to do is:
cols <- setdiff(colnames(DT), c('x','y','z'))
DT[, paste0(cols,'s') := lapply(.SD, sum), by = .(x,y), .SDcols = cols]
which gives:
> DT
x y z 100 200 300 100s 200s 300s
1: 1 1 a 1 1 1 3 4 5
2: 1 1 b 2 3 4 3 4 5
3: 1 2 c 3 5 7 7 12 7
4: 1 2 d 4 7 0 7 12 7
5: 2 1 e 5 9 3 11 10 9
6: 2 1 f 6 1 6 11 10 9
7: 2 2 g 7 3 9 15 8 11
8: 2 2 h 8 5 2 15 8 11

Related

Creating two columns of cumulative sum based on the categories of one column

I like to create two columns with cumulative frequency of "A" and "B" in the assignment columns.
df = data.frame(id = 1:10, assignment= c("B","A","B","B","B","A","B","B","A","B"))
id assignment
1 1 B
2 2 A
3 3 B
4 4 B
5 5 B
6 6 A
7 7 B
8 8 B
9 9 A
10 10 B
The resulting table would have this format
id assignment A B
1 1 B 0 1
2 2 A 1 1
3 3 B 1 2
4 4 B 1 3
5 5 B 1 4
6 6 A 2 4
7 7 B 2 5
8 8 B 2 6
9 9 A 3 6
10 10 B 3 7
How to generalize the codes for more than 2 categories (say for "A","B",C")?
Thanks
Use lapply over unique values in assignment to create new columns.
vals <- sort(unique(df$assignment))
df[vals] <- lapply(vals, function(x) cumsum(df$assignment == x))
df
# id assignment A B
#1 1 B 0 1
#2 2 A 1 1
#3 3 B 1 2
#4 4 B 1 3
#5 5 B 1 4
#6 6 A 2 4
#7 7 B 2 5
#8 8 B 2 6
#9 9 A 3 6
#10 10 B 3 7
We can use model.matrix with colCumsums
library(matrixStats)
cbind(df, colCumsums(model.matrix(~ assignment - 1, df[-1])))
A base R option
transform(
df,
A = cumsum(assignment == "A"),
B = cumsum(assignment == "B")
)
gives
id assignment A B
1 1 B 0 1
2 2 A 1 1
3 3 B 1 2
4 4 B 1 3
5 5 B 1 4
6 6 A 2 4
7 7 B 2 5
8 8 B 2 6
9 9 A 3 6
10 10 B 3 7

How do I select rows in a data frame before and after a condition is met?

I'm searching the web for a few a days now and I can't find a solution to my (probably easy to solve) problem.
I have huge data frames with 4 variables and over a million observations each. Now I want to select 100 rows before, all rows while and 1000 rows after a specific condition is met and fill the rest with NA's. I tried it with a for loop and if/ifelse but it doesn't work so far. I think it shouldn't be a big thing, but in the moment I just don't get the hang of it.
I create the data using:
foo<-data.frame(t = 1:15, a = sample(1:15), b = c(1,1,1,1,1,4,4,4,4,1,1,1,1,1,1), c = sample(1:15))
My Data looks like this:
ID t a b c
1 1 4 1 7
2 2 7 1 10
3 3 10 1 6
4 4 2 1 4
5 5 13 1 9
6 6 15 4 3
7 7 8 4 15
8 8 3 4 1
9 9 9 4 2
10 10 14 1 8
11 11 5 1 11
12 12 11 1 13
13 13 12 1 5
14 14 6 1 14
15 15 1 1 12
What I want is to pick the value of a (in this example) 2 rows before, all rows while and 3 rows after the value of b is >1 and fill the rest with NA's. [Because this is just an example I guess you can imagine that after these 15 rows there are more rows with the value for b changing from 1 to 4 several times (I did not post it, so I won't spam the question with unnecessary data).]
So I want to get something like:
ID t a b c d
1 1 4 1 7 NA
2 2 7 1 10 NA
3 3 10 1 6 NA
4 4 2 1 4 2
5 5 13 1 9 13
6 6 15 4 3 15
7 7 8 4 15 8
8 8 3 4 1 3
9 9 9 4 2 9
10 10 14 1 8 14
11 11 5 1 11 5
12 12 11 1 13 11
13 13 12 1 5 NA
14 14 6 1 14 NA
15 15 1 1 12 NA
I'm thankful for any help.
Thank you.
Best regards,
Chris
here is the same attempt as missuse, but with data.table:
library(data.table)
foo<-data.frame(t = 1:11, a = sample(1:11), b = c(1,1,1,4,4,4,4,1,1,1,1), c = sample(1:11))
DT <- setDT(foo)
DT[ unique(c(DT[,.I[b>1] ],DT[,.I[b>1]+3 ],DT[,.I[b>1]-2 ])), d := a]
t a b c d
1: 1 10 1 2 NA
2: 2 6 1 10 6
3: 3 5 1 7 5
4: 4 11 4 4 11
5: 5 4 4 9 4
6: 6 8 4 5 8
7: 7 2 4 8 2
8: 8 3 1 3 3
9: 9 7 1 6 7
10: 10 9 1 1 9
11: 11 1 1 11 NA
Here
unique(c(DT[,.I[b>1] ],DT[,.I[b>1]+3 ],DT[,.I[b>1]-2 ]))
gives you your desired indixes : the unique indices of the line for your condition, the same indices+3 and -2.
Here is an attempt.
Get indexes that satisfy the condition b > 1
z <- which(foo$b > 1)
get indexes for (z - 2) : (z + 3)
ind <- unique(unlist(lapply(z, function(x){
g <- pmax(x - 2, 1) #if x - 2 is negative
g : (x + 3)
})))
create d column filled with NA
foo$d <- NA
replace elements with appropriate indexes with foo$a
foo$d[ind] <- foo$a[ind]
library(dplyr)
library(purrr)
# example dataset
foo<-data.frame(t = 1:15,
a = sample(1:15),
b = c(1,1,1,1,1,4,4,4,4,1,1,1,1,1,1),
c = sample(1:15))
# function to get indices of interest
# for a given index x go 2 positions back and 3 forward
# keep only positive indices
GetIDsBeforeAfter = function(x) {
v = (x-2) : (x+3)
v[v > 0]
}
foo %>% # from your dataset
filter(b > 1) %>% # keep rows where b > 1
pull(t) %>% # get the positions
map(GetIDsBeforeAfter) %>% # for each position apply the function
unlist() %>% # unlist all sets indices
unique() -> ids_to_remain # keep unique ones and save them in a vector
foo$d = foo$c # copy column c as d
foo$d[-ids_to_remain] = NA # put NA to all positions not in our vector
foo
# t a b c d
# 1 1 5 1 8 NA
# 2 2 6 1 14 NA
# 3 3 4 1 10 NA
# 4 4 1 1 7 7
# 5 5 10 1 5 5
# 6 6 8 4 9 9
# 7 7 9 4 15 15
# 8 8 3 4 6 6
# 9 9 7 4 2 2
# 10 10 12 1 3 3
# 11 11 11 1 1 1
# 12 12 15 1 4 4
# 13 13 14 1 11 NA
# 14 14 13 1 13 NA
# 15 15 2 1 12 NA

Label quantile by group with varying group sizes

Within my group (the "name" variable), I want cut the value into quartile. And create a quartile label column for variable "value". Since the group size varies, for the quartile range for different group changes as well. But below code, only cut the quartile by the overall value, resulting the same quartile range for all groups.
dt<-data.frame(name=c(rep('a',8),rep('b',4),rep('c',5)),value=c(1:8,1:4,1:5))
dt
dt.2<-dt%>% group_by(name)%>% mutate(newcol=
cut(value,breaks=quantile(value,probs=seq(0,1,0.25),na.rm=TRUE),include.lowest=TRUE))
dt.2
str(dt.2)
Data:
name value
1 a 1
2 a 2
3 a 3
4 a 4
5 a 5
6 a 6
7 a 7
8 a 8
9 b 1
10 b 2
11 b 3
12 b 4
13 c 1
14 c 2
15 c 3
16 c 4
17 c 5
output from above code.
Update: the problem is not that newcol is factor but the necol has the same quartile range across all the different group. For example name b, the value is 1-4 but the quartile range has 3-5, which is derived from min(value) to max(value) regardless of the group.
name value newcol
<fctr> <int> <fctr>
1 a 1 [1,2]
2 a 2 [1,2]
3 a 3 (2,3]
4 a 4 (3,5]
5 a 5 (3,5]
6 a 6 (5,8]
7 a 7 (5,8]
8 a 8 (5,8]
9 b 1 [1,2]
10 b 2 [1,2]
11 b 3 (2,3]
12 b 4 (3,5]
13 c 1 [1,2]
14 c 2 [1,2]
15 c 3 (2,3]
16 c 4 (3,5]
17 c 5 (3,5]
Desired output
name value newcol/quartile label
1 a 1 1
2 a 2 1
3 a 3 2
4 a 4 2
5 a 5 3
6 a 6 3
7 a 7 4
8 a 8 4
9 b 1 1
10 b 2 2
11 b 3 3
12 b 4 4
13 c 1 1
14 c 2 2
15 c 3 3
16 c 4 4
17 c 5 4
Here's a way you can do it, following the split-apply-combine framework.
dt<-data.frame(name=c(rep('a',8),rep('b',4),rep('c',5)),value=c(1:8,1:4,1:5))
split_dt <- lapply(split(dt, dt$name),
transform,
quantlabel = as.numeric(
cut(value, breaks = quantile(value, probs = seq(0,1,.25)), include.lowest = T)))
dt <- unsplit(split_dt, dt$name)
name value quantlabel
1 a 1 1
2 a 2 1
3 a 3 2
4 a 4 2
5 a 5 3
6 a 6 3
7 a 7 4
8 a 8 4
9 b 1 1
10 b 2 2
11 b 3 3
12 b 4 4
13 c 1 1
14 c 2 1
15 c 3 2
16 c 4 3
17 c 5 4
edit: there's a data.table way
following this post, we can use the data.table package, if performance is a concern:
library(data.table)
dt<-data.frame(name=c(rep('a',8),rep('b',4),rep('c',5)),value=c(1:8,1:4,1:5))
dt.t <- as.data.table(dt)
dt.t[,quantlabels := as.numeric(cut(value, breaks = quantile(value, probs = seq(0,1,.25)), include.lowest = T)), name]
name value quantlabels
1: a 1 1
2: a 2 1
3: a 3 2
4: a 4 2
5: a 5 3
6: a 6 3
7: a 7 4
8: a 8 4
9: b 1 1
10: b 2 2
11: b 3 3
12: b 4 4
13: c 1 1
14: c 2 1
15: c 3 2
16: c 4 3
17: c 5 4
edit: and there's a dplyr way
We can follow #akrun's advice and use as.numeric (which is what we've done for the other solutions):
dt %>%
group_by(name) %>%
mutate(quantlabel =
as.numeric(
cut(value,
breaks = quantile(value, probs = seq(0,1,.25)),
include.lowest = T)))
Note that if you instead wanted the labels themselves, use as.character:
dt %>%
group_by(name) %>%
mutate(quantlabel = as.character(cut(value, breaks = quantile(value, probs = seq(0,1,.25)), include.lowest = T)))
Source: local data frame [17 x 3]
Groups: name [3]
name value quantlabel
<fctr> <int> <chr>
1 a 1 [1,2.75]
2 a 2 [1,2.75]
3 a 3 (2.75,4.5]
4 a 4 (2.75,4.5]
5 a 5 (4.5,6.25]
6 a 6 (4.5,6.25]
7 a 7 (6.25,8]
8 a 8 (6.25,8]
9 b 1 [1,1.75]
10 b 2 (1.75,2.5]
11 b 3 (2.5,3.25]
12 b 4 (3.25,4]
13 c 1 [1,2]
14 c 2 [1,2]
15 c 3 (2,3]
16 c 4 (3,4]
17 c 5 (4,5]

R Subset matching contiguous blocks

I have a dataframe.
dat <- data.frame(k=c("A","A","B","B","B","A","A","A"),
a=c(4,2,4,7,5,8,3,2),b=c(2,5,3,5,8,4,5,8),
stringsAsFactors = F)
k a b
1 A 4 2
2 A 2 5
3 B 4 3
4 B 7 5
5 B 5 8
6 A 8 4
7 A 3 5
8 A 2 8
I would like to subset contiguous blocks based on variable k. This would be a standard approach.
#using rle rather than levels
kval <- rle(dat$k)$values
for(i in 1:length(kval))
{
subdf <- subset(dat,dat$k==kval[i])
print(subdf)
#do something with subdf
}
k a b
1 A 4 2
2 A 2 5
6 A 8 4
7 A 3 5
8 A 2 8
k a b
3 B 4 3
4 B 7 5
5 B 5 8
k a b
1 A 4 2
2 A 2 5
6 A 8 4
7 A 3 5
8 A 2 8
So the subsetting above obviously does not work the way I intended. Any elegant way to get these results?
k a b
1 A 4 2
2 A 2 5
k a b
1 B 4 3
2 B 7 5
3 B 5 8
k a b
1 A 8 4
2 A 3 5
3 A 2 8
We can use rleid from data.table to create a grouping variable
library(data.table)
setDT(dat)[, grp := rleid(k)]
dat
# k a b grp
#1: A 4 2 1
#2: A 2 5 1
#3: B 4 3 2
#4: B 7 5 2
#5: B 5 8 2
#6: A 8 4 3
#7: A 3 5 3
#8: A 2 8 3
We can group by 'grp' and do all the operations within the 'grp' using standard data.table methods.
Here is a base R option to create 'grp'
dat$grp <- with(dat, cumsum(c(TRUE, k[-1]!= k[-length(k)])))

How to drop factors that have fewer than n members

Is there a way to drop factors that have fewer than N rows, like N = 5, from a data table?
Data:
DT = data.table(x=rep(c("a","b","c"),each=6), y=c(1,3,6), v=1:9,
id=c(1,1,1,1,2,2,2,2,2,3,3,3,3,3,3,4,4,4))
Goal: remove rows when the number of id is less than 5. The variable "id" is the grouping variable, and the groups to delete when the number of rows in a group is less than 5. In DT, need to determine which groups have less than 5 members, (groups "1" and "4") and then remove those rows.
1: a 3 5 2
2: b 6 6 2
3: b 1 7 2
4: b 3 8 2
5: b 6 9 2
6: b 1 1 3
7: c 3 2 3
8: c 6 3 3
9: c 1 4 3
10: c 3 5 3
11: c 6 6 3
Here's an approach....
Get the length of the factors, and the factors to keep
nFactors<-tapply(DT$id,DT$id,length)
keepFactors <- nFactors >= 5
Then identify the ids to keep, and keep those rows. This generates the desired results, but is there a better way?
idsToKeep <- as.numeric(names(keepFactors[which(keepFactors)]))
DT[DT$id %in% idsToKeep,]
Since you begin with a data.table, this first part uses data.table syntax.
EDIT: Thanks to Arun (comment) for helping me improve this data table answer
DT[DT[, .(I=.I[.N>=5L]), by=id]$I]
# x y v id
# 1: a 3 5 2
# 2: a 6 6 2
# 3: b 1 7 2
# 4: b 3 8 2
# 5: b 6 9 2
# 6: b 1 1 3
# 7: b 3 2 3
# 8: b 6 3 3
# 9: c 1 4 3
# 10: c 3 5 3
# 11: c 6 6 3
In base R you could use
df <- data.frame(DT)
tab <- table(df$id)
df[df$id %in% names(tab[tab >= 5]), ]
# x y v id
# 5 a 3 5 2
# 6 a 6 6 2
# 7 b 1 7 2
# 8 b 3 8 2
# 9 b 6 9 2
# 10 b 1 1 3
# 11 b 3 2 3
# 12 b 6 3 3
# 13 c 1 4 3
# 14 c 3 5 3
# 15 c 6 6 3
If using a data.table is not necessary, you can use dplyr:
library(dplyr)
data.frame(DT) %>%
group_by(id) %>%
filter(n() >= 5)

Resources