Subset in i by variable name in data.table [duplicate] - r

This question already has an answer here:
Pass column name in data.table using variable [duplicate]
(1 answer)
Closed 6 years ago.
Suppose I have a data.table with columns names that are specified in a variable. For example I might have used dcast as:
groups <- sample(LETTERS, 2) # i.e. I don't now the values
dt1 <- data.table(ID = rep(1:2, each = 2), group = groups, value = 3:6)
(dt2 <- dcast(dt1, ID~group, value.var = "value"))
# ID D Q
# 1: 1 3 4
# 2: 2 5 6
Now I want to subset based on values in the last two columns, e.g. do something like:
dt2[groups[1] == 3 & groups[2] == 4]
# Empty data.table (0 rows) of 3 cols: ID,D,Q
Is there an easy way?
I found I can do this with keys:
setkeyv(dt2, groups)
dt2[.(3, 4)]
# ID D Q
# 1: 1 3 4
But how do I do something more elaborate, as
dt2[groups[1] > 3 & groups[2] < 7]
?

You can use get to (from ?get)
search by name for an object
:
dt2[get(groups[1]) > 2 & get(groups[2]) == 4]
# ID A J
#1: 1 3 4

We can use eval with as.name and it should be faster than get
dt2[eval(as.name(groups[1])) > 2 & eval(as.name(groups[2])) == 4]
# ID L U
#1: 1 4 3

Related

How to Efficiently Populate R Dataframe with Lookup Values from Another Dataframe [duplicate]

This question already has answers here:
Expand ranges defined by "from" and "to" columns
(10 answers)
Closed 2 years ago.
I have a question regarding efficiently populating an R dataframe based on data retrieved from another dataframe.
So my input typically looks like:
dfInput <- data.frame(start = c(1,6,17,29), end = c(5,16,28,42), value = c(1,2,3,4))
start end value
1 5 1
6 16 2
17 28 3
29 42 4
I want to find the min and max values in cols 1 and 2 and create a new dataframe with a row for each value in that range:
rangeMin <- min(dfInput$start)
rangeMax <- max(dfInput$end)
dfOutput <- data.frame(index = c(rangeMin:rangeMax), value = 0)
And then populate it with the appropriate "values" from the input dataframe:
for (i in seq(nrow(dfOutput))) {
lookup <- dfOutput[i,"index"]
dfOutput[i, "value"] <- dfInput[which(dfInput$start <= lookup &
dfInput$end >= lookup),"value"]
}
This for-loop achieves what I want to do, but it feels like this is a very convoluted way to do it.
Is there a way that I can do something like:
dfOutput$value <- dfInput[which(dfInput$start <= dfOutput$index &
dfInput$end >= dfOutput$index),"value"]
Or something else to populate the values when instantiating dfOutput.
I feel like this is pretty basic but I'm new to R, so many thanks for any help!
You can create a sequence between start and end :
library(dplyr)
dfInput %>%
mutate(index = purrr::map2(start, end, seq)) %>%
tidyr::unnest(index) %>%
select(-start, -end)
# A tibble: 42 x 2
# value index
# <dbl> <int>
# 1 1 1
# 2 1 2
# 3 1 3
# 4 1 4
# 5 1 5
# 6 2 6
# 7 2 7
# 8 2 8
# 9 2 9
#10 2 10
# … with 32 more rows
In base R :
do.call(rbind, Map(function(x, y, z)
data.frame(index = x:y, value = z), dfInput$start, dfInput$end, dfInput$value))

Aggregate a categorical data.table column based on frequency of occurrence in one step in R [duplicate]

This question already has answers here:
Mode in R by groups
(5 answers)
Closed 3 years ago.
I got a data.table DT with millions of rows and quite a few columns.
I'd like to aggregate the data.table on various columns at the same time. One column 'Var' is a categorical variable and I want to aggregate it in a way that the entry with the most occurrence is chosen.
> require(data.table)
> DT <- data.table(ID = c(1,1,1,1,2,2,2,3,3), Var = c('A', 'B', 'B', 'B', 'C', 'C', 'A', 'A', 'A'))
> DT
ID Var
1: 1 A
2: 1 B
3: 1 B
4: 1 B
5: 2 C
6: 2 C
7: 2 A
8: 3 A
9: 3 A
My desired output is:
> desired_output
ID agg_Var
1: 1 B # B occurred the most for ID = 1
2: 2 C # C occurred the most for ID = 2
3: 3 A # A occurred the most for ID = 3
I know i can do this in two steps. First by aggregating the numbers of occurrence for each ID and Var, then choosing the row with maximum frequency:
> ## I know this works but it involves more than one step:
> step1 <- DT[,.( freq = .N), by=.(ID, Var)]
> step1
ID Var freq
1: 1 A 1
2: 1 B 3
3: 2 C 2
4: 2 A 1
5: 3 A 2
> step2 <- step1[, .(Var_agg = Var[which.max(freq)]), by = .(ID)]
> step2
ID Var_agg
1: 1 B
2: 2 C
3: 3 A
I'm looking for a way to do this in one step if possible?
The reason is that I have quite a few other aggregations i need to do for this table but the other aggregations all involve one step and it would be great if I didn't have to do a separate aggregation for this column, so that I could just include it with the aggregation of other columns. This problem is a code optimisation issue. I'm only interested in data.table operations, not additional packages.
Create a function for calculation of Mode and do a group by Mode
Mode <- function(x) {
ux <- unique(x)
ux[which.max(tabulate(match(x, ux)))]
}
DT[, .(agg_Var = Mode(Var)), ID]

Compare Lists in datatable

I have a data table(data) which looks like the following.
rn peoplecount
1 0,2,0,1
2 1,1,0,0
3 0,1,0,5
4 5,3,0,2
5 2,2,0,1
6 1,2,0,3
7 0,1,0,0
8 0,2,0,8
9 8,2,0,0
10 0,1,0,0
My goal is to find out all records which have the 1st element of the present row not matching with 4th element of previous row. In this example, 7th row matches the criteria. How can I get a list of all such records.
My attempt so far.
data[, previous_peoplecount:= c(NA, peoplecount[shift(seq_along(peoplecount), fill = 0)])]
This gives a new table as follows:
rn peoplecount previous_peoplecount
1 0,2,0,1 NA
2 1,1,0,0 0,2,0,1
3 0,1,0,5 1,1,0,0
4 5,3,0,2 0,1,0,5
5 0,2,0,1 5,3,0,2
6 1,2,0,3 0,2,0,1
7 0,1,0,0 1,2,0,3
8 0,2,0,8 0,1,0,0
9 8,2,0,0 0,2,0,8
10 0,1,0,0 8,2,0,0
Now I have to fetch all records where 1st element of people_count is not equal to 4th element of previous_peoplecount. I am stuck at this part. Any suggestions?
Edit: poeplecount is list of numerics.
You can try something along the lines of removing all but first value and all but last value, and comparing, i.e.
library(data.table)
setDT(dt)[, first_pos := sub(',.*', '', peoplecount)][,
last_pos_shifted := sub('.*,', '', shift(peoplecount))][
first_pos != last_pos_shifted,]
which gives,
rn peoplecount first_pos last_pos_shifted
1: 7 0,1,0,0 0 3
I would convert to long format and then select interested elements:
dt <- data.table(rn = 1:3, x = lapply(1:3, function(x) x:(x+3)))
dt$x[[2]] <- c(4, 1, 1, 1)
dt
# rn x
# 1: 1 1,2,3,4
# 2: 2 4,1,1,1
# 3: 3 3,4,5,6
# convert to long format
dt2 <- dt[, .(rn = rep(rn, each = 4), x = unlist(x))]
dt2[, id:= 1:4]
dtSelected <- dt2[x == shift(x) & id == 4]
dtSelected
# rn x id
# 1: 2 1 4
dt[dtSelected$rn]
# rn x
# 1: 2 4,1,1,1
I was not satisfied with the answers and came up with my own solution as follows:
h<-sapply(data$peoplecount,function(x){x[1]})
t<-sapply(data$peoplecount,function(x){x[4]})
indices<-which(head(t,-1)!=tail(h,-1))
Thanks to #Sotos and #minem to push me in the correct direction.

How to append values of columns? [duplicate]

This question already has answers here:
paste two data.table columns
(4 answers)
Closed 6 years ago.
For example there is the following data.table:
dt <- data.table(x = list(1:2, 3:5, 6:9), y = c(1,2,3))
# x y
# 1: 1,2 1
# 2: 3,4,5 2
# 3: 6,7,8,9 3
I need to create a new data.table, where values of the y column will be appended to lists stored in the x column:
# z
# 1: 1,2,1
# 2: 3,4,5,2
# 3: 6,7,8,9,3
I've tried lapply, cbind, list, c functions. But I can't get the table I need.
UPDATE:
The question is different from paste two data.table columns because a trivial solution with paste function or something like this doesn't work.
This will do it
# Merge two lists
dt[, z := mapply(c, x, y, SIMPLIFY=FALSE)]
print(dt)
x y z
1: 1,2 1 1,2,1
2: 3,4,5 2 3,4,5,2
3: 6,7,8,9 3 6,7,8,9,3
And deleting the original x and y columns
dt[, c("x", "y") := NULL]
print(dt)
z
1: 1,2,1
2: 3,4,5,2
3: 6,7,8,9,3
I would like to suggest a general approach for this kind of task in case you have multiple columns that you would like to combine into a single column
An example data with multiple columns
dt <- data.table(x = list(1:2, 3:5, 6:9), y = 1:3, z = list(4:6, NULL, 5:8))
Solution
res <- melt(dt, measure.vars = names(dt))[, .(.(unlist(value))), by = rowid(variable)]
res$V1
# [[1]]
# [1] 1 2 1 4 5 6
#
# [[2]]
# [1] 3 4 5 2
#
# [[3]]
# [1] 6 7 8 9 3 5 6 7 8
The idea here is to convert to long format and then unlist/list by group
(You will receive an warning due to different classes in the resulting value column)

Reshaping count-summarised data into long form in R [duplicate]

This question already has answers here:
Repeat each row of data.frame the number of times specified in a column
(10 answers)
Closed 4 years ago.
Embarrassingly basic question, but if you don't know.. I need to reshape a data.frame of count summarised data into what it would've looked like before being summarised. This is essentially the reverse of {plyr} count() e.g.
> (d = data.frame(value=c(1,1,1,2,3,3), cat=c('A','A','A','A','B','B')))
value cat
1 1 A
2 1 A
3 1 A
4 2 A
5 3 B
6 3 B
> (summry = plyr::count(d))
value cat freq
1 1 A 3
2 2 A 1
3 3 B 2
If you start with summry what is the quickest way back to d? Unless I'm mistaken (very possible), {Reshape2} doesn't do this..
Just use rep:
summry[rep(rownames(summry), summry$freq), c("value", "cat")]
# value cat
# 1 1 A
# 1.1 1 A
# 1.2 1 A
# 2 2 A
# 3 3 B
# 3.1 3 B
A variation of this approach can be found in expandRows from my "SOfun" package. If you had that loaded, you would be able to simply do:
expandRows(summry, "freq")
There is a good table to dataframe function on the R cookbook website that you can modify slightly. The only modifications were changing 'Freq' -> 'freq' (to be consistent with plyr::count) and making sure the rownames were reset as increasing integers.
expand.dft <- function(x, na.strings = "NA", as.is = FALSE, dec = ".") {
# Take each row in the source data frame table and replicate it
# using the Freq value
DF <- sapply(1:nrow(x),
function(i) x[rep(i, each = x$freq[i]), ],
simplify = FALSE)
# Take the above list and rbind it to create a single DF
# Also subset the result to eliminate the Freq column
DF <- subset(do.call("rbind", DF), select = -freq)
# Now apply type.convert to the character coerced factor columns
# to facilitate data type selection for each column
for (i in 1:ncol(DF)) {
DF[[i]] <- type.convert(as.character(DF[[i]]),
na.strings = na.strings,
as.is = as.is, dec = dec)
}
row.names(DF) <- seq(nrow(DF))
DF
}
expand.dft(summry)
value cat
1 1 A
2 1 A
3 1 A
4 2 A
5 3 B
6 3 B

Resources