Access a single cell / subsetted column of a data.table - r

How can I access just a single cell in a data.table in the way as I could for a data.frame:
mdf <- data.frame(a = c("A", "B", "C"), b = rnorm(3), c = 1:3)
mdf[ mdf$a == "B", "c" ]
[1] 2
Doing the analogue on a data.table a data.table is returned including the key column(s):
mdt <- data.table( mdf, key = "a" )
mdt[ "B", c ]
a c
1: B 2
mdt[ "B", c ][ , c]
[1] 2
Did I miss a parameter or does it has to be done as in the last line?

Either of these will avoid repeating the c but are not as efficient since they involve computing the first [] as well as the final answer:
> mdt[ "B", ][["c"]]
[1] 2
> mdt[ "B", ][, c]
[1] 2

Recent versions of data.table make this easier
mdt[ "B", c]
# [1] 2
Original answer was returning a data.table like:
mdt['B', 'c']
# c
# 1: 2

Related

Arrange by multiple conditions in R

Having the following dataset:
test <- data.frame(name= c("A", "B", "C", "D", "E"), v1 = c(2, 4, 1, 1, 2), v2 = c(3, 4, 2, 1, 5))
name v1 v2
A 2 3
B 4 4
C 1 2
D 1 1
E 2 5
I want to test a concept of actor/node dominance, which means that for each row/entry I want to see if it has the highest value in the dataset. For example B is higher than A, C and D for both v1 and v2. So it "dominates" all the other rows. E for example is only higher than A, C and D, so it dominates those 3 rows.
Mathematically speaking, what I'm searching for is i >= j, for v1_i >= v1_j and v2_i >= v2_j.
Arranging or sorting by columns doesn't work because it doesn't really show how one row will impose another as it sorts first by one column and then by another.
EDIT:Just to add an example an end output would be:
B dominates A, C, D
E dominates A, D, C
C dominates D
A dominates C, D
Doesn't really matter how it would look like. If it's in the form of a directed network/matrix or a table with a variable with all the letters.
i got a way out hope this helps :)
c$v3= rowSums(x = c[,-1])
c = c[order(c$v3,decreasing = T),]
k = length(c$name)
for (i in 1:k ) {
if (i == k) {
}else {
a = c$name[i+1:k]
a = as.character(a[!is.na(a)])
b = c$name[i]
b = as.character(b[!is.na(b)])
cat(b,"greater than ", a,"\n",sep=" ")
}
}
so your output will be
B greater than E A C D
E greater than A C D
A greater than C D
C greater than D
for loop are very inefficient in R. Please, avoid them !
You can simply do it with apply:
# Names column
names = c("A", "B", "C", "D", "E")
# Dataframe
test <- data.frame(name= names, v1 = c(2, 4, 1, 1, 2), v2 = c(3, 4, 2, 1, 5))
# Display function
findLowerValues <- function(row, test, names) {
rep <- test$v1 <= row["v1"] & test$v2 <= row["v2"] & test$name != row["name"]
cat(row["name"], 'dominates', names[rep], "\n")
}
# Apply the display function
# axis : row
# Extra args: the full dataset and names
apply(test, 1, findLowerValues, test=test, names=names)
# A dominates C D
# B dominates A C D
# C dominates D
# D dominates
# E dominates A C D
# NULL
Here is a suggestion. It's probably not the most elegant solution.
We can have a function compare that checks if one letter dominates the other (unless it is the identical letter), and then use two nested sapplys.
my_letters <- c("A", "B", "C", "D", "E")
test <- data.frame(name= my_letters, v1 = c(2, 4, 1, 1, 2), v2 = c(3, 4, 2, 1, 5))
get_row<-function(letter){
test[test$name==letter,2:ncol(test)]
}
compare<-function(letter,i){
if(letter!=i){
if(!sum(get_row(letter) < get_row(i))){
return(i)
}
}
}
result <- sapply(my_letters, function(let) unlist(sapply(my_letters, compare, letter=let)))
results in a list:
$A
C D
"C" "D"
$B
A C D
"A" "C" "D"
$C
D
"D"
$D
NULL
$E
A C D
"A" "C" "D"
We first split every row into list of rows and pass it to mapply, repeat each row nrow(test) times, compare them with the entire dataframe test and select the name which has all values greater than that row. As this will also match rows with itself as well we use setdiff to remove those name values.
mapply(function(x, y) setdiff(
test$name[rowSums(x[rep(1, nrow(test)),] >= test[-1]) == ncol(test) - 1], y),
split(test[-1], test$name), test$name)
#$A
#[1] "C" "D"
#$B
#[1] "A" "C" "D"
#$C
#[1] "D"
#$D
#character(0)
#$E
#[1] "A" "C" "D"
data
test <- data.frame(name= c("A", "B", "C", "D", "E"), v1 = c(2, 4, 1, 1, 2),
v2 = c(3, 4, 2, 1, 5), stringsAsFactors = FALSE)
If you don't mind a data.table solution, a possibility is using non-equi joins as follows:
library(data.table)
setDT(test)
test[test, on=.(v1<=v1, v2<=v2), .(actor=i.name, node=x.name), by=.EACHI, allow.cartesian=TRUE][
actor!=node, .(actor, node)]
output:
actor node
1: A C
2: A D
3: B A
4: B C
5: B D
6: C D
7: E A
8: E C
9: E D

How to paste vector elements comma-separated and in quotation marks?

I want to select columns of data frame dfr by their names in a certain order, that i obtain with the numbers in first place.
> (x <- names(dfr)[c(3, 4, 2, 1, 5)])
[1] "c" "d" "b" "a" "e"
In the final code there only should be included the names version, because it's safer.
dfr[, c("c", "d", "b", "a", "e")
I want to paste the elements separated with commas and quotation marks into a string, in order to include it into the final code. I've tried a few options, but they don't give me what I want:
> paste(x, collapse='", "')
[1] "c\", \"d\", \"b\", \"a\", \"e"
> paste(x, collapse="', '")
[1] "c', 'd', 'b', 'a', 'e"
I need something like "'c', 'd', 'b', 'a', 'e'",—of course "c", "d", "b", "a", "e" would be much nicer.
Data
dfr <- setNames(data.frame(matrix(1:15, 3, 5)), letters[1:5])
So dput(x) is the correct answer but just in case you were wondering how to achieve this by modifying your existing code you could do something like the following:
cat(paste0('c("', paste(x, collapse='", "'), '")'))
c("c", "d", "b", "a", "e")
Can also be done with packages (as Tung has showed), here is an example using glue:
library(glue)
glue('c("{v}")', v = glue_collapse(x, '", "'))
c("c", "d", "b", "a", "e")
Try vector_paste() function from the datapasta package
library(datapasta)
vector_paste(input_vector = letters[1:3])
#> c("a", "b", "c")
vector_paste_vertical(input_vector = letters[1:3])
#> c("a",
#> "b",
#> "c")
Or, using base R, this gives you what you want:
(x <- letters[1:3])
q <- "\""
( y <- paste0("c(", paste(paste0(q, x, q), collapse = ", ") , ")" ))
[1] "c(\"a\", \"b\", \"c\")"
Though I'm not realy sure why you want it? Surely you can simply subset like this:
df <- data.frame(a=1:3, b = 1:3, c = 1:3)
df[ , x]
a b c
1 1 1 1
2 2 2 2
3 3 3 3
df[ , rev(x)]
c b a
1 1 1 1
2 2 2 2
3 3 3 3
suppose you want to add a quotation infront and at the end of a text, and save it as an R object - use the capture.output function from utils pkg.
Example. I want ABCDEFG to be saved as an R object as "ABCDEFG"
> cat("ABCDEFG")
> ABCDEFG
> cat("\"ABCDEFG\"")
> "ABCDEFG"
>
#To save output of the cat as an R object including the quotation marks at the start and end of the word use the capture.ouput
> add_quote <- capture.output(cat("\"ABCDEFG\""))
> add_quote
[1] "\"ABCDEFG\""

Generate random numbers in an R dataframe which are constant across similar-rows

I have a dataframe containing X rows per 'user', where X is not constant between users. What I would like to do is to be able to generate random numbers to fill a new column, but for each 'user' the random number is the same across all of the rows that correspond to that user. For example, the data might look something like this:
user feature1 feature2
1 "A" "B"
1 "L" "L"
1 "Q" "B"
1 "D" "M"
1 "D" "M"
1 "P" "E"
2 "A" "B"
2 "R" "P"
2 "A" "F"
3 "X" "U"
... ... ...
and I would like to generate a new column that might look something like this:
user feature1 feature2 new_rand
1 "A" "B" 0.183
1 "L" "L" 0.183
1 "Q" "B" 0.183
1 "D" "M" 0.183
1 "D" "M" 0.183
1 "P" "E" 0.183
2 "A" "B" 0.971
2 "R" "P" 0.971
2 "A" "F" 0.971
3 "X" "U" 0.302
... ... ...
The first approach I did was to basically use s <- split(df, df$user)but the dataframe contains a huge number of users and I think this is probably an extremely inefficient way to do this.
Many thanks.
#akrun's method is a great one-off but it doesn't leverage vectorization (we repeatedly call rnorm a single time within each level of user), so it's probably on the slow side. A more general way to do this is:
library(data.table)
setDT(df)
df[unique(df, by = "user")[ , new_rand := rnorm(.N)],
new_rand := i.new_rand, on = "user"]
What's going on here? unique returns a new data.table where all the duplicate observations (as defined by by, here user) are removed; we then add a column to this new object ([, := ]). Finally, this augmented data.table is joined back to the original table.
Note that here we only call rnorm once, returning a vector of exactly the right size. We then join this back to the original data set, "spreading" the value as needed across all observations of each user.
Or for assigning to a more specific group, say user and feature1 and feature2:
grps <- c("user", "feature1", "feature2")
df[unique(df, by = grps)[ , new_rand := rnorm(.N)],
new_rand := i.new_rand, on = grps]
We can try data.table. We convert the 'data.frame' to 'data.table' (setDT(df1)), grouped by 'user', we get a single random number (rnorm(1)) and assign (:=) to create 'new_rand'
library(data.table)
setDT(df1)[, new_rand := rnorm(1) , by = user]
Or we can use dplyr.
library(dplyr)
df1 %>%
group_by(user) %>%
mutate(new_rand = rnorm(1))
Or another option with left_join
distinct(df1, user) %>%
mutate(new_rand=rnorm(n())) %>%
left_join(df1, ., by='user')
and a base R solution:
df_ <- data.frame(user = c(1, 1, 1, 1, 1, 1, 2, 2, 2, 3), feature1 = c("A", "L", "Q", "D", "D", "P", "A", "R", "A", "X"), feature2 = c("B", "L", "B", "M", "M", "E", "B", "P", "F", "U"))
tmp <- by(df_, df_[, 'user'], FUN = function(x) data.frame(x, new_rand = rnorm(1)))
do.call(rbind, tmp)
# user feature1 feature2 new_rand
# 1.1 1 A B -0.6145338
# 1.2 1 L L -0.6145338
# 1.3 1 Q B -0.6145338
# 1.4 1 D M -0.6145338
# 1.5 1 D M -0.6145338
# 1.6 1 P E -0.6145338
# 2.7 2 A B -1.4292151
# 2.8 2 R P -1.4292151
# 2.9 2 A F -1.4292151
# 3 3 X U -0.3309754
or as per akrun's suggestion:
df_[, 'new_rand'] <- ave(seq_along(df_$user), df_$user, FUN = function(x) rnorm(1))

How to convert a factor to numeric in a predefined order in R

I have a factor column, with three values: "b", "c" and "free".
I did
df$new_col = as.numeric (df$factor_col)
But it will convert "b" to 1, "c" to 2 and "free" to 3.
But I want to convert "free" to 0, "b" to 2 and "c" to 5. How can I do it in R?
Thanks a lot
f <- factor(c("b", "c", "c", "free", "b", "free"))
You can try renaming the factor levels,
levels(f)[levels(f)=="b"] <- 2
levels(f)[levels(f)=="c"] <- 5
levels(f)[levels(f)=="free"] <- 0
> f
#[1] 2 5 5 0 2 0
#Levels: 2 5 0
One option would be to call the 'factor' again and specify the levels and labels argument based on the custom order and change to numeric after converting to 'character' or through the levels
df$new_col <- as.numeric(as.character(factor(df$factor_col,
levels=c('b', 'c', 'free'), labels=c(2, 5, 0))))
Another option is recode from library(car). The output will be factor class. If we need to convert to 'numeric', we can do this as in the earlier solution (as.numeric(..).
library(car)
df$new_col <- with(df, recode(factor_col, "'b'=2; 'c'=5; 'free'=0"))
data
df <- data.frame(factor_col= c('b', 'c', 'b', 'free', 'c', 'free'))
You can use the following approach to create the new column:
# an example data frame
f <- data.frame(factor_col = c("b", "c", "free"))
# create new_col
f <- transform(f, new_col = (factor_col == "b") * 2 + (factor_col == "c") * 5)
The result (f):
factor_col new_col
1 b 2
2 c 5
3 free 0

Check frequency of data.table value in other data.table

library(data.table)
DT1 <- data.table(num = 1:6, group = c("A", "B", "B", "B", "A", "C"))
DT2 <- data.table(group = c("A", "B", "C"))
I want to add a column popular to DT2 with value TRUE whenever DT2$group is contained in DT1$group at least twice. So, in the example above, DT2 should be
group popular
1: A TRUE
2: B TRUE
3: C FALSE
What would be an efficient way to get to this?
Updated example: DT2 may actually contain more groups than DT1, so here's an updated example:
DT1 <- data.table(num = 1:6, group = c("A", "B", "B", "B", "A", "C"))
DT2 <- data.table(group = c("A", "B", "C", "D"))
And the desired output would be
group popular
1: A TRUE
2: B TRUE
3: C FALSE
4: D FALSE
I'd just do it this way:
## 1.9.4+
setkey(DT1, group)
DT1[J(DT2$group), list(popular = .N >= 2L), by = .EACHI]
# group popular
# 1: A TRUE
# 2: B TRUE
# 3: C FALSE
# 4: D FALSE ## on the updated example
data.table's join syntax is quite powerful, in that, while joining, you can also aggregate / select / update columns in j. Here we perform a join. For each row in DT2$group, on the corresponding matching rows in DT1, we compute the j-expression .N >= 2L; by specifying by = .EACHI (please check 1.9.4 NEWS), we compute the j-expression each time.
In 1.9.4, .() has been introduced as an alias in all i, j and by. So you could also do:
DT1[.(DT2$group), .(popular = .N >= 2L), by = .EACHI]
When you're joining by a single character column, you can drop the .() / J() syntax altogether (for convenience). So this can be also written as:
DT1[DT2$group, .(popular = .N >= 2L), by = .EACHI]
This is how I would do it: first count the number of times each group appears in DT1, then simply join DT2 and DT1.
require(data.table)
DT1 <- data.table(num = 1:6, group = c("A", "B", "B", "B", "A", "C"))
DT2 <- data.table(group = c("A", "B", "C"))
#solution:
DT1[,num_counts:=.N,by=group] #the number of entries in this group, just count the other column
setkey(DT1, group)
setkey(DT2, group)
DT2 = DT1[DT2,mult="last"][,list(group, popular = (num_counts >= 2))]
#> DT2
# group popular
#1: A TRUE
#2: B TRUE
#3: C FALSE

Resources