Subsetting at the row level, but value must be column name - r

Imagine a dataframe:
set.seed(1234)
data<-data.frame(id = sample(letters, 26, replace = FALSE),
a = sample(1:10,26,replace=T),
b = sample(1:10,26,replace=T),
c = sample(1:10,26,replace=T))
I'd like to retain, for each id, the column name in which the largest value lies.
The result I am looking for is a data frame with dimensions of 26 x 2 with a column for id and column for largest_value_var. The largest_value_var would contain either a,b, or c.
So far, I have been able to extract the variable name with which the max value is associated using this:
apply(data[,-1], 1, function(x) c(names(x))[which.max(x)])
But I can't seem to quite get the result I'd like into a dataframe...
Any help is appreciated.

You can do this fairly easily with max.col(). Setting ties.method = "first" (thanks akrun), we will get the first column in the case of a tie. Here's a data table method:
library(data.table)
setDT(data)[, names(.SD)[max.col(.SD, "first")], by = id]
Update: It seems this method would be more efficient when implemented in base R, probably because of the as.matrix() conversion in max.col(). So here's one way to accomplish it in base.
cbind(data[1], largest = names(data)[-1][max.col(data[-1], "first")])
Thanks to Ananda Mahto for pointing out the efficiency difference.

I like #Richard's use of max.col, but the first thing that came to my mind was to actually get the data into a "tidy" form first, after which doing the subsetting you want should be easy:
library(reshape2)
library(data.table)
melt(as.data.table(data), id.vars = "id")[, variable[which.max(value)], by = id]
# id V1
# 1: c b
# 2: p a
# 3: o c
# 4: x b
# 5: s a
## SNIP ###
# 21: g a
# 22: f b
# 23: t a
# 24: y a
# 25: w b
# 26: v a
# id V1

In order to put the result from your apply() call into a data frame, you could do
df <- data.frame(id=data$id,
largest_value_var=apply(data[,-1], 1, function(x) names(x)[which.max(x)]))
Note that c(names(x)) is the same as names(x), so I omitted c().

Related

How to extract first n rows per group and calculate function using that subset?

My question is very similar to this one:
How to extract the first n rows per group?
dt
date age name val
1: 2000-01-01 3 Andrew 93.73546
2: 2000-01-01 4 Ben 101.83643
3: 2000-01-01 5 Charlie 91.64371
4: 2000-01-02 6 Adam 115.95281
5: 2000-01-02 7 Bob 103.29508
6: 2000-01-02 8 Campbell 91.79532
We have a dt and I've added an extra column named val. First, we want to extract the first n rows within each group.
The solutions from the link provided are:
dt[, .SD[1:2], by=date] # where 1:2 is the index needed
dt[dt[, .I[1:2], by = date]$V1] # for speed
My question is how do I apply a function to the first n rows within each group if that function depends on the subsetted information. I am trying to apply something like this:
# uses other columns for results/ is dependent on subsetted rows
# but keep it simple for replication
do_something <- function(dt){
res <- ifelse(cumsum(dt$val) > 200, 1, 0)
return(res)
}
# first 2 rows of dt by group=date
x <- dt[, .SD[1:2], by=date]
# apply do_something to first 2 rows of dt by group=date
x[, list('age'=age,'name'=name,'val'=val, 'funcVal'= do_something(.SD[1:2])),by=date]
date age name val funcVal
1: 2000-01-01 3 Andrew 93.73546 0
2: 2000-01-01 4 Ben 101.83643 1
3: 2000-01-02 6 Adam 115.95281 0
4: 2000-01-02 7 Bob 103.29508 1
Am I going about this wrong? Is there a more efficient way to do this? I cannot seem to figure out how to apply the "for speed" solution to this. Is there a way to do this without saving the subset-ed results first and applying the function to the first 2 rows by date right away?
Any help is appreciated and below is the code to produce the data above:
date <- c("2000-01-01","2000-01-01","2000-01-01",
"2000-01-02","2000-01-02","2000-01-02")
age <- c(3,4,5,6,7,8)
name <- c("Andrew","Ben","Charlie","Adam","Bob","Campbell")
val <- val <- rnorm(6,100,10)
dt <- data.table(date, age, name,val)
In case there's more than one grouping column, it might be more efficient to collapse to one:
m = dt[, .(g = .GRP, r = .I[1:2]), by = date]
dt[m$r, v := ff(.SD), by=m$g, .SDcols="val"]
This is just an extension to #eddi's approach (of keeping row numbers .I, seen in #akrun's answer) to also keep group counter .GRP.
Re OP's comment that they're more concerned about the function, well, borrowing from #akrun, there's ...
ff = function(x) as.integer(cumsum(x[[1]]) > 200)
Assuming all values are nonnegative, you could probably handle this in C more efficiently, since the cumulative sum can stop as soon as the threshold is reached. For the special case of two rows, that will hardly matter, though.
My impression is that this is a dummy function so there's no point going there. Many efficiency improvements that I usually think of are contingent on the function and data.
We can use as.integer on the cumsum to coerce the logical to binary. Extract the row index, specify it as i, grouped by 'date', apply the function on the 'val' column
f1 <- function(x) as.integer(cumsum(x) > 200)
i1 <- dt[, .I[1:2], by = date]$V1
dt[i1, newcol := f1(val), date]

One to Many Join in data.table

I am using data.table to do a one-to-many merge. Instead of matching with all the rows, the output is showing only the last matched row for each unique value of the key.
a <- data.table(x = 1:2L, y = letters[1:4])
b <- data.table(x = c(1L,3L))
setkey(a,x)
setkey(b,x)
I want to do a many to one (b to a) join based on column x.
c <- a[b,on=.(x)]
c
# x y
# 1: 1 a
# 2: 1 c
# 3: 3 NA
However, this approach creates a new data.table called c, instead of making a new data.table, I use the following code to add the column y with b.
b[a,y:=i.y]
Now b looks like,
b
# x y
# 1: 1 c
# 2: 3 NA
The desired output is the one in the first method (c). Is there a way of using := and output all the rows instead of the last matched row alone?
PS: The reason I want to use method 2 using := is because my data is huge and I do not want to make copies. The example I showed reflects what happens in my data.

How to turn several columns into a column of type list in r?

I am looking to turn a dataframe (or datatable) such as
dt <- data.table(a = c(1,2,4), b = c(NA,3,5), d = c(NA,8,NA))
into something with one column, such as
dt <- data.table(combined = list(list(1,NA,NA),list(2,3,8),list(4,5,NA))
None of the following work:
dt[,combined := as.list(a,b,d)]
dt[,combined := do.call(list,list(a,b,d))]
dt[,combined := cbind(a,b,d)]
dt[,combined := lapply(list(a,b,d),list)]
Note that this is different from the question here, data.frame rows to a list, which returns a different shaped object (I think it's just a plain list, with each row as an item in the list, rather than a vector of lists)
You can use purrr::transpose(), which transposes a list of vectors to a list of lists:
dt[, combined := purrr::transpose(.(a,b,d))]
dt
# a b d combined
#1: 1 NA NA <list>
#2: 2 3 8 <list>
#3: 4 5 NA <list>
combined = list(list(1,NA_real_,NA_real_),list(2,3,8),list(4,5,NA_real_))
identical(dt$combined, combined)
# [1] TRUE
If you don't want to use an extra package, you can use data.table::transpose with a little extra effort:
dt[, combined := lapply(transpose(.(a,b,d)), as.list)]
identical(dt$combined, combined)
# [1] TRUE
To make #David's comment more explicit, and generalize the data.table approach to SE version, which allows you to pass in columns names as character vector and avoids hard coding column names, you can do, to learn more about SE vs NSE (you can refer to vignette("nse")):
dt[, combined := lapply(transpose(.SD), as.list), .SDcols = c("a","b","d")]
This makes all sublists named, but the values correspond to the combined list:
identical(lapply(dt$combined, setNames, NULL), combined)
# [1] TRUE
If you don't want to use any functions:
dt[, combined := .(.(.SD)), by = 1:nrow(dt)]
# because you want to transform each row to a list, normally you can group the data frame
# by the row id, and turn each row into a list, and store the references in a new list
# which will be a column in the resulted data.table
dt$combined
#[[1]]
# a b d
#1: 1 NA NA
#[[2]]
# a b d
#1: 2 3 8
#[[3]]
# a b d
#1: 4 5 NA
Or: dt[, combined := .(.(.(a,b,d))), by = 1:nrow(dt)] which gives you closer to the exact desired output.

Avoiding NA in rolling sums of last n observations within by groups using data.table

According to this threat I learned, rolling sums for variable b in the following data.table can be achieved as follows:
data creation + computing rolling sums:
x <- data.table(a = sample(letters[1:3], 100, replace = TRUE), b = runif(100))
setorder(x, a)
# alternative 1
x[, .(b, Reduce(`+`, shift(b, 0:2))), by = a]
# alternative 2
x[, .(b, stats::filter(b, rep(1, 3), sides = 1)), by = a]
Current + desired output:
a b V2 V2_desired
1: a 0.457665568 NA 0.457665568
2: a 0.752555834 NA 1.210221
3: a 0.864672124 2.0748935 2.0748935
4: a 0.542168656 2.1593966 2.1593966
5: a 0.197962875 1.6048037 1.6048037
Now there are NAs generated for the first two obs. in each by group.
I need to adjust one of the alternatives to sum only the current obs. (last two obs.) in cases where the group index starts (is at position 2).
This should be generalizable such that I could consider windows of last n values and the exceptions are handled.
Any idea?
I'm not 100% sure I'm getting what you need, but the shift function leaves behind NA values by default. You can change that behaviour by passing a fill argument. In your case, since you're summing the data, you might want to try it with fill=0:
set.seed( 123 )
x[, .(b, Reduce(`+`, shift(b, 0:2, fill=0))), by = a]
head returns:
a b V2
1: a 0.5999890 0.599989
2: a 0.8903502 1.490339
3: a 0.7205963 2.210935
4: a 0.5492847 2.160231
5: a 0.9540912 2.223972
6: a 0.5854834 2.088859

Number of Unique Obs by Variable in a Data Table

I have read in a large data file into R using the following command
data <- as.data.set(spss.system.file(paste(path, file, sep = '/')))
The data set contains columns which should not belong, and contain only blanks. This issue has to do with R creating new variables based on the variable labels attached to the SPSS file (Source).
Unfortunately, I have not been able to determine the options necessary to resolve the problem. I have tried all of: foreign::read.spss, memisc:spss.system.file, and Hemisc::spss.get, with no luck.
Instead, I would like to read in the entire data set (with ghost columns) and remove unnecessary variables manually. Since the ghost columns contain only blank spaces, I would like to remove any variables from my data.table where the number of unique observations is equal to one.
My data are large, so they are stored in data.table format. I would like to determine an easy way to check the number of unique observations in each column, and drop columns which contain only one unique observation.
require(data.table)
### Create a data.table
dt <- data.table(a = 1:10,
b = letters[1:10],
c = rep(1, times = 10))
### Create a comparable data.frame
df <- data.frame(dt)
### Expected result
unique(dt$a)
### Expected result
length(unique(dt$a))
However, I wish to calculate the number of obs for a large data file, so referencing each column by name is not desired. I am not a fan of eval(parse()).
### I want to determine the number of unique obs in
# each variable, for a large list of vars
lapply(names(df), function(x) {
length(unique(df[, x]))
})
### Unexpected result
length(unique(dt[, 'a', with = F])) # Returns 1
It seems to me the problem is that
dt[, 'a', with = F]
returns an object of class "data.table". It makes sense that the length of this object is 1, since it is a data.table containing 1 variable. We know that data.frames are really just lists of variables, and so in this case the length of the list is just 1.
Here's pseudo code for how I would remedy the solution, using the data.frame way:
for (x in names(data)) {
unique.obs <- length(unique(data[, x]))
if (unique.obs == 1) {
data[, x] <- NULL
}
}
Any insight as to how I may more efficiently ask for the number of unique observations by column in a data.table would be much appreciated. Alternatively, if you can recommend how to drop observations if there is only one unique observation within a data.table would be even better.
Update: uniqueN
As of version 1.9.6, there is a built in (optimized) version of this solution, the uniqueN function. Now this is as simple as:
dt[ , lapply(.SD, uniqueN)]
If you want to find the number of unique values in each column, something like
dt[, lapply(.SD, function(x) length(unique(x)))]
## a b c
## 1: 10 10 1
To get your function to work you need to use with=FALSE within [.data.table, or simply use [[ instead (read fortune(312) as well...)
lapply(names(df) function(x) length(unique(dt[, x, with = FALSE])))
or
lapply(names(df) function(x) length(unique(dt[[x]])))
will work
In one step
dt[,names(dt) := lapply(.SD, function(x) if(length(unique(x)) ==1) {return(NULL)} else{return(x)})]
# or to avoid calling `.SD`
dt[, Filter(names(dt), f = function(x) length(unique(dt[[x]]))==1) := NULL]
The approaches in the other answers are good. Another way to add to the mix, just for fun :
for (i in names(DT)) if (length(unique(DT[[i]]))==1) DT[,(i):=NULL]
or if there may be duplicate column names :
for (i in ncol(DT):1) if (length(unique(DT[[i]]))==1) DT[,(i):=NULL]
NB: (i) on the LHS of := is a trick to use the value of i rather than a column named "i".
Here is a solution to your core problem (I hope I got it right).
require(data.table)
### Create a data.table
dt <- data.table(a = 1:10,
b = letters[1:10],
d1 = "",
c = rep(1, times = 10),
d2 = "")
dt
a b d1 c d2
1: 1 a 1
2: 2 b 1
3: 3 c 1
4: 4 d 1
5: 5 e 1
6: 6 f 1
7: 7 g 1
8: 8 h 1
9: 9 i 1
10: 10 j 1
First, I introduce two columns d1 and d2 that have no values whatsoever. Those you want to delete, right? If so, I just identify those columns and select all other columns in the dt.
only_space <- function(x) {
length(unique(x))==1 && x[1]==""
}
bolCols <- apply(dt, 2, only_space)
dt[, (1:ncol(dt))[!bolCols], with=FALSE]
Somehow, I have the feeling that you could further simplify it...
Output:
a b c
1: 1 a 1
2: 2 b 1
3: 3 c 1
4: 4 d 1
5: 5 e 1
6: 6 f 1
7: 7 g 1
8: 8 h 1
9: 9 i 1
10: 10 j 1
There is an easy way to do that using "dplyr" library, and then use select function as follow:
library(dplyr)
newdata <- select(old_data, first variable,second variable)
Note that, you can choose as many variables as you like.
Then you will get the type of data that you want.
Many thanks,
Fadhah

Resources