Create indicator - r

I would like to create a numeric indicator for a matrix such that for each unique element in one variable, it creates a sequence of the length based on the element in another variable. For example:
frame<- data.frame(x = c("a", "a", "a", "b", "b"), y = c(3,3,3,2,2))
frame
x y
1 a 3
2 a 3
3 a 3
4 b 2
5 b 2
The indicator, z, should look like this:
x y z
1 a 3 1
2 a 3 2
3 a 3 3
4 b 2 1
5 b 2 2
Any and all help greatly appreciated. Thanks.

No ave?
frame$z <- with(frame, ave(y,x,FUN=seq_along) )
frame
# x y z
#1 a 3 1
#2 a 3 2
#3 a 3 3
#4 b 2 1
#5 b 2 2
A data.table version could be something like below (thanks to #mnel):
#library(data.table)
#frame <- as.data.table(frame)
frame[,z := seq_len(.N), by=x]
My original thought was to use:
frame[,z := .SD[,.I], by=x]
where .SD refers to each subset of the data.table split by x. .I returns the row numbers for an entire data.table. So, .SD[,.I] returns the row numbers within each group. Although, as #mnel points out, this is inefficient compared to the other method as the entire .SD needs to be loaded into memory for each group to run this calculation.

Another approach:
frame$z <- unlist(lapply(rle(as.numeric(frame[, "x"]))$lengths, seq_len))

library(dplyr)
frame %.%
group_by(x) %.%
mutate(z = seq_along(y))

You can split the data.frame on x, and generate a new id column based on that:
> frame$z <- unlist(lapply(split(frame, frame$x), function(x) 1:nrow(x)))
> frame
x y z
1 a 3 1
2 a 3 2
3 a 3 3
4 b 2 1
5 b 2 2
Or even more simply using data.table:
library(data.table)
frame <- data.table(frame)[,z:=1:nrow(.SD),by=x]

Try this where x is the column by which grouping is to be done and y is any numeric column. if there are no numeric columns use seq_along(x), say, in place of y:
transform(frame, z = ave(y, x, FUN = seq_along))

Related

Create a column with the column name of the max value of the row in R

I got this data:
df = data.frame(x = c(1,2,3), y = c(5,1,4))
> x y
> 1 1 5
> 2 2 1
> 3 3 4
But i want a new column with the column name of the max value in the row
like this:
> x y max.col
> 1 1 5 y
> 2 2 1 x
> 3 3 4 y
I've tried a lot of codes, but without sucess. Extra points with i can use the solution with %>%
Edit1: i got a lot of NA's and i want skip it
Edit2: i got 30 different columns in the real df
We can use max.col to return the index of the max value and use that to subset the column name. If there are NAs replace the NA with a negative value
If a row is all NA, then we can identify it with rowSums on logical matrix
i1 <- !rowSums(!is.na(df))
df$max.col <- names(df)[max.col(replace(df, is.na(df), -999), 'first')]
df$max.col[i1] <- NA
Here is the solution for your question
df2 <- df %>%
mutate(max.col = ifelse(x>y, "x", "y"))
# x y max.col
# 1 1 5 y
# 2 2 1 x
# 3 3 4 y

calculated columns in new datatable without altering the original

I have a dataset which looks like this:
set.seed(43)
dt <- data.table(
a = rnorm(10),
b = rnorm(10),
c = rnorm(10),
d = rnorm(10),
e = sample(c("x","y"),10,replace = T),
f=sample(c("t","s"),10,replace = T)
)
i need (for example) a count of negative values in columns 1:4 for each value of e, f. The result would have to look like this:
e neg_a_count neg_b_count neg_c_count neg_d_count
1: x 6 3 5 3
2: y 2 1 3 NA
1: s 4 2 3 1
2: t 4 2 5 2
Here's my code:
for (k in 5:6) { #these are the *by* columns
for (i in 1:4) {#these are the columns whose negative values i'm counting
n=paste("neg",names(dt[,i,with=F]),"count","by",names(dt[,k,with=F]),sep="_")
dt[dt[[i]]<0, (n):=.N, by=names(dt[,k,with=F])]
}
}
dcast(unique(melt(dt[,5:14], id=1, measure=3:6))[!is.na(value),],e~variable)
dcast(unique(melt(dt[,5:14], id=2, measure=7:10))[!is.na(value),],f~variable)
which obviously produces two tables, not one:
e neg_a_count_by_e neg_b_count_by_e neg_c_count_by_e neg_d_count_by_e
1: x 6 3 5 3
2: y 2 1 3 NA
f neg_a_count_by_f neg_b_count_by_f neg_c_count_by_f neg_d_count_by_f
1: s 4 2 3 1
2: t 4 2 5 2
and need to be rbind to produce one table.
This approach modifies dt by adding eight additional columns (4 data columns x 2 by columns), and the counts related to the levels of e and f get recycled (as expected). I was wondering if there is a cleaner way to achieve the result, one which does not modify dt. Also, casting after melting seems inefficient, there should be a better way, especially since my dataset has several e and f-like columns.
If there is only two grouping columns, we could do an rbindlist after grouping by them separately
rbindlist(list(dt[,lapply(.SD, function(x) sum(x < 0)) , .(e), .SDcols = a:d],
dt[,lapply(.SD, function(x) sum(x < 0)) , .(f), .SDcols = a:d]))
# e a b c d
#1: y 2 1 3 0
#2: x 6 3 5 3
#3: s 4 2 3 1
#4: t 4 2 5 2
Or make it more dynamic by looping through the grouping column names
rbindlist(lapply(c('e', 'f'), function(x) dt[, lapply(.SD,
function(.x) sum(.x < 0)), by = x, .SDcols = a:d]))
You can melt before aggregating as follows:
cols <- c("a","b","c", "d")
melt(dt, id.vars=cols)[,
lapply(.SD, function(x) sum(x < 0)), by=value, .SDcols=cols]

What is the most effective way to sort dataframe and add special id? [duplicate]

I would like to create a numeric indicator for a matrix such that for each unique element in one variable, it creates a sequence of the length based on the element in another variable. For example:
frame<- data.frame(x = c("a", "a", "a", "b", "b"), y = c(3,3,3,2,2))
frame
x y
1 a 3
2 a 3
3 a 3
4 b 2
5 b 2
The indicator, z, should look like this:
x y z
1 a 3 1
2 a 3 2
3 a 3 3
4 b 2 1
5 b 2 2
Any and all help greatly appreciated. Thanks.
No ave?
frame$z <- with(frame, ave(y,x,FUN=seq_along) )
frame
# x y z
#1 a 3 1
#2 a 3 2
#3 a 3 3
#4 b 2 1
#5 b 2 2
A data.table version could be something like below (thanks to #mnel):
#library(data.table)
#frame <- as.data.table(frame)
frame[,z := seq_len(.N), by=x]
My original thought was to use:
frame[,z := .SD[,.I], by=x]
where .SD refers to each subset of the data.table split by x. .I returns the row numbers for an entire data.table. So, .SD[,.I] returns the row numbers within each group. Although, as #mnel points out, this is inefficient compared to the other method as the entire .SD needs to be loaded into memory for each group to run this calculation.
Another approach:
frame$z <- unlist(lapply(rle(as.numeric(frame[, "x"]))$lengths, seq_len))
library(dplyr)
frame %.%
group_by(x) %.%
mutate(z = seq_along(y))
You can split the data.frame on x, and generate a new id column based on that:
> frame$z <- unlist(lapply(split(frame, frame$x), function(x) 1:nrow(x)))
> frame
x y z
1 a 3 1
2 a 3 2
3 a 3 3
4 b 2 1
5 b 2 2
Or even more simply using data.table:
library(data.table)
frame <- data.table(frame)[,z:=1:nrow(.SD),by=x]
Try this where x is the column by which grouping is to be done and y is any numeric column. if there are no numeric columns use seq_along(x), say, in place of y:
transform(frame, z = ave(y, x, FUN = seq_along))

group by count when count is zero in r

I use aggregate function to get count by group. The aggregate function only returns count for groups if count > 0. This is what I have
dt <- data.frame(
n = c(1,2,3,4,5,6),
id = c('A','A','A','B','B','B'),
group = c("x","x","y","x","x","x"))
applying the aggregate function
my.count <- aggregate(n ~ id+group, dt, length)
now see the results
my.count[order(my.count$id),]
I get following
id group n
1 A x 2
3 A y 1
2 B x 3
I need the following (the last row has zero that i need)
id group n
1 A x 2
3 A y 1
2 B x 3
4 B y 0
thanks for you help in in advance
We can create another column 'ind' and then use dcast to reshape from 'long' to 'wide', specifying the fun.aggregate as length and drop=FALSE.
library(reshape2)
dcast(transform(dt, ind='n'), id+group~ind,
value.var='n', length, drop=FALSE)
# id group n
#1 A x 2
#2 A y 1
#3 B x 3
#4 B y 0
Or a base R option is
as.data.frame(table(dt[-1]))
You can merge your "my.count" object with the complete set of "id" and "group" columns:
merge(my.count, expand.grid(lapply(dt[c("id", "group")], unique)), all = TRUE)
## id group n
## 1 A x 2
## 2 A y 1
## 3 B x 3
## 4 B y NA
There are several questions on SO that show you how to replace NA with 0 if that is required.
aggregate with drop=FALSE worked for me.
my.count <- aggregate(n ~ id+group, dt, length, drop=FALSE)
my.count[is.na(my.count)] <- 0
my.count
# id group n
# 1 A x 2
# 2 B x 3
# 3 A y 1
# 4 B y 0
If you are interested in frequencies only, you create with your formula a frequency table an turn it into a dataframe:
as.data.frame(xtabs(formula = ~ id + group, dt))
Obviously this won't work for other aggregate functions. I'm still waiting for dplyr's summarise function to let the user decide whether zero-groups are kept or not. Maybe you can vote for this improvement here: https://github.com/hadley/dplyr/issues/341

R:How to get name of element in lapply function?

Suppose I have a list of data.frames:
list <- list(A=data.frame(x=c(1,2),y=c(3,4)), B=data.frame(x=c(1,2),y=c(7,8)))
I want to combine them into one data.frame like this:
data.frame(x=c(1,2,1,2), y=c(3,4,7,8), group=c("A","A","B","B"))
x y group
1 1 3 A
2 2 4 A
3 1 7 B
4 2 8 B
I can do in this way:
add_group_name <- function(df, group) {
df$group <- group
df
}
Reduce(rbind, mapply(add_group_name, list, names(list), SIMPLIFY=FALSE))
But I want to know if it's possible to get the name inside the lapply loop without the use of names(list), just like this:
add_group_name <- function(df) {
df$group <- ? #How to get the name of df in the list here?
}
Reduce(rbind, lapply(list, add_group_name))
I renamed list to listy to remove the clash with the base function. This is a variation on SeƱor O's answer in essence:
do.call(rbind, Map("[<-", listy, TRUE, "group", names(listy) ) )
# x y group
#A.1 1 3 A
#A.2 2 4 A
#B.1 1 7 B
#B.2 2 8 B
This is also very similar to a previous question and answer here: r function/loop to add column and value to multiple dataframes
The inner Map part gives this result:
Map("[<-", listy, TRUE, "group", names(listy) )
#$A
# x y group
#1 1 3 A
#2 2 4 A
#
#$B
# x y group
#1 1 7 B
#2 2 8 B
...which in long form, for explanation's sake, could be written like:
Map(function(data, nms) {data[TRUE,"group"] <- nms; data;}, listy, names(listy) )
As #flodel suggests, you could also use R's built in transform function for updating dataframes, which may be simpler again:
do.call(rbind, Map(transform, listy, group = names(listy)) )
I think a much easier approach is:
> do.call(rbind, lapply(names(list), function(x) data.frame(list[[x]], group = x)))
x y group
1 1 3 A
2 2 4 A
3 1 7 B
4 2 8 B
Using plyr:
ldply(ll)
.id x y
1 A 1 3
2 A 2 4
3 B 1 7
4 B 2 8
Or in 2 steps :
xx <- do.call(rbind,ll)
xx$group <- sub('([A-Z]).*','\\1',rownames(xx))
xx
x y group
A.1 1 3 A
A.2 2 4 A
B.1 1 7 B
B.2 2 8 B

Resources