"Weighted" counts at each combination of factor levels - r

I have the following dataframe:
> df=data.frame(from = c("x","y","x","z"), to=c("w","x","w","y"),weight=c(1,1,3,4))
> df
from to weight
1 x w 1
2 y x 1
3 x w 3
4 z y 4
If I want to calculate how many times an element of column from appears in the dataframe, I need to use:
> table(df$from)
x y z
2 1 1
This is not a weighted sum. Anyway, how could I consider also the column weight? E.g. in my example, the correct answer should be:
x y z
4 1 4

You can use tapply and calculate sum for each unique value in from
tapply(df$weight, df$from, sum)
#x y z
#4 1 4

We can use count from dplyr
library(dplyr)
df %>%
count(from, wt = weight)
# from n
#1 x 4
#2 y 1
#3 z 4
In base R, we can use xtabs
xtabs(weight~ from, df)
#from
#x y z
#4 1 4

Related

Create a column with the column name of the max value of the row in R

I got this data:
df = data.frame(x = c(1,2,3), y = c(5,1,4))
> x y
> 1 1 5
> 2 2 1
> 3 3 4
But i want a new column with the column name of the max value in the row
like this:
> x y max.col
> 1 1 5 y
> 2 2 1 x
> 3 3 4 y
I've tried a lot of codes, but without sucess. Extra points with i can use the solution with %>%
Edit1: i got a lot of NA's and i want skip it
Edit2: i got 30 different columns in the real df
We can use max.col to return the index of the max value and use that to subset the column name. If there are NAs replace the NA with a negative value
If a row is all NA, then we can identify it with rowSums on logical matrix
i1 <- !rowSums(!is.na(df))
df$max.col <- names(df)[max.col(replace(df, is.na(df), -999), 'first')]
df$max.col[i1] <- NA
Here is the solution for your question
df2 <- df %>%
mutate(max.col = ifelse(x>y, "x", "y"))
# x y max.col
# 1 1 5 y
# 2 2 1 x
# 3 3 4 y

Identify and replace minimum value from a numeric column present in all dataframes in a list of dataframes

I need a way to identify the minimum value in a particular column presents in all dataframes in a list of dataframes and replace it with some non-numeric character. For example:
df1 <- data.frame(x=c("a","b","c"), y=c(2,4,6))
df2 <- data.frame(x=c("a","b","c"), y=c(10,20,30))
myList <- list(df1, df2)
[[1]]
x y
1 a 2
2 b 4
3 c 6
[[2]]
x y
1 a 10
2 b 20
3 c 30
should become
[[1]]
x y
1 a *
2 b 4
3 c 6
[[2]]
x y
1 a *
2 b 20
3 c 30
What's the best way? It would be great if someone knew a Base R and external packages (purrr) solution.
Thanks!
Here is a base R option
lapply(myList, function(df) transform(df, y = replace(y, which.min(y), "*")))
#[[1]]
# x y
#1 a *
#2 b 4
#3 c 6
#
#[[2]]
# x y
#1 a *
#2 b 20
#3 c 30
Or the same in the tidyverse
library(tidyverse)
map(myList, ~.x %>% mutate(y = replace(y, which.min(y), "*")))
for(i in 1:length(myList)){
currMin = min(myList[[i]]$y)
myList[[i]]$y[myList[[i]]$y==currMin] <- '*'
}
please note, assigning '*' will convert type to character

What is the most effective way to sort dataframe and add special id? [duplicate]

I would like to create a numeric indicator for a matrix such that for each unique element in one variable, it creates a sequence of the length based on the element in another variable. For example:
frame<- data.frame(x = c("a", "a", "a", "b", "b"), y = c(3,3,3,2,2))
frame
x y
1 a 3
2 a 3
3 a 3
4 b 2
5 b 2
The indicator, z, should look like this:
x y z
1 a 3 1
2 a 3 2
3 a 3 3
4 b 2 1
5 b 2 2
Any and all help greatly appreciated. Thanks.
No ave?
frame$z <- with(frame, ave(y,x,FUN=seq_along) )
frame
# x y z
#1 a 3 1
#2 a 3 2
#3 a 3 3
#4 b 2 1
#5 b 2 2
A data.table version could be something like below (thanks to #mnel):
#library(data.table)
#frame <- as.data.table(frame)
frame[,z := seq_len(.N), by=x]
My original thought was to use:
frame[,z := .SD[,.I], by=x]
where .SD refers to each subset of the data.table split by x. .I returns the row numbers for an entire data.table. So, .SD[,.I] returns the row numbers within each group. Although, as #mnel points out, this is inefficient compared to the other method as the entire .SD needs to be loaded into memory for each group to run this calculation.
Another approach:
frame$z <- unlist(lapply(rle(as.numeric(frame[, "x"]))$lengths, seq_len))
library(dplyr)
frame %.%
group_by(x) %.%
mutate(z = seq_along(y))
You can split the data.frame on x, and generate a new id column based on that:
> frame$z <- unlist(lapply(split(frame, frame$x), function(x) 1:nrow(x)))
> frame
x y z
1 a 3 1
2 a 3 2
3 a 3 3
4 b 2 1
5 b 2 2
Or even more simply using data.table:
library(data.table)
frame <- data.table(frame)[,z:=1:nrow(.SD),by=x]
Try this where x is the column by which grouping is to be done and y is any numeric column. if there are no numeric columns use seq_along(x), say, in place of y:
transform(frame, z = ave(y, x, FUN = seq_along))

group by count when count is zero in r

I use aggregate function to get count by group. The aggregate function only returns count for groups if count > 0. This is what I have
dt <- data.frame(
n = c(1,2,3,4,5,6),
id = c('A','A','A','B','B','B'),
group = c("x","x","y","x","x","x"))
applying the aggregate function
my.count <- aggregate(n ~ id+group, dt, length)
now see the results
my.count[order(my.count$id),]
I get following
id group n
1 A x 2
3 A y 1
2 B x 3
I need the following (the last row has zero that i need)
id group n
1 A x 2
3 A y 1
2 B x 3
4 B y 0
thanks for you help in in advance
We can create another column 'ind' and then use dcast to reshape from 'long' to 'wide', specifying the fun.aggregate as length and drop=FALSE.
library(reshape2)
dcast(transform(dt, ind='n'), id+group~ind,
value.var='n', length, drop=FALSE)
# id group n
#1 A x 2
#2 A y 1
#3 B x 3
#4 B y 0
Or a base R option is
as.data.frame(table(dt[-1]))
You can merge your "my.count" object with the complete set of "id" and "group" columns:
merge(my.count, expand.grid(lapply(dt[c("id", "group")], unique)), all = TRUE)
## id group n
## 1 A x 2
## 2 A y 1
## 3 B x 3
## 4 B y NA
There are several questions on SO that show you how to replace NA with 0 if that is required.
aggregate with drop=FALSE worked for me.
my.count <- aggregate(n ~ id+group, dt, length, drop=FALSE)
my.count[is.na(my.count)] <- 0
my.count
# id group n
# 1 A x 2
# 2 B x 3
# 3 A y 1
# 4 B y 0
If you are interested in frequencies only, you create with your formula a frequency table an turn it into a dataframe:
as.data.frame(xtabs(formula = ~ id + group, dt))
Obviously this won't work for other aggregate functions. I'm still waiting for dplyr's summarise function to let the user decide whether zero-groups are kept or not. Maybe you can vote for this improvement here: https://github.com/hadley/dplyr/issues/341

Create indicator

I would like to create a numeric indicator for a matrix such that for each unique element in one variable, it creates a sequence of the length based on the element in another variable. For example:
frame<- data.frame(x = c("a", "a", "a", "b", "b"), y = c(3,3,3,2,2))
frame
x y
1 a 3
2 a 3
3 a 3
4 b 2
5 b 2
The indicator, z, should look like this:
x y z
1 a 3 1
2 a 3 2
3 a 3 3
4 b 2 1
5 b 2 2
Any and all help greatly appreciated. Thanks.
No ave?
frame$z <- with(frame, ave(y,x,FUN=seq_along) )
frame
# x y z
#1 a 3 1
#2 a 3 2
#3 a 3 3
#4 b 2 1
#5 b 2 2
A data.table version could be something like below (thanks to #mnel):
#library(data.table)
#frame <- as.data.table(frame)
frame[,z := seq_len(.N), by=x]
My original thought was to use:
frame[,z := .SD[,.I], by=x]
where .SD refers to each subset of the data.table split by x. .I returns the row numbers for an entire data.table. So, .SD[,.I] returns the row numbers within each group. Although, as #mnel points out, this is inefficient compared to the other method as the entire .SD needs to be loaded into memory for each group to run this calculation.
Another approach:
frame$z <- unlist(lapply(rle(as.numeric(frame[, "x"]))$lengths, seq_len))
library(dplyr)
frame %.%
group_by(x) %.%
mutate(z = seq_along(y))
You can split the data.frame on x, and generate a new id column based on that:
> frame$z <- unlist(lapply(split(frame, frame$x), function(x) 1:nrow(x)))
> frame
x y z
1 a 3 1
2 a 3 2
3 a 3 3
4 b 2 1
5 b 2 2
Or even more simply using data.table:
library(data.table)
frame <- data.table(frame)[,z:=1:nrow(.SD),by=x]
Try this where x is the column by which grouping is to be done and y is any numeric column. if there are no numeric columns use seq_along(x), say, in place of y:
transform(frame, z = ave(y, x, FUN = seq_along))

Resources