Identify and label the largest number in each group - r

Hi I want to identify and label the largest number for each group, can someone tell me how to get this done in r (or maybe excel would be easier)?
The following is an example data, the original data contains only the left 2 columns and I want to generate the third one. In the 3rd column, I want to label the largest value in the group as 1, e.g., in group 1, the largest is .02874 so it's marked as 1, otherwise 0. Thank you!
x <- read.table(header=T, text="group value largest
1 0.02827 0
1 0.02703 0
1 0.02874 1
2 0.03255 0
2 0.10394 1
2 0.03417 0
3 0.13858 0
3 0.16084 0
3 0.99830 1
3 0.24563 0")
UPDATE: Thank you all for your help! They all are great solutions!

Finally, the base (no package required) approach:
is.largest <- function(x) as.integer(seq_along(x) == which.max(x))
x <- transform(x, largest = ave(value, group, FUN = is.largest))
Note that if I were you, I would remove the as.integer and just store a logical (TRUE/FALSE) vector.

library(data.table)
x <- data.table(x)
y <- x[,list(value = max(value), maxindicator = TRUE), by = c('group')]
z <- merge(x,y, by = c('group','value'), all = TRUE)
Output
> z
group value largest maxindicator
1: 1 0.02703 0 NA
2: 1 0.02827 0 NA
3: 1 0.02874 1 TRUE
4: 2 0.03255 0 NA
5: 2 0.03417 0 NA
6: 2 0.10394 1 TRUE
7: 3 0.13858 0 NA
8: 3 0.16084 0 NA
9: 3 0.24563 0 NA
10: 3 0.99830 1 TRUE

Here is a solution with plyr :
x$largest <- 0
x <- ddply(x, .(group), function(df) {
df$largest[which.max(df$value)] <- 1
df
})
And one with base R :
x$largest <- 0
l <- split(x, x$group)
l <- lapply(l, function(df) {
df$largest[which.max(df$value)] <- 1
df
})
x <- do.call(rbind, l)

Here's a less cool base approach:
FUN <- function(x) {y <- rep(0, length(x)); y[which.max(x)] <- 1; y}
x$largest <- unlist(tapply(x$value, x$group, FUN))
## group value largest
## 1 1 0.02827 0
## 2 1 0.02703 0
## 3 1 0.02874 1
## 4 2 0.03255 0
## 5 2 0.10394 1
## 6 2 0.03417 0
## 7 3 0.13858 0
## 8 3 0.16084 0
## 9 3 0.99830 1
## 10 3 0.24563 0
It was more difficult to do in base than I had anticipated.

Related

How to replace consecutive NAs with zero given a max gap parameter (in R)

I would like to replace all consecutive NA values per row with zero but only if the number of consecutive NAs is less than a parmeter maxgap.
This is very similar to the function zoo::na.locf
x = c(NA,1,2,3,NA,NA,5,6,7,NA,NA,NA)
zoo::na.locf(x, maxgap = 2, na.rm = FALSE)
gives
[1] NA 1 2 3 3 3 5 6 7 NA NA NA
There are two things different from my aim:
I would like to replace the leading NA too and I would like to replace the 2 consecutive NAs with 0 and not the last non-NA value.
I would like to get
0 1 2 3 0 0 5 6 7 NA NA NA
How can I do this in R. Can I use functions from the tidyverse?
If y is the result of the na.locf line then if y[i] is not NA but x[i] is NA then it was replaced so assign 0 to it. Also if it is a leading NA which occurs when the cumsum(...) term below is 0 then replace it as well.
replace(y, (!is.na(y) & is.na(x)) | cumsum(!is.na(y)) == 0, 0)
## [1] 0 1 2 3 0 0 5 6 7 NA NA NA
We can use rle to do this
f1 <- function(vec){
rl <- rle(is.na(vec))
lst <- within.list(rl, {
i1 <- seq_along(values)==1
i2 <- seq_along(values) != length(values)
values[!((lengths==2 & values & i2)|
(values & i1))] <- FALSE
})
vec[inverse.rle(lst)] <- 0
vec
}
f1(x)
#[1] 0 1 2 3 0 0 5 6 7 NA NA NA
You could e.g. do this:
require(data.table)
require(dplyr)
x = c(NA,1,2,3,NA,NA,5,6,7,NA,NA,NA)
my_replace <- function(x, n, maxgap){
if(is.na(x[1]) && n <= maxgap){
x <- 0
}
x
}
data.frame(x, y=x) %>%
group_by(data.table::rleid(x)) %>%
mutate(x = my_replace(x, n(), 2), y = my_replace(y, n(), 1)) %>%
ungroup() %>%
select(x,y)
This allows you to set the maxgap columnwise: for x 2 for y 1.
This results in:
# A tibble: 12 × 2
x y
<dbl> <dbl>
1 0 0
2 1 1
3 2 2
4 3 3
5 0 NA
6 0 NA
7 5 5
8 6 6
9 7 7
10 NA NA
11 NA NA
12 NA NA

Cumulative sum for positive numbers only [duplicate]

This question already has answers here:
Create counter within consecutive runs of certain values
(6 answers)
Closed 1 year ago.
I have this vector :
x = c(1,1,1,1,1,0,1,0,0,0,1,1)
And I want to do a cumulative sum for the positive numbers only. I should have the following vector in return:
xc = (1,2,3,4,5,0,1,0,0,0,1,2)
How could I do it?
I've tried : cumsum(x) but that do the cumulative sum for all values and gives :
cumsum(x)
[1] 1 2 3 4 5 5 6 6 6 6 7 8
One option is
x1 <- inverse.rle(within.list(rle(x), values[!!values] <-
(cumsum(values))[!!values]))
x[x1!=0] <- ave(x[x1!=0], x1[x1!=0], FUN=seq_along)
x
#[1] 1 2 3 4 5 0 1 0 0 0 1 2
Or a one-line code would be
x[x>0] <- with(rle(x), sequence(lengths[!!values]))
x
#[1] 1 2 3 4 5 0 1 0 0 0 1 2
Here's a possible solution using data.table v >= 1.9.5 and its new rleid funciton
library(data.table)
as.data.table(x)[, cumsum(x), rleid(x)]$V1
## [1] 1 2 3 4 5 0 1 0 0 0 1 2
Base R, one line solution with Map Reduce :
> Reduce('c', Map(function(u,v) if(v==0) rep(0,u) else 1:u, rle(x)$lengths, rle(x)$values))
[1] 1 2 3 4 5 0 1 0 0 0 1 2
Or:
unlist(Map(function(u,v) if(v==0) rep(0,u) else 1:u, rle(x)$lengths, rle(x)$values))
x=c(1,1,1,1,1,0,1,0,0,0,1,1)
cumsum_ <- function(x) {
r <- rle(x)
s <- split(x, rep(seq_along(r$values), rle(x)$lengths))
return(unlist(sapply(s, cumsum), use.names = F))
}
(xc <- cumsum_(x))
# [1] 1 2 3 4 5 0 1 0 0 0 1 2
I dont know much of R but i have written a small code in Python. Logic remains the same in all language. Hope this will help you
x=[1,1,1,1,1,0,1,0,0,0,1,1]
tot=0
for i in range(0,len(x)):
if x[i]!=0:
tot=tot+x[i]
x[i]=tot
else:
tot=0
print x
x<-c(1,1,1,1,1,0,1,0,0,0,1,1)
skumulowana<-function(x) {
dl<-length(x)
xx<-numeric(dl+1)
for (i in 1:dl){
ifelse (x[i]==0,xx[i+1]<-0,xx[i+1]<-xx[i]+x[i])
}
wynik<<-xx[1:dl+1]
return (wynik)
}
skumulowana(x)
## [1] 1 2 3 4 5 0 1 0 0 0 1 2
Try this one-liner...
Reduce(function(x,y) (x+y)*(y!=0), x, accumulate=T)
split and lapply version:
x <- c(1,1,1,1,1,0,1,0,0,0,1,1)
unlist(lapply(split(x, cumsum(x==0)), cumsum))
step by step:
a <- split(x, cumsum(x==0)) # divides x into pieces where each 0 starts a new piece
b <- lapply(a, cumsum) # calculates cumsum in each piece
unlist(b) # rejoins the pieces
Result has useless names but is otherwise what you wanted:
# 01 02 03 04 05 11 12 2 3 41 42 43
# 1 2 3 4 5 0 1 0 0 0 1 2
Here is another base R solution using aggregate. The idea is to make a data frame with x and a new column named x.1 by which we can apply aggregate functions (cumsum in this case):
x <- c(1,1,1,1,1,0,1,0,0,0,1,1)
r <- rle(x)
df <- data.frame(x,
x.1=unlist(sapply(1:length(r$lengths), function(i) rep(i, r$lengths[i]))))
# df
# x x.1
# 1 1 1
# 2 1 1
# 3 1 1
# 4 1 1
# 5 1 1
# 6 0 2
# 7 1 3
# 8 0 4
# 9 0 4
# 10 0 4
# 11 1 5
# 12 1 5
agg <- aggregate(df$x~df$x.1, df, cumsum)
as.vector(unlist(agg$`df$x`))
# [1] 1 2 3 4 5 0 1 0 0 0 1 2

combine tables into a data frame

How do I turn a list of tables into a data frame?
I have:
> (tabs <- list(table(c('a','a','b')),table(c('c','c','b')),table(c()),table(c('b','b'))))
[[1]]
a b
2 1
[[2]]
b c
1 2
[[3]]
< table of extent 0 >
[[4]]
b
2
I want:
> data.frame(a=c(2,0,0),b=c(1,1,2),c=c(0,2,0))
a b c
1 2 1 0
2 0 1 2
3 0 0 0
4 0 2 0
PS. Please do not assume that the tables were created by table calls! They were not!
c_names <- unique(unlist(sapply(tabs, names)))
df <- do.call(rbind, lapply(tabs, `[`, c_names))
colnames(df) <- c_names
df[is.na(df)] <- 0
This assumes the tables are one dimensional.
all.names <- unique(unlist(lapply(tabs, names)))
df <- as.data.frame(do.call(rbind,
lapply(
tabs, function(x) as.list(replace(c(x)[all.names], is.na(c(x)[all.names]), 0))
) ) )
names(df) <- all.names
df
There is probably a cleaner way to do this.
# a b c
# 1 2 1 0
# 2 0 1 2
# 3 0 0 0
# 4 0 2 0
tabs <- list(table(c('a','a','b')),table(c('c','c','b')),table(c()),table(c('b','b')))
dat.names <- unique(unlist(sapply(tabs, names)))
dat <- matrix(0, nrow = length(tabs), ncol = length(dat.names))
colnames(dat) <- dat.names
for (ii in 1:length(tabs)) {
dat[ii, ] <- tabs[[ii]][match(colnames(dat), names(tabs[[ii]]) )]
}
dat[is.na(dat)] <- 0
> dat
a b c
[1,] 2 1 0
[2,] 0 1 2
[3,] 0 0 0
[4,] 0 2 0
Here is a pretty clean approach:
library(reshape2)
newTabs <- melt(tabs)
newTabs
# Var1 value L1
# 1 a 2 1
# 2 b 1 1
# 3 b 1 2
# 4 c 2 2
# 5 b 2 4
newTabs$L1 <- factor(newTabs$L1, seq_along(tabs))
dcast(newTabs, L1 ~ Var1, fill = 0, drop = FALSE)
# L1 a b c
# 1 1 2 1 0
# 2 2 0 1 2
# 3 3 0 0 0
# 4 4 0 2 0
This makes use of the fact that there is a melt method for lists (see reshape2:::melt.list) which automatically adds in a variable (L1 for an unnested list) that identifies the index of the list element. Since your list has some items which are empty, they won't show up in your melted list, so you need to factor the "L1" column, specifying the levels you want. dcast takes care of restructuring your output and allows you to specify the desired fill value.

Aggregating every 10 columns in binary matrice

I am new to R.
I would like to transform a binary matrix like this:
example:
" 1874 1875 1876 1877 1878 .... 2009
F 1 0 0 0 0 ... 0
E 1 1 0 0 0 ... 0
D 1 1 0 0 0 ... 0
C 1 1 0 0 0 ... 0
B 1 1 0 0 0 ... 0
A 1 1 0 0 0 ... 0"
Since, columns names are years I would like to aggregate them in decades and obtain something like:
"1840-1849 1850-1859 1860-1869 .... 2000-2009
F 1 0 0 0 0 ... 0
E 1 1 0 0 0 ... 0
D 1 1 0 0 0 ... 0
C 1 1 0 0 0 ... 0
B 1 1 0 0 0 ... 0
A 1 1 0 0 0 ... 0"
I am used to python and do not know how to do this transformation without making loops!
Thanks, isabel
It is unclear what aggregation you want, but using the following dummy data
set.seed(42)
df <- data.frame(matrix(sample(0:1, 6*25, replace = TRUE), ncol = 25))
names(df) <- 1874 + 0:24
The following counts events in each 10-year period.
Get the years as a numeric variable
years <- as.numeric(names(df))
Next we need an indicator for the start of each decade
ind <- seq(from = signif(years[1], 3), to = signif(tail(years, 1), 3), by = 10)
We then apply over the indices of ind (1:(length(ind)-1)), select columns from df that are the current decade and count the 1s using rowSums.
tmp <- lapply(seq_along(ind[-1]),
function(i, inds, data) {
rowSums(data[, names(data) %in% inds[i]:(inds[i+1]-1)])
}, inds = ind, data = df)
Next we cbind the resulting vectors into a data frame and fix-up the column names:
out <- do.call(cbind.data.frame, tmp)
names(out) <- paste(head(ind, -1), tail(ind, -1) - 1, sep = "-")
out
This gives:
> out
1870-1879 1880-1889 1890-1899
1 4 5 6
2 4 6 6
3 2 5 5
4 5 5 7
5 3 3 7
6 5 5 4
If you want simply a binary matrix with a 1 indicating at least 1 event happened in that decade, then you can use:
tmp2 <- lapply(seq_along(ind[-1]),
function(i, inds, data) {
as.numeric(rowSums(data[, names(data) %in% inds[i]:(inds[i+1]-1)]) > 0)
}, inds = ind, data = df)
out2 <- do.call(cbind.data.frame, tmp2)
names(out2) <- paste(head(ind, -1), tail(ind, -1) - 1, sep = "-")
out2
which gives:
> out2
1870-1879 1880-1889 1890-1899
1 1 1 1
2 1 1 1
3 1 1 1
4 1 1 1
5 1 1 1
6 1 1 1
If you want a different aggregation, then modify the function applied in the lapply call to use something other than rowSums.
This is another option, using modular arithmetic to aggregate the columns.
# setup, borrowed from #GavinSimpson
set.seed(42)
df <- data.frame(matrix(sample(0:1, 6*25, replace = TRUE), ncol = 25))
names(df) <- 1874 + 0:24
result <- do.call(cbind,
by(t(df), as.numeric(names(df)) %/% 10 * 10, colSums))
# add -xxx9 to column names, for each decade
dimnames(result)[[2]] <- paste(colnames(result), as.numeric(colnames(result)) + 9, sep='-')
# 1870-1879 1880-1889 1890-1899
# V1 4 5 6
# V2 4 6 6
# V3 2 5 5
# V4 5 5 7
# V5 3 3 7
# V6 5 5 4
If you wanted to aggregate with something other than sum, replace the call to
colSums with something like function(cols) lapply(cols, f), where f is the aggregating
function, e.g., max.

Generate vectors using R

I would like to ask,if some of You dont know any simple way to solve this kind of problem:
I need to generate all combinations of A numbers taken from a set B (0,1,2...B), with their sum = C.
ie if A=2, B=3, C=2:
Solution in this case:
(1,1);(0,2);(2,0)
So the vectors are length 2 (A), sum of all its items is 2 (C), possible values for each of vectors elements come from the set {0,1,2,3} (maximum is B).
A functional version since I already started before SO updated:
A=2
B=3
C=2
myfun <- function(a=A, b=B, c=C) {
out <- do.call(expand.grid, lapply(1:a, function(x) 0:b))
return(out[rowSums(out)==c,])
}
> out[rowSums(out)==c,]
Var1 Var2
3 2 0
6 1 1
9 0 2
z <- expand.grid(0:3,0:3)
z[rowSums(z)==2, ]
Var1 Var2
3 2 0
5 1 1
7 0 2
If you wanted to do the expand grid programmatically this would work:
z <- expand.grid( rep( list(C), A) )
You need to expand as a list so that the items remain separate. rep(0:3, 3) would not return 3 separate sequences. So for A=3:
> z <- expand.grid(rep(list(0:3), 3))
> z[rowSums(z)==2, ]
Var1 Var2 Var3
3 2 0 0
6 1 1 0
9 0 2 0
18 1 0 1
21 0 1 1
33 0 0 2
Using the nifty partitions() package, and more interesting values of A, B, and C:
library(partitions)
A <- 2
B <- 5
C <- 7
comps <- t(compositions(C, A))
ii <- apply(comps, 1, FUN=function(X) all(X %in% 0:B))
comps[ii, ]
# [,1] [,2]
# [1,] 5 2
# [2,] 4 3
# [3,] 3 4
# [4,] 2 5

Resources