extracting variables in R using frequencies

extracting variables in R using frequencies - r

Suppose I have a dataframe:
x y
a 1
b 2
a 3
a 4
b 5
c 6
a 7
d 8
a 9
b 10
e 12
b 13
c 15
I want to create another dataframe that includes only the x values that occur at least 3 times (a and b, in this case), and their highest corresponding y values.
So I want the output as:
x y
a 9
b 13
Here 9 and 13 are the highest values of a and b respectively
I tried using:
sort-(table(x,y))
but it did not work.

The data.table package is great for this. If df is the original data, you can do
library(data.table)
setDT(df)[, .(y = max(y)[.N >= 3]), by=x]
# x y
# 1: a 9
# 2: b 13
.N is an integer that tells us how many rows are in each group (which we've set to x here). So we just subset max(y) such that .N is at least three.

Here's one way, using subset to omit any x that occur less than 3 times, and then aggregate to find the maximum value by group:
d <- read.table(text='x y
a 1
b 2
a 3
a 4
b 5
c 6
a 7
d 8
a 9
b 10
e 12
b 13
c 15', header=TRUE)
with(subset(d, x %in% names(which(table(d$x) >= 3))),
aggregate(list(y=y), list(x=x), max))
# x y
# 1 a 9
# 2 b 13
And for good measure, a dplyr approach:
library(dplyr)
d %>%
group_by(x) %>%
filter(n() >= 3) %>%
summarise(max(y))
# Source: local data frame [2 x 2]
#
# x max(y)
# 1 a 9
# 2 b 13

Related

How to update row by group in sequence

I have a dt:
library(data.table)
DT <- data.table(a = c(1,2,3,4,5), b = c(4,5,6,7,8), c = c("X","X","X","Y","Y") )
I want to add one column d, within each group of column C:
the first row value should be the same as b[i],
the second to last row within each group should be d[i-1] + 2*b[i]
Intended results:
a b c d
1: 1 4 X 4
2: 2 5 X 14
3: 3 6 X 26
4: 4 7 Y 7
5: 5 8 Y 23
I tried to use functions such as shift but I struggle to update rows dynamically (so to speak) here,
wonder if there is any elegant data.table style solution?

We can use cumsum and subtract the first row using [1]:
DT[, d := cumsum(2 * b) - b[1], .(c)][]
#> a b c d
#> 1: 1 4 X 4
#> 2: 2 5 X 14
#> 3: 3 6 X 26
#> 4: 4 7 Y 7
#> 5: 5 8 Y 23

Here we can use accumulate
library(purrr)
library(data.table)
DT[, d := accumulate(b, ~ .x + 2 *.y), by = c]
Or with Reduce and accumulate = TRUE from base R
DT[, d := Reduce(function(x, y) x + 2 * y, b, accumulate = TRUE), by = c]

Sum of individual elements in a vector

I would like to determine the sum for each individual element in a vector.
For example, suppose I have the vector
x <- c(2,3,2,2,5,5,3,3)
and I want to find the sum for each element.
The answer would be something like
2: 6
3: 9
5: 10
This is because there are three 2's (2+2+2 or 2*), etc.
In other words, I want to essentially multiply the number times the number of times that element is found in the vector.

Using base R tapply
tapply(x, x, sum)
# 2 3 5
# 6 9 10
If you need it as dataframe wrap it in stack
stack(tapply(x, x, sum))
# values ind
#1 6 2
#2 9 3
#3 10 5
If you convert this to a dataframe then this becomes (How to sum a variable by group)
library(dplyr)
tibble::tibble(x) %>%
group_by(x) %>%
summarise(n = sum(x))
# A tibble: 3 x 2
# x n
# <dbl> <dbl>
#1 2 6
#2 3 9
#3 5 10

A method with dplyr:
x <- c(2,3,2,2,5,5,3,3)
a = tibble(x)
a %>% count(x) %>% mutate(xn = x*n)
# A tibble: 3 x 3
x n xn
<dbl> <int> <dbl>
1 2 3 6
2 3 3 9
3 5 2 10

Lots of ways to do this. A couple of base approaches:
with(rle(sort(x)), data.frame(val = values, freq = lengths, prod = lengths*values))
val freq prod
1 2 3 6
2 3 3 9
3 5 2 10
Or:
transform(as.data.frame(table(x), stringsAsFactors = FALSE), sum = as.numeric(x) * Freq)
x Freq sum
1 2 3 6
2 3 3 9
3 5 2 10

library(tidyverse)
x <- c(2,3,2,2,5,5,3,3)
tibble(x) %>%
count(x) %>%
mutate(xn = x*n ) %>%
pull(xn)

We can use rowsum from base R
rowsum(x, group = x)
# [,1]
#2 6
#3 9
#5 10
Or with by
by(x, x, FUN = sum)
Or with split
sapply(split(x, x), sum)
# 2 3 5
# 6 9 10
Or another option with xtabs
xtabs(x1 ~ x, cbind(x1 = x, x))
# 2 3 5
# 6 9 10
Or with ave
unique(data.frame(x, Sum = ave(x, x, FUN = sum)))
# x Sum
#1 2 6
#2 3 9
#5 5 10
Or using data.table
library(data.table)
data.table(grp = x, x=x)[, .(Sum = sum(x)), grp]
# grp Sum
#1: 2 6
#2: 3 9
#3: 5 10

How to merge and sum two data frames

Here is my issue:
df1 <- data.frame(x = 1:5, y = 2:6, z = 3:7)
rownames(df1) <- LETTERS[1:5]
df1
x y z
A 1 2 3
B 2 3 4
C 3 4 5
D 4 5 6
E 5 6 7
df2 <- data.frame(x = 1:5, y = 2:6, z = 3:7)
rownames(df2) <- LETTERS[3:7]
df2
x y z
C 1 2 3
D 2 3 4
E 3 4 5
F 4 5 6
G 5 6 7
what I wanted is:
x y z
A 1 2 3
B 2 3 4
C 4 6 8
D 6 8 10
E 8 10 12
F 4 5 6
G 5 6 7
where duplicated rows were added up by same variable.

A solution with base R:
# create a new variable from the rownames
df1$rn <- rownames(df1)
df2$rn <- rownames(df2)
# bind the two dataframes together by row and aggregate
res <- aggregate(cbind(x,y,z) ~ rn, rbind(df1,df2), sum)
# or (thx to #alistaire for reminding me):
res <- aggregate(. ~ rn, rbind(df1,df2), sum)
# assign the rownames again
rownames(res) <- res$rn
# get rid of the 'rn' column
res <- res[, -1]
which gives:
> res
x y z
A 1 2 3
B 2 3 4
C 4 6 8
D 6 8 10
E 8 10 12
F 4 5 6
G 5 6 7

With dplyr,
library(dplyr)
# add rownames as a column in each data.frame and bind rows
bind_rows(df1 %>% add_rownames(),
df2 %>% add_rownames()) %>%
# evaluate following calls for each value in the rowname column
group_by(rowname) %>%
# add all non-grouping variables
summarise_all(sum)
## # A tibble: 7 x 4
## rowname x y z
## <chr> <int> <int> <int>
## 1 A 1 2 3
## 2 B 2 3 4
## 3 C 4 6 8
## 4 D 6 8 10
## 5 E 8 10 12
## 6 F 4 5 6
## 7 G 5 6 7

could also vectorize the operation turning the dfs to matrices:
result_df <- as.data.frame(as.matrix(df1) + as.matrix(df2))

This might need some teaking to get the rownames logic working on a longer example:
dfr <-rbind(df1,df2)
do.call(rbind, lapply( split(dfr, sapply(rownames(dfr),substr,1,1)), colSums))
x y z
A 1 2 3
B 2 3 4
C 4 6 8
D 6 8 10
E 8 10 12
F 4 5 6
G 5 6 7
If the rownames could all be assumed to be alpha characters a gsub solution should be easy.

An alternative is to melt the data and cast it. At first we set the row names to the last column of both data frames thanks to #Jaap
df1$rn <- rownames(df1)
df2$rn <- rownames(df2)
Then we melt the data based on the name
melt(list(df1, df2), id.vars = "rn")
Then we use dcast with mget function which is used to retrieve multiple variables at once.
mydf<- dcast(melt(mget(ls(pattern = "df\\d+")), id.vars = "rn"),
rn ~ variable, value.var = "value", fun.aggregate = sum)
rownames(mydf) <- mydf$rn
# get rid of the 'rn' column
mydf <- mydf[, -1]
> mydf
# x y z
#A 1 2 3
#B 2 3 4
#C 4 6 8
#D 6 8 10
#E 8 10 12
#F 4 5 6
#G 5 6 7

In R, split a dataframe so subset dataframes contain last row of previous dataframe and first row of subsequent dataframe

There are many answers for how to split a dataframe, for example How to split a data frame?
However, I'd like to split a dataframe so that the smaller dataframes contain the last row of the previous dataframe and the first row of the following dataframe.
Here's an example
n <- 1:9
group <- rep(c("a","b","c"), each = 3)
data.frame(n = n, group)
n group
1 1 a
2 2 a
3 3 a
4 4 b
5 5 b
6 6 b
7 7 c
8 8 c
9 9 c
I'd like the output to look like:
d1 <- data.frame(n = 1:4, group = c(rep("a",3),"b"))
d2 <- data.frame(n = 3:7, group = c("a",rep("b",3),"c"))
d3 <- data.frame(n = 6:9, group = c("b",rep("c",3)))
d <- list(d1, d2, d3)
d
[[1]]
n group
1 1 a
2 2 a
3 3 a
4 4 b
[[2]]
n group
1 3 a
2 4 b
3 5 b
4 6 b
5 7 c
[[3]]
n group
1 6 b
2 7 c
3 8 c
4 9 c
What is an efficient way to accomplish this task?

Suppose DF is the original data.frame, the one with columns n and group. Let n be the number of rows in DF. Now define a function extract which given a sequence of indexes ix enlarges it to include the one prior to the first and after the last and then returns those rows of DF. Now that we have defined extract, split the vector 1, ..., n by group and apply extract to each component of the split.
n <- nrow(DF)
extract <- function(ix) DF[seq(max(1, min(ix) - 1), min(n, max(ix) + 1)), ]
lapply(split(seq_len(n), DF$group), extract)
$a
n group
1 1 a
2 2 a
3 3 a
4 4 b
$b
n group
3 3 a
4 4 b
5 5 b
6 6 b
7 7 c
$c
n group
6 6 b
7 7 c
8 8 c
9 9 c

Or why not try good'ol by, which "[a]ppl[ies] a Function to a Data Frame Split by Factors [INDICES]".
by(data = df, INDICES = df$group, function(x){
id <- c(min(x$n) - 1, x$n, max(x$n) + 1)
na.omit(df[id, ])
})
# df$group: a
# n group
# 1 1 a
# 2 2 a
# 3 3 a
# 4 4 b
# --------------------------------------------------------------------------------
# df$group: b
# n group
# 3 3 a
# 4 4 b
# 5 5 b
# 6 6 b
# 7 7 c
# --------------------------------------------------------------------------------
# df$group: c
# n group
# 6 6 b
# 7 7 c
# 8 8 c
# 9 9 c
Although the print method of by creates a 'fancy' output, the (default) result is a list, with elements named by the levels of the grouping variable (just try str and names on the resulting object).

I was going to comment under #cdetermans answer but its too late now.
You can generalize his approach using data.table::shift (or dyplr::lag) in order to find the group indices and then run a simple lapply on the ranges, something like
library(data.table) # v1.9.6+
indx <- setDT(df)[, which(group != shift(group, fill = TRUE))]
lapply(Map(`:`, c(1L, indx - 1L), c(indx, nrow(df))), function(x) df[x,])
# [[1]]
# n group
# 1: 1 a
# 2: 2 a
# 3: 3 a
# 4: 4 b
#
# [[2]]
# n group
# 1: 3 a
# 2: 4 b
# 3: 5 b
# 4: 6 b
# 5: 7 c
#
# [[3]]
# n group
# 1: 6 b
# 2: 7 c
# 3: 8 c
# 4: 9 c

Could be done with data.frame as well, but is there ever a reason not to use data.table? Also this has the option to be executed with parallelism.
library(data.table)
n <- 1:9
group <- rep(c("a","b","c"), each = 3)
df <- data.table(n = n, group)
df[, `:=` (group = factor(df$group))]
df[, `:=` (group_i = seq_len(.N), group_N = .N), by = "group"]
library(doParallel)
groups <- unique(df$group)
foreach(i = seq(groups)) %do% {
df[group == groups[i] | (as.integer(group) == i + 1 & group_i == 1) | (as.integer(group) == i - 1 & group_i == group_N), c("n", "group"), with = FALSE]
}
[[1]]
n group
1: 1 a
2: 2 a
3: 3 a
4: 4 b
[[2]]
n group
1: 3 a
2: 4 b
3: 5 b
4: 6 b
5: 7 c
[[3]]
n group
1: 6 b
2: 7 c
3: 8 c
4: 9 c

Here is another dplyr way:
library(dplyr)
data =
data_frame(n = n, group) %>%
group_by(group)
firsts =
data %>%
slice(1) %>%
ungroup %>%
mutate(new_group = lag(group)) %>%
slice(-1)
lasts =
data %>%
slice(n()) %>%
ungroup %>%
mutate(new_group = lead(group)) %>%
slice(-n())
bind_rows(firsts, data, lasts) %>%
mutate(final_group =
ifelse(is.na(new_group),
group,
new_group) ) %>%
arrange(final_group, n) %>%
group_by(final_group)

Why can't I apply a function to create a new column with mutate() using dplyr?

I have a data.frame, let's call it " df".
I'm trying to create a column, let's call it "result", summing four other columns.
Using dplyr, I can do it with the following code:
mutate(df, result=col1+col2+col3+col4)
However, when I try the following:
mutate(df, result=sum(col1, col2, col3, col4))
It's not working. Why is it occurring?

As pointed out + and sum() differ in behaviour. Consider:
> sum(1:10,1:10)
[1] 110
> `+`(1:10,1:10)
[1] 2 4 6 8 10 12 14 16 18 20
If you really want to sum() the variables along each row you want rowwise():
library(dplyr)
df <- data_frame(w = letters[1:3], x=1:3, y = x^2, z = y - x)
# Source: local data frame [3 x 4]
#
# w x y z
# 1 a 1 1 0
# 2 b 2 4 2
# 3 c 3 9 6
df %>% rowwise() %>% mutate(result = sum(x, y, z))
# Source: local data frame [3 x 5]
# Groups: <by row>
#
# w x y z result
# 1 a 1 1 0 2
# 2 b 2 4 2 8
# 3 c 3 9 6 18
Compare this to:
df %>% mutate(result = x + y + z)
# Source: local data frame [3 x 5]
#
# w x y z result
# 1 a 1 1 0 2
# 2 b 2 4 2 8
# 3 c 3 9 6 18
df %>% mutate(result = sum(x, y, z)) # sums over all of x, y and z and recycles the result!
# Source: local data frame [3 x 5]
#
# w x y z result
# 1 a 1 1 0 28
# 2 b 2 4 2 28
# 3 c 3 9 6 28

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

extracting variables in R using frequencies - r

Related

How to update row by group in sequence

Sum of individual elements in a vector

How to merge and sum two data frames

In R, split a dataframe so subset dataframes contain last row of previous dataframe and first row of subsequent dataframe

Why can't I apply a function to create a new column with mutate() using dplyr?

Categories

Resources