assign frequency as new column in r - r

Let's say I have the following string:
a <- c("a", "b", "c", "b", "a", "a", "e")
b <- table(a)
b gives me the frequency of every element in a. How do I create a dataframe with two columns, the first column is a and in the second I have the frequency of each element?
The output should look like this:
f <- c(3, 2, 1, 2, 3, 3, 1)
output <- data.frame(a,f)
Thank you very much in advance!

We can use add_count to create a new column
library(tibble)
library(dplyr)
tibble(a) %>%
add_count(a)
Or in base R with ave
data.frame(a, freq = ave(seq_along(a), a, FUN = length))
Or if it needs to be from 'b', do the match with the names of 'b' and the vector 'a' to expand the table output and then convert the table object to data.frame with as.data.frame
as.data.frame(b[a])
# a Freq
#1 a 3
#2 b 2
#3 c 1
#4 b 2
#5 a 3
#6 a 3
#7 e 1

Using merge:
merge(as.data.frame(a), as.data.frame(table(a)))
# a Freq
#1 a 3
#2 a 3
#3 a 3
#4 b 2
#5 b 2
#6 c 1
#7 e 1

Related

R - subset rows by rows in another data frame

Let's say I have a data frame df containing only factors/categorical variables. I have another data frame conditions where each row contains a different combination of the different factor levels of some subset of variables in df (made using expand.grid and levels etc.). I'm trying to figure out a way of subsetting df based on each row of conditions. So for example, if the column names of conditions are c("A", "B", "C") and the first row is c('a1', 'b1', 'c1'), then I want df[df$A == 'a1' & df$B == 'b1' & df$C == 'c1',], and so on.
I'd think this is a great time to use merge (or dplyr::*_join or ...):
df1 <- expand.grid(A = letters[1:4], B = LETTERS[1:4], stringsAsFactors = FALSE)
df1$rn <- seq_len(nrow(df1))
# 'df2' contains the conditions we want to filter (retain)
df2 <- data.frame(
a1 = c('a', 'a', 'c'),
b1 = c('B', 'C', 'C'),
stringsAsFactors = FALSE
)
df1
# A B rn
# 1 a A 1
# 2 b A 2
# 3 c A 3
# 4 d A 4
# 5 a B 5
# 6 b B 6
# 7 c B 7
# 8 d B 8
# 9 a C 9
# 10 b C 10
# 11 c C 11
# 12 d C 12
# 13 a D 13
# 14 b D 14
# 15 c D 15
# 16 d D 16
df2
# a1 b1
# 1 a B
# 2 a C
# 3 c C
Using df2 to define which combinations we need to keep,
merge(df1, df2, by.x=c('A','B'), by.y=c('a1','b1'))
# A B rn
# 1 a B 5
# 2 a C 9
# 3 c C 11
# or
dplyr::inner_join(df1, df2, by=c(A='a1', B='b1'))
(I defined df2 with different column names just to show how it works, but in reality since its purpose is "solely" to be declarative on which combinations to filter, it would make sense to me to have the same column names, in which case the by= argument just gets simpler.)
One option is to create the condition with Reduce
df[Reduce(`&`, Map(`==`, df[c("A", "B", "C")], df[1, c("A", "B", "C")])),]
Or another option is rowSums
df[rowSums(df[c("A", "B", "C")] ==
df[1, c("A", "B", "C")][col(df[c("A", "B", "C")])]) == 3,]

tidyr spread subset of key-value pairs

Given the example data, I'd like to spread a subset of the key-value pairs. In this case it is just one pair. However there are other cases where the subset to be spread is more than one pair.
library(tidyr)
# dummy data
> df1 <- data.frame(e = c(1, 1, 1, 1),
n = c("a", "b", "c", "d") ,
s = c(1, 2, 5, 7))
> df1
e n s
1 1 a 1
2 1 b 2
3 1 c 5
4 1 d 7
Classical spread of all key-value pairs:
> df1 %>% spread(n,s)
e a b c d
1 1 1 2 5 7
Desired output, spread only n=c
e c n s
1 1 5 a 1
2 1 5 b 2
3 1 5 d 7
We can do a gather after the spread
df1 %>%
spread(n, s) %>%
gather(n, s, -c, -e)
# e c n s
#1 1 5 a 1
#2 1 5 b 2
#3 1 5 d 7
Or instead of spread/gather, we filter without the 'c' row and then mutate to create the 'c' column while subsetting the 's' that corresponds to 'c'
df1 %>%
filter(n != "c") %>%
mutate(c = df1$s[df1$n=="c"])

order a row by column name of other data frame and match in length

For example you have this data frame :
dd <- data.frame(b = c("cpg1", "cpg2", "cpg3", "cpg4"),
x = c("A", "D", "A", "C"), y = c(8, 3, 9, 9),
z = c(1, 1, 1, 2))
dd
b x y z
1 cpg1 A 8 1
2 cpg2 D 3 1
3 cpg3 A 9 1
4 cpg4 C 9 2
I want to order the column names (b,x,y,z) by a row in another data frame which is:
d <- data.frame(pos = c("x", "z", "b"),
g = c("A", "D", "A"), h = c(8, 3, 9))
d
pos g h
1 x A 8
2 z D 3
3 b A 9
So I want to order the column name of dd with the row d$pos and dd also needs to have the same number in the row d$pos.
I tried with order and match but it did not give me the need result. My dataset is quite large, so something automtic would be ideal.
Thanks a lot for your help!
We can do a match and then order the columns
i1 <- match(d$pos, names(dd), nomatch = 0)
dd[i1]
# x z b
#1 A 1 cpg1
#2 D 1 cpg2
#3 A 1 cpg3
#4 C 2 cpg4
Or if we want only the columns based on the 'd$pos'
dd[as.character(d$pos)]
# x z b
#1 A 1 cpg1
#2 D 1 cpg2
#3 A 1 cpg3
#4 C 2 cpg4

Group data frame by elements from a variable containing lists of elements

I would like to perform a a non-trivial group_by, grouping and summarizing a data frame by single elements of lists found in one of its variables.
df <- data.frame(x = 1:5)
df$y <- list("A", c("A", "B"), "C", c("B", "D", "C"), "E")
df
x y
1 1 A
2 2 A, B
3 3 C
4 4 B, D, C
5 5 E
Now grouping by y (and say counting no. of rows), which is a variable holding lists of elements, the required end results should be:
data.frame(group = c("A", "B", "C", "D", "E"), n = c(2,2,2,1,1))
group n
1 A 2
2 B 2
3 C 2
4 D 1
5 E 1
Because "A" appears in 2 rows, "B" in 2 rows, etc.
Note: the sum of n is not necessarily equal to number of rows in the data frame.
We can use simple base R solution with table to calculate the frequency after unlisting the list and then create a data.table based on that table object
tbl <- table(unlist(df$y))
data.frame(group = names(tbl), n = as.vector(tbl))
# group n
#1 A 2
#2 B 2
#3 C 2
#4 D 1
#5 E 1
Or another option with tidyverse
library(dplyr)
library(tidyr)
unnest(df) %>%
group_by(group = y) %>%
summarise(n=n())
# <chr> <int>
#1 A 2
#2 B 2
#3 C 2
#4 D 1
#5 E 1
Or as #alexis_laz mentioned in the comments, an alternative is as.data.frame.table
as.data.frame(table(group = unlist(df$y)), responseName = "n")
simple base R solution: (actually this is dup question, unable to locate it though)
sapply(unique(unlist(df$y)), function(x) sum(grepl(x, df$y))
# A B C D E
# 2 2 2 1 1

Reshaping dataframe - two columns from correlation variables

I have the below df
var1 var2 Freq
1 a b 10
2 b a 5
3 b d 10
created from
help <- data.frame(var1 = c("a", "b", "b"), var2 = c("b", "a", "d"), Freq = c(10, 5, 10))
ab correlation is the same as ba, and I am hoping to combine them into one row to look like
var1 var2 Freq
1 a b 15
2 b d 10
any thoughts?
Here's one way:
setNames(aggregate(help$Freq, as.data.frame(t(apply(help[-3], 1, sort))), sum),
names(help))
# var1 var2 Freq
# 1 a b 15
# 2 b d 10
In base R :
do.call(rbind,
by(dat,rowSums(sapply(dat[,c("var1","var2")],as.integer)),
function(x)data.frame(x[1,c("var1","var2")],
Freq= sum(x[,"Freq"]))))
var1 var2 Freq
3 a b 15
5 b d 10
I create first a grouping variable by summing the integer representation of your columns. Then performing the sum of frequencies by group. Finally bind the result to get a new data.frame.

Resources