Frequency of values per column in table - r

What is a good way to get the independent frequency counts of multiple columns using dplyr? I want to go from a table of values:
# A tibble: 7 x 4
a b c d
<int> <int> <int> <int>
1 1 2 1 3
2 1 2 1 3
3 2 2 5 3
4 3 2 4 3
5 3 3 2 3
6 5 3 4 3
7 5 4 2 1
to a frequency table like so:
# A tibble: 5 x 5
x a_n b_n c_n d_n
<int> <int> <int> <int> <int>
1 1 2 0 2 1
2 2 1 4 2 0
3 3 2 2 0 6
4 4 0 1 2 0
5 5 2 0 1 0
I'm still trying to get my head around dplyr, but it seems like this is something it could do. If it is easier to do with an add-on library, that is fine too.

For the same data set that you provided in the question this would be another solution (base-R):
myfreq <- sapply(df, function(x) table(factor(x, levels=unique(unlist(df)), ordered=TRUE)))
Output would be:
> myfreq
# a b c d
# 1 2 0 2 1
# 2 1 4 2 0
# 3 2 2 0 6
# 5 2 0 1 0
# 4 0 1 2 0

Using tabulate in base R:
apply(df,2,function(x) tabulate(x)[min(df):max(df)])
# a b c d
#[1,] 2 0 2 1
#[2,] 1 4 2 0
#[3,] 2 2 0 6
#[4,] 0 1 2 NA
#[5,] 2 NA 1 NA

library(dplyr)
library(reshape2)
df %>%
melt() %>%
dcast(value ~ variable, fun.aggregate=length)
# value a b c d
# 1 1 2 0 2 1
# 2 2 1 4 2 0
# 3 3 2 2 0 6
# 4 4 0 1 2 0
# 5 5 2 0 1 0
Data
df <- structure(list(a = c(1L, 1L, 2L, 3L, 3L, 5L, 5L), b = c(2L, 2L,
2L, 2L, 3L, 3L, 4L), c = c(1L, 1L, 5L, 4L, 2L, 4L, 2L), d = c(3L,
3L, 3L, 3L, 3L, 3L, 1L)), .Names = c("a", "b", "c", "d"), class = "data.frame", row.names = c("1",
"2", "3", "4", "5", "6", "7"))

library(tidyverse)
dt <- data.frame(a = c(1L, 1L, 2L, 3L, 3L, 5L, 5L), b = c(2L, 2L, 2L, 2L, 3L, 3L, 4L),
c = c(1L, 1L, 5L, 4L, 2L, 4L, 2L), d = c(3L, 3L, 3L, 3L, 3L, 3L, 1L))
dt2 <- dt %>%
mutate(ID = 1:n()) %>%
gather(Group, x, -ID) %>%
select(-ID) %>%
mutate(Group = paste(Group, "n", sep = "_")) %>%
count(Group, x) %>%
spread(Group, n, fill = 0L)

Related

Sum column values over a window and report the values of the previous window

I´m having a data.frame of the following form:
ID Var1
1 1
1 1
1 3
1 4
1 1
1 0
2 2
2 2
2 6
2 7
2 8
2 0
3 0
3 2
3 1
3 3
3 2
3 4
and I would like to get there:
ID Var1 X
1 1 0
1 1 0
1 3 0
1 4 5
1 1 5
1 0 5
2 2 0
2 2 0
2 6 0
2 7 10
2 8 10
2 0 10
3 0 0
3 2 0
3 1 0
3 3 3
3 2 3
3 4 3
so in words: I´d like to calculate the sum of the variable in a window = 3, and then report the results obtained in the previous window. This should happen with respect to the IDs and thus the first three observations on every ID should be returned with 0, as there is no previous time period that could be reported.
For understanding: In the actual dataset each row corresponds to one week and the window = 7. So X is supposed to give information on the sum of Var1 in the previous week.
I have tried using some rollapply stuff, but always ended in an error and also the window would be a rolling window if I got that right, which is specifically not what I need.
Thanks for your answers!
In rollapply, the width argument can be a list which provides the offsets to use. In this case we want to use the points 3, 2 and 1 back for the first point, 4, 3 and 2 back for the second, 5, 4 and 3 back for the third and then recycle. That is, for a window width of k = 3 we would want the following list of offset vectors:
w <- list(-(3:1), -(4:2), -(5:3))
In general we can write w below in terms of the window width k. ave then invokes rollapply with that width list for each ID.
library(zoo)
k <- 3
w <- lapply(1:k, function(x) seq(to = -x, length = k))
transform(DF, X = ave(Var1, ID, FUN = function(x) rollapply(x, w, sum, fill = 0)))
giving:
ID Var1 X
1 1 1 0
2 1 1 0
3 1 3 0
4 1 4 5
5 1 1 5
6 1 0 5
7 2 2 0
8 2 2 0
9 2 6 0
10 2 7 10
11 2 8 10
12 2 0 10
13 3 0 0
14 3 2 0
15 3 1 0
16 3 3 3
17 3 2 3
18 3 4 3
Note
The input DF in reproducible form is:
DF <- structure(list(ID = c(1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L,
2L, 2L, 3L, 3L, 3L, 3L, 3L, 3L), Var1 = c(1L, 1L, 3L, 4L, 1L,
0L, 2L, 2L, 6L, 7L, 8L, 0L, 0L, 2L, 1L, 3L, 2L, 4L)),
class = "data.frame", row.names = c(NA, -18L))
We could group by 'ID', create a new grouping column with window size of 3 using gl, then get the summarized output by taking the sum of 'Var1' and placing the 'Var1' in a list, get the lag of 'X' and unnest
library(dplyr) #1.0.0
library(tidyr)
df1 %>%
# // grouping by ID
group_by(ID) %>%
# // create another group added with gl
group_by(grp = as.integer(gl(n(), 3, n())), .add = TRUE) %>%
# // get the sum of Var1, while changing the Var1 in a list
summarise(X = sum(Var1), Var1 = list(Var1)) %>%
# // get the lag of X
mutate(X = lag(X, default = 0)) %>%
# // unnest the list column
unnest(c(Var1)) %>%
select(names(df1), X)
# A tibble: 18 x 3
# Groups: ID [3]
# ID Var1 X
# <int> <int> <dbl>
# 1 1 1 0
# 2 1 1 0
# 3 1 3 0
# 4 1 4 5
# 5 1 1 5
# 6 1 0 5
# 7 2 2 0
# 8 2 2 0
# 9 2 6 0
#10 2 7 10
#11 2 8 10
#12 2 0 10
#13 3 0 0
#14 3 2 0
#15 3 1 0
#16 3 3 3
#17 3 2 3
#18 3 4 3
data
df1 <- structure(list(ID = c(1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L,
2L, 2L, 3L, 3L, 3L, 3L, 3L, 3L), Var1 = c(1L, 1L, 3L, 4L, 1L,
0L, 2L, 2L, 6L, 7L, 8L, 0L, 0L, 2L, 1L, 3L, 2L, 4L)), class = "data.frame",
row.names = c(NA,
-18L))

Sorting data with some similar words in R

I have a database with 100 columns, but a minimal production of my data are as follows:
df1<=read.table(text="PG1S1AW KOM1S1zo PG2S2AW KOM2S2zo PG3S3AW KOM3S3zo PG4S4AW KOM4S4zo PG5S5AW KOM5S5zo
4 1 2 4 4 3 0 4 0 5
4 4 3 1 3 1 0 3 0 1
2 3 5 3 3 2 1 4 0 2
1 1 1 1 1 3 0 5 0 1
2 5 3 4 4 5 0 1 3 4", header=TRUE)
I want to get columns starting with KOM and PG which have a greater of 3 . So we need to have PG4, KOM4 and above. Put it simply, starting with PG and KOM have the same values which is 4 and greater.
The intended output is:
PG4S4AW KOM4S4zo PG5S5AW KOM5S5zo
0 4 0 5
0 3 0 1
1 4 0 2
0 5 0 1
0 1 3 4
I have used the following code, but it does not work for me:
df2<- df1%>% select(contains("KO"))
Thanks for your help.
It is not entirely clear about the patterns. We create a function (f1) to extract one or more digits (\\d+) that follows the 'KOM' or (|) 'PG' with str_extract (from stringr), convert to numeric ('v1'), similarly, extract numbers after the 'S' ('v2'). Do a check whether these values are same and if one of the value is greater than 3, wrap with which so that if there are any NAs resulting from str_extract would be removed as which gives the column index while removing any NAs. Use the function in select to select the columns that follow the pattern
library(dplyr)
library(stringr)
f1 <- function(nm) {
v1 <- as.numeric(str_extract(nm, "(?<=(KOM|PG))\\d+"))
v2 <- as.numeric(str_extract(nm, "(?<=S)\\d+"))
nm[which((v1 == v2) & (v1 > 3))]
}
df1 %>%
select(f1(names(.)))
# PG4S4AW KOM4S4zo PG5S5AW KOM5S5zo
#1 0 4 0 5
#2 0 3 0 1
#3 1 4 0 2
#4 0 5 0 1
#5 0 1 3 4
data
df1 <- structure(list(PG1S1AW = c(4L, 4L, 2L, 1L, 2L), KOM1S1zo = c(1L,
4L, 3L, 1L, 5L), PG2S2AW = c(2L, 3L, 5L, 1L, 3L), KOM2S2zo = c(4L,
1L, 3L, 1L, 4L), PG3S3AW = c(4L, 3L, 3L, 1L, 4L), KOM3S3zo = c(3L,
1L, 2L, 3L, 5L), PG4S4AW = c(0L, 0L, 1L, 0L, 0L), KOM4S4zo = c(4L,
3L, 4L, 5L, 1L), PG5S5AW = c(0L, 0L, 0L, 0L, 3L), KOM5S5zo = c(5L,
1L, 2L, 1L, 4L)), class = "data.frame", row.names = c(NA, -5L
))
Given your example data, you can just instead look for the numbers 4 or 5.
df1 %>%
select(matches("4|5"))
#> KO4S4AW KOM4S4zo KO5S5AW KOM5S5zo
#> 1 0 4 0 5
#> 2 0 3 0 1
#> 3 1 4 0 2
#> 4 0 5 0 1
#> 5 0 1 3 4

Count values of the whole dataframe

I have this dataframe:
> df
X1 X2 X3 X4 X5 X6 X7
1 2 7 2 3 5 6 7
2 4 2 3 6 1 NA 3
3 3 6 4 4 4 7 7
4 6 5 6 NA 3 1 7
5 1 1 2 3 3 3 7
6 4 7 2 4 5 4 2
7 5 NA 4 5 2 2 3
8 3 7 2 4 4 1 5
9 4 5 6 2 5 6 3
10 2 4 6 4 5 6 3
And I want to count the numbers 1,2,3,4 and assign it to x, 6,7 and assign it to y, and all the numbers (1,2,3,4,5,6,7) to z. After this, I will compute y/z - x/z.
I've done it with table(unlist(df)) and after assigning the value individually. However, I'm looking for a solution without a loop or apply(), as I can't see a way to escalate them as I have near 100 columns and 10.000 rows (I know that all of them are integers from 1 to 7 and NA values).
I'm looking for a solution like this:
x <- count(df, c(1,2,3,4), na.rm = TRUE)
y <- count(df, c(6,7), na.rm = TRUE)
z <- count(df, c(1,2,3,4,5,6,7), na.rm = TRUE)
However, it seems that count() doesn't work like that neither exist a function that does that.
Any suggestions?
A base R solution.
vec <- unlist(df)
vec_c <- table(vec)
x <- sum(vec_c[names(vec_c) %in% as.character(1:4)])
y <- sum(vec_c[names(vec_c) %in% as.character(6:7)])
z <- sum(vec_c)
y/z - x/z
# [1] -0.358209
Another idea.
vec <- unlist(df)
x <- sum(vec %in% 1:4)
y <- sum(vec %in% 6:7)
z <- length(vec[!is.na(vec)])
y/z - x/z
# [1] -0.358209
Another idea.
m <- as.matrix(df)
x <- sum(m %in% 1:4)
y <- sum(m %in% 6:7)
z <- sum(!is.na(df))
y/z - x/z
# [1] -0.358209
DATA
df <- read.table(text = " X1 X2 X3 X4 X5 X6 X7
1 2 7 2 3 5 6 7
2 4 2 3 6 1 NA 3
3 3 6 4 4 4 7 7
4 6 5 6 NA 3 1 7
5 1 1 2 3 3 3 7
6 4 7 2 4 5 4 2
7 5 NA 4 5 2 2 3
8 3 7 2 4 4 1 5
9 4 5 6 2 5 6 3
10 2 4 6 4 5 6 3",
header = TRUE)
Here is an option using tidyverse
library(tidyverse)
gather(df, na.rm = TRUE) %>%
count(value) %>%
mutate(n1 = sum(n)) %>%
filter(value %in% c(1:4, 6:7)) %>%
group_by(grp = value %in% 1:4) %>%
summarise(perc = sum(n)/first(n1)) %>%
summarise(z = diff(perc))
# A tibble: 1 x 1
# z
# <dbl>
# 1 0.358
Another approach sticking on table(), putting your counting structure into a list.
count <- setNames(lapply(list(1:4, 6:7, 1:7), function(x){
tab <- table(unlist(d))
return(sum(tab[x]))
}), tail(letters, 3))
> with(count, y/z - x/z)
[1] -0.358209
Data
d <- structure(list(X1 = c(2L, 4L, 3L, 6L, 1L, 4L, 5L, 3L, 4L, 2L),
X2 = c(7L, 2L, 6L, 5L, 1L, 7L, NA, 7L, 5L, 4L), X3 = c(2L,
3L, 4L, 6L, 2L, 2L, 4L, 2L, 6L, 6L), X4 = c(3L, 6L, 4L, NA,
3L, 4L, 5L, 4L, 2L, 4L), X5 = c(5L, 1L, 4L, 3L, 3L, 5L, 2L,
4L, 5L, 5L), X6 = c(6L, NA, 7L, 1L, 3L, 4L, 2L, 1L, 6L, 6L
), X7 = c(7L, 3L, 7L, 7L, 7L, 2L, 3L, 5L, 3L, 3L)), class = "data.frame", row.names = c("1",
"2", "3", "4", "5", "6", "7", "8", "9", "10"))

Finding difference between specific rows by group

Within a group, I want to find the difference between that row and the first time that user appeared in the data. For example, I need to create the diff variable below. Users have different number of rows each as in the following data:
df <- structure(list(ID = c(1L, 1L, 1L, 1L, 2L, 2L, 2L, 3L, 4L, 4L),
money = c(9L, 12L, 13L, 15L, 5L, 7L, 8L, 5L, 2L, 10L), occurence = c(1L,
2L, 3L, 4L, 1L, 2L, 3L, 1L, 1L, 2L), diff = c(NA, 3L, 4L,
6L, NA, 2L, 3L, NA, NA, 8L)), .Names = c("ID", "money", "occurence",
"diff"), class = "data.frame", row.names = c(NA, -10L))
ID money occurence diff
1 1 9 1 NA
2 1 12 2 3
3 1 13 3 4
4 1 15 4 6
5 2 5 1 NA
6 2 7 2 2
7 2 8 3 3
8 3 5 1 NA
9 4 2 1 NA
10 4 10 2 8
You can use ave(). We just remove the first value per group and replace it with NA, and subtract the first value from the rest of the values.
with(df, ave(money, ID, FUN = function(x) c(NA, x[-1] - x[1])))
# [1] NA 3 4 6 NA 2 3 NA NA 8
A dplyr solution, which uses the first function to get the first value and calculate the difference.
library(dplyr)
df2 <- df %>%
group_by(ID) %>%
mutate(diff = money - first(money)) %>%
mutate(diff = replace(diff, diff == 0, NA)) %>%
ungroup()
df2
# # A tibble: 10 x 4
# ID money occurence diff
# <int> <int> <int> <int>
# 1 1 9 1 NA
# 2 1 12 2 3
# 3 1 13 3 4
# 4 1 15 4 6
# 5 2 5 1 NA
# 6 2 7 2 2
# 7 2 8 3 3
# 8 3 5 1 NA
# 9 4 2 1 NA
# 10 4 10 2 8
Update
Here is a data.table solution provided by Sotos. Notice that no need to replace 0 with NA.
library(data.table)
setDT(df)[, money := money - first(money), by = ID][]
# ID money occurence diff
# 1: 1 0 1 NA
# 2: 1 3 2 3
# 3: 1 4 3 4
# 4: 1 6 4 6
# 5: 2 0 1 NA
# 6: 2 2 2 2
# 7: 2 3 3 3
# 8: 3 0 1 NA
# 9: 4 0 1 NA
# 10: 4 8 2 8
DATA
dput(df)
structure(list(ID = c(1L, 1L, 1L, 1L, 2L, 2L, 2L, 3L, 4L, 4L),
money = c(9L, 12L, 13L, 15L, 5L, 7L, 8L, 5L, 2L, 10L), occurence = c(1L,
2L, 3L, 4L, 1L, 2L, 3L, 1L, 1L, 2L)), .Names = c("ID", "money",
"occurence"), row.names = c(NA, -10L), class = "data.frame")

R Aggregate and count of not null

I have the following data table
PIECE SAMPLE QC_CODE
1 1 1
2 1 NA
3 2 2
4 2 4
5 2 NA
6 3 6
7 3 3
8 3 NA
9 4 6
10 4 NA
and I would like to count the number of qc_code in each sample and return an output like this
SAMPLE SAMPLE_SIZE QC_CODE_COUNT
1 2 1
2 3 2
3 3 2
4 2 1
Where sample size is the count of pieces in each sample, and qc_code_count is the count of al qc_code that are no NA.
How would I go about this in R
You can try
library(dplyr)
df1 %>%
group_by(SAMPLE) %>%
summarise(SAMPLE_SIZE=n(), QC_CODE_UNIT= sum(!is.na(QC_CODE)))
# SAMPLE SAMPLE_SIZE QC_CODE_UNIT
#1 1 2 1
#2 2 3 2
#3 3 3 2
#4 4 2 1
Or
library(data.table)
setDT(df1)[,list(SAMPLE_SIZE=.N, QC_CODE_UNIT=sum(!is.na(QC_CODE))), by=SAMPLE]
Or using aggregate from base R
do.call(data.frame,aggregate(QC_CODE~SAMPLE, df1, na.action=NULL,
FUN=function(x) c(SAMPLE_SIZE=length(x), QC_CODE_UNIT= sum(!is.na(x)))))
data
df1 <- structure(list(PIECE = 1:10, SAMPLE = c(1L, 1L, 2L, 2L, 2L, 3L,
3L, 3L, 4L, 4L), QC_CODE = c(1L, NA, 2L, 4L, NA, 6L, 3L, NA,
6L, NA)), .Names = c("PIECE", "SAMPLE", "QC_CODE"), class = "data.frame",
row.names = c(NA, -10L))

Resources