Loop to create bivariate/cross table - r

I am trying to create a loop where I want get the frequency between column 1 and column 2,column 1 and column 3....till col1 and col30.
Col1 col2 col3
0 A 25
1 A 30
0 A 30
1 B 20
0 B 20
Output.
0 1 0 1
A 2 1 25 0 0
B 1 1 30 1 1
20 1 1

Use lapply to loop over columns and then table to calculate frequency
lapply(df[-1], function(x) table(x, df[, 1]))
#$col2
#x 0 1
# A 2 1
# B 1 1
#$col3
#x 0 1
# 20 1 1
# 25 1 0
# 30 1 1
Or a shorter version using Map
Map(table, df[1], df[-1])
data
df <- structure(list(Col1 = c(0L, 1L, 0L, 1L, 0L), col2 = structure(c(1L,
1L, 1L, 2L, 2L), .Label = c("A", "B"), class = "factor"), col3 = c(25L,
30L, 30L, 20L, 20L)), class = "data.frame", row.names = c(NA, -5L))

We can use tidyverse
library(tidyverse)
map(names(df)[-1], ~ cbind(df[1], df[.x]) %>%
count(Col1, !! rlang::sym(.x)) %>%
spread(Col1, n, fill = 0))
data
df <- structure(list(Col1 = c(0L, 1L, 0L, 1L, 0L), col2 = structure(c(1L,
1L, 1L, 2L, 2L), .Label = c("A", "B"), class = "factor"), col3 = c(25L,
30L, 30L, 20L, 20L)), class = "data.frame", row.names = c(NA, -5L))

Related

Aggregate sum of column within groups [duplicate]

This question already has answers here:
Aggregate / summarize multiple variables per group (e.g. sum, mean)
(10 answers)
Closed 1 year ago.
Hello everyone I'm looking for help in order to aggregate sum of columns within df$Names
Here is the df
Names COL1 COL2 COL3 COL4
A 2 2 0 1
A 3 1 1 1
A 3 2 0 1
A 4 0 4 0
B 1 1 0 0
B 3 1 1 1
The expected output is :
Names COL1 COL2 COL3 COL4
A 12 5 5 3
B 4 2 1 1
Here are the data :
structure(list(Names = structure(c(1L, 1L, 1L, 1L, 2L, 2L), .Label = c("A",
"B"), class = "factor"), COL1 = c(2L, 3L, 3L, 4L, 1L, 3L), COL2 = c(2L,
1L, 2L, 0L, 1L, 1L), COL3 = c(0L, 1L, 0L, 4L, 0L, 1L), COL4 = c(1L,
1L, 1L, 0L, 0L, 1L)), class = "data.frame", row.names = c(NA,
-6L))
I tried:
aggregate(cbind(COL1,COL2,COL3,COL4) ~ Names, data = df, sum, na.rm = TRUE)
Does this work:
library(dplyr)
df %>% group_by(Names) %>% summarise(across(starts_with('COL'), ~ sum(.)))
`summarise()` ungrouping output (override with `.groups` argument)
# A tibble: 2 x 5
Names COL1 COL2 COL3 COL4
<chr> <dbl> <dbl> <dbl> <dbl>
1 A 12 5 5 3
2 B 4 2 1 1

Matching the previous row in a specific column and performing a calculation in R

I currently have a data file that resembles this:
R ID A B
1 A1 0 0
2 A1 2 4
3 A1 4 8
4 A2 0 0
5 A2 3 3
6 A2 6 6
I would like to write a script that will only calculate "(8-4)/(4-2)" from the previous row only if the "ID" matches. For example, in the output for a column "C" in row 3, if A1 == A1 in the "ID" column, then (8-4)/(4-2) = 2. If A1 != A1, then output is 0.
I would like the output to be like this:
R ID A B C
1 A1 0 0 0
2 A1 2 4 2
3 A1 4 8 2
4 A2 0 0 0
5 A2 3 3 1
6 A2 6 6 1
Hopefully I explained this correctly in a non-confusing manner.
We could group_by ID, use diff to calculate difference between rows and divide.
library(dplyr)
df %>% group_by(ID) %>% mutate(C = c(0, diff(B)/diff(A)))
# R ID A B C
# <int> <fct> <int> <int> <dbl>
#1 1 A1 0 0 0
#2 2 A1 2 4 2
#3 3 A1 4 8 2
#4 4 A2 0 0 0
#5 5 A2 3 3 1
#6 6 A2 6 6 1
and similarly using data.table
library(data.table)
setDT(df)[, C := c(0, diff(B)/diff(A)), ID]
data
df <- structure(list(R = 1:6, ID = structure(c(1L, 1L, 1L, 2L, 2L,
2L), .Label = c("A1", "A2"), class = "factor"), A = c(0L, 2L,
4L, 0L, 3L, 6L), B = c(0L, 4L, 8L, 0L, 3L, 6L)), class = "data.frame",
row.names = c(NA, -6L))
We can also use lag
library(dplyr)
df %>%
group_by(ID) %>%
mutate(C = (B - lag(B, default = first(B)))/(A - lag(A, default = first(A))))
data
df <- structure(list(R = 1:6, ID = structure(c(1L, 1L, 1L, 2L, 2L,
2L), .Label = c("A1", "A2"), class = "factor"), A = c(0L, 2L,
4L, 0L, 3L, 6L), B = c(0L, 4L, 8L, 0L, 3L, 6L)), class = "data.frame",
row.names = c(NA, -6L))

How to subtract one record from another data frame in R

I have two data frame. One data frame has only 1 record and 3 columns. Another data frame has 6 rows and 3 columns.
Now I want to subtract data frame 1 values from data frame 2 values.
Sample data:
df1 = structure(list(col1 = 2L, col2 = 3L, col3 = 4L), .Names = c("col1",
"col2", "col3"), class = "data.frame", row.names = c(NA, -1L))
df2 = structure(list(col1 = c(1L, 2L, 4L, 5L, 6L, 3L), col2 = c(1L,
2L, 4L, 3L, 5L, 7L), col3 = c(6L, 4L, 3L, 6L, 4L, 6L)), .Names = c("col1", "col2", "col3"), class = "data.frame", row.names = c(NA, -6L))
Final output should be like,
output = structure(list(col1 = c(-1L, 0L, 2L, 3L, 4L, 1L), col2 = c(-2L,
-1L, 1L, 0L, 2L, 4L), col3 = c(2L, 0L, -1L, 2L, 0L, 2L)), .Names = c("col1","col2", "col3"), class = "data.frame", row.names = c(NA, -6L))
Try this..
# Creating Datasets
df1 = structure(list(col1 = 2L, col2 = 3L, col3 = 4L), .Names = c("col1", "col2", "col3"), class = "data.frame", row.names = c(NA, -1L))
df2 = structure(list(col1 = c(1L, 2L, 4L, 5L, 6L, 3L), col2 = c(1L,2L, 4L, 3L, 5L, 7L), col3 = c(6L, 4L, 3L, 6L, 4L, 6L)), .Names = c("col1", "col2", "col3"), class = "data.frame", row.names = c(NA, -6L))
# Output
data.frame(sapply(names(df1), function(i){df2[[i]] - df1[[i]]}))
# col1 col2 col3
# 1 -1 -2 2
# 2 0 -1 0
# 3 2 1 -1
# 4 3 0 2
# 5 4 2 0
# 6 1 4 2
If you do df2 - df1 directly you get
df2 - df1
Error in Ops.data.frame(df2, df1) :
‘-’ only defined for equally-sized data frames
So let us make df1 the same size as df2 by repeating rows and then subtract
df2 - df1[rep(seq_len(nrow(df1)), nrow(df2)), ]
# col1 col2 col3
#1 -1 -2 2
#2 0 -1 0
#3 2 1 -1
#4 3 0 2
#5 4 2 0
#6 1 4 2
Or another option is using mapply without replicating rows
mapply("-", df2, df1)
This would return a matrix, if you want a dataframe back
data.frame(mapply("-", df2, df1))
# col1 col2 col3
#1 -1 -2 2
#2 0 -1 0
#3 2 1 -1
#4 3 0 2
#5 4 2 0
#6 1 4 2
We can use sweep:
x <- sweep(df2, 2, unlist(df1), "-")
#test if same as output
identical(output, x)
# [1] TRUE
Note, it is twice slower than mapply:
df2big <- data.frame(col1 = runif(100000),
col2 = runif(100000),
col3 = runif(100000))
microbenchmark::microbenchmark(
mapply = data.frame(mapply("-", df2big, df1)),
sapply = data.frame(sapply(names(df1), function(i){df2big[[i]] - df1[[i]]})),
sweep = sweep(df2big, 2, unlist(df1), "-"))
# Unit: milliseconds
# expr min lq mean median uq max neval
# mapply 5.239638 7.645213 11.49182 8.514876 9.345765 60.60949 100
# sapply 5.250756 5.518455 10.94827 8.706027 10.091841 59.09909 100
# sweep 10.572785 13.912167 21.18537 14.985525 16.737820 64.90064 100

Subsetting a data frame according to recursive rows and creating a column for ordering

Consider the sample data
df <-
structure(
list(
id = c(1L, 1L, 1L, 1L, 2L, 2L, 3L),
A = c(20L, 12L, 13L, 8L, 11L, 21L, 17L),
B = c(1L, 1L, 0L, 0L, 1L, 0L, 0L)
),
.Names = c("id", "A", "B"),
class = "data.frame",
row.names = c(NA,-7L)
)
Each id (stored in column 1) has varying number of entries for column A and B. In the example data, there are four observations with id = 1. I am looking for a way to subset this data in R so that there will be at most 3 entries for for each id and finally create another column (labelled as C) which consists of the order of each id. The expected output would look like:
df <-
structure(
list(
id = c(1L, 1L, 1L, 2L, 2L, 3L),
A = c(20L, 12L, 13L, 11L, 21L, 17L),
B = c(1L, 1L, 0L, 1L, 0L, 0L),
C = c(1L, 2L, 3L, 1L, 2L, 1L)
),
.Names = c("id", "A", "B","C"),
class = "data.frame",
row.names = c(NA,-6L)
)
Your help is much appreciated.
Like this?
library(data.table)
dt <- as.data.table(df)
dt[, C := seq(.N), by = id]
dt <- dt[C <= 3,]
dt
# id A B C
# 1: 1 20 1 1
# 2: 1 12 1 2
# 3: 1 13 0 3
# 4: 2 11 1 1
# 5: 2 21 0 2
# 6: 3 17 0 1
Here is one option with dplyr and considering the top 3 values based on A (based of the comments of #Ronak Shah).
library(dplyr)
df %>%
group_by(id) %>%
top_n(n = 3, wt = A) %>% # top 3 values based on A
mutate(C = rank(id, ties.method = "first")) # C consists of the order of each id
# A tibble: 6 x 4
# Groups: id [3]
id A B C
<int> <int> <int> <int>
1 1 20 1 1
2 1 12 1 2
3 1 13 0 3
4 2 11 1 1
5 2 21 0 2
6 3 17 0 1

Subsetting a dataframe based on summation of rows of a given column

I am dealing with data with three variables (i.e. id, time, gender). It looks like
df <-
structure(
list(
id = c(1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L),
time = c(21L, 3L, 4L, 9L, 5L, 9L, 10L, 6L, 27L, 3L, 4L, 10L),
gender = c(1L, 1L, 1L, 1L, 0L, 0L, 0L, 0L, 1L, 1L, 1L, 1L)
),
.Names = c("id", "time", "gender"),
class = "data.frame",
row.names = c(NA,-12L)
)
That is, each id has four observations for time and gender. I want to subset this data in R based on the sums of the rows of variable time which first gives a value which is greater than or equal to 25 for each id. Notice that for id 2 all observations will be included and for id 3 only the first observation is involved. The expected results would look like:
df <-
structure(
list(
id = c(1L, 1L, 1L, 2L, 2L, 2L, 2L, 3L ),
time = c(21L, 3L, 4L, 5L, 9L, 10L, 6L, 27L ),
gender = c(1L, 1L, 1L, 0L, 0L, 0L, 0L, 1L)
),
.Names = c("id", "time", "gender"),
class = "data.frame",
row.names = c(NA,-8L)
)
Any help on this is highly appreciated.
One option is using lag of cumsum as:
library(dplyr)
df %>% group_by(id,gender) %>%
filter(lag(cumsum(time), default = 0) < 25 )
# # A tibble: 8 x 3
# # Groups: id, gender [3]
# id time gender
# <int> <int> <int>
# 1 1 21 1
# 2 1 3 1
# 3 1 4 1
# 4 2 5 0
# 5 2 9 0
# 6 2 10 0
# 7 2 6 0
# 8 3 27 1
Using data.table: (Updated based on feedback from #Renu)
library(data.table)
setDT(df)
df[,.SD[shift(cumsum(time), fill = 0) < 25], by=.(id,gender)]
Another option would be to create a logical vector for each 'id', cumsum(time) >= 25, that is TRUE when the cumsum of 'time' is equal to or greater than 25.
Then you can filter for rows where the cumsum of this vector is less or equal then 1, i.e. filter for entries until the first TRUE for each 'id'.
df %>%
group_by(id) %>%
filter(cumsum( cumsum(time) >= 25 ) <= 1)
# A tibble: 8 x 3
# Groups: id [3]
# id time gender
# <int> <int> <int>
# 1 1 21 1
# 2 1 3 1
# 3 1 4 1
# 4 2 5 0
# 5 2 9 0
# 6 2 10 0
# 7 2 6 0
# 8 3 27 1
Can try dplyr construction:
dt <- groupby(df, id) %>%
#sum time within groups
mutate(sum_time = cumsum(time))%>%
#'select' rows, which fulfill the condition
filter(sum_time < 25) %>%
#exclude sum_time column from the result
select (-sum_time)

Resources