Combining columns with a grouping in R - r

I need to categorize information from column names and restructure a dataset. Here is how my sample dataset looks like:
df <- data.frame(id = c(111,112,113),
Demo_1_Color_Naming = c("Text1","Text1","Text1"),
Demo_1.Errors =c(0,1,2),
Item_1_Color_Naming = c("Text1","Text1","Text1"),
Item_1.Time_in_Seconds =c(10,NA, 44),
Item_1.Errors = c(0,1,NA),
Demo_2_Shape_Naming = c("Text1","Text1","Text1"),
Demo_2.Errors =c(4,1,5),
Item_2_Shape_Naming = c("Text1","Text1","Text1"),
Item_2.Time_in_Seconds =c(55,35, 22),
Item_2.Errors = c(5,2,NA))
> df
id Demo_1_Color_Naming Demo_1.Errors Item_1_Color_Naming Item_1.Time_in_Seconds Item_1.Errors Demo_2_Shape_Naming Demo_2.Errors
1 111 Text1 0 Text1 10 0 Text1 4
2 112 Text1 1 Text1 NA 1 Text1 1
3 113 Text1 2 Text1 44 NA Text1 5
Item_2_Shape_Naming Item_2.Time_in_Seconds Item_2.Errors
1 Text1 55 5
2 Text1 35 2
3 Text1 22 NA
The columns are grouped by the numbers 1,2,3,... Each number represesents a grouping name. For example number 1 in this dataset represents Color grouping where number 2 represents Shape grouping. I would like to keep Time_in_seconds info and Errors info. Then I need to sum both time and errors.
Additionally, this dataset is only limited to two grouping. The bigger dataset has more than 2 grouping. I need to handle this for a multi group/column.
How can I achieve this below:
> df1
id ColorTime ShapeTime ColorError ShapeError TotalTime TotalError
1 111 10 55 0 5 65 5
2 112 NA 35 1 2 35 3
3 113 44 22 NA NA 66 NA

We may do
cbind(df['id'], do.call(cbind, lapply(setNames(c("Time_in_Seconds",
"Item.*Errors"), c("Time_in_Seconds", "Errors")), \(x) {
tmp <- df[grep(x, names(df), value = TRUE)]
out <- setNames(as.data.frame(sapply(split.default(tmp,
gsub("\\D+", "", names(tmp))), rowSums, na.rm = TRUE)), c("Color", "Shape"))
transform(out, Total = rowSums(out))
})))
-output
id Time_in_Seconds.Color Time_in_Seconds.Shape Time_in_Seconds.Total Errors.Color Errors.Shape Errors.Total
1 111 10 55 65 0 5 5
2 112 0 35 35 1 2 3
3 113 44 22 66 0 0 0
If we need the NAs
cbind(df['id'], do.call(cbind, lapply(setNames(c("Time_in_Seconds",
"Item.*Errors"), c("Time_in_Seconds", "Errors")), \(x) {
tmp <- df[grep(x, names(df), value = TRUE)]
out <- setNames(as.data.frame(sapply(split.default(tmp,
gsub("\\D+", "", names(tmp))), \(u) Reduce(`+`, u))), c("Color", "Shape")); transform(out, Total = rowSums(out, na.rm = TRUE)) })))

Related

Run table for all columns in sequence

If this is my dataset
Id Col_A_1 Col_A_2 Col_A_3 ..... Col_A_100
1 87 88 82 88
2 88 82 82 87
3 82 87 NA 82
4 88 87 82 88
5 87 87 87 88
What is the efficient way to execute table function on these columns from Col_A_1 to Col_A_100 ? I am trying to avoid running the table(df$Col_A_1 , useNA ="ifany"), table(df$Col_A_2 , useNA ="ifany"), .... table(df$Col_A_100 , useNA ="ifany")
100 times.
Also if possible, I like the output saved in a dataframe .
Expected output
Column 82 85 87 88 Missing
Col_A_1 1 0 2 2 0
Col_A_2 1 0 3 1 0
Col_A_3 3 0 1 0 1
.
.
.
Col_A_100 1 0 1 3 0
Thanks in advance.
# example data
d <- read.table(text = "
Id Col_A_1 Col_A_2 Col_A_3 Col_A_100
1 87 88 82 88
2 88 82 82 87
3 82 87 NA 82
4 88 87 82 88
5 87 87 87 88", header = TRUE)
Excluding Id column reshape from wide-to-long using stack, then table to get counts including NAs, transpose to have column names as rows, then convert table object to dataframe:
data.frame(rbind(t(table(stack(d[, -1]), useNA = "always"))))
# X82 X87 X88 NA.
# Col_A_1 1 2 2 0
# Col_A_2 1 3 1 0
# Col_A_3 3 1 0 1
# Col_A_100 1 1 3 0
# NA. 0 0 0 0
I just created a small tibble to work with and to illustrate it with.
Tibbles can essentially be considered lists, so lapply works just fine. Since the result can be cumbersome to work with, I put it in a tibble as a list entry:
library(dplyr)
x = tibble(col1 = sample(100,replace = T),
col2 = sample(100,replace = T),
col3 = sample(100,replace = T),
col4 = sample(100,replace = T))
res = tibble(cols = colnames(x),
tables = lapply(x, function(col) table(col, useNA = "ifany")))
# A tibble: 4 x 2
# cols tables
# <chr> <named list>
# col1 <table [61]>
# col2 <table [69]>
# col3 <table [60]>
# col4 <table [62]>
EDIT: I did not notice the output format requirement. It can be done (perhaps a bit inelegantly) like this:
#I assume it is all numeric values
unique_names = sapply(res$tables, names) %>% purrr::reduce(union) #get all names present
unique_names_sorted = c(sort(as.numeric(unique_names)), if(any(is.na(unique_names))) "NA") # sort them by value and add in NA, if present
#create dummy matrix
mat = matrix(0, nrow = nrow(res), ncol = length(unique_names_sorted))
#assign corresponding names
colnames(mat) = unique_names_sorted
#populate dummy matrix
for (i in 1:nrow(mat)) {
tmp = res$tables[[i]]
if(any(is.na(names(tmp)))) names(tmp)[is.na(names(tmp))] = "NA"
mat[,names(tmp)] = tmp
}

R: How to fill out values in a DF which are dependent on previous rows

I have a dataframe, and I want to do some calculations depending on the previous rows (like dragging informations down in excel). My DF looks like this:
set.seed(1234)
df <- data.frame(DA = sample(1:3, 6, rep = TRUE) ,HB = sample(0:600, 6, rep = TRUE), D = sample(1:5, 6, rep = TRUE), AD = sample(1:14, 6, rep = TRUE), GM = sample(30:31, 6, rep = TRUE), GL = NA, R =NA, RM =0 )
df$GL[1] = 646
df$R[1] = 60
df$DA[5] = 2
df
# DA HB D AD GM GL R RM
# 1 2 399 4 13 30 646 60 0
# 2 2 97 4 10 31 NA NA 0
# 3 1 102 5 5 31 NA NA 0
# 4 3 325 4 2 31 NA NA 0
# 5 2 78 3 14 30 NA NA 0
# 6 1 269 4 8 30 NA NA 0
I want to fill out the missing values in my GL, R and RM columns, and the values are dependent on each other. So eg.
attach(df)
#calc GL and R for the 2nd row
df$GL[2] <- GL[1]+HB[2]+RM[1]
df$R[2] <- df$GL[2]*D[2]/GM[2]*AD[2]
#calc GL and R for the 3rd row
df$GL[3] <- df$GL[2]+HB[3]+df$RM[2]
df$R[3] <-df$GL[3]*D[3]/GM[3]*AD[3]
#and so on..
Is there a way to do all the calculations at once, instead of row by row?
In addition, each time the column 'DA' = 1, the previous values for 'R' should be summed up for the same row for 'RM', but only from the last occurence. So that
attach(df)
df$RM[3] <-R[1]+R[2]+R[3]
#and RM for the 6th row is calculated by
#df$RM[6] <-R[4]+R[5]+R[6]
Thanks a lot in advance!
You can use a for loop to calculate GL values and once you have them you can do the calculation for R columns directly.
for(i in 2:nrow(df)) {
df$GL[i] <- with(df, GL[i-1]+HB[i]+RM[i-1])
}
df$R <- with(df, (GL* D)/(GM *AD))
You can use indexing to solve the first two problems:
> # Original code from question~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> set.seed(1234)
> df <- data.frame(DA = sample(1:3, 6, rep = TRUE), HB = sample(0:600, 6, rep = TRUE),
+ D = sample(1:5, 6, rep = TRUE), AD = sample(1:14, 6, rep = TRUE),
+ GM = sample(30:31, 6, rep = TRUE), GL = NA, R =NA, RM =0 )
> df$GL[1] = 646
> df$R[1] = 60
> df$DA[5] = 2
> #~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> # View df
> df
DA HB D AD GM GL R RM
1 2 399 4 13 30 646 60 0
2 2 97 4 10 31 NA NA 0
3 1 102 5 5 31 NA NA 0
4 3 325 4 2 31 NA NA 0
5 2 78 3 14 30 NA NA 0
6 1 269 4 8 30 NA NA 0
> # Solution below, based on indexing
> # 1. GL column
> df$GL <- cumsum(c(df$GL[1], df$HB[-1] + df$RM[-nrow(df)]))
> # 2. R column
> df$R[-1] <- (df$GL * df$D / df$GM * df$AD)[-1]
> # May be more clear like this (same result)
> df$R[-1] <- df$GL[-1] * df$D[-1] / df$GM[-1] * df$AD[-1]
> # Or did you mean this for last *?
> df$R[-1] <- (df$GL * df$D / (df$GM * df$AD))[-1]
The third problem can be solved with a loop.
> df$RM[1] <- df$R[1]
> for (i in 2:nrow(df)) {
+ df$RM[i] <- df$R[i] + df$RM[i-1] * (df$DA[i] != 2)
+ }
> df
DA HB D AD GM GL R RM
1 2 399 4 13 30 646 60.000000 60.000000
2 2 97 4 10 31 743 9.587097 9.587097
3 1 102 5 5 31 845 27.258065 36.845161
4 3 325 4 2 31 1170 75.483871 112.329032
5 2 78 3 14 30 1248 8.914286 8.914286
6 1 269 4 8 30 1517 25.283333 34.197619
Do these results look correct?
Update: Assuming RM should = R unless DA = 1, and in that case RM = sum of current row and previous R up to (not including) the above row with DA = 1, try the following loop.
df$RM[1] <- cs <- df$R[1]
for (i in 2:nrow(df)) {
df$RM[i] <- df$R[i] + cs * (df$DA[i] == 1)
cs <- cs * (df$DA[i] != 1) + df$R[i]
}

Get variable combination matrix

Data
We have numerous text strings that look like this (way longer in our real dataset):
df <- data.frame(
id = c('text1','text2','text3'),text = c('ABA','ABA','AAA')
)
>df
id text
1 text1 ABA
2 text2 ABA
3 text3 AAA
We want to create a matrix that tells how often a letter at position x is found together with the other letters at other positions, so in this case:
3A 3 1 2 3
2B 2 0 2 2
2A 1 1 0 1
1A 3 1 2 3
1A 2A 2B 3A
What I tried
I previously converted the matrix to a binary matrix, looking like this:
structure(list(pos1_A = c(1, 1, 1), pos2_A = c(0, 0, 1), pos2_B = c(1,
1, 0), pos3_A = c(1, 1, 1)), class = "data.frame", row.names = c("text1",
"text2", "text3"))
pos1_A pos2_A pos2_B pos3_A
text1 1 0 1 1
text2 1 0 1 1
text3 1 1 0 1
Then I can run commands like cor to get correlations, however, instead of correlations I want the frequencies.
Note this is different from questions about co-occurrences wherein the variable name itself (here position) is neglected, for example like "How to use R to create a word co-occurrence matrix"
Huge credit to #Ronak Shah with the answer here
It's much simpler if we convert the categorical data to a numerical (binary matrix), for example using this hacky but easy way with the homals package and then apply the method by #Ronak Shah linked above:
# The dataset
df <- data.frame(
id = c('text1','text2','text3'),text = c('ABA','ABA','AAA')
)
# Split the strings in characters and add column names
df2 <- df %>% splitstackshape::cSplit('text', sep = '', stripWhite = FALSE, type.convert = FALSE, direction = 'wide') %>%
column_to_rownames('id')
colnames(df2) <- paste0('pos', 1:ncol(df2))
# Convert to binary matrix (hacky way)
bin.mat <- homals:::expandFrame(df2, clean = F)
# Method by #Ronak Shah to get the frequency matrix
fun <- function(x, y) sum(bin.mat[, x] & bin.mat[, y])
n <- seq_along(bin.mat)
mat <- outer(n, n, Vectorize(fun))
dimnames(mat) <- list(names(bin.mat)[n], names(bin.mat[n]))
This produces the matrix:
>mat
pos1_A pos2_A pos2_B pos3_A
pos1_A 3 1 2 3
pos2_A 1 1 0 1
pos2_B 2 0 2 2
pos3_A 3 1 2 3
Here's an alternative approach that produces a matrix as originally requested:
# Make all strings the same length:
df$text <- stringr::str_pad(df$text, side = "right", max(nchar(df$text)))
# Create a matrix with all letters labelled by their position:
all_vals <- apply(do.call(rbind, strsplit(df$text, "")), 1,
function(x) paste0(seq_along(x), x))
# Create a vector of all possible letter / position combos
all_labs <- do.call(paste0, expand.grid(seq(max(nchar(df$text))),
unique(unlist(strsplit(df$text, "")))))
# Create a function that will count all co-occurences per data frame row
f <- function(y, x) as.vector(outer(x, x, function(a, b) 1 * (a %in% y & b %in% y)))
# Create the results matrix and label it
m <- matrix(rowSums(apply(as.data.frame(all_vals), 2, f, all_labs)), nrow = length(all_labs))
rownames(m) <- all_labs
colnames(m) <- all_labs
m
#> 1A 2A 3A 1B 2B 3B
#> 1A 3 1 3 0 2 0
#> 2A 1 1 1 0 0 0
#> 3A 3 1 3 0 2 0
#> 1B 0 0 0 0 0 0
#> 2B 2 0 2 0 2 0
#> 3B 0 0 0 0 0 0
Created on 2020-05-24 by the reprex package (v0.3.0)

Identify sequences between identical values

I have a large matrix:
id v1 v2 v3 v4 v5 v6 v7 v8
1001 37 15 30 37 4 11 35 37
2111 44 31 44 30 24 39 44 18
3121 43 49 39 34 44 43 26 24
4532 45 31 26 33 12 47 37 15
5234 23 27 34 23 30 34 23 4
6345 9 46 39 34 8 43 26 24
For each row (id), I would like to identify intervals of numbers in column v1 to v8. An interval is here defined as a sequence of numbers which starts and ends with the same number.
For example, in the first row, there are two sequences which both start and ends with 37: From column 1 to 4 (37, 15, 30, 37) and from column 4 to column 8 (37, 4, 11, 35, 37).
The focal value should only occur in start and end positions. For example, in the first row, the sequence from 37 at V1, to 37 at V8 is not included, because 37 also occurs in V4.
For each interval, I want the index of the start and end columns, the focal start and end value, and the sequence of numbers in between.
Desired output:
1001 [v1] to [v4] 37 to 37: 15,30
1001 [v4] to [v8] 37 to 37: 4, 11, 35
2111 [v1] to [v3] 44 to 44: 31
2111 [v3] to [v7] 44 to 44: 30, 24, 39
Any suggestions? Algorithm?
I managed to code for the indices for a vector not a matrix,
a <- which(x == 37)
from <- a[!(a-1) %in% a]
to <- a[!(a+1) %in% a]
rbind(from, to)
Very brute-force method. Get unique elements for the given row, check if they are present more than once but not side-by-side, then lapply through each, getting the elements of the row x between them.
apply(m, 1, function(x) {
u <- unique(x)
u <- u[sapply(u, function(u) any(diff(which(x == u)) > 1))]
lapply(setNames(u, u), function(u){
ind <- which(x == u)
lapply(seq(length(ind) - 1),
function(i) x[seq(ind[i] + 1, ind[i + 1] - 1)])
})
})
Output:
# [[1]]
# [[1]]$`37`
# [[1]]$`37`[[1]]
# [1] 15 30
#
# [[1]]$`37`[[2]]
# [1] 4 11 35
#
#
#
# [[2]]
# [[2]]$`44`
# [[2]]$`44`[[1]]
# [1] 31
#
# [[2]]$`44`[[2]]
# [1] 30 24 39
#
#
#
# [[3]]
# [[3]]$`43`
# [[3]]$`43`[[1]]
# [1] 49 39 34 44
#
#
#
# [[4]]
# named list()
#
# [[5]]
# [[5]]$`23`
# [[5]]$`23`[[1]]
# [1] 27 34
#
# [[5]]$`23`[[2]]
# [1] 30 34
#
#
# [[5]]$`34`
# [[5]]$`34`[[1]]
# [1] 23 30
#
#
#
# [[6]]
# named list()
Edit: Henrik's answer inspired me to do a join-based version
library(data.table)
library(magrittr)
d <- melt(as.data.table(m), "id", variable.name = 'ci')[, ci := rowid(id)]
setorder(d, id)
options(datatable.nomatch = 0)
d[d, on = .(id, value, ci > ci)
, .(id, value, i.ci, x.ci)
, mult = 'first'] %>%
.[d, on = .(id, i.ci < ci, x.ci > ci)
, .(id, value, from_ci = x.i.ci, to_ci = x.x.ci, i.value)] %>%
.[, .(val = .(i.value))
, by = setdiff(names(.), 'i.value')]
# id value from_ci to_ci val
# 1: 1001 37 1 4 15,30
# 2: 1001 37 4 8 4,11,35
# 3: 2111 44 1 3 31
# 4: 2111 44 3 7 30,24,39
# 5: 3121 43 1 6 49,39,34,44
# 6: 5234 23 1 4 27,34
# 7: 5234 34 3 6 23,30
# 8: 5234 23 4 7 30,34
Here's a data.table alternative.
Convert matrix to data.table and melt to long format. Create a column index 'ci' to keep track of the original columns (rowid(id)). Order by 'id'.
For each 'id' and 'value' (by = .(id, value)), check if number of rows is larger than one (if(.N > 1)), i.e. if there is at least one sequence. If so, grab the row index (.I) of the sequences and their column indexes (in the original data). For each sequence, grab the corresponding values between start and end index. Wrap in list twice (.(.() to create a list column.
library(data.table)
d <- melt(as.data.table(m), id.vars = "id")
d[ , `:=`(
ci = rowid(id),
variable = NULL)]
setorder(d, id)
d2 <- d[ , if(.N > 1){
.(from = .I[-.N], to = .I[-1],
from_ci = ci[-.N], to_ci = ci[ -1])
}, by = .(id, value)]
d2[ , val := .(.(d$value[seq(from + 1, to - 1)])), by = 1:nrow(d2)]
d2[ , `:=`(from = NULL, to = NULL)]
# id value from_ci to_ci val
# 1: 1001 37 1 4 15,30
# 2: 1001 37 4 8 4,11,35
# 3: 2111 44 1 3 31
# 4: 2111 44 3 7 30,24,39
# 5: 3121 43 1 6 49,39,34,44
# 6: 5234 23 1 4 27,34
# 7: 5234 23 4 7 30,34
# 8: 5234 34 3 6 23,30

vector to dataframe in r given length of vector

I have vectors of different lengths. For instance:
df1
[1] 1 95 5 2 135 4 3 135 4 4 135 4 5 135 4 6 135 4
df2
[1] 1 70 3 2 110 4 3 112 4
I'm trying to write a script in R in order to have any vector enter the function or for loop and it returns a dataframe of three columns. So a separate dataframe for each input vector. Each vector is a multiple of three (hence, the three columns). I'm fairly new to R in terms of writing functions and can't seem to figure this out. Here was my attempt:
newdf = c()
ld <- length(df1)
ld_mult <- length(df1)/3
ld_seq <- seq(from=1,to=ld,by=3)
ld_seq2 < ld_seq +2
for (i in 1:ld_mult) {
newdf[i,] <- df1[ld_seq[i]:ld_seq2[i]]
}
the output I want for df1 would be:
1 95 5
2 135 4
3 135 4
4 135 4
5 135 4
6 135 4
Here's an example of how you could use matrix for that purpose:
x <- c(1, 95, 5,2, 135, 4, 3, 135, 4)
as.data.frame(matrix(x, ncol = 3, byrow = TRUE))
# V1 V2 V3
#1 1 95 5
#2 2 135 4
#3 3 135 4
And
y <- c(1, 70, 3, 2, 110, 4, 3, 112, 4)
as.data.frame(matrix(y, ncol = 3, byrow = TRUE))
# V1 V2 V3
#1 1 70 3
#2 2 110 4
#3 3 112 4
Or if you want to make it a custom function:
newdf <- function(vec) {
as.data.frame(matrix(vec, ncol = 3, byrow = TRUE))
}
newdf(y)
#V1 V2 V3
#1 1 70 3
#2 2 110 4
#3 3 112 4
You could also let the user specify the number of columns he wants to create with the function if you add another argument to newdf:
newdf <- function(vec, cols = 3) {
as.data.frame(matrix(vec, ncol = cols, byrow = T))
}
Now, the default number of columns is 3, if the user doesnt specify a number. If he wants to, he could use it like this:
newdf(z, 5) # to create 5 columns
Another nice little addon for the function would be a check if the input vector length is a multiple of the number of columns specified in the function call:
newdf <- function(vec, cols = 3) {
if(length(vec) %% cols != 0) {
stop("Number of columns is not a multiple of input vector length. Please double check.")
}
as.data.frame(matrix(vec, ncol = cols, byrow = T))
}
newdf(x, 4)
#Error in newdf(x, 4) :
# Number of columns is not a multiple of input vector length. Please double check.
If you had multiple vectors sitting in a list, here's how you could convert each of them to be a data.frame:
> l <- list(x,y)
> l
#[[1]]
#[1] 1 95 5 2 135 4 3 135 4
#
#[[2]]
#[1] 1 70 3 2 110 4 3 112 4
> lapply(l, newdf)
#[[1]]
# V1 V2 V3
#1 1 70 3
#2 2 110 4
#3 3 112 4
#
#[[2]]
# V1 V2 V3
#1 1 70 3
#2 2 110 4
#3 3 112 4

Resources