If this is my dataset
Id Col_A_1 Col_A_2 Col_A_3 ..... Col_A_100
1 87 88 82 88
2 88 82 82 87
3 82 87 NA 82
4 88 87 82 88
5 87 87 87 88
What is the efficient way to execute table function on these columns from Col_A_1 to Col_A_100 ? I am trying to avoid running the table(df$Col_A_1 , useNA ="ifany"), table(df$Col_A_2 , useNA ="ifany"), .... table(df$Col_A_100 , useNA ="ifany")
100 times.
Also if possible, I like the output saved in a dataframe .
Expected output
Column 82 85 87 88 Missing
Col_A_1 1 0 2 2 0
Col_A_2 1 0 3 1 0
Col_A_3 3 0 1 0 1
.
.
.
Col_A_100 1 0 1 3 0
Thanks in advance.
# example data
d <- read.table(text = "
Id Col_A_1 Col_A_2 Col_A_3 Col_A_100
1 87 88 82 88
2 88 82 82 87
3 82 87 NA 82
4 88 87 82 88
5 87 87 87 88", header = TRUE)
Excluding Id column reshape from wide-to-long using stack, then table to get counts including NAs, transpose to have column names as rows, then convert table object to dataframe:
data.frame(rbind(t(table(stack(d[, -1]), useNA = "always"))))
# X82 X87 X88 NA.
# Col_A_1 1 2 2 0
# Col_A_2 1 3 1 0
# Col_A_3 3 1 0 1
# Col_A_100 1 1 3 0
# NA. 0 0 0 0
I just created a small tibble to work with and to illustrate it with.
Tibbles can essentially be considered lists, so lapply works just fine. Since the result can be cumbersome to work with, I put it in a tibble as a list entry:
library(dplyr)
x = tibble(col1 = sample(100,replace = T),
col2 = sample(100,replace = T),
col3 = sample(100,replace = T),
col4 = sample(100,replace = T))
res = tibble(cols = colnames(x),
tables = lapply(x, function(col) table(col, useNA = "ifany")))
# A tibble: 4 x 2
# cols tables
# <chr> <named list>
# col1 <table [61]>
# col2 <table [69]>
# col3 <table [60]>
# col4 <table [62]>
EDIT: I did not notice the output format requirement. It can be done (perhaps a bit inelegantly) like this:
#I assume it is all numeric values
unique_names = sapply(res$tables, names) %>% purrr::reduce(union) #get all names present
unique_names_sorted = c(sort(as.numeric(unique_names)), if(any(is.na(unique_names))) "NA") # sort them by value and add in NA, if present
#create dummy matrix
mat = matrix(0, nrow = nrow(res), ncol = length(unique_names_sorted))
#assign corresponding names
colnames(mat) = unique_names_sorted
#populate dummy matrix
for (i in 1:nrow(mat)) {
tmp = res$tables[[i]]
if(any(is.na(names(tmp)))) names(tmp)[is.na(names(tmp))] = "NA"
mat[,names(tmp)] = tmp
}
Related
I need to categorize information from column names and restructure a dataset. Here is how my sample dataset looks like:
df <- data.frame(id = c(111,112,113),
Demo_1_Color_Naming = c("Text1","Text1","Text1"),
Demo_1.Errors =c(0,1,2),
Item_1_Color_Naming = c("Text1","Text1","Text1"),
Item_1.Time_in_Seconds =c(10,NA, 44),
Item_1.Errors = c(0,1,NA),
Demo_2_Shape_Naming = c("Text1","Text1","Text1"),
Demo_2.Errors =c(4,1,5),
Item_2_Shape_Naming = c("Text1","Text1","Text1"),
Item_2.Time_in_Seconds =c(55,35, 22),
Item_2.Errors = c(5,2,NA))
> df
id Demo_1_Color_Naming Demo_1.Errors Item_1_Color_Naming Item_1.Time_in_Seconds Item_1.Errors Demo_2_Shape_Naming Demo_2.Errors
1 111 Text1 0 Text1 10 0 Text1 4
2 112 Text1 1 Text1 NA 1 Text1 1
3 113 Text1 2 Text1 44 NA Text1 5
Item_2_Shape_Naming Item_2.Time_in_Seconds Item_2.Errors
1 Text1 55 5
2 Text1 35 2
3 Text1 22 NA
The columns are grouped by the numbers 1,2,3,... Each number represesents a grouping name. For example number 1 in this dataset represents Color grouping where number 2 represents Shape grouping. I would like to keep Time_in_seconds info and Errors info. Then I need to sum both time and errors.
Additionally, this dataset is only limited to two grouping. The bigger dataset has more than 2 grouping. I need to handle this for a multi group/column.
How can I achieve this below:
> df1
id ColorTime ShapeTime ColorError ShapeError TotalTime TotalError
1 111 10 55 0 5 65 5
2 112 NA 35 1 2 35 3
3 113 44 22 NA NA 66 NA
We may do
cbind(df['id'], do.call(cbind, lapply(setNames(c("Time_in_Seconds",
"Item.*Errors"), c("Time_in_Seconds", "Errors")), \(x) {
tmp <- df[grep(x, names(df), value = TRUE)]
out <- setNames(as.data.frame(sapply(split.default(tmp,
gsub("\\D+", "", names(tmp))), rowSums, na.rm = TRUE)), c("Color", "Shape"))
transform(out, Total = rowSums(out))
})))
-output
id Time_in_Seconds.Color Time_in_Seconds.Shape Time_in_Seconds.Total Errors.Color Errors.Shape Errors.Total
1 111 10 55 65 0 5 5
2 112 0 35 35 1 2 3
3 113 44 22 66 0 0 0
If we need the NAs
cbind(df['id'], do.call(cbind, lapply(setNames(c("Time_in_Seconds",
"Item.*Errors"), c("Time_in_Seconds", "Errors")), \(x) {
tmp <- df[grep(x, names(df), value = TRUE)]
out <- setNames(as.data.frame(sapply(split.default(tmp,
gsub("\\D+", "", names(tmp))), \(u) Reduce(`+`, u))), c("Color", "Shape")); transform(out, Total = rowSums(out, na.rm = TRUE)) })))
I have a large matrix:
id v1 v2 v3 v4 v5 v6 v7 v8
1001 37 15 30 37 4 11 35 37
2111 44 31 44 30 24 39 44 18
3121 43 49 39 34 44 43 26 24
4532 45 31 26 33 12 47 37 15
5234 23 27 34 23 30 34 23 4
6345 9 46 39 34 8 43 26 24
For each row (id), I would like to identify intervals of numbers in column v1 to v8. An interval is here defined as a sequence of numbers which starts and ends with the same number.
For example, in the first row, there are two sequences which both start and ends with 37: From column 1 to 4 (37, 15, 30, 37) and from column 4 to column 8 (37, 4, 11, 35, 37).
The focal value should only occur in start and end positions. For example, in the first row, the sequence from 37 at V1, to 37 at V8 is not included, because 37 also occurs in V4.
For each interval, I want the index of the start and end columns, the focal start and end value, and the sequence of numbers in between.
Desired output:
1001 [v1] to [v4] 37 to 37: 15,30
1001 [v4] to [v8] 37 to 37: 4, 11, 35
2111 [v1] to [v3] 44 to 44: 31
2111 [v3] to [v7] 44 to 44: 30, 24, 39
Any suggestions? Algorithm?
I managed to code for the indices for a vector not a matrix,
a <- which(x == 37)
from <- a[!(a-1) %in% a]
to <- a[!(a+1) %in% a]
rbind(from, to)
Very brute-force method. Get unique elements for the given row, check if they are present more than once but not side-by-side, then lapply through each, getting the elements of the row x between them.
apply(m, 1, function(x) {
u <- unique(x)
u <- u[sapply(u, function(u) any(diff(which(x == u)) > 1))]
lapply(setNames(u, u), function(u){
ind <- which(x == u)
lapply(seq(length(ind) - 1),
function(i) x[seq(ind[i] + 1, ind[i + 1] - 1)])
})
})
Output:
# [[1]]
# [[1]]$`37`
# [[1]]$`37`[[1]]
# [1] 15 30
#
# [[1]]$`37`[[2]]
# [1] 4 11 35
#
#
#
# [[2]]
# [[2]]$`44`
# [[2]]$`44`[[1]]
# [1] 31
#
# [[2]]$`44`[[2]]
# [1] 30 24 39
#
#
#
# [[3]]
# [[3]]$`43`
# [[3]]$`43`[[1]]
# [1] 49 39 34 44
#
#
#
# [[4]]
# named list()
#
# [[5]]
# [[5]]$`23`
# [[5]]$`23`[[1]]
# [1] 27 34
#
# [[5]]$`23`[[2]]
# [1] 30 34
#
#
# [[5]]$`34`
# [[5]]$`34`[[1]]
# [1] 23 30
#
#
#
# [[6]]
# named list()
Edit: Henrik's answer inspired me to do a join-based version
library(data.table)
library(magrittr)
d <- melt(as.data.table(m), "id", variable.name = 'ci')[, ci := rowid(id)]
setorder(d, id)
options(datatable.nomatch = 0)
d[d, on = .(id, value, ci > ci)
, .(id, value, i.ci, x.ci)
, mult = 'first'] %>%
.[d, on = .(id, i.ci < ci, x.ci > ci)
, .(id, value, from_ci = x.i.ci, to_ci = x.x.ci, i.value)] %>%
.[, .(val = .(i.value))
, by = setdiff(names(.), 'i.value')]
# id value from_ci to_ci val
# 1: 1001 37 1 4 15,30
# 2: 1001 37 4 8 4,11,35
# 3: 2111 44 1 3 31
# 4: 2111 44 3 7 30,24,39
# 5: 3121 43 1 6 49,39,34,44
# 6: 5234 23 1 4 27,34
# 7: 5234 34 3 6 23,30
# 8: 5234 23 4 7 30,34
Here's a data.table alternative.
Convert matrix to data.table and melt to long format. Create a column index 'ci' to keep track of the original columns (rowid(id)). Order by 'id'.
For each 'id' and 'value' (by = .(id, value)), check if number of rows is larger than one (if(.N > 1)), i.e. if there is at least one sequence. If so, grab the row index (.I) of the sequences and their column indexes (in the original data). For each sequence, grab the corresponding values between start and end index. Wrap in list twice (.(.() to create a list column.
library(data.table)
d <- melt(as.data.table(m), id.vars = "id")
d[ , `:=`(
ci = rowid(id),
variable = NULL)]
setorder(d, id)
d2 <- d[ , if(.N > 1){
.(from = .I[-.N], to = .I[-1],
from_ci = ci[-.N], to_ci = ci[ -1])
}, by = .(id, value)]
d2[ , val := .(.(d$value[seq(from + 1, to - 1)])), by = 1:nrow(d2)]
d2[ , `:=`(from = NULL, to = NULL)]
# id value from_ci to_ci val
# 1: 1001 37 1 4 15,30
# 2: 1001 37 4 8 4,11,35
# 3: 2111 44 1 3 31
# 4: 2111 44 3 7 30,24,39
# 5: 3121 43 1 6 49,39,34,44
# 6: 5234 23 1 4 27,34
# 7: 5234 23 4 7 30,34
# 8: 5234 34 3 6 23,30
I have a column containing 1200 character strings. In each one, every four character group is hexadecimal for a number. i.e. 300 numbers in hexadecimal crammed into a 1200 character string, in every row. I need to get each number out into decimal, and into its own column (300 new columns) named 1-300.
Here's what I've figured out so far:
Data.frame:
BigString
[1] 0043003E803C0041004A...(etc...)
Here's what I've done so far:
decimal.fours <- function(x) {
strtoi(substring(BigString[x], seq(1,1197,4), seq(4,1197,4)), 16L)
}
decimal.fours(1)
[1] 283 291 239 177 ...
But now I'm stuck. How can I output these individual number, (and the remaining 296, into new columns? I have fifty total rows/strings. It would be great to do them all at once, i.e. 300 new columns, containing split up substrings from 50 strings.
You can use read.fwf which read in files with fixed width for each column:
# an example vector of big strings
BigString = c("0043003E803C0041004A", "0043003E803C0041004A", "0043003E803C0041004A")
n = 5 # n is the number of columns for your result(300 for your real case)
as.data.frame(
lapply(read.fwf(file = textConnection(BigString),
widths = rep(4, n),
colClasses = "character"),
strtoi, base = 16))
# V1 V2 V3 V4 V5
#1 67 62 32828 65 74
#2 67 62 32828 65 74
#3 67 62 32828 65 74
If you'd like to keep the decimal.hours function, you can modify it as follows and call lapply to convert your bigStrings to list of integers which can be further converted to data.frame with do.call(rbind, ...) pattern:
decimal.fours <- function(x) {
strtoi(substring(x, seq(1,1197,4), seq(4,1197,4)), 16L)
}
do.call(rbind, lapply(BigString, decimal.fours))
Obligatory tidyverse example:
library(tidyverse)
Setup some data
set.seed(1492)
bet <- c(0:9, LETTERS[1:6]) # alphabet for hex digit sequences
i <- 8 # number of rows
n <- 10 # number of 4-hex-digit sequences
df <- data_frame(
some_other_col=LETTERS[1:i],
big_str=map_chr(1:i, ~sample(bet, 4*n, replace=TRUE) %>% paste0(collapse=""))
)
df
## # A tibble: 8 × 2
## some_other_col big_str
## <chr> <chr>
## 1 A 432100D86CAA388C15AEA6291E985F2FD3FB6104
## 2 B BC2673D112925EBBB3FD175837AF7176C39B4888
## 3 C B4E99FDAABA47515EADA786715E811EE0502ABE8
## 4 D 64E622D7037D35DE6ADC40D0380E1DC12D753CBC
## 5 E CF7CDD7BBC610443A8D8FCFD896CA9730673B181
## 6 F ED86AEE8A7B65F843200B823CFBD17E9F3CA4EEF
## 7 G 2B9BCB73941228C501F937DA8E6EF033B5DD31F6
## 8 H 40823BBBFDF9B14839B7A95B6E317EBA9B016ED5
Do the manipulation
read_fwf(paste0(df$big_str, collapse="\n"),
fwf_widths(rep(4, n)),
col_types=paste0(rep("c", n), collapse="")) %>%
mutate_all(strtoi, base=16) %>%
bind_cols(df) %>%
select(some_other_col, everything(), -big_str)
## # A tibble: 8 × 11
## some_other_col X1 X2 X3 X4 X5 X6 X7 X8 X9
## <chr> <int> <int> <int> <int> <int> <int> <int> <int> <int>
## 1 A 17185 216 27818 14476 5550 42537 7832 24367 54267
## 2 B 48166 29649 4754 24251 46077 5976 14255 29046 50075
## 3 C 46313 40922 43940 29973 60122 30823 5608 4590 1282
## 4 D 25830 8919 893 13790 27356 16592 14350 7617 11637
## 5 E 53116 56699 48225 1091 43224 64765 35180 43379 1651
## 6 F 60806 44776 42934 24452 12800 47139 53181 6121 62410
## 7 G 11163 52083 37906 10437 505 14298 36462 61491 46557
## 8 H 16514 15291 65017 45384 14775 43355 28209 32442 39681
## # ... with 1 more variables: X10 <int>
just a try using base-R
BigString = c("0043003E803C0041004A", "0043003E803C0041004A", "0043003E803C0041004A")
df = data.frame(BigString)
t(sapply(df$BigString, function(x) strtoi(substring(x, seq(1, 297, 4)[1:5],
seq(4, 300, 4)[1:5]), base = 16)))
# [,1] [,2] [,3] [,4] [,5]
#[1,] 67 62 32828 65 74
#[2,] 67 62 32828 65 74
#[3,] 67 62 32828 65 74
# you can set the columns together at the end using `paste0("new_col", 1:300)`
# [1:5] was just used for this example, because i had strings of length 20cahr
I would like to create an update function using lazy evaluation and the mutate_if function from dplyrExtras by skranz.
It would work something like this:
data %>%
update(variable1_original = variable1_update,
variable2_original = variable2_update)
would be evaluated as
data %>%
mutate_if(!is.na(variable1_update),
variable1_original = variable1_update) %>%
mutate_if(!is.na(variable2_update),
variable2_original = variable2_update) %>%
select(-variable1_update, variable2_update)
Yikes, that package isn't very fun to use. mutate_if doesn't seem to work with data.frames and the package doesn't have standard-evaluation alternatives for functions like standard dplyr does. Here's an attempt to re-create the function
myupdate <- function(.data, ...) {
dots <- as.list(substitute(...()))
dx <- Reduce(function(a,b) {
upd <- b[[1]]
ifc <- bquote(!is.na(.(upd)))
do.call("mutate_if", c(list(a, ifc), b))
}, split(dots, seq_along(dots)), .data)
select_(dx, .dots=sapply(dots, function(x) bquote(-.(x))))
}
To test it, i used
library(data.table)
dd<-data.table(
a = c(1:3, NA, 5:8)+0,
b = c(1:2, NA, 4:5, NA, 7:8)+100,
x= 1:8+20,
y=1:8+30
)
dd
# a b x y
# 1: 1 101 21 31
# 2: 2 102 22 32
# 3: 3 NA 23 33
# 4: NA 104 24 34
# 5: 5 105 25 35
# 6: 6 NA 26 36
# 7: 7 107 27 37
# 8: 8 108 28 38
and then I ran
myupdate(dd, x=b, y=a)
# x y
# 1: 101 1
# 2: 102 2
# 3: 23 3
# 4: 104 34
# 5: 105 5
# 6: 26 6
# 7: 107 7
# 8: 108 8
Notice how columns "a" and "b" disappear. Also see how values in rows 3 and 6 in column "x" and the value in row 4 in column "y" was preserved because the corresponding values in columns "b" and "a" were NA.
I have matrix like :
[,1][,2][,3][,4]
[1,] 12 32 43 55
[2,] 54 54 7 8
[3,] 2 56 76 88
[4,] 58 99 93 34
I do not know in advance how many rows and columns I will have in matrix. Thus, I need to create row and column names dynamically.
I can name columns (row) directly like:
colnames(rmatrix) <- c("a", "b", "c", "d")
However, how can I create my names vector dynamically to fit the dimensions of the matrix?
nm <- ("a", "b", "c", "d")
colnames(rmatrix) <- nm
You can use rownames and colnames and setting do.NULL=FALSE in order to create names dynamically, as in:
set.seed(1)
rmatrix <- matrix(sample(0:100, 16), ncol=4)
dimnames(rmatrix) <- list(rownames(rmatrix, do.NULL = FALSE, prefix = "row"),
colnames(rmatrix, do.NULL = FALSE, prefix = "col"))
rmatrix
col1 col2 col3 col4
row1 26 19 58 61
row2 37 86 5 33
row3 56 97 18 66
row4 89 62 15 42
you can change prefix to name the rows/cols as you want to.
To dynamically names columns (or rows) you can try
colnames(rmatrix) <- letters[1:ncol(rmatrix)]
where letters can be replaced by a vector of column names of your choice. You can do similar thing for rows.
You may use provideDimnames. Some examples with various degree of customisation:
m <- matrix(1:12, ncol = 3)
provideDimnames(m)
# A B C
# A 1 5 9
# B 2 6 10
# C 3 7 11
# D 4 8 12
provideDimnames(m, base = list(letters, LETTERS))
# A B C
# a 1 5 9
# b 2 6 10
# c 3 7 11
# d 4 8 12
provideDimnames(m, base = list(paste0("row_", letters), paste0("col_", letters)))
# col_a col_b col_c
# row_a 1 5 9
# row_b 2 6 10
# row_c 3 7 11
# row_d 4 8 12