How can i count frequency AMT by ABCD? - r

C1:
AMT A B C D
1 13 0 1 0 0
2 17 0 0 1 0
3 19 0 0 0 1
4 1 0 0 1 0
5 9 0 1 0 0
How can i count frequency AMT by ABCD?
C2= t(as.matrix(C1[1])) %*% as.matrix(C1[2:5])
It gives me a result of Total Sum by Region.
My desired output to combine A B C D in one col since it is binary then count frequency by Type. ie.
AMT GROUP N
1 1 A 1
2 9 B 1
3 13 B 1
4 17 C 1
5 19 D 1
...
AMT IS NOT LIMITED TO 1 9 13 17 ... RANGE FROM 0-30
res <- C1 %>% group_by( ) %>% summarise(Freq=n())

library(tidyverse)
C1 %>%
tidyr::pivot_longer(
cols = A:D,
names_to = "Names",
values_to = "Values",
) %>%
group_by(Names) %>%
filter(Values == 1) %>%
summarise(AMT = sum(AMT))
select(Names, AMT, -Values)
Output:
Names AMT
<chr> <dbl>
1 B 22
2 C 18
3 D 19

You can use max.col to get the column name which has value 1 in it.
library(dplyr)
C1 %>%
transmute(AMT,
GROUP = names(.)[-1][max.col(select(., -1))],
N = 1) %>%
arrange(AMT) -> res
res
# AMT GROUP N
#4 1 C 1
#5 9 B 1
#1 13 B 1
#2 17 C 1
#3 19 D 1
data
C1 <- structure(list(AMT = c(13L, 17L, 19L, 1L, 9L), A = c(0L, 0L,
0L, 0L, 0L), B = c(1L, 0L, 0L, 0L, 1L), C = c(0L, 1L, 0L, 1L,
0L), D = c(0L, 0L, 1L, 0L, 0L)), class = "data.frame", row.names = c(NA, -5L))

Related

Transforming a one-hot encoded variable to one column

I have age columns like so that are dummy encoded.
How can I transform these columns to one column using dplyr?
Input:
age_0-10 age_11-20 age_21-30 age_31-40 age_41-50 age_51-60 gender
1 0 1 0 0 0 0 0
2 0 0 1 0 0 0 1
3 0 0 0 1 0 0 0
4 0 1 0 0 0 0 1
5 0 0 0 0 0 1 1
Expected output:
age gender
1 11-20 0
2 21-30 1
3 31-40 0
4 11-20 1
5 51-60 1
A possible solution, now, thanks to #Adam's comment, with names_prefix:
library(tidyverse)
df <- data.frame(
check.names = FALSE,
`age_0-10` = c(0L, 0L, 0L, 0L, 0L),
`age_11-20` = c(1L, 0L, 0L, 1L, 0L),
`age_21-30` = c(0L, 1L, 0L, 0L, 0L),
`age_31-40` = c(0L, 0L, 1L, 0L, 0L),
`age_41-50` = c(0L, 0L, 0L, 0L, 0L),
`age_51-60` = c(0L, 0L, 0L, 0L, 1L),
gender = c(0L, 1L, 0L, 1L, 1L)
)
df %>%
pivot_longer(col=starts_with("age"), names_to="age", names_prefix="age_") %>%
filter(value==1) %>%
select(age, gender, -value)
#> # A tibble: 5 × 2
#> age gender
#> <chr> <int>
#> 1 11-20 0
#> 2 21-30 1
#> 3 31-40 0
#> 4 11-20 1
#> 5 51-60 1
Here is a way in dplyr using c_across().
library(dplyr)
library(stringr)
df %>%
rowwise() %>%
mutate(age = str_remove(names(.)[which(c_across(starts_with("age")) == 1)], "^age_")) %>%
ungroup() %>%
select(age, gender)
# # A tibble: 5 x 2
# age gender
# <chr> <int>
# 1 11-20 0
# 2 21-30 1
# 3 31-40 0
# 4 11-20 1
# 5 51-60 1
Try the base R code below using max.col
cbind(
age = gsub("^age_", "", head(names(df), -1)[max.col(df[-ncol(df)])]),
df[ncol(df)]
)
which gives
age gender
1 11-20 0
2 21-30 1
3 31-40 0
4 11-20 1
5 51-60 1
Here is another tidyverse solution:
library(dplyr)
library(purrr)
df %>%
mutate(age = pmap_chr(select(cur_data(), !gender),
~ names(df)[-ncol(df)][as.logical(c(...))])) %>%
select(age, gender)
age gender
1 age_11-20 0
2 age_21-30 1
3 age_31-40 0
4 age_11-20 1
5 age_51-60 1

Trying to expand all values in specific rows in a dataframe that correspond to a value in another column into rows

I have a data frame that looks like:
d s X3 X4 X5 X6
1 0 1 1 0 1 1
2 1 1 1 0 1 1
3 2 2 0 0 0 1
4 3 2 1 0 0 1
5 4 3 0 0 0 0
6 5 3 0 1 0 0
I want to combine the values in the X3-X6 columns into rows that correspond to the value in column s such that it looks something like:
s G1 G2 G3 G4 G5 G6 G7 G8
1 1 1 1 0 0 1 1 1 1
2 2 0 1 0 0 0 0 1 1
3 3 0 0 0 1 0 0 0 0
I did:
combined_data <- fake_data[,c(2:6)] %>%
melt(id = 's') %>% group_by(s) %>%
summarise(paste(value, collapse = ',')) %>%
separate("paste(value, collapse = \",\")", into = c("G1", "G2", "G3", "G4", "G5", "G6", "G7", "G8"))
It does what I want but I'm not convinced it's the best way to do it.
Any help would be appreciated.
We can pivot to 'long' format, create a group by sequence column and reshape it back to 'wide'
library(dplyr)
library(tidyr)
library(stringr)
fake_data %>%
# // remove the d column
select(-d) %>%
# // pivot to long format
pivot_longer(cols = starts_with('X')) %>%
# // order the columns to get the same order as melt
arrange(s, name) %>%
group_by(s) %>%
# // update the name column by pasteing 'G' with sequence after grouping
mutate(name = str_c('G', row_number())) %>%
# // reshape to wide format
pivot_wider(names_from = name, values_from = value)
# A tibble: 3 x 9
# Groups: s [3]
# s G1 G2 G3 G4 G5 G6 G7 G8
# <int> <int> <int> <int> <int> <int> <int> <int> <int>
#1 1 1 1 0 0 1 1 1 1
#2 2 0 1 0 0 0 0 1 1
#3 3 0 0 0 1 0 0 0 0
data
fake_data <- structure(list(d = 0:5, s = c(1L, 1L, 2L, 2L, 3L, 3L), X3 = c(1L,
1L, 0L, 1L, 0L, 0L), X4 = c(0L, 0L, 0L, 0L, 0L, 1L), X5 = c(1L,
1L, 0L, 0L, 0L, 0L), X6 = c(1L, 1L, 1L, 1L, 0L, 0L)),
class = "data.frame", row.names = c("1",
"2", "3", "4", "5", "6"))

Summarize Gaps in Binary Data using R

I am playing around with binary data.
I have data in columns in the following manner:
A B C D E F G H I J K L M N
-----------------------------------------------------
1 1 1 1 1 1 1 1 1 0 0 0 0 0
0 0 0 0 1 1 1 0 1 1 0 0 1 0
0 0 0 0 0 0 0 1 1 1 1 1 0 0
1 - Indicating that the system was on and 0 indicating that the system was off
I am trying to figure out ways to figure out a way to summarize the gaps between the on/off transition of these systems.
For example,
for the first row, it stops working after 'I'
for the second row, it works from 'E' to 'G' and then works again in 'I' and 'M' but is off during other.
Is there a way to summarize this?
I wish to see my result in the following form
row-number Number of 1's Range
------------ ------------------ ------
1 9 A-I
2 3 E-G
2 2 I-J
2 1 M
3 5 H-L
Here's a tidyverse solution:
library(tidyverse)
df %>%
rowid_to_column() %>%
gather(col, val, -rowid) %>%
group_by(rowid) %>%
# This counts the number of times a new streak starts
mutate(grp_num = cumsum(val != lag(val, default = -99))) %>%
filter(val == 1) %>%
group_by(rowid, grp_num) %>%
summarise(num_1s = n(),
range = paste0(first(col), "-", last(col)))
## A tibble: 5 x 4
## Groups: rowid [3]
# rowid grp_num num_1s range
# <int> <int> <int> <chr>
#1 1 1 9 A-I
#2 2 2 3 E-G
#3 2 4 2 I-J
#4 2 6 1 M-M
#5 3 2 5 H-L
An option with data.table. Convert the 'data.frame' to 'data.table' while creating a row number column (setDT), melt from 'wide' to 'long' format specifying the id.var as row number column 'rn', create a run-lenght-id (rleid) column on the 'value' column grouped by 'rn', subset the rows where 'value' is 1, summarise with number of rows (.N), and pasted range of 'variable' values, grouped by 'grp' and 'rn', assign the columns not needed to NULL and order by 'rn' if necessary.
library(data.table)
melt(setDT(df1, keep.rownames = TRUE), id.var = 'rn')[,
grp := rleid(value), rn][value == 1, .(NumberOfOnes = .N,
Range = paste(range(as.character(variable)), collapse="-")),
.(grp, rn)][, grp := NULL][order(rn)]
# rn NumberOfOnes Range
#1: 1 9 A-I
#2: 2 3 E-G
#3: 2 2 I-J
#4: 2 1 M-M
#5: 3 5 H-L
Or using base R with rle
do.call(rbind, apply(df1, 1, function(x) {
rl <- rle(x)
i1 <- rl$values == 1
l1 <- rl$lengths[i1]
nm1 <- tapply(names(x), rep(seq_along(rl$values), rl$lengths),
FUN = function(y) paste(range(y), collapse="-"))[i1]
data.frame(NumberOfOnes = l1, Range = nm1)}))
data
df1 <- structure(list(A = c(1L, 0L, 0L), B = c(1L, 0L, 0L), C = c(1L,
0L, 0L), D = c(1L, 0L, 0L), E = c(1L, 1L, 0L), F = c(1L, 1L,
0L), G = c(1L, 1L, 0L), H = c(1L, 0L, 1L), I = c(1L, 1L, 1L),
J = c(0L, 1L, 1L), K = c(0L, 0L, 1L), L = c(0L, 0L, 1L),
M = c(0L, 1L, 0L), N = c(0L, 0L, 0L)), class = "data.frame", row.names = c(NA,
-3L))

merge columns in dataframe with duplicated values in a second dataframe in R

I have two df and i would like to sum the values for those columnsnames which have the same value in a second dataframe
let say first df is:
file1 file2 file3 file4 file5 file6 file7
num1 1 0 3 1 4 1 11
num2 0 1 1 3 4 2 2
num3 2 0 0 0 1 1 2
num4 11 0 2 1 1 1 1
num5 3 1 0 1 0 0 0
num6 0 0 0 1 2 1 1
And the second df, data is:
Group Link
1 file1
2 file2
3 file3
1 file4
4 file5
3 file6
2 file7
And at the end i would like to have something like:
file1_4 file2_7 file3_6 file5
num1 2 11 4 4
num2 3 3 3 4
num3 2 2 1 1
num4 12 1 3 1
num5 4 1 0 0
num6 1 1 1 2
Hope it is clear enough
Any help will be welcome! thanks!
Here is a tidyverse option:
library(dplyr)
# modify df2 to define a more suitable Group variable
df2 <- df2 %>%
mutate(Link2 = gsub("file", "", Link)) %>%
group_by(Group) %>%
mutate(Link2 = paste("file", paste(Link2, collapse = "_"),
sep = "")) %>%
ungroup() %>% select(Link, Group = Link2)
# dplyr pipe chain
df1 %>%
t() %>% as.data.frame() %>%
tibble::rownames_to_column("Link") %>%
left_join(df2, by = "Link") %>%
group_by(Group) %>%
transmute_at(vars(num1:num6), sum) %>%
ungroup() %>% distinct() %>%
t() %>% as.data.frame() %>%
setNames(., as.character(unlist(.[1, ]))) %>%
tail(., -1)
# output
file1_4 file2_7 file3_6 file5
num1 2 11 4 4
num2 3 3 3 4
num3 2 2 1 1
num4 12 1 3 1
num5 4 1 0 0
num6 1 1 1 2
Data
df1 <- structure(list(file1 = c(1L, 0L, 2L, 11L, 3L, 0L), file2 = c(0L,
1L, 0L, 0L, 1L, 0L), file3 = c(3L, 1L, 0L, 2L, 0L, 0L), file4 = c(1L,
3L, 0L, 1L, 1L, 1L), file5 = c(4L, 4L, 1L, 1L, 0L, 2L), file6 = c(1L,
2L, 1L, 1L, 0L, 1L), file7 = c(11L, 2L, 2L, 1L, 0L, 1L)), .Names = c("file1",
"file2", "file3", "file4", "file5", "file6", "file7"), class = "data.frame", row.names = c("num1",
"num2", "num3", "num4", "num5", "num6"))
df2 <- structure(list(Group = c(1L, 2L, 3L, 1L, 4L, 3L, 2L), Link = c("file1",
"file2", "file3", "file4", "file5", "file6", "file7")), .Names = c("Group",
"Link"), class = "data.frame", row.names = c(NA, -7L))
Since my comment was clearly not what you wanted here another try :)
data_temp = read.table(text = "
index file1 file2 file3 file4 file5 file6 file7
num1 1 0 3 1 4 1 11
num2 0 1 1 3 4 2 2
num3 2 0 0 0 1 1 2
num4 11 0 2 1 1 1 1
num5 3 1 0 1 0 0 0
num6 0 0 0 1 2 1 1"
, header = T, stringsAsFactors = F)
data_group = read.table(text = "
Group Link
1 file1
2 file2
3 file3
1 file4
4 file5
3 file6
2 file7"
, header = T, stringsAsFactors = F)
data_desired = data.frame("index" = data_temp$index)
for(i in unique(data_group$Group)) # for loop over all groups
{
data_to_group = data_group$Link[which(data_group$Group == i)]
data_sum = rowSums(data_temp[data_to_group], na.rm = T)
data_desired$data_sum = data_sum
names(data_desired)[NCOL(data_desired)] = paste(data_to_group, collapse = "_")
}
data_desired
# index file1_file4 file2_file7 file3_file6 file5
# 1 num1 2 11 4 4
# 2 num2 3 3 3 4
# 3 num3 2 2 1 1
# 4 num4 12 1 3 1
# 5 num5 4 1 0 0
# 6 num6 1 1 1 2

R: codify a variable based on preavious observation and by other variable

considering the following data:
Var1 Var2 Target
A 0 no
A 250 no
A 0 si
A 0 si
B 0 no
B 0 no
B 0 no
B 250 no
C 0 no
C 250 no
C 0 si
C 250 no
and look at the variable called Target. I need to reproduce it with the same values.
The condition to obtain "si" or "no" is the following:
for the same level of Var1 (e.g A) if Var2=250 and the nexts are =0 then Target=si
I made this code:
df$Target <- NA
for(i in unique(df$Var1)){
subset.data.frame(df, Var1==i)
for(n in 1: length(df$Var1))
df$Target <-
ifelse(df$Var2[n]==250 && df$Var2[n+1]==0 && df$Var1[n+1]==df$Var1[n], "si", "no"))
But I get Target=si only if the next Var2=0.
Instead, as described in the dataset above, all observations with Var2=0 after a 250 have to be Target=si.
Could you help me to solve the problem, please?
Thank you,
Andrea
Solution
library(dplyr)
df %>%
group_by(Var1) %>%
mutate(Target = ifelse(cumsum(lag(Var2, default=0) == 250) > 0
& Var2 == 0, 'si', 'no'))
Result
# A tibble: 12 x 3
# Groups: Var1 [3]
Var1 Var2 Target
<fctr> <int> <chr>
1 A 0 no
2 A 250 no
3 A 0 si
4 A 0 si
5 B 0 no
6 B 0 no
7 B 0 no
8 B 250 no
9 C 0 no
10 C 250 no
11 C 0 si
12 C 250 no
Explanation
We use dplyr to group df by the levels of Var1, then for each group cumsum(lag(Var2, default=0) == 250) > 0 tells us for every row in that group if any previous observations of Var2 within that group were 250 and Var2 == 0 tells us if the current observation of Var2 is 0. If both of those conditions are TRUE, we code Target as "si", otherwise we code it as "no"
Data
The data I started with for df are
structure(list(Var1 = structure(c(1L, 1L, 1L, 1L, 2L, 2L, 2L,
2L, 3L, 3L, 3L, 3L), .Label = c("A", "B", "C"), class = "factor"),
Var2 = c(0L, 250L, 0L, 0L, 0L, 0L, 0L, 250L, 0L, 250L, 0L,
250L)), .Names = c("Var1", "Var2"), row.names = c(NA, -12L
), class = "data.frame")
Comparison to akrun's Solution
The output of arkun's solution is below so you can determine which approach is more appropriate for your problem.
# A tibble: 12 x 3
# Groups: Var1 [3]
Var1 Var2 Target
<fctr> <int> <chr>
1 A 0 si
2 A 250 no
3 A 0 no
4 A 0 no
5 B 0 no
6 B 0 no
7 B 0 si
8 B 250 no
9 C 0 si
10 C 250 no
11 C 0 si
12 C 250 no
We can use dplyr
library(dplyr)
df1 %>%
group_by(Var1) %>%
mutate(Target = replace(Target, Var2==0 & lead(Var2, default = Var2[n()])==250, 'si'))

Resources