frequency count on considering four columns in r - r

currently i am trying to count frequency of set of sequence of data frame.
A B
1 a
1 b
1 c
2 a
2 b
2 c
i have this data frame and i would like to count frequency of "B" of another data frame looking like this
C D
1 a
1 a
1 b
1 b
2 b
2 c
2 c
As you can see the number of rows is different so datatable(counts) does not work. i would like to it to look like this after frequency count is done
a b freq
1 a 2
1 b 2
1 c 0
2 a 0
2 b 1
2 c 2
As you can see it makes counts of all the frequency even the 0 as the on some groups there is no data on it.
thanks for anyone that helps!

By using merge and aggregate
df2$freq = 1
df = merge(df1,aggregate(freq~.,df2,length),by.x = c('A','B'),by.y = c('C','D'),all.x = T)
df[is.na(df)] = 0
df
A B freq
1 1 a 2
2 1 b 2
3 1 c 0
4 2 a 0
5 2 b 1
6 2 c 2
More Info
aggregate(freq~.,df2,length)
C D freq
1 1 a 2
2 1 b 2
3 2 b 1
4 2 c 2
Data Input
df1
A B
1 1 a
2 1 b
3 1 c
4 2 a
5 2 b
6 2 c
df2
C D
1 1 a
2 1 a
3 1 b
4 1 b
5 2 b
6 2 c
7 2 c

This looks to be a question of how to tabulate frequencies across two factors without dropping missing levels.
Here's the dplyr solution. This assumes that dfAB, as in your example data, contains no duplicates (dfAB is interchangeable with the output of expand.grid if you don't already have the level combinations in a data frame)
library(dplyr)
dfAB %>%
# need at least one non-joining variable to tell matches from non-matches
left_join(mutate(dfCD, dummy = 1), by = c("A" = "C", "B" = "D")) %>%
group_by(A, B) %>%
summarize(freq = sum(dummy, na.rm = TRUE))
Output:
# A tibble: 6 x 3
# Groups: A [?]
A B freq
<dbl> <chr> <dbl>
1 1 a 2
2 1 b 2
3 1 c 0
4 2 a 0
5 2 b 1
6 2 c 2
(if there are duplicates in dfAB, add a distinct call to the chain before the join)

df1_rows = Reduce(paste, df1)
df2_rows = Reduce(paste, df2)
data.frame(df1, freq = sapply(df1_rows, function(x) sum(df2_rows %in% x)),
row.names = NULL)
# A B freq
#1 1 a 2
#2 1 b 2
#3 1 c 0
#4 2 a 0
#5 2 b 1
#6 2 c 2
DATA
df1 = data.frame(A = c(1L, 1L, 1L, 2L, 2L, 2L),
B = c("a", "b", "c", "a", "b", "c"))
df2 = data.frame(C = c(1L, 1L, 1L, 1L, 2L, 2L, 2L),
D = c("a", "a", "b", "b", "b", "c", "c"))

Related

Error assigned data must be compatible with existing data

I want to create a new variable, "F", by adding columns (B+C+D+E) if the column "A" is 1.
ID
A
B
C
D
E
001
1
1
2
NA
1
002
0
2
1
1
NA
df$F <- rowSums(df[df$A == '1', c(3:6)],na.rm=TRUE)
I get this error:
Error:
! Assigned data `rowSums(df[df$A == "1", c(3:6)], na.rm = TRUE)` must be compatible with existing data.
✖ Existing data has 12358 rows.
✖ Assigned data has 474 rows.
ℹ Only vectors of size 1 are recycled.
Backtrace:
1. base::`$<-`(`*tmp*`, F, value = `<dbl>`)
12. tibble (local) `<fn>`(`<vctrs___>`)
Error:
How can I fix this? Are there other ways to get my final outcome something looks like the one below?
ID
A
B
C
D
E
F
001
1
1
2
NA
1
4
002
0
2
1
1
NA
NA
Try this.
df$F <- ifelse(df$A == 1, rowSums(df[, c("B", "C", "D", "E")], na.rm=TRUE), NA)
df
# ID A B C D E F
# 1 1 1 1 2 NA 1 4
# 2 2 0 2 1 1 NA NA
We just need the logical to be on the lhs as well to keep the lengths same
df$F[df$A == '1'] <- rowSums(df[df$A == '1', c(3:6)],na.rm=TRUE)
-output
> df
ID A B C D E F
1 1 1 1 2 NA 1 4
2 2 0 2 1 1 NA NA
A tidyverse approach:
Libraries
library(dplyr)
Data
data <-
tibble::tribble(
~ID, ~A, ~B, ~C, ~D, ~E,
"001", 1L, 1L, 2L, NA, 1L,
"002", 0L, 2L, 1L, 1L, NA
)
Code
data %>%
rowwise() %>%
mutate(`F` = if_else(A == 1, sum(c_across(cols = B:E),na.rm = TRUE), NA_integer_) )
Output
# A tibble: 2 x 7
# Rowwise:
ID A B C D E F
<chr> <int> <int> <int> <int> <int> <int>
1 001 1 1 2 NA 1 4
2 002 0 2 1 1 NA NA

R - Count unique/distinct values in two columns together per group

R - Count unique/distinct values in two columns together
Hi everyone. I have a panel of electoral behaviour but I am having problems to compute a new variable that would capture unique values (parties) of my two columns Party and Party2013 per group. The column Party2013 measures the vote in election 2013 and Party measures voters intentions after 2013. Everytime I try n_distinct or length I get the count of unique values in both columns separately but not as a sum.
ID Wave Party Party2013
1 1 A A
1 2 A NA
1 3 B NA
1 4 B NA
Based on the example above I normally get the count of 3 instead of desired 2.
I´ve tried following commands but got only the number of separate unique values:
data %>% group_by(ID) %>% distinct(Party, Party2013, .keep_all = TRUE) %> dplyr::summarise(Party_Party2013 = n())
or
ddply(data, .(ID), mutate, count = length(unique(Party, Party2013)))
The expected outcome would as follows:
ID Wave Party Party2013 Count
1 1 A A 2
1 2 A NA 2
1 3 B NA 2
1 4 B NA 2
2 1 A C 3
2 2 B NA 3
2 3 B NA 3
2 4 B NA 3
I would very much appreciate any advice on how to count the overall number of unique parties across the two columns per group and not the number of distinct values per each one. Thanks.
You can subset the data from cur_data() and unlist the data to get a vector. Use n_distinct to count number of unique values.
library(dplyr)
df %>%
group_by(ID) %>%
mutate(Count = n_distinct(unlist(select(cur_data(),
Party, Party2013)), na.rm = TRUE)) %>%
ungroup
# ID Wave Party Party2013 Count
# <int> <int> <chr> <chr> <int>
#1 1 1 A A 2
#2 1 2 A NA 2
#3 1 3 B NA 2
#4 1 4 B NA 2
#5 2 1 A C 3
#6 2 2 B NA 3
#7 2 3 B NA 3
#8 2 4 B NA 3
data
It is easier to help if you provide data in a reproducible format
df <- structure(list(ID = c(1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L), Wave = c(1L,
2L, 3L, 4L, 1L, 2L, 3L, 4L), Party = c("A", "A", "B", "B", "A",
"B", "B", "B"), Party2013 = c("A", NA, NA, NA, "C", NA, NA, NA
)), class = "data.frame", row.names = c(NA, -8L))
In situations like this I always like to simplify the problem and change the data into the long format since it is easier to solve problems like this if all of your values are in one column. With pivot_longer() you can also use the argument values_drop_na = TRUE to drop NAs which were counted in your example:
library(tidyr)
library(dplyr)
data <- read.table(text =
"ID Wave Party Party2013
1 1 A A
1 2 A NA
1 3 B NA
1 4 B NA
2 1 A C
2 2 B NA
2 3 B NA
2 4 B NA", header = TRUE)
data %>% pivot_longer(cols = starts_with("Party"), values_drop_na = TRUE) %>% group_by(ID) %>%
summarise(Count = n_distinct(value)) %>% merge(data, .)
#> ID Wave Party Party2013 Count
#> 1 1 1 A A 2
#> 2 1 2 A <NA> 2
#> 3 1 3 B <NA> 2
#> 4 1 4 B <NA> 2
#> 5 2 1 A C 3
#> 6 2 2 B <NA> 3
#> 7 2 3 B <NA> 3
#> 8 2 4 B <NA> 3
Created on 2021-08-30 by the reprex package (v2.0.1)
You can also and this way:
library(dplyr)
data <- read.table(text =
"ID Wave Party Party2013
1 1 A A
1 2 A NA
1 3 B NA
1 4 B NA
2 1 A C
2 2 B NA
2 3 B NA
2 4 B NA", header = TRUE)
data %>%
group_by(ID) %>%
mutate(Count = paste(Party, Party2013) %>%
unique %>% length() %>%
rep(length(Party)))
output
# A tibble: 8 x 5
# Groups: ID [2]
ID Wave Party Party2013 Count
<int> <int> <chr> <chr> <int>
1 1 1 A A 3
2 1 2 A NA 3
3 1 3 B NA 3
4 1 4 B NA 3
5 2 1 A C 2
6 2 2 B NA 2
7 2 3 B NA 2
8 2 4 B NA 2

How do I merge and add up columns in R?

I have an issue in R I cannot fix, so I'm asking for help here. I want to merge three columns into one, but haven't found a way to do so. Let's say it looks like this table:
Time H C W K
0 1 2 0 5
1 5 2 1 1
2 0 1 2 2
How do I turn it into this table:
Time G K
0 3 5
1 8 1
2 3 2
Maybe you can try the code below
subset(within(df, G <- rowSums(cbind(H, C, W))), select = -c(H, C, W))
giving
Time K G
1 0 5 3
2 1 1 8
3 2 2 3
or a data.table option
> setDT(df)[, .(Time, G = rowSums(cbind(H, C, W)), K)][]
Time G K
1: 0 3 5
2: 1 8 1
3: 2 3 2
We can use transmute
library(dplyr)
df %>%
transmute(Time, G = rowSums(select(., H:W)), K)
# Time G K
#1 0 3 5
#2 1 8 1
#3 2 3 2
Maybe try this:
#Code
newdf <- data.frame(df[,1,drop=F],G=rowSums(df[,-c(1,5)]),df[,5,drop=F])
Output:
Time G K
1 0 3 5
2 1 8 1
3 2 3 2
Some data used:
#Data
df <- structure(list(Time = 0:2, H = c(1L, 5L, 0L), C = c(2L, 2L, 1L
), W = 0:2, K = c(5L, 1L, 2L)), class = "data.frame", row.names = c(NA,
-3L))
Also a shortcut instead of placing each variable and improving the answer of #KarthikS can be using c_across():
library(dplyr)
#Code2
newdf <- df %>% rowwise() %>% mutate(G = sum(c_across(H:W))) %>% select(Time, G, K)
Output:
# A tibble: 3 x 3
# Rowwise:
Time G K
<int> <int> <int>
1 0 3 5
2 1 8 1
3 2 3 2

Order, sort or rank rows by group in dataframe

I have the following df
> df
A 1
B 2
B 2
C 1
D 2
D 2
E 1
F 2
F 2
df = data.frame(Letters = LETTERS[1:6], Times = rep(c(1,2)), stringsAsFactors = FALSE)
df = df[rep(seq_len(nrow(df)), df$Times),]
But I would like to reorder/sort/rank (not sure what to use) my rows as follows:
> df
B 2
B 2
A 1
D 2
D 2
C 1
F 2
F 2
E 1
I have found answers to similar but yet different questions on SO. Still, none of them seems to solve mine.
Is there a way to do so in BaseR?
Here is an option with base R
lvls <- c(do.call(rbind, with(unique(df), split(Letters,
factor(Times, levels = sort(unique(Times), decreasing = TRUE))))))
df[order(factor(df$Letters, levels = lvls)),]
# Letters Times
#2 B 2
#3 B 2
#1 A 1
#5 D 2
#6 D 2
#4 C 1
#8 F 2
#9 F 2
#7 E 1
data
df <- structure(list(Letters = c("A", "B", "B", "C", "D", "D", "E",
"F", "F"), Times = c(1L, 2L, 2L, 1L, 2L, 2L, 1L, 2L, 2L)),
class = "data.frame", row.names = c(NA,
-9L))
Not sure if this follows your logic but it does agree with expected output,
library(dplyr)
df %>%
arrange(desc(Letters)) %>%
arrange(desc(cumsum(c(0, diff(Times) == 1))))
# Letters Times
#1 B 2
#2 B 2
#3 A 1
#4 D 2
#5 D 2
#6 C 1
#7 F 2
#8 F 2
#9 E 1
Here is one way in base R using split with assumptions that Times argument would always be 1 or 2 and you'll have same number of unique Letters for both 1 and 2 values.
lst <- split(seq_len(nrow(df)), df$Letters)
df[unlist(c(rbind(lst[lengths(lst) == 2], lst[lengths(lst) == 1]))), ]
# Letters Times
#2 B 2
#3 B 2
#1 A 1
#5 D 2
#6 D 2
#4 C 1
#8 F 2
#9 F 2
#7 E 1

R: Count number of values from one column according to values in another column

I have a bit of an unclear question, so I hope I can explain this properly.
I am using R. I know for loops can be slow in R, but for me it would be ok to use a for loop in this case.
I have a dataframe like this:
id_A id_B id_C calc_A calc_B calc_C
1 x,z d g,f 1 1 5
2 x,y,z d,e f 1 2 8
3 y,z d,e g 6 7 1
I also have a vector with the names c('A', 'B', 'C', etc.)
What I want to do is to count for every row, how many id’s have a calc <= 2.
id_A is linked to calc_A, etc.
For example, for the first row A and B have calc values <= 2, together A and B have 3 id's.
So the output will be something like this:
count
1 3
2 5
3 1
It's a bit messy, but this should do the trick (for data.frame d):
# store indices of calc columns and id columns
calc.cols <- grep('^calc', names(d))
id.cols <- grep('^id', names(d))
sapply(split(d, seq_len(nrow(d))), function(x) {
length(unique(unlist(strsplit(paste(x[, id.cols][which(x[, calc.cols] <= 2)],
collapse=','), ','))))
})
# 1 2 3
# 3 5 1
Assuming that the ID columns and the calc columns are in the same order
library(stringr)
indx <- sapply(df[,1:3], str_count, ",")+1
indx[df[,4:6] >2] <- NA
df$count <- rowSums(indx,na.rm=TRUE)
df
# id_A id_B id_C calc_A calc_B calc_C count
#1 x,z d g,f 1 1 5 3
#2 x,y,z d,e f 1 2 8 5
#3 y,z d,e g 6 7 1 1
Update
Suppose, your dataset is not in the same order
set.seed(42)
df1 <- df[,sample(6)]
library(gtools)
df2 <-df1[,mixedorder(names(df1))]
# calc_A calc_B calc_C id_A id_B id_C
#1 1 1 5 x,z d g,f
#2 1 2 8 x,y,z d,e f
#3 6 7 1 y,z d,e g
id1 <- grep("^id", colnames(df2))
calc1 <- grep("^calc", colnames(df2))
indx1 <-sapply(df2[, id1], str_count, ",")+1
indx1[df2[, calc1] >2] <- NA
df1$count <- rowSums(indx1, na.rm=TRUE)
df1
# calc_C calc_B id_B id_C calc_A id_A count
#1 5 1 d g,f 1 x,z 3
#2 8 2 d,e f 1 x,y,z 5
#3 1 7 d,e g 6 y,z 1
data
df <- structure(list(id_A = c("x,z", "x,y,z", "y,z"), id_B = c("d",
"d,e", "d,e"), id_C = c("g,f", "f", "g"), calc_A = c(1L, 1L,
6L), calc_B = c(1L, 2L, 7L), calc_C = c(5L, 8L, 1L)), .Names = c("id_A",
"id_B", "id_C", "calc_A", "calc_B", "calc_C"), class = "data.frame", row.names = c("1",
"2", "3"))
I don't know if this is less messy than jbaums solution but here is another option :
mydf<-data.frame(id_A=c("x,y","x,y,z","y,z"),id_B=c("d","d,e","d,e"),id_C=c("g,f","f","g"),
calc_A=c(1,1,6),calc_B=c(1,2,7),calc_C=c(5,8,1),stringsAsFactors=F)
mydf$count<-apply(mydf,1,function(rg,namesrg){
rg_calc<-rg[grep("calc",namesrg)]
rg_ids<-rg[grep("id",namesrg)]
idsinf2<-which(as.numeric( rg_calc)<=2)
ttids<-unlist(sapply(rg_ids[gsub("calc","id",names(rg_calc[idsinf2]))],function(id){strsplit(id,",")[[1]]}))
return(length(ttids))
},colnames(mydf))
> mydf
id_A id_B id_C calc_A calc_B calc_C count
1 x,y d g,f 1 1 5 3
2 x,y,z d,e f 1 2 8 5
3 y,z d,e g 6 7 1 1

Resources