R Table with variables x levels - r

I have a dataframe with multiple variables, each has values of TRUE, FALSE, or NA. I'm trying to summarize the data, but get anything to work quite the way I want.
names <- c("n1","n2","n3","n4","n5","n6")
groupname <- c("g1","g2","g3","g4","g4","g4")
var1 <- c(TRUE,TRUE,NA,FALSE,TRUE,NA)
var2 <- c(FALSE,TRUE,NA,FALSE,TRUE,NA)
var3 <- c(FALSE,TRUE,NA,FALSE,TRUE,NA)
df <- data.frame(names,groupname,var1,var2,var3)
I'm trying to summarize the data for individual groups:
G4 TRUE FALSE NA
var1 3 1 2
var2 2 2 2
var3 2 2 2
I can do table(groupname,var1) to do them individually, but I'm trying to get it all in a single table. Any suggestions?

using dplyr
library(dplyr)
df %>% gather("key", "value", var1:var3) %>%
group_by(key) %>%
summarise(true = sum(value==TRUE, na.rm=T),
false = sum(!value, na.rm=T),
missing = sum(is.na(value)))
# key true false missing
#1 var1 3 1 2
#2 var2 2 2 2
#3 var3 2 2 2

In base R, you could use table to get the counts, lapply to run through the variables, and do.call to put the results together. A minor subsetting with [ orders the columns as desired.
do.call(rbind, lapply(df[3:5], table, useNA="ifany"))[, c(2,1,3)]
TRUE FALSE <NA>
var1 3 1 2
var2 2 2 2
var3 2 2 2
This will work if each variable has all levels (TRUE, FALSE, NA). If one of the levels is missing, you can tell table to fill it with a 0 count by feeding it a factor variable.
Here is an example.
# expand data set
df$var4 <- c(TRUE, NA)
do.call(rbind, lapply(df[3:6],
function(i) table(factor(i, levels=c(TRUE, FALSE, NA)),
useNA="ifany")))[, c(2,1,3)]
FALSE TRUE <NA>
var1 1 3 2
var2 2 2 2
var3 2 2 2
var4 0 3 3

Related

Subset and group dataframe by matching columns and values R

I have 2 dataframes, df1 contains a groupID and continuous variables like so:
GroupID Var1 Var2 Var3 Var4
1 20.33115 19.59319 0.6384765 0.6772862
1 31.05899 23.14446 0.5796645 0.7273182
2 24.28984 20.99047 0.6425050 0.6865804
2 22.47856 21.36709 0.6690020 0.6368560
3 21.65817 20.99444 0.6829786 0.6461840
3 23.45899 21.57718 0.6655482 0.6473043
And df2 contains cutoff values (ct) for each variable:
Var1ct Var2ct Var3ct Var4ct
22.7811 20.3349 0.7793 0.4294
What I want to do is, for each variable in df1, find the number of rows where the value is greater than the cutoff value in its associated columnn in df2 and return that number for each groupID, so the output would look like this:
GroupID N-Var1 N-Var2 N-Var3 N-Var4
1 62 78 33 99
2 69 25 77 12
3 55 45 27 62
df1 is ~ 2million rows unevenly distributed by GroupID and 30 variable columns I need the count for, I am just looking for a more effecient way than typing out the same function for all 30 variables.
Here's a way in dplyr:
library(dplyr)
df1 %>%
group_by(GroupID) %>%
summarise(across(everything(), ~ sum(.x > df2[grepl(cur_column(), colnames(df2))][, 1])))
GroupID Var1 Var2 Var3 Var4
<int> <int> <int> <int> <int>
1 1 1 1 0 2
2 2 1 2 0 2
3 3 1 2 0 2
data
df1 <- read.table(header = T, text = "GroupID Var1 Var2 Var3 Var4
1 20.33115 19.59319 0.6384765 0.6772862
1 31.05899 23.14446 0.5796645 0.7273182
2 24.28984 20.99047 0.6425050 0.6865804
2 22.47856 21.36709 0.6690020 0.6368560
3 21.65817 20.99444 0.6829786 0.6461840
3 23.45899 21.57718 0.6655482 0.6473043 ")
df2 <- read.table(header = T, text = "Var1ct Var2ct Var3ct Var4ct
22.7811 20.3349 0.7793 0.4294")
a data.table approach that should scale well..
library(data.table)
# if df1 and dsf2 are not data.table, use
# setDT(df)1; setDT(df2)
# we need similara columnnames in df1 and df2 to easily join
setnames(df2, names(df1)[2:5])
# melt df1 and to long format
df1.long <- melt(df1, id.vars = "GroupID")
df2.long <- melt(df2, measure.vars = names(df2))
# join ct-values
df1.long[df2.long, ct := i.value, on = .(variable)]
# summarise
ans <- df1.long[, sum(value > ct), by = .(GroupID, variable)]
# cast to wide
dcast(ans, GroupID ~ variable, value.var = "V1")
# GroupID Var1 Var2 Var3 Var4
# 1: 1 1 1 0 2
# 2: 2 1 2 0 2
# 3: 3 1 2 0 2
sample data
df1 <- fread("GroupID Var1 Var2 Var3 Var4
1 20.33115 19.59319 0.6384765 0.6772862
1 31.05899 23.14446 0.5796645 0.7273182
2 24.28984 20.99047 0.6425050 0.6865804
2 22.47856 21.36709 0.6690020 0.6368560
3 21.65817 20.99444 0.6829786 0.6461840
3 23.45899 21.57718 0.6655482 0.6473043 ")
df2 <- fread("Var1ct Var2ct Var3ct Var4ct
22.7811 20.3349 0.7793 0.4294")

R: Repeating row of dataframe with respect to multiple count columns

I have a R DataFrame that has a structure similar to the following:
df <- data.frame(var1 = c(1, 1), var2 = c(0, 2), var3 = c(3, 0), f1 = c('a', 'b'), f2=c('c', 'd') )
So visually the DataFrame would look like
> df
var1 var2 var3 f1 f2
1 1 0 3 a c
2 1 2 0 b d
What I want to do is the following:
(1) Treat the first C=3 columns as counts for three different classes. (C is the number of classes, given as an input variable.) Add a new column called "class".
(2) For each row, duplicate the last two entries of the row according to the count of each class (separately); and append the class number to the new "class" column.
For example, the output for the above dataset would be
> df_updated
f1 f2 class
1 a c 1
2 a c 3
3 a c 3
4 a c 3
5 b d 1
6 b d 2
7 b d 2
where row (a c) is duplicated 4 times, 1 time with respect to class 1, and 3 times with respect to class 3; row (b d) is duplicated 3 times, 1 time with respect to class 1 and 2 times with respect to class 2.
I tried looking at previous posts on duplicating rows based on counts (e.g. this link), and I could not figure out how to adapt the solutions there to multiple count columns (and also appending another class column).
Also, my actual dataset has many more rows and classes (say 1000 rows and 20 classes), so ideally I want a solution that is as efficient as possible.
I wonder if anyone can help me on this. Thanks in advance.
Here is a tidyverse option. We can use uncount from tidyr to duplicate the rows according to the count in value (i.e., from the var columns) after pivoting to long format.
library(tidyverse)
df %>%
pivot_longer(starts_with("var"), names_to = "class") %>%
filter(value != 0) %>%
uncount(value) %>%
mutate(class = str_extract(class, "\\d+"))
Output
f1 f2 class
<chr> <chr> <chr>
1 a c 1
2 a c 3
3 a c 3
4 a c 3
5 b d 1
6 b d 2
7 b d 2
Another slight variation is to use expandrows from splitstackshape in conjunction with tidyverse.
library(splitstackshape)
df %>%
pivot_longer(starts_with("var"), names_to = "class") %>%
filter(value != 0) %>%
expandRows("value") %>%
mutate(class = str_extract(class, "\\d+"))
base R
Row order (and row names) notwithstanding:
tmp <- subset(reshape2::melt(df, id.vars = c("f1","f2"), value.name = "class"), class > 0, select = -variable)
tmp[rep(seq_along(tmp$class), times = tmp$class),]
# f1 f2 class
# 1 a c 1
# 2 b d 1
# 4 b d 2
# 4.1 b d 2
# 5 a c 3
# 5.1 a c 3
# 5.2 a c 3
dplyr
library(dplyr)
# library(tidyr) # pivot_longer
df %>%
pivot_longer(-c(f1, f2), values_to = "class") %>%
dplyr::filter(class > 0) %>%
select(-name) %>%
slice(rep(row_number(), times = class))
# # A tibble: 7 x 3
# f1 f2 class
# <chr> <chr> <dbl>
# 1 a c 1
# 2 a c 3
# 3 a c 3
# 4 a c 3
# 5 b d 1
# 6 b d 2
# 7 b d 2

Paste two columns in R but NAs are included in new column [duplicate]

This question already has answers here:
How to omit NA values while pasting numerous column values together?
(2 answers)
suppress NAs in paste()
(13 answers)
Closed 1 year ago.
I am trying to concoctate two columns in R using:
df_new$conc_variable <- paste(df$var1, df$var2)
My dataset look as follows:
id var1 var2
1 10 NA
2 NA 8
3 11 NA
4 NA 1
I am trying to get it such that there is a third column:
id var1 var2 conc_var
1 10 NA 10
2 NA 8 8
3 11 NA 11
4 NA 1 1
but instead I get:
id var1 var2 conc_var
1 10 NA 10NA
2 NA 8 8NA
3 11 NA 11NA
4 NA 1 1NA
Is there a way to exclude NAs in the paste process? I tried including na.rm = FALSE but that just added FALSE add the end of the NA in conc_var column. Here is the dataset:
id <- c(1,2,3,4)
var1 <- c(10, NA, 11, NA)
var2 <- c(NA, 8, NA, 1)
df <- data.frame(id, var1, var2)
One out of many options is to use ifelse as in:
df <- data.frame(var1 = c(10, NA, 11, NA),
var2 = c(NA, 8, NA, 1))
df$new <- ifelse(is.na(df$var1), yes = df$var2, no = df$var1)
print(df)
Depending on the circumstances rowSums might be suitable as well as in
df$new2 <- rowSums(df[, c("var1", "var2")], na.rm = TRUE)
print(df)
You can use tidyr::unite -
df <- tidyr::unite(df, conc_var, var1, var2, na.rm = TRUE, remove = FALSE)
df
# id conc_var var1 var2
#1 1 10 10 NA
#2 2 8 NA 8
#3 3 11 11 NA
#4 4 1 NA 1
Like in the example if in a row at a time you'll have only one value you can also use pmax or coalesce.
pmax(df$var1, df$var2, na.rm = TRUE)
dplyr::coalesce(df$var1, df$var2)
You could use glue from the glue package instead.
glue::glue(10, NA, .na = '')

Identify which row of data.frame exactly matches a vector

Given this data.frame:
var1 <- c(1, 2)
var2 <- c(3, 4)
var3 <- c(5, 6)
df <- expand.grid(var1 = var1, var2 = var2, var3 = var3)
var1 var2 var3
1 1 3 5
2 2 3 5
3 1 4 5
4 2 4 5
5 1 3 6
6 2 3 6
7 1 4 6
8 2 4 6
I would like to identify the data.frame row number matching this vector (4 is the answer in this case):
vec <- c(var1 = 2, var2 = 4, var3 = 5)
var1 var2 var3
2 4 5
I can't seem to sort out a simple subsetting method. The best I have been able to come up with is the following:
working <- apply(df, 2, match, vec)
which(apply(working, 1, anyNA) == FALSE)
This seems less straightforward than expected; I was wondering if there was a more straightforward solution?
We can transpose the dataframe, compare it with vec and select the row where all of the value matches.
which(colSums(t(df) == vec) == ncol(df))
#[1] 4
For the sake of completeness, subsetting can be implemented using data.table's join:
library(data.table)
setDT(df)[as.list(vec), on = names(vec), which = TRUE]
[1] 4
This can be solved using the prodlim library:
> library(prodlim)
> row.match(vec, df)
[1] 4
Here is a dplyr option:
library(dplyr)
library(magrittr)
df %>% mutate(new=paste0(var1,var2,var3), num=row_number()) %>%
filter(new=="245") %>% select(num) %>% as.integer()
[1] 4

Removing rows when flipped in two columns

Considering the following data frame:
df <- data.frame(var1 = 1:5, var2 = c(5,6,7,8,1))
> df
var1 var2
1 1 5
2 2 6
3 3 7
4 4 8
5 5 1
I'd like to remove all rows whose values are flipped across the two columns. In this case, it would be row 1 and row 5 as the values 1 and 5 in row 1 are flipped to 5 and 1 in row 5. These two rows should be removed.
I hope it came clear what I am asking for :-)
Kind regards!
Perhaps something like this could work too:
df <- data.frame(var1 = 1:5, var2 = c(5,6,7,8,1))
df[!do.call(paste, df) %in% do.call(paste, rev(df)), ]
var1 var2
2 2 6
3 3 7
4 4 8
I'd have to test it on a few more test cases though, but the general idea is to use rev to reverse the order of the columns in "df" and paste them together and compare that with the pasted columns from "df".
Here's a simple but not especially elegant way: make a reversed data frame with a flag, and then merge it on to df:
# Make a reversed dataset
fd <- data.frame(var1 = df$var2, var2 = df$var1, flag = TRUE)
# Merge it onto your original df, then drop the matched rows and the flag var
df.sub <- subset(merge(x = df, y = fd, by = c("var1", "var2"), all.x = TRUE),
subset = is.na(flag),
select = c("var1", "var2"))
Using a bit of maths - the two rows are the same up to a permutation if the sum and absolute value of difference are the same:
df[with(df, !duplicated(data.frame(var1 + var2, abs(var1 - var2)), fromLast = TRUE)),]
# var1 var2
#1 1 5
#2 2 6
#3 3 7
#4 4 8
edit: should've read the question more carefully, to remove both duplicates, follow Ananda's suggestion:
df.ind = with(df, data.frame(var1 + var2, abs(var1 - var2)))
df[!duplicated(df.ind) & !duplicated(df.ind, fromLast = TRUE),]
# var1 var2
#2 2 6
#3 3 7
#4 4 8
If creating a copy doesn't cause memory issues then this works as well -
df <- data.frame(var1 = 1:5, var2 = c(5,6,7,8,1))
df2 <- data.frame(var12 = 1:5, var22 = c(5,6,7,8,1))
df3 <- merge(df,df2, by.x = 'var2', by.y = 'var12', all.x = TRUE)
df3 <- subset(
df3,
is.na(var22),
select = c('var1','var2')
)
Output:
> df3
var1 var2
3 2 6
4 3 7
5 4 8
I tried merging df with df but that gives a warning about the column var2 being duplicated. Anybody know what to do?
If you can assume there are no duplicates in the data frame. Here's a one line answer, but still not too concise:
df[!duplicated(rbindlist(list(df,df[,2:1])))[nrow(df) + 1:nrow(df)],]
## var1 var2
## 2 2 6
## 3 3 7
## 4 4 8
rbindlist is necessary here because rbind(df,df[,2:1]) will match by column name rather than index, so the other option is something like rbind(df,setnames(df[,2:1],names(df))). If you want to keep duplicates from the original, this gets even more unpleasant:
> df <- data.frame(var1 = 1:5, var2 = c(5,6,7,8,1))
> df<-rbind(df,c(2,6))
> df[!duplicated(rbindlist(list(df,df[,2:1])))[nrow(df)+1:nrow(df)],]
var1 var2
2 2 6
3 3 7
4 4 8
> df[!duplicated(rbindlist(list(df,df[,2:1])))[nrow(df)+1:nrow(df)] | duplicated(df),]
var1 var2
2 2 6
3 3 7
4 4 8
6 2 6

Resources