Transform dataframe (crosstab by counts) - r

I have a df (coming from a csv) that has this structure
id att1_beer att1_wine att2_beer att2_wine
1 1 1 0 0
2 0 1 0 1
3 1 1 0 1
4 0 1 0 1
5 1 1 0 0
I would like to get a table (preferably with Tidyverse) to this format:
Beer Wine
Att1 3 5
Att2 0 3
Is this possible? I'm trying to avoid exporting to Excel to do it.

Can reshape to 'long' with pivot_longer and then get the sum by group
library(dplyr)
df %>%
select(-id) %>%
pivot_longer(cols =everything(), names_sep="_",
names_to = c("grp", ".value")) %>%
group_by(grp) %>%
summarise(across(everything(), sum), .groups = 'drop')
Or using base R
sapply(split.default(df[-1], sub(".*_", "", names(df)[-1])), colSums)
data
df <- structure(list(id = 1:5, att1_beer = c(1L, 0L, 1L, 0L, 1L),
att1_wine = c(1L,
1L, 1L, 1L, 1L), att2_beer = c(0L, 0L, 0L, 0L, 0L), att2_wine = c(0L,
1L, 1L, 1L, 0L)), class = "data.frame", row.names = c(NA, -5L
))

data.table solution for completeness' sake
library( data.table ); setDT(df) #prepare
ans <- melt( df, id.vars = "id" ) #melt to long format
ans[, c("att", "drink") := tstrsplit( variable, "_" )] #split column to variables
dcast(ans, att ~ drink, fun.aggregate = sum ) #cast to wide and sum

A base R option using xtabs + colSums
u <- colSums(df[-1])
xtabs(
u ~ .,
data.frame(
u,
do.call(rbind, strsplit(names(u), "_"))
)
)
gives
X2
X1 beer wine
att1 3 5
att2 0 3

Related

How can I create a single column from multiple columns combined?

A dataset that I am using has recorded respondents' ethnicity. Responses are recorded across multiple variables, and respondents were allowed to pick more than one. Example:
ethnicity1 ethnicity2 ethnicity3 ethnicity4 ethnicity5 ethnicity6
1 0 0 0 0 0
0 2 0 0 0 0
0 0 3 4 0 0
Each ethnicity has its own column. I have recoded each column using the recode command so that different numbers represent different ethnicities (i.e., black would be 1, white would be 2, etc.) to try and make a single ethnicity variable
A) create a single column from the multiple columns combined
B) have it so that anyone who reported multiple columns gets designated "multiple".
My expected output would look like this:
Ethnicity
1
2
999
(I'm not sure whether it's best to have a numerical value signify multiple races for coding purposes, or have it be a character value as in "multiple')
Initially, I thought to do this but it's not going as I'd hope.
Ethnicity <- df %>% dplyr::na_if(0)
## create column for ethnicity
Ethnicity %>% unite("RaceEthnicity", ethnicity1:ethnicity5, na.rm = TRUE, remove = FALSE)
Here's a tidyverse solution. I assume your data has a column for the respondent. I've added that and named it ID.
To understand what's happening you can run the code by successively adding each line, up to but not including the pipe (%>%) and look at the output.
The columns used for pivot_longer will depend on what your real data looks like: here the ethnicities are in 1-6, ID in 7.
library(dplyr)
library(tidyr)
mydata %>%
# add IDs for respondent
mutate(ID = LETTERS[1:3]) %>%
# convert to 'long' format
pivot_longer(1:6) %>%
# remove zero value rows
filter(value != 0) %>%
# group by person
group_by(ID) %>%
# use value where there is one row per person, otherwise use 999
# we need doubles for both values (existing are int)
summarise(ethnicity = case_when(n() == 1 ~ as.double(value),
TRUE ~ 999)) %>%
ungroup() %>%
# discard duplicate rows
distinct()
Result:
ID ethnicity
<chr> <dbl>
1 A 1
2 B 2
3 C 999
Your example data with corrected column names:
mydata <- structure(list(ethnicity1 = c(1L, 0L, 0L),
ethnicity2 = c(0L, 2L, 0L),
ethnicity3 = c(0L, 0L, 3L),
ethnicity4 = c(0L, 0L, 4L),
ethnicity5 = c(0L, 0L, 0L),
ethnicity6 = c(0L, 0L, 0L)),
class = "data.frame",
row.names = c(NA, -3L))
Here's an approach with dplyr and purrr:
library(dplyr);library(purrr)
df %>%
mutate(RaceEthnicity = select(cur_data(), enthnicity1:ethnicity6) %>%
{case_when(pmap_lgl(., ~ all(is.na(.x))) ~ NA_real_,
rowSums(.,na.rm = TRUE) == 0 ~ 0,
rowSums(.,na.rm = TRUE) != pmap_int(.,pmax,na.rm = TRUE) ~ 999,
TRUE ~ rowSums(.,na.rm = TRUE))})
enthnicity1 enthnicity2 ethnicity3 enthnicity4 enthnicity5 ethnicity6 RaceEthnicity
1 1 0 0 0 0 0 1
2 0 2 0 0 0 0 2
3 0 0 3 4 0 0 999
This is probably not the most novice friendly approach, but you can define the columns within the select call. After select, we pass the data into a set of {} so that the data is represented by the . symbol. From there, we use dplyr::case_when to test multiple conditions.
If all columns are NA, return NA
If the rowSums = 0, return 0
If the rowSums don't equal the row max, return 999
Otherwise, return the rowSum (because it will only be length one and be the ethnicity of interest).
Note you misspelled column names.
Data:
df <- structure(list(enthnicity1 = c(1L, 0L, 0L), enthnicity2 = c(0L,
2L, 0L), ethnicity3 = c(0L, 0L, 3L), enthnicity4 = c(0L, 0L,
4L), enthnicity5 = c(0L, 0L, 0L), ethnicity6 = c(0L, 0L, 0L)), class = "data.frame", row.names = c(NA, -3L))
Here is another tidyverse solution. Here, I create a new column (with mutate) and then use pmap to select all of the columns that start with ethnicity. Then, I put everything from that row into a list. Then, I remove all of the 0s from that list and replace any row that has more than one value with 999 and keep only 1 unique value.
library(tidyverse)
df %>%
mutate(Ethnicity = pmap(
select(., starts_with("ethnicity")),
~ c(...) %>%
keep(~ all(. != 0)) %>%
replace(length(.) > 1, 999) %>%
unique
))
If you need to simply select the columns (since your real data may not actually have the word "ethnicity" for each column), then you can just put in the column index (e.g., c(1:6)) or use the column names (shown below).
df %>%
mutate(Ethnicity = pmap(
select(., c("ethnicity1", "ethnicity2", "ethnicity3", "ethnicity4", "ethnicity5", "ethnicity6")),
~ c(...) %>%
keep(~ all(. != 0)) %>%
replace(length(.) > 1, 999) %>%
unique
))
Another option is to use mutate with ifelse and change any row with multiple values to 999.
library(tidyverse)
df %>%
mutate(Ethnicity = pmap(select(., starts_with("ethnicity")), ~ c(...) %>%
keep( ~ all(. != 0)))) %>%
rowwise %>%
mutate(Ethnicity = ifelse(length(Ethnicity) > 1, 999, Ethnicity)) %>%
select(Ethnicity)
Output
# A tibble: 3 × 1
# Rowwise:
Ethnicity
<dbl>
1 1
2 2
3 999
Data
df <-
structure(
list(
ethnicity1 = c(1L, 0L, 0L),
ethnicity2 = c(0L, 2L, 0L),
ethnicity3 = c(0L, 0L, 3L),
ethnicity4 = c(0L, 0L, 4L),
ethnicity5 = c(0L, 0L, 0L),
ethnicity6 = c(0L, 0L, 0L)
),
class = "data.frame",
row.names = c(NA,-3L)
)
in Base R you could do:
aggregate(.~row, data.frame(which(df>0, TRUE)), \(x) if(length(x)>1)999 else x)
row col
1 1 1
2 2 2
3 3 999
I would propose another strategy to consider. It seems that if the new number of ethnicityn columns is limited (less than 32 in the simple case) the better approach could be using a bitmask. This way is used in many languages for the alike purposes for instance in MySQL list columns, in Pascal/Delphi sets, etc. In this case, the resulting column will hold the following values: c(1L, 2L, 12L)
Maybe that simple? Or do I overlook something?
library(dplyr)
df %>%
mutate(Ethnicity = rowSums(select(., contains("ethnicity"))),
Ethnicity = ifelse(Ethnicity > 2, 999, Ethnicity))
ethnicity1 ethnicity2 ethnicity3 ethnicity4 ethnicity5 ethnicity6 Ethnicity
1 1 0 0 0 0 0 1
2 0 2 0 0 0 0 2
3 0 0 3 4 0 0 999

R: find number of columns > 0 per row for a group of column names with a partial string match

I have a dataframe that resembles the following:
ID
X
Y
A_1_l
A_2_m
B_1_n
B_2_l
C_1_m
C_2_n
C_3_l
w
X
Y
0
0
0
0
0
0
0
x
X
Y
0
0
3
0
0
0
0
y
X
Y
0
1
0
4
0
1
0
z
X
Y
3
4
5
6
2
1
5
The first letter denotes a sample, the number a repetition and the second letter a batch. I am trying to find a count of the number of samples with at least one value > 0 for each ID and store these numbers in a list.
This is the desired result as a list that I can append to a an existing dataframe:
0,1,3,3
For a previous analysis I used strsplit to count the total number of samples per batch.
colsList <- colnames(df)
cols <- grep("_", colsList, value=TRUE)
splitList <- strsplit(cols, "_\\d_")
stats <-data.frame(t(as.data.frame.list(splitList)))
rownames(stats)<-NULL
names(stats)<-c("Sample", "Batch")
perSample <- aggregate(Sample ~ Batch, stats,
function(x) length(unique(x))) # number of strains
And I was able to find the total number of columns with a value > 0 using rowSums(df[sapply(df, is.numeric)] > 0) but I cant seem to figure out how to combine the two to find the total number of samples > 0
First filter the data to keep only the numeric columns.
Use split.default to divide the data into groups so that you have all the 'A' columns in one group, 'B' in another and so on. Within each group return TRUE if a row has a single value which is greater than 0, sum all the values together from all the groups to get final count.
tmp <- Filter(is.numeric, df)
rowSums(sapply(split.default(tmp, sub('_.*', '', names(tmp))),
function(x) rowSums(x) > 0))
#[1] 0 1 3 3
We can do this in tidyverse
library(dplyr)
library(stringr)
library(tidyr)
df1 %>%
select(ID, where(is.numeric)) %>%
pivot_longer(cols = -ID) %>%
mutate(name = str_remove(name, "_.*")) %>%
group_by(ID, name) %>%
summarise(value = sum(value > 0), .groups = 'drop_last') %>%
summarise(value = sum(value > 0))
# A tibble: 4 x 2
ID value
<chr> <int>
1 w 0
2 x 1
3 y 3
4 z 3
data
df1 <- structure(list(ID = c("w", "x", "y", "z"), X = c("X", "X", "X",
"X"), Y = c("Y", "Y", "Y", "Y"), A_1_l = c(0L, 0L, 0L, 3L), A_2_m = c(0L,
0L, 1L, 4L), B_1_n = c(0L, 3L, 0L, 5L), B_2_l = c(0L, 0L, 4L,
6L), C_1_m = c(0L, 0L, 0L, 2L), C_2_n = c(0L, 0L, 1L, 1L), C_3_l = c(0L,
0L, 0L, 5L)), class = "data.frame", row.names = c(NA, -4L))

Count values above 0 and count how many match a pattern in a row (in R)

I would like to count how many rows in each column are >0 and how many of those rows (that are >0) start with "mt-".
The result should also be in a data frame.
Here is an example.
df1
mt-abc 1 0 2
mt-dca 1 1 2
cla 0 2 0
dla 0 3 0
result
above0 2 3 2
mt 2 1 2
In base R you can do :
mat <- df[-1] > 0
rbind(above0 = colSums(mat),
mt = colSums(startsWith(df$V1, 'mt') & mat))
# V2 V3 V4
#above0 2 3 2
#mt 2 1 2
Actual data has numbers in the column and names in rownames for which we can do :
mat <- df > 0
rbind(above0 = colSums(mat),
mt = colSums(startsWith(rownames(df), 'mt') & mat))
data
df <- structure(list(V1 = c("mt-abc", "mt-dca", "cla", "dla"), V2 = c(1L,
1L, 0L, 0L), V3 = 0:3, V4 = c(2L, 2L, 0L, 0L)), class = "data.frame",
row.names = c(NA, -4L))
I don't think this is the most elegant approach in the tidyverse, but just out of curiosity:
library(tidyverse)
my_df <- data.frame(
stringsAsFactors = FALSE,
var = c("mt-abc", "mt-dca", "cla", "dla"),
x = c(1L, 1L, 0L, 0L),
y = c(0L, 1L, 2L, 3L),
z = c(2L, 2L, 0L, 0L)
)
df_1 <- my_df %>%
summarize(across(.cols=x:z, .fn=~sum(.x > 0))) %>%
mutate(var="above0")
df_2 <- my_df %>%
filter(str_detect(var, "^mt")) %>%
summarise(across(.cols=x:z, .fn=~sum(.x > 0))) %>%
mutate(var="mt")
bind_rows(df_1, df_2)
#> x y z var
#> 1 2 3 2 above0
#> 2 2 1 2 mt
Created on 2020-12-04 by the reprex package (v0.3.0)

R: How to identify first date of positive observation by ID for multiple columns?

I would like to identify first date of positive observation by ID for multiple columns.
Example dataframe:
ID date Observ1 Observ2 Observ3
1 1 1 0 0
1 2 0 1 0
1 3 1 0 1
2 1 1 1 0
Desired result:
ID FirstObserv1 FirstObserv2 FirstObserv3
1 1 2 3
2 1 1 NA
For single column of observation, I can solve it with dplyr:
df %>% group_by(ID) %>% filter( Observ1 > 0) %>% summarize( FirstObserv1 = min(date) ) %>% as.data.frame()
Having no idea how to do it for multiple column at once, though.
Try reshaping your data like this using tidyverse functions. The key of the code id filtering those values with value of 1 and then set a filter to extract the min date value using filter(). After that you reshape to wide and you get the expected output. Here the code:
library(tidyverse)
#Code
dfnew <- df %>% pivot_longer(-c(ID,date)) %>%
group_by(ID) %>%
filter(value==1) %>% select(-value) %>% ungroup() %>%
group_by(ID,name) %>%
filter(date==min(date)) %>%
pivot_wider(names_from = name,values_from=date)
Output:
# A tibble: 2 x 4
# Groups: ID [2]
ID Observ1 Observ2 Observ3
<int> <int> <int> <int>
1 1 1 2 3
2 2 1 1 NA
Some data used:
#Data
df <- structure(list(ID = c(1L, 1L, 1L, 2L), date = c(1L, 2L, 3L, 1L
), Observ1 = c(1L, 0L, 1L, 1L), Observ2 = c(0L, 1L, 0L, 1L),
Observ3 = c(0L, 0L, 1L, 0L)), class = "data.frame", row.names = c(NA,
-4L))
Here's a method which just replaces the observation with date if the observation is positive and NA otherwise. Getting the min of each observation yields the desired results.
df %>%
mutate_at(vars(starts_with("Observ")), ~ifelse(. > 0, date, NA)) %>%
group_by(ID) %>%
summarise_at(vars(starts_with("Observ")), min, na.rm = TRUE)
#> # A tibble: 2 x 4
#> ID Observ1 Observ2 Observ3
#> <dbl> <dbl> <dbl> <dbl>
#> 1 1 1 2 3
#> 2 2 1 1 Inf
Another alternative:
df %>%
group_by(ID) %>%
summarise(across(
-date,
list(First = ~{x <- which(. > 0); if (length(x) > 0L) date[[x[[1L]]]] else NA_real_}),
.names = "{.fn}{.col}"
))
Output
ID FirstObserv1 FirstObserv2 FirstObserv3
<dbl> <dbl> <dbl> <dbl>
1 1 1 2 3
2 2 1 1 NA
We can use data.table
library(data.table)
setDT(df)[, lapply(.SD, function(x) which(x > 0)[1]),
ID, .SDcols = patterns('^Observ')]
# ID Observ1 Observ2 Observ3
#1: 1 1 2 3
#2: 2 1 1 NA
Or using tidyverse
library(dplyr)
df %>%
group_by(ID) %>%
summarise(across(starts_with('Obser'), ~ which(. > 0)[1],
.names = 'First{col}'), .groups = 'drop')
# A tibble: 2 x 4
# ID FirstObserv1 FirstObserv2 FirstObserv3
# <int> <int> <int> <int>
#1 1 1 2 3
#2 2 1 1 NA
data
df <- structure(list(ID = c(1L, 1L, 1L, 2L), date = c(1L, 2L, 3L, 1L
), Observ1 = c(1L, 0L, 1L, 1L), Observ2 = c(0L, 1L, 0L, 1L),
Observ3 = c(0L, 0L, 1L, 0L)), class = "data.frame", row.names = c(NA,
-4L))

Merge (make mean) columns with partially matched header name

I have a data which look like:
AAA_1 AAA_2 AAA_3 BBB_1 BBB_2 BBB_3 CCC
1 1 1 1 2 2 2 1
2 3 1 4 0 0 0 0
3 5 3 0 1 1 1 1
For each row, I want to make a mean for those columns which have a common feature as follow
feature <- c("AAA","BBB","CCC")
the desired output should look like:
AAA BBB CCC
1 1 2 1
2 2.6 0 0
3 2.6 1 1
for each pattern separately I was able to do that:
data <- read.table("data.txt",header=T,row.name=1)
AAA <- as.matrix(rowMeans(data[ , grepl("AAA" , names( data ) ) ])
But I did not know how to do partially match for different patterns in one row
Also tried some other things like :
for (i in 1:length(features)){
feature[i] <- as.matrix(rowMeans(data[ , grepl(feature[i] , names( data ) ) ]))
}
Assuming your colnames are always structured as shown in your example, then you can split the names and aggregate.
new_names <- unlist(strsplit(names(df),"\\_.*"))
colnames(df) <- new_names
#Testing with your data, we need to prevent the loss of dimension by using drop = FALSE
sapply(unique(new_names), function(i) rowMeans(df[, new_names==i, drop = FALSE]))
# AAA BBB CCC
#[1,] 1.000000 2 1
#[2,] 2.666667 0 0
#[3,] 2.666667 1 1
Data:
df <- structure(list(AAA_1 = c(1L, 3L, 5L), AAA_2 = c(1L, 1L, 3L),
AAA_3 = c(1L, 4L, 0L), BBB_1 = c(2L, 0L, 1L), BBB_2 = c(2L,
0L, 1L), BBB_3 = c(2L, 0L, 1L), CCC = c(1L, 0L, 1L)), .Names = c("AAA_1",
"AAA_2", "AAA_3", "BBB_1", "BBB_2", "BBB_3", "CCC"), class = "data.frame", row.names = c(NA,
-3L))
Here is another option for you. Seeing your column pattern, I chose to use gsub() and get the first three letters. Using ind which includes AAA, BBB, and CCC, I used lapply(), subsetted the data for each element of ind, calculated row means, and extracted a column for row mean only. Then, I used bind_cols() and created foo. The last thing was to assign column names to foo.
library(dplyr)
ind <- unique(gsub("_\\d+$", "", names(mydf)))
lapply(ind, function(x){
select(mydf, contains(x)) %>%
transmute(out = rowMeans(.))
}) %>%
bind_cols() %>%
add_rownames -> foo
names(foo) <- ind
# AAA BBB CCC
# (dbl) (dbl) (dbl)
#1 1.000000 2 1
#2 2.666667 0 0
#3 2.666667 1 1
DATA
mydf <- structure(list(AAA_1 = c(1L, 3L, 5L), AAA_2 = c(1L, 1L, 3L),
AAA_3 = c(1L, 4L, 0L), BBB_1 = c(2L, 0L, 1L), BBB_2 = c(2L,
0L, 1L), BBB_3 = c(2L, 0L, 1L), CCC = c(1L, 0L, 1L)), .Names = c("AAA_1",
"AAA_2", "AAA_3", "BBB_1", "BBB_2", "BBB_3", "CCC"), class = "data.frame", row.names = c(NA,
-3L))
library(dplyr)
library(tidyr)
data %>%
add_rownames() %>%
gather("variable", "value", -rowname) %>%
mutate(variable = gsub("_.*$", "", variable)) %>%
group_by(rowname, variable) %>%
summarise(mean = mean(value)) %>%
spread(variable, mean)

Resources