Transforming a dataframe in r to apply pivot table - r

I have a data frame like below:
Red Green Black
John A B C
Sean A D C
Tim B C C
How can I transform it to below form to apply a pivot table (or if it can be done directly in r without transforming data):
Names Code Type
John Red A
John Green B
John Black C
Sean Red A
Sean Green D
Sean Black C
Tim Red B
Tim Green C
Tim Black C
So then my ultimate goal is to count the types as below by a pivot table on the transformed dataframe:
Count of Code for each type:
Row Labels A B C D Grand Total
John 1 1 1 3
Sean 1 1 1 3
Tim 1 2 3
Grand Total 2 2 4 1 9
```
reading similar topics did not help that much.
Thanks in advance!
Regards

Using a literal dump from your first matrix-like frame above:
dat <- structure(list(Red = c("A", "A", "B"), Green = c("B", "D", "C"
), Black = c("C", "C", "C")), class = "data.frame", row.names = c("John",
"Sean", "Tim"))
I can do this:
library(dplyr)
library(tidyr)
tibble::rownames_to_column(dat, var = "Names") %>%
gather(Code, Type, -Names)
# Names Code Type
# 1 John Red A
# 2 Sean Red A
# 3 Tim Red B
# 4 John Green B
# 5 Sean Green D
# 6 Tim Green C
# 7 John Black C
# 8 Sean Black C
# 9 Tim Black C
We can extend that to get your next goal:
tibble::rownames_to_column(dat, var = "Names") %>%
gather(Code, Type, -Names) %>%
xtabs(~ Names + Type, data = .)
# Type
# Names A B C D
# John 1 1 1 0
# Sean 1 0 1 1
# Tim 0 1 2 0
which then just needs marginals:
tibble::rownames_to_column(dat, var = "Names") %>%
gather(Code, Type, -Names) %>%
xtabs(~ Names + Type, data = .) %>%
addmargins()
# Type
# Names A B C D Sum
# John 1 1 1 0 3
# Sean 1 0 1 1 3
# Tim 0 1 2 0 3
# Sum 2 2 4 1 9

You can use reshape(). I'm not sure about your data structure, if there is a column with names or if they are row names. I've added both versions.
reshape(dat1, idvar="Names",
varying=2:4,
v.names="Type", direction="long",
timevar="Code", times=c("red", "green", "black"),
new.row.names=1:9)
reshape(transform(dat2, Names=rownames(dat2)), idvar="Names",
varying=1:3,
v.names="Type", direction="long",
timevar="Code", times=c("red", "green", "black"),
new.row.names=1:9)
# V1 Code Type
# 1 John red A
# 2 Sean red A
# 3 Tim red B
# 4 John black B
# 5 Sean black D
# 6 Tim black C
# 7 John green C
# 8 Sean green C
# 9 Tim green C
To get kind of a raw version you could do:
res <- reshape(transform(dat2, Names=rownames(dat2)), idvar="Names",
varying=1:3,
v.names="Type", direction="long",
timevar="Code")
res
# Names Code Type
# John.1 John 1 A
# Sean.1 Sean 1 A
# Tim.1 Tim 1 B
# John.2 John 2 B
# Sean.2 Sean 2 D
# Tim.2 Tim 2 C
# John.3 John 3 C
# Sean.3 Sean 3 C
# Tim.3 Tim 3 C
After that you may assign labels at will to "Code" column by transforming to factor like so:
res$Code <- factor(res$Code, labels=c("red", "green", "black"))
Data
dat1 <- structure(list(Names = c("John", "Sean", "Tim"), Red = c("A",
"A", "B"), Green = c("B", "D", "C"), Black = c("C", "C", "C")), row.names = c(NA,
-3L), class = "data.frame")
dat2 <- structure(list(Red = c("A", "A", "B"), Green = c("B", "D", "C"
), Black = c("C", "C", "C")), row.names = c("John", "Sean", "Tim"
), class = "data.frame")

What you aim to do is (1) creating a contingency table and then (2) compute the sum of table entries for both rows and columns.
Step1: Create a contingency table
I first pivoted the data using pivot_longer() rather than gather() because it's more intuitive. Then, apply table() to the two variables of your interest.
# Toy example
df <- structure(list(Red = c("A", "A", "B"), Green = c("B", "D", "C"
), Black = c("C", "C", "C")), class = "data.frame", row.names = c("John",
"Sean", "Tim"))
# Pivot the data
long_df <- tibble::rownames_to_column(df, var = "Names") %>%
tidyverse::pivot_longer(cols = c(-Names),
names_to = "Type",
values_to = "Code")
# Create a contingency table
df_table <- table(long_df$Names, long_df$Code)
Step 2: Compute the sum of entries for both rows and columns.
Again, I only used a base R function margin.table(). Using this approach also allows you to save the sum of the row and column entries for further analysis.
# Grand total (margin = 1 indicates rows)
df_table %>%
margin.table(margin = 1)
# Grand total (margin = 2 indicates columns)
df_table %>%
margin.table(margin = 2)

Related

R: outer-merge two dataframes with unequal columns

I'm new to coding and am struggling a bit with this merge. I have two dataframes:
> a1
a b c
1 1 apple x
2 2 bees a
3 3 candy a
4 4 dice s
5 5 donut d
> a2
a b c d
1 1 apple x a
2 2 bees a d
3 6 coffee r s
I would like to join these two dataframes by a, b, and c. I want to get rid of the duplicate rows where a, b, c are the same, but keep the unique rows in both datasets. In the case where a unique row in a2 is kept, I would also want d to be shown. So the result would be something like the following:
> a3
a b c d
1 1 apple x a
2 2 bees a d
3 3 candy a NA
4 4 dice s NA
5 5 donut d NA
6 6 coffee r s
You can use a full_join from tidyverse:
library(tidyverse)
full_join(a1, a2, by = c("a", "b", "c")) %>%
distinct()
Output
a b c d
1 1 apple x a
2 2 bees a d
3 3 candy a <NA>
4 4 dice s <NA>
5 5 donut d <NA>
6 6 coffee r s
Data
a1 <- structure(list(a = 1:5, b = c("apple", "bees", "candy", "dice",
"donut"), c = c("x", "a", "a", "s", "d")), class = "data.frame", row.names = c("1",
"2", "3", "4", "5"))
a2 <- structure(list(a = c(1L, 2L, 6L), b = c("apple", "bees", "coffee"
), c = c("x", "a", "r"), d = c("a", "d", "s")), class = "data.frame", row.names = c("1",
"2", "3"))
Hello there & here you go:
a3 <- merge(a1, a2, all=TRUE) # this merges your data preserving all vals from all cols/rows (if missing - adds NA)
a3 <- a3[order(a3[,"d"], decreasing=TRUE),] # this sorts your merged df
a3 <- a3[!duplicated(a3[,"b"]),] # this removes duplicate values
I edited my answer since I looked more carefully at your desired output. You wanna preserve "d" col value even tho there is a duplicate, so I first suggest to order your df based on the "d" column in decreasing fashion (so the NAs will be at the bottom of the df). Then I suggest to exclude the duplicates in col "b", and since the function preserves the first encountered value, the "apple" row taken initially from a2 df will be preserved in the output. Also, you can make an oneliner using pipe operator.

Look-up table with different values for each column

I've got one table that is a set of all of my columns, their possible corresponding values, and the description for each one of those values. For example, the table looks like this:
ID Column Value Description
1 Age A Age_20-30
2 Age B Age_30-50
3 Age C Age_50-75
4 Geo A Big_City
5 Geo B Suburbs
6 Geo C Rural_Town
And so on.. Next, I have my main data frame that is populated with the column values. What I'd like to do is switch all values in each column with their corresponding description.
Old:
ID Age Geo
1 A B
2 A A
3 C A
4 B C
5 C C
New:
ID Age Geo
1 Age_20-30 Suburbs
2 Age_20-30 Big_City
3 Age_50-75 Big_City
4 Age_30-50 Rural_Town
5 Age_50-75 Rural_Town
Now I know how I can do this for one column using the following (where lookup_df is a table for only one of my columns):
old <- lookup_df$Value
new <- lookup_df$Description
df$Age <- new[match(df$Age, old, nomatch = 0)]
But I am struggling to do this for all columns. My full set of data has >100 columns so doing this manually for each column isn't really an option (at least in terms of efficiency). Any help or pointers in the right direction would be a huge help.
We can split the first dataset into to a list of named vectors. Use that to match and replace the second dataset
lst1 <- lapply(split(df1[c('Value', 'Description')], df1$Column),
function(x) setNames(x$Description, x$Value))
df2[-1] <- Map(function(x, y) y[x], df2[-1], lst1)
-output
df2
# ID Age Geo
#1 1 Age_20-30 Suburbs
#2 2 Age_20-30 Big_City
#3 3 Age_50-75 Big_City
#4 4 Age_30-50 Rural_Town
#5 5 Age_50-75 Rural_Town
data
df1 <- structure(list(ID = 1:6, Column = c("Age", "Age", "Age", "Geo",
"Geo", "Geo"), Value = c("A", "B", "C", "A", "B", "C"),
Description = c("Age_20-30",
"Age_30-50", "Age_50-75", "Big_City", "Suburbs", "Rural_Town"
)), class = "data.frame", row.names = c(NA, -6L))
df2 <- structure(list(ID = 1:5, Age = c("A", "A", "C", "B", "C"), Geo = c("B",
"A", "A", "C", "C")), class = "data.frame", row.names = c(NA,
-5L))
To do this on data with lot of columns you can get the data in long format, join it with the first dataframe and (if needed) get it back in wide format.
library(dplyr)
library(tidyr)
df2 %>%
pivot_longer(cols = -ID) %>%
left_join(df1 %>% select(-ID),
by = c('name' = 'Column', 'value' = 'Value')) %>%
select(-value) %>%
pivot_wider(names_from = name, values_from = Description)
# ID Age Geo
# <int> <chr> <chr>
#1 1 Age_20-30 Suburbs
#2 2 Age_20-30 Big_City
#3 3 Age_50-75 Big_City
#4 4 Age_30-50 Rural_Town
#5 5 Age_50-75 Rural_Town

R sum of aggregate columns found in another column

Given this data, the first 4 columns (rowid, order, line, special), I need to create a column, numSpecial as such:
rowid order line special numSpecial
1 A 01 X 1
2 B 01 0
3 B 02 X 2
4 B 03 X 2
5 C 01 X 1
6 C 02 0
Where numSpecial is determined by summing the number of times for each order that is special (value = X), given that order-line is special itself, otherwise its 0.
I first tried adding a column that simply concats 'order' with 'X', call it orderX, and would look like:
orderX
AX
BX
BX
BX
CX
CX
Then do a sum of order & special in orderx:
df$numSpecial <- sum(paste(order, special, sep = "") %in% orderx)
But that doesnt work, it returns the sum of the results for all rows for every order:
numSpecial
4
4
4
4
4
4
I then tried as.data.table, but I'm not getting the expected results using:
as.data.table(mydf)[, numSpecial := sum(paste(order, special, sep = "") %in% orderx), by = rowid]
However that is returning just 1 for each row and not sums:
numSpecial
1
0
1
1
1
0
Where am I going wrong with these? I shouldn't have to create that orderX column either I don't think, but I can't figure out the way to get this count right. It's similar to a countif in excel which is easy to do.
There's probably several ways, but you could just multiply it by a TRUE/FALSE flag of "X" being present:
dat[, numSpecial := sum(special == "X") * (special == "X"), by=order]
dat
# rowid order line special numSpecial
#1: 1 A 1 X 1
#2: 2 B 1 0
#3: 3 B 2 X 2
#4: 4 B 3 X 2
#5: 5 C 1 X 1
#6: 6 C 2 0
You could also do it a bit differently like:
dat[, numSpecial := 0L][special == "X", numSpecial := .N, by=order]
Where dat was:
library(data.table)
dat <- structure(list(rowid = 1:6, order = c("A", "B", "B", "B", "C",
"C"), line = c(1L, 1L, 2L, 3L, 1L, 2L), special = c("X", "",
"X", "X", "X", "")), .Names = c("rowid", "order", "line", "special"
), row.names = c(NA, -6L), class = "data.frame")
setDT(dat)
You could use ave with a dummy variable (just filled with 1s):
df$numSpecial <- ifelse(df$special == "X", ave(rep(1,nrow(df)), df$order, df$special, FUN = length), 0)
df
# rowid order line special numSpecial
#1 1 A 1 X 1
#2 2 B 1 0
#3 3 B 2 X 2
#4 4 B 3 X 2
#5 5 C 1 X 1
#6 6 C 2 0
Note I read in your data without the numSpecial column.
Using the dplyr package:
library(dplyr)
df %>% group_by(order) %>%
mutate(numSpecial = ifelse(special=="X", sum(special=="X"), 0))
rowid order special numSpecial
1 1 A X 1
2 2 B 0
3 3 B X 2
4 4 B X 2
5 5 C X 1
6 6 C 0
One other option using base R only would be to use aggregate:
# Your data
df <- data.frame(rowid = 1:6, order = c("A", "B", "B", "B", "C", "C"), special = c("X", "", "X", "X", "X", ""))
# Make the counts
dat <- with(df,aggregate(x=list(answer=special),by=list(order=order,special=special),FUN=function(x) sum(x=="X")))
# Merge back to original dataset:
dat.fin <- merge(df,dat,by=c('order','special'))

Combine two data frames across multiple columns

Say I have two dataframes, each with four columns. One column is a numeric value. The other three are identifying variables. For example:
set1 <- data.frame(label1 = c("a","b", "c"), label2 = c("red", "white", "blue"), name = c("sam", "bob", "drew"), val = c(1, 10, 100))
set2 <- data.frame(label1 = c("b","c", "d"), label2 = c("white", "green", "orange"), name = c("bob", "drew", "collin"), val = c(7, 100, 15))
Which are:
> set1
label1 label2 name val
1 a red sam 1
2 b white bob 10
3 c blue drew 50
> set2
label1 label2 name val
1 b white bob 7
2 c green drew 100
3 d orange collin 15
The first three columns can be combined to form a primary key. What is the most efficient way to combine these two data frames such that all unique values (from columns label1, label2, name) are displayed along with the two val columns:
set3 <- data.frame(label = c("a", "b", "c", "c", "d"), label2 = c("red", "white", "blue", "green", "orange"), name = c("sam", "bob", "drew", "drew", "collin"), val.set1 = c(1, 10, 50, NA, NA), val.set2 = c(NA, 7, NA, 100, 15))
> set3
label label2 name val.set1 val.set2
1 a red sam 1 NA
2 b white bob 10 7
3 c blue drew 50 NA
4 c green drew NA 100
5 d orange collin NA 15
>
When thinking of efficiency, you should evaluate the data.table package:
library(data.table)
(merge(
setDT(set1, key=names(set1)[1:3]),
setDT(set2, key=names(set2)[1:3]),
all=T,
suffixes=paste0(".set",1:2)
) -> set3)
# label1 label2 name val.set1 val.set2
# 1: a red sam 1 NA
# 2: b white bob 10 7
# 3: c blue drew 100 NA
# 4: c green drew NA 100
# 5: d orange collin NA 15
Since they're in the same format, you could just rowbind them together and then take only the unique values. Using dplyr:
bind_rows(set1, set2) %>% distinct(label1, label2, name)
You just want to make sure that you don't have factors in there, that everything is a character or numeric.

subset a dataframe based on sum of a column

I have a df that looks like this:
> df2
name value
1 a 0.20019421
2 b 0.17996454
3 c 0.14257010
4 d 0.14257010
5 e 0.11258865
6 f 0.07228970
7 g 0.05673759
8 h 0.05319149
9 i 0.03989362
I would like to subset it using the sum of the column value, i.e, I want to extract those rows which sum of values from column value is higher than 0.6, but starting to sum values from the first row. My desired output will be:
> df2
name value
1 a 0.20019421
2 b 0.17996454
3 c 0.14257010
4 d 0.14257010
I have tried df2[, colSums[,5]>=0.6] but obviously colSums is expecting an array
Thanks in advance
Here's an approach:
df2[seq(which(cumsum(df2$value) >= 0.6)[1]), ]
The result:
name value
1 a 0.2001942
2 b 0.1799645
3 c 0.1425701
4 d 0.1425701
I'm not sure I understand exactly what you are trying to do, but I think cumsum should be able to help.
First to make this reproducible, let's use dput so others can help:
df <- structure(list(name = structure(1:9, .Label = c("a", "b", "c",
"d", "e", "f", "g", "h", "i"), class = "factor"), value = c(0.20019421,
0.17996454, 0.1425701, 0.1425701, 0.11258865, 0.0722897, 0.05673759,
0.05319149, 0.03989362)), .Names = c("name", "value"), class = "data.frame", row.names = c(NA,
-9L))
Then look at what cumsum(df$value) provides:
cumsum(df$value)
# [1] 0.2001942 0.3801587 0.5227289 0.6652990 0.7778876 0.8501773 0.9069149 0.9601064 1.0000000
Finally, subset accordingly:
subset(df, cumsum(df$value) <= 0.6)
# name value
# 1 a 0.2001942
# 2 b 0.1799645
# 3 c 0.1425701
subset(df, cumsum(df$value) >= 0.6)
# name value
# 4 d 0.14257010
# 5 e 0.11258865
# 6 f 0.07228970
# 7 g 0.05673759
# 8 h 0.05319149
# 9 i 0.03989362

Resources