Turn row values from multiple columns into column names in R? - r

I have a data frame that looks as follows:
state1 state1_pp state2 state2_pp state3 state3_pp
<chr> <chr> <chr> <chr> <chr> <chr>
1 0 0.995614 F 0.004386 NA 0
2 0 1 NA 0 NA 0
3 0 1 NA 0 NA 0
I want the values from each of the rows to be the column names the numeric values to be the row values:
0 F NA
<chr> <chr> <chr>
1 0.995614 0.004386 0
2 1 0 0
3 1 0 0
How do I do this in R?
Or a more complex scenario:
state1 state1_pp state2 state2_pp state3 state3_pp
1 0 0.995614 F 0.004386 NA 0
2 A 1 B 0 C 0
3 D 0.7 B 0.3 NA 0
This is what I want:
0 A D F B C NA
1 0.995614 0 0 0.004386 0 0 0
2 0 1 0 0 0 0 0
3 0 0 0.7 0 0.3 0 0

First a warning, having column names that are numeric (like 1) or are reserved R keywords (like NA) can cause you all sorts of errors. But if you must do it, I suggest the following:
library(dplyr)
# extract title row
headers <- df %>%
head(1) %>%
select(state1, state2, state3) %>%
unlist(use.names = FALSE) %>%
as.character()
# replace NA with "NA"
headers[is.na(headers)] = "NA"
# drop columns that are not wanted
new_df <- df %>%
select(-state1, -state2, -state3)
# replace column names
colnames(new_df) <- headers
In order to refer to your new columns you will probably need to use backticks: `
So with your new column names 0, F and NA you can call df$F but you can not call df$NA or df$1. Instead you will have to call df$`1` and df$`NA`.

Here's an attempt using dplyr and tidyr :
library(dplyr)
library(tidyr)
df %>%
mutate(row = row_number()) %>%
mutate_all(as.character) %>%
pivot_longer(cols = -row) %>%
mutate(name = sub('\\d+', '', name)) %>%
group_by(name, row) %>%
mutate(row1 = row_number()) %>%
pivot_wider() %>%
group_by(state, row) %>%
mutate(row1 = row_number()) %>%
pivot_wider(names_from = state, values_from = state_pp,
values_fill = list(state_pp = 0)) %>%
ungroup() %>%
select(-row, -row1)
# A tibble: 3 x 7
# `0` F `NA` A B C D
# <chr> <chr> <chr> <chr> <chr> <chr> <chr>
#1 0.995614 0.004386 0 0 0 0 0
#2 0 0 0 1 0 0 0
#3 0 0 0 0 0.3 0 0.7

Related

r count values in rows after dcast

I want to sum all values in a row of a dataframe after performing a dcast operation from the reshape2 package. Problem is that all values are the same (10) and are the sum of all rows combined. Values should be 4,2,4
Example data with code:
df <- data.frame(x = as.factor(c("A","A","A","A","B","B","C","C","C","C")),
y = as.factor(c("AA","AB","AA","AC","BB","BA","CC","CC","CC","CD")),
z = c("var1","var1","var2","var1","var2","var1","var1","var2","var2","var1"))
df2 <- df %>%
group_by(x,y) %>%
summarise(num = n()) %>%
ungroup()
df3 <- dcast(df2,x~y, fill = 0 )
df3$total <- sum(df3$AA,df3$AB,df3$AC,df3$BA,df3$BB,df3$CC,df3$CD)
sum gives you 1 combined value and that value is repeated for all other rows.
sum(df3$AA,df3$AB,df3$AC,df3$BA,df3$BB,df3$CC,df3$CD)
#[1] 10
You need rowSums to get sum of each row separately.
df3$total <- rowSums(df3[-1])
Here is a simplified tidyverse approach starting from df :
library(dplyr)
library(tidyr)
df %>%
count(x, y, name = 'num') %>%
pivot_wider(names_from = y, values_from = num, values_fill = 0) %>%
mutate(total = rowSums(select(., AA:CD)))
# x AA AB AC BA BB CC CD total
# <fct> <int> <int> <int> <int> <int> <int> <int> <dbl>
#1 A 2 1 1 0 0 0 0 4
#2 B 0 0 0 1 1 0 0 2
#3 C 0 0 0 0 0 3 1 4
We can specify the values_fn in pivot_wider and also use adorn_totals from janitor
library(dplyr)
library(tidyr)
library(janitor)
df %>%
pivot_wider(names_from = y, values_from = z, values_fill = 0,
values_fn = length) %>%
adorn_totals("col")
-output
# x AA AB AC BB BA CC CD Total
# A 2 1 1 0 0 0 0 4
# B 0 0 0 1 1 0 0 2
# C 0 0 0 0 0 3 1 4
Or using base R with xtabs and addmargins
addmargins(xtabs(z ~ x + y, transform(df, z = 1)), 2)
# y
#x AA AB AC BA BB CC CD Sum
# A 2 1 1 0 0 0 0 4
# B 0 0 0 1 1 0 0 2
# C 0 0 0 0 0 3 1 4

Use mutate_at() to create multiple binary variables from the values of a single variable

I have some variables that contain the following support values {a, b, c, ... k} and I wanted to create multiple binary variables for each response. For example, var_a would be equivalent to as.numeric(variable name very long== "a"), var_b would be equivalent to as.numeric(variable name very long== "b") and so on. However, in some of the variables, they don't go neatly from a:k. Some might have skipped a letter or two.
I know how to use mutate_at when I have multiple variables that I want to change, but what if I only have one variable from which I want to create multiple variables all at once?
What I have been doing so far is this:
df <- df %>% mutate(var_a = as.numeric(`variable name very long` == "a"),
var_b = as.numeric(`variable name very long` == "b"),
...)
Except of course there are more than two variables that I want to create. Is there an easier way to do this? And I also use mutate as a way to shorten the variable name. I've also tried creating a function that might be able to do this for whatever variable and value I want it to be since I have to do this often, but I wasn't able to get it to work:
varname <- function(newvar, var, value){
df <- df %>% mutate(newvar = as.numeric(var == "value"))
}
varname("var_a", "`variable name very long`", "a")
Any suggestions are deeply appreciated. Thank you!
We could use map2 to loop over the unique elements in the column, along with the vector of new column names, transmute to create the column, and bind the output with the original data
library(dplyr)
library(purrr)
library(stringr)
un1 <- sort(as.character(unique(df[["variable name very long"]])))
un2 <- str_c('var_', un1)
map2_dfc(un1, un2, ~ df %>%
transmute(!! .y := +(`variable name very long` == .x))) %>%
bind_cols(df, .)
# A tibble: 20 x 7
# `variable name very long` val var_a var_b var_c var_d var_e
# * <chr> <dbl> <int> <int> <int> <int> <int>
# 1 c -0.710 0 0 1 0 0
# 2 b -1.04 0 1 0 0 0
# 3 c -0.798 0 0 1 0 0
# 4 e 0.319 0 0 0 0 1
# 5 b 1.87 0 1 0 0 0
# 6 b -0.317 0 1 0 0 0
# 7 a -0.773 1 0 0 0 0
# 8 d -1.44 0 0 0 1 0
# 9 a -0.348 1 0 0 0 0
#10 a -0.421 1 0 0 0 0
#11 e 1.06 0 0 0 0 1
#12 e 0.528 0 0 0 0 1
#13 a 3.13 1 0 0 0 0
#14 e -0.546 0 0 0 0 1
#15 e -1.05 0 0 0 0 1
#16 d -0.687 0 0 0 1 0
#17 e -1.13 0 0 0 0 1
#18 b -0.489 0 1 0 0 0
#19 a 1.85 1 0 0 0 0
#20 d -0.0376 0 0 0 1 0
Or another option is pivot_wider
library(tidyr)
df %>%
mutate(rn = row_number(), n = 1,
newcol = str_c('var_', `variable name very long`)) %>%
pivot_wider(names_from = newcol, values_from = n, values_fill = list(n = 0))
Or in base R with model.matrix
cbind(df, model.matrix(~ `variable name very long` -1, df))
data
set.seed(24)
df <- tibble(`variable name very long` = sample(letters[1:5],
20, replace = TRUE), val = rnorm(20))

One hot encode list of vectors

Is there a quick way to one-hot encode lists of vectors (with different lenghts) in R, preferably using tidyverse?
For example:
vals <- list(a=c(1), b=c(2,3), c=c(1,2))
The wanted result is a wide dataframe:
1 2 3
a 1 0 0
b 0 1 1
c 1 1 0
Thanks!
We can enframe the list and convert them into separate rows, create a dummy column and convert the data into wide-format using pivot_wider.
library(tidyverse)
enframe(vals) %>%
unnest(value) %>%
mutate(temp = 1) %>%
pivot_wider(names_from = value, values_from = temp, values_fill = list(temp = 0))
# name `1` `2` `3`
# <chr> <dbl> <dbl> <dbl>
#1 a 1 0 0
#2 b 0 1 1
#3 c 1 1 0
One base R option could be:
t(table(stack(vals)))
values
ind 1 2 3
a 1 0 0
b 0 1 1
c 1 1 0
A base R approach,
do.call(rbind, lapply(vals, function(i) as.integer(!is.na(match(unique(unlist(vals)), i)))))
# [,1] [,2] [,3]
#a 1 0 0
#b 0 1 1
#c 1 1 0

I am trying to identify patterns of missing values in rows of a dataset

I am trying to find patterns in missing values in rows.
For example if I have this data set:
a b c d
1 0.1 NA NA
2 NA 3 4
5 NA 6 NA
I expect the output to be:
n a b c d m
1 0 0 1 1 2
1 0 1 0 0 1
1 0 1 0 1 2
where column n shows the number of rows missing values in column m and 1's indicate missing values (except for columns n and m) .That is, the interpretation of the first row of the output is as follows: 1 row is missing 2 values which are for variables c and d; second row: 1 row is missing 1 value in variable b and so on.
I have tried using the subtable() function in extracat package(archived version) but I cant find the locations of missing values in each variables. I can only find frequencies.
rowmiss<-rowSums(is.na(dat1[1:ncol(dat1)]))
r1<-matrix(rowmiss, nrow=nrow(dat1))
subtable(rowmiss,1)
I expect the output to be as shown above. What I am finding so far is the frequency of missing values in rows but I expect patterns and positions of missing values.
Here's a tidyverse approach. The n column seems redundant, should it be doing something else?
library(tidyverse)
df %>%
rowid_to_column() %>%
gather(col, val, -rowid) %>%
mutate(val = is.na(val) * 1) %>%
group_by(rowid) %>% mutate(m = sum(val)) %>% ungroup() %>%
spread(col, val) %>%
mutate(n = 1) %>%
select(n, a:d, m)
# A tibble: 3 x 6
n a b c d m
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 0 0 1 1 2
2 1 0 1 0 0 1
3 1 0 1 0 1 2
An alternative way of doing this with tidyverse:
library(tidyverse)
df %>%
mutate_all(~ is.na(.) %>% as.numeric()) %>%
mutate(m = rowSums(.)) %>%
group_by_all() %>%
count()
Output (you may also want to ungroup() if doing anything further with the df):
# A tibble: 3 x 6
# Groups: a, b, c, d, m [3]
a b c d m n
<dbl> <dbl> <dbl> <dbl> <dbl> <int>
1 0 0 1 1 2 1
2 0 1 0 0 1 1
3 0 1 0 1 2 1
mice::md.pattern() also does basically what you want, but returns a matrix with some of the useful info in the rownames, so would require a bit of processing to trun into a dataframe.

Count combinations between columns in data frame

I have a dataframe like this
V1 V2 V3
1 A A A
2 B A A
3 A B C
4 C A A
With this code i get another dataframe with all the possible combinations with "A", "B", "C".
library("gtools")
vars <- c("A", "B", "C")
combMatrix <- (combinations(n = 3, r = 2, repeats.allowed = T, v = vars))
combArray <- paste(combMatrix [,1], combMatrix [,2], sep="")
combDf <- expand.grid(combArray ,vars)
Then I want to count the combinations between a couple of two columns in the first dataframe(let's say V1 and V2) and the other column, and it's important to consider the concatenated characters (V1+V2 in this case) like "AB" and "BA" as the same combination.
The final data frame should look like this.
V1+V2 V3 Freq
AA A 1
AB A 1
AC A 1
BB A 0
BC A 0
CC A 0
AA B 0
AB B 0
AC B 0
BB B 0
BC B 0
CC B 0
AA C 0
AB C 1
AC C 0
BB C 0
BC C 0
CC C 0
Then I have to iterate the process for every combinations of columns (V1+V2/V3, V1+V3/V2, V2+V3/V1).
You can try this tidyverse solution.
First calculate all possible 18 combinations considering AB==BA for two variables using some magic and not-so-elegant code including map, unite, sort and paste together with rowwise.
library(tidyverse)
all_combs <- expand.grid(unique(unlist(d)),unique(unlist(d)),unique(unlist(d))) %>%
rowwise() %>%
mutate_all(as.character) %>%
mutate(two=paste(sort(c(Var1,Var2)), collapse="")) %>%
ungroup() %>%
unite(all, two, Var3) %>%
select(all) %>%
distinct()
Then the rest
combn(1:ncol(d),2, simplify = F) %>%
set_names(map(.,~paste(., collapse = "&"))) %>%
map(~select(d,a =.[1], b=.[2], everything()) %>%
rowwise() %>%
mutate_all(as.character) %>%
mutate(two=paste(sort(c(a, b)), collapse="")) %>%
select(two, contains("V"), -a,-b) %>%
ungroup() %>%
unite(all, two, contains("V")) %>%
count(all)) %>%
map(~right_join(.,all_combs, by="all")) %>%
bind_rows(.id = "id") %>%
mutate(n=ifelse(is.na(n), 0, n)) %>%
spread(id, n)
# A tibble: 18 x 4
all `1&2` `1&3` `2&3`
<chr> <dbl> <dbl> <dbl>
1 AA_A 1 1 1
2 AA_B 0 0 1
3 AA_C 0 0 1
4 AB_A 1 1 0
5 AB_B 0 0 0
6 AB_C 1 0 0
7 AC_A 1 1 0
8 AC_B 0 1 0
9 AC_C 0 0 0
10 BB_A 0 0 0
11 BB_B 0 0 0
12 BB_C 0 0 0
13 BC_A 0 0 1
14 BC_B 0 0 0
15 BC_C 0 0 0
16 CC_A 0 0 0
17 CC_B 0 0 0
18 CC_C 0 0 0

Resources