Taking two columns from your dataframe and creating a squared dataframe - r

Hello I have a dataframe that contains information on the correlation of two factors, which looks somewhat like this.
Factor_1 Factor_2 value
a b 0.8
a a 1
a d 0.6
b c 0.4
b b 1
a c 0.2
b d 0.75
b a 0.8
c a 0.2
c c 1
c d 0.1
c b 0.4
As you can see, when Factor_1 and Factor_2 are of the same value, their correlation is 1. Also, the number of each factor does not match (Factor_1 has a,b,c when Factor_2 has a,b,c,d.)
With this dataframe, I want to create a squared dataframe that has the values of Factor_1 as the row and column names, and the values matching each correlation value.
It should look something like this.
a b c
a 1 0.8 0.2
b 0.8 1 0.4
c 0.2 0.4 1
Any way to create this dataframe, using tidyverse?
Thanks in advance!

in base R:
xtabs(value~Factor_1 + Factor_2, df)[,-4]
Factor_2
Factor_1 a b c
a 1.00 0.80 0.20
b 0.80 1.00 0.40
c 0.20 0.40 1.00
if you need it as a dataframe:
as.data.frame.matrix(xtabs(value~., df)[,1:3])
a b c
a 1.0 0.8 0.2
b 0.8 1.0 0.4
c 0.2 0.4 1.0
You could also use
xtabs(value~.,subset(df, Factor_1 %in%Factor_2 & Factor_2 %in%Factor_1))
Factor_2
Factor_1 a b c
a 1.0 0.8 0.2
b 0.8 1.0 0.4
c 0.2 0.4 1.0

You may try
library(dplyr)
library(tidyr)
df %>%
filter(Factor_2 %in% unique(Factor_1), Factor_1 %in% unique(Factor_2)) %>%
arrange(Factor_1, Factor_2) %>%
pivot_wider(id_cols = Factor_1, names_from = Factor_2, values_from = value) %>%
column_to_rownames(var = "Factor_1")
a b c
a 1.0 0.8 0.2
b 0.8 1.0 0.4
c 0.2 0.4 1.0

df %>% arrange(Factor_1,Factor_2) %>%
pivot_wider(names_from = Factor_2, values_from = value)%>%
select(1:(nrow(.)+1))
# A tibble: 3 × 4
Factor_1 a b c
<chr> <dbl> <dbl> <dbl>
1 a 1 0.8 0.2
2 b 0.8 1 0.4
3 c 0.2 0.4 1

Related

Retrieve Top AND Bottom Values R Dataframe

Looking for a way to select the top 3 AND bottom 3 rows by value. I have tried using slice_max() in conjunction with slice_min() with no success.
id value
a 0.9
b 0.2
c -0.4
d -0.9
e 0.6
f 0.8
g -0.3
h 0.1
i 0.2
j 0.5
k -0.2
# Desired output: <br>
a 0.9
f 0.8
e 0.6
d -0.9
c -0.4
g -0.3
dplyr
dat %>%
filter(!between(dense_rank(value), 4, n() - 4))
# id value
# 1 a 0.9
# 2 c -0.4
# 3 d -0.9
# 4 e 0.6
# 5 f 0.8
# 6 g -0.3
or
dat %>%
arrange(value) %>%
slice( unique(c(1:3, n() - 0:2)) )

Use conditions from multiple variables to replace a variable in R

I did some searches but could not find the best keywords to phrase my question so I think I will attempt to ask it here.
I am dealing with a data frame in R that have two variables represent the identity of the data points. In the following example, A and 1 represent the same individual, B and 2 are the same and so are C and 3 but they are being mixed in the original data.
ID1 ID2 Value
A 1 0.5
B 2 0.8
C C 0.7
A A 0.6
B 2 0.3
3 C 0.4
2 2 0.3
1 A 0.4
3 3 0.6
What I want to achieve is to unify the identity by using only one of the identifiers so it can be either:
ID1 ID2 Value ID
A 1 0.5 A
B 2 0.8 B
C C 0.7 C
A A 0.6 A
B 2 0.3 B
3 C 0.4 C
2 2 0.3 B
1 A 0.4 A
3 3 0.6 C
or:
ID1 ID2 Value ID
A 1 0.5 1
B 2 0.8 2
C C 0.7 3
A A 0.6 1
B 2 0.3 2
3 C 0.4 3
2 2 0.3 2
1 A 0.4 1
3 3 0.6 3
I can probably achieve it by using ifelse function but that means I have to write two ifelse statements for each condition and it does not seem efficient so I was wondering if there is a better way to do it. Here is the example data set.
df=data.frame(ID1=c("A","B","C","A","B","3","2","1","3"),
ID2=c("1","2","C","A","2","C","2","A","3"),
Value=c(0.5,0.8,0.7,0.6,0.3,0.4,0.3,0.4,0.6))
Thank you so much for the help!
Edit:
To clarify, the two identifiers I have in my real data are longer string of texts instead of just ABC and 123. Sorry I did not make it clear.
An option is to to detect the elements that are only digits, convert to integer, then get the corresponding LETTERS in case_when
library(dplyr)
library(stringr)
df %>%
mutate(ID = case_when(str_detect(ID1, '\\d+')~
LETTERS[as.integer(ID1)], TRUE ~ ID1))
# ID1 ID2 Value ID
#1 A 1 0.5 A
#2 B 2 0.8 B
#3 C C 0.7 C
#4 A A 0.6 A
#5 B 2 0.3 B
#6 3 C 0.4 C
#7 2 2 0.3 B
#8 1 A 0.4 A
#9 3 3 0.6 C
Or more compactly
df %>%
mutate(ID = coalesce(LETTERS[as.integer(ID1)], ID1))
If we have different sets of values, then create a key/value dataset and do a join
keyval <- data.frame(ID1 = c('1', '2', '3'), ID = c('A', 'B', 'C'))
left_join(df, keyval) %>% mutate(ID = coalesce(ID, ID1))
A base R option using replace
within(
df,
ID <- replace(
ID1,
!ID1 %in% LETTERS,
LETTERS[as.numeric(ID1[!ID1 %in% LETTERS])]
)
)
or ifelse
within(
df,
ID <- suppressWarnings(ifelse(ID1 %in% LETTERS,
ID1,
LETTERS[as.integer(ID1)]
))
)
which gives
ID1 ID2 Value ID
1 A 1 0.5 A
2 B 2 0.8 B
3 C C 0.7 C
4 A A 0.6 A
5 B 2 0.3 B
6 3 C 0.4 C
7 2 2 0.3 B
8 1 A 0.4 A
9 3 3 0.6 C

Lag variable by group/time indicator in dplyr

I have data that looks like this:
set.seed(13)
dt <- data.frame(group = c(rep("a", 3), rep("b", 4), rep("c", 3)), var = c(rep(0.1,3), rep(0.3, 4), rep(1.1,3)))
dt
group var
1 a 0.1
2 a 0.1
3 a 0.1
4 b 0.3
5 b 0.3
6 b 0.3
7 b 0.3
8 c 1.1
9 c 1.1
10 c 1.1
I'd like to lag var variable for all respondents in the group variable group. One difficulty is that the groups are of different size, otherwise this would be no problem specifing n as the size of all groups. My data should look accordingly (see below). How do I get at this using dplyr for example?
group var lag1.var lag2.var
1 a 0.1 NA NA
2 a 0.1 NA NA
3 a 0.1 NA NA
4 b 0.3 0.1 NA
5 b 0.3 0.1 NA
6 b 0.3 0.1 NA
7 b 0.3 0.1 NA
8 c 1.1 0.3 0.1
9 c 1.1 0.3 0.1
10 c 1.1 0.3 0.1
You can create a tibble with the lag variables for each group and then merge it with dt. Try this:
left_join(dt, dt %>%
group_by(group) %>%
mutate(var = first(var)) %>%
distinct() %>%
ungroup() %>%
mutate(lag1.var = lag(var, order_by = group),
lag2.var = lag(lag1.var, order_by = group)) %>%
select(-var),
by = "group")
# output
group var lag1.var lag2.var
1 a 0.1 NA NA
2 a 0.1 NA NA
3 a 0.1 NA NA
4 b 0.3 0.1 NA
5 b 0.3 0.1 NA
6 b 0.3 0.1 NA
7 b 0.3 0.1 NA
8 c 1.1 0.3 0.1
9 c 1.1 0.3 0.1
10 c 1.1 0.3 0.1
This assumes that var is always the same within each group
Here is another option. First we nest by group, then we map out the lagged values and then unnest.
library(tidyverse)
dt %>%
nest(-group) %>%
mutate(lag1.var = map_dbl(data, ~.x$var[[1]]) %>% lag(.), lag2.var = lag(lag1.var)) %>%
unnest
#> group lag1.var lag2.var var
#> 1 a NA NA 0.1
#> 2 a NA NA 0.1
#> 3 a NA NA 0.1
#> 4 b 0.1 NA 0.3
#> 5 b 0.1 NA 0.3
#> 6 b 0.1 NA 0.3
#> 7 b 0.1 NA 0.3
#> 8 c 0.3 0.1 1.1
#> 9 c 0.3 0.1 1.1
#> 10 c 0.3 0.1 1.1

How to reset row names?

Here is a sample data set:
sample1 <- data.frame(Names=letters[1:10], Values=sample(seq(0.1,1,0.1)))
When I'm reordering the data set, I'm losing the row names order
sample1[order(sample1$Values), ]
Names Values
7 g 0.1
4 d 0.2
3 c 0.3
9 i 0.4
10 j 0.5
5 e 0.6
8 h 0.7
6 f 0.8
1 a 0.9
2 b 1.0
Desired output:
Names Values
1 g 0.1
2 d 0.2
3 c 0.3
4 i 0.4
5 j 0.5
6 e 0.6
7 h 0.7
8 f 0.8
9 a 0.9
10 b 1.0
Try
rownames(Ordersample2) <- 1:10
or more generally
rownames(Ordersample2) <- NULL
I had a dplyr usecase:
df %>% as.data.frame(row.names = 1:nrow(.))

Fill nth columns in a dataframe

I have this data frame:
df <- data.frame(A=c("a","b","c","d","e","f","g","h","i"),
B=c("1","1","1","2","2","2","3","3","3"),
C=c(0.1,0.2,0.4,0.1,0.5,0.7,0.1,0.2,0.5))
> df
A B C
1 a 1 0.1
2 b 1 0.2
3 c 1 0.4
4 d 2 0.1
5 e 2 0.5
6 f 2 0.7
7 g 3 0.1
8 h 3 0.2
9 i 3 0.5
I would like to add 1000 further columns and fill this columns with the values generated by :
transform(df, D=ave(C, B, FUN=function(b) sample(b, replace=TRUE)))
I've tried with a for loop but it does not work:
for (i in 4:1000){
df[, 4:1000] <- NA
df[,i] = transform(df, D=ave(C, B, FUN=function(b) sample(b, replace=TRUE)))
}
For efficiency reasons, I suggest running sample only once for each group. This can be achieved with this:
sample2 <- function(x, size)
{
if(length(x)==1) rep(x, size) else sample(x, size, replace=TRUE)
}
new_df <- do.call(rbind, by(df, df$B,
function(d) cbind(d, matrix(sample2(d$C, length(d$C)*1000),
ncol=1000))))
Notes:
I've created sample2 in case there is a group with only one C value. Check ?sample to see what I mean.
The names of the columns will be numbers, from 1 to 1000. This can be changed as in the answer by #agstudy.
The row names are also changed. "Fixing" them is similar, just use row.names instead of col.names.
Using replicate for example:
cbind(df,replicate(1000,ave(df$C, df$B,
FUN=function(b) sample(b, replace=TRUE))))
To add 4 columns for example:
cbind(df,replicate(4,ave(df$C, df$B,
FUN=function(b) sample(b, replace=TRUE))))
A B C 1 2 3 4
1 a 1 0.1 0.2 0.2 0.1 0.2
2 b 1 0.2 0.4 0.2 0.4 0.4
3 c 1 0.4 0.1 0.1 0.1 0.1
4 d 2 0.1 0.1 0.5 0.5 0.1
5 e 2 0.5 0.7 0.1 0.5 0.1
6 f 2 0.7 0.1 0.7 0.7 0.7
7 g 3 0.1 0.2 0.5 0.2 0.2
8 h 3 0.2 0.2 0.1 0.2 0.1
9 i 3 0.5 0.5 0.5 0.1 0.5
Maybe you need to rename columns by something like :
gsub('([0-9]+)','D\\1',colnames(res))
1] "A" "B" "C" "D1" "D2" "D3" "D4"

Resources