In R: How to coerce a list of vectors with unequal length to a dataframe using tidyverse? - r

Suppose you have the following list in R:
list_test <- list(c(2,4,5, 6), c(1,2,3), c(7,8))
What I am looking for is a dataframe of the following form:
value list_index
2 1
4 1
5 1
6 1
1 2
2 2
3 2
7 3
8 3
I tried to find a solution with the tidyverse but either lost the the list_index/name or had problems with the unequal length of the vectors.

You can give name to the list and then use stack in base R.
names(list_test) <- seq_along(list_test)
stack(list_test)
# values ind
#1 2 1
#2 4 1
#3 5 1
#4 6 1
#5 1 2
#6 2 2
#7 3 2
#8 7 3
#9 8 3
If interested in a tidyverse solution we can use enframe with unnest.
tibble::enframe(list_test) %>% tidyr::unnest(value)
Or imap_dfr from purrr.
purrr::imap_dfr(list_test, ~tibble::tibble(value = .x, list_index = .y))

Another option could be:
map_dfr(list_test, ~ enframe(.) %>%
select(-name), .id = "name")
name value
<chr> <dbl>
1 1 2
2 1 4
3 1 5
4 1 6
5 2 1
6 2 2
7 2 3
8 3 7
9 3 8
Or if you don't mind to have a column also with vector indexes:
map_dfr(list_test, enframe, .id = "name_list")
name_list name value
<chr> <int> <dbl>
1 1 1 2
2 1 2 4
3 1 3 5
4 1 4 6
5 2 1 1
6 2 2 2
7 2 3 3
8 3 1 7
9 3 2 8

In base R, we can use lengths to replicate the sequence and unlist the list elements into a two column 'data.frame'
data.frame(value = unlist(list_test),
list_index = rep(seq_along(list_test), lengths(list_test)))
# value list_index
#1 2 1
#2 4 1
#3 5 1
#4 6 1
#5 1 2
#6 2 2
#7 3 2
#8 7 3
#9 8 3

Related

Using "contain" function with two arguments in R

I have a dataset f.ex. like this:
dat1 <- read.table(header=TRUE, text="
Trust_01_T1 Trust_02_T1 Trust_03_T1 Trust_01_T2 Trust_02_T2 Trust_03_T2 Cont_01_T1 Cont_01_T2
5 1 2 1 5 3 1 1
3 1 3 3 4 2 1 2
2 1 3 1 3 1 2 2
4 2 5 5 3 2 3 3
5 1 4 1 2 2 4 5
")
I'd like to use the select function to gather the variables that contain Trust AND T1.
dat1 <- dat1 %>%
mutate(Trust_T1 = select(., contains("Trust")))
Does anybody know how to use two Arguments there, to have Trust AND T1. If I use:
dat1 <- dat1 %>%
mutate(Trust_T1 = select(., contains("Trust"), contains("T1")))
it gives me the Variables that contain EITHER Trust or T1.
best!
If we need both, then use matches with a regex to specify the column names that starts (^) with 'Trust' and ends ($) as 'T1' (assuming these are only patterns
library(dplyr)
dat1 %>%
select(matches("^Trust_.*T1$"))
The mutate used to create a new column is not clear as there are multiple columns that matches the 'Trust' followed by 'T1'. If the intention is to do some operations on the selected columns, can either be across or c_across with rowwise (not clear from the post)
One solution could be:
library(dplyr)
df %>% select(starts_with('Trust') | contains('_T1'))
#> Trust_01_T1 Trust_02_T1 Trust_03_T1 Trust_01_T2 Trust_02_T2 Trust_03_T2
#> 1 5 1 2 1 5 3
#> 2 3 1 3 3 4 2
#> 3 2 1 3 1 3 1
#> 4 4 2 5 5 3 2
#> 5 5 1 4 1 2 2
#> Cont_01_T1
#> 1 1
#> 2 1
#> 3 2
#> 4 3
#> 5 4
DATA
df <- read.table(text =
"
Trust_01_T1 Trust_02_T1 Trust_03_T1 Trust_01_T2 Trust_02_T2 Trust_03_T2 Cont_01_T1 Cont_01_T2
5 1 2 1 5 3 1 1
3 1 3 3 4 2 1 2
2 1 3 1 3 1 2 2
4 2 5 5 3 2 3 3
5 1 4 1 2 2 4 5
", header =T)

How to recode multiple variables for a subset of a dataframe?

I'm lost, so any directions would be helpful. Let's say I have a dataframe:
df <- data.frame(
id = 1:12,
v1 = rep(c(1:4), 3),
v2 = rep(c(1:3), 4),
v3 = rep(c(1:6), 2),
v4 = rep(c(1:2), 6))
My goal would be to recode 2=4 and 4=2 for variables v3 and v4 but only for the first 4 cases (id < 5). I'm looking for a solution that works for up to twenty variables. I know how to do basic recoding but I don't see a simple way to implement the subset condition while manipulating multiple variables.
Here is a base R solution,
df[1:5, c('v3', 'v4')] <- lapply(df[1:5, c('v3', 'v4')], function(i)
ifelse(i == 2, 4, ifelse(i == 4, 2, i)))
which gives,
id v1 v2 v3 v4
1 1 1 1 1 1
2 2 2 2 4 4
3 3 3 3 3 1
4 4 4 1 2 4
5 5 1 2 5 1
6 6 2 3 6 2
7 7 3 1 1 1
8 8 4 2 2 2
9 9 1 3 3 1
10 10 2 1 4 2
11 11 3 2 5 1
12 12 4 3 6 2
You can try mutate_at with case_when in dplyr
library(dplyr)
df %>%
mutate_at(vars(v3:v4), ~case_when(id < 5 & . == 4 ~ 2L,
id < 5 & . == 2 ~ 4L,
TRUE ~.))
# id v1 v2 v3 v4
#1 1 1 1 1 1
#2 2 2 2 4 4
#3 3 3 3 3 1
#4 4 4 1 2 4
#5 5 1 2 5 1
#6 6 2 3 6 2
#7 7 3 1 1 1
#8 8 4 2 2 2
#9 9 1 3 3 1
#10 10 2 1 4 2
#11 11 3 2 5 1
#12 12 4 3 6 2
With mutate_at you can specify range of columns to apply the function.
Another, more direct, option is to get the indices of the numbers to replace, and to replace them by 6 minus the number (6-4=2, 6-2=4):
whToChange <- which(df[1:5, c("v3", "v4")] ==2 | df[1:5, c("v3", "v4")]==4, arr.ind=TRUE)
df[, c("v3", "v4")][whToChange] <- 6-df[, c("v3", "v4")][whToChange]
head(df, 5)
# id v1 v2 v3 v4
#1 1 1 1 1 1
#2 2 2 2 4 4
#3 3 3 3 3 1
#4 4 4 1 2 4
#5 5 1 2 5 1
You can use match and a lookup table - just in chase you have to recede more than two values.
rosetta <- matrix(c(2,4,4,2), 2)
df[1:4, c("v3", "v4")] <- lapply(df[1:4, c("v3", "v4")], function(x) {
i <- match(x, rosetta[1,]); j <- !is.na(i); "[<-"(x, j, rosetta[2, i[j]])})
df
# id v1 v2 v3 v4
#1 1 1 1 1 1
#2 2 2 2 4 4
#3 3 3 3 3 1
#4 4 4 1 2 4
#5 5 1 2 5 1
#6 6 2 3 6 2
#7 7 3 1 1 1
#8 8 4 2 2 2
#9 9 1 3 3 1
#10 10 2 1 4 2
#11 11 3 2 5 1
#12 12 4 3 6 2
Have also a look at R: How to recode multiple variables at once or Recoding multiple variables in R

Data merge with data.table for repeating unique values

I am trying two merge two columns in data table 'A' with another column in another data table 'B' which is the unique value of a column . I want to merge in such a way that for every unique combination of two variables in data table 'A' , we get all unique values of column in data table 'B' repeated.
I tried merge but it doesn't give me all the values.I also tried the automated recycling function in data.table but this also doesn't give me the result.
Input:
data.table A
X Y
1 1
1 2
1 3
2 1
3 1
4 4
4 5
5 6
data.table B
Z
1
2
Expected output
X Y Z
1 1 1
1 1 2
1 2 1
1 2 2
1 3 1
1 3 2
2 1 1
2 1 2
3 1 1
3 1 2
4 4 1
4 4 2
4 5 1
4 5 2
5 6 1
5 6 2
We can make use of crossing from tidyr
library(tidyr)
crossing(A, B)
# X Y Z
#1 1 1 1
#2 1 1 2
#3 1 2 1
#4 1 2 2
#5 1 3 1
#6 1 3 2
#7 2 1 1
#8 2 1 2
#9 3 1 1
#10 3 1 2
#11 4 4 1
#12 4 4 2
#13 4 5 1
#14 4 5 2
#15 5 6 1
#16 5 6 2
Or with merge from base R, but the order will be slightly different
merge(A, B)
To get the correct order, replace the arguments in reverse and then order the columns
merge(B, A)[c(names(A), names(B))]

merge/join two long df in R

I have two dataframes a and b which I would like to combine
a <- data.frame(g=c("1","2","2","3","3","3","4","4","4","4"),h=c("1","1","2","1","2","3","1","2","3","4"))
b <- data.frame(g=c("1","2","3","3","3","4","4","4","4","4"),i=c("1","2","3","2","1","2","3","4","5","6"))
g represents a grouping variable and h and i the columns I want to merge/join
> a
g h
1 1 1
2 2 1
3 2 2
4 3 1
5 3 2
6 3 3
7 4 1
8 4 2
9 4 3
10 4 4
> b
g i
1 1 1
2 2 2
3 3 3
4 3 2
5 3 1
6 4 2
7 4 3
8 4 4
9 4 5
10 4 6
a and b should be merged on the level of the grouping variable g whereas identical values of h and i should be put together (independant of the order they appear in h/i) and not identical values should be combined once (not all possible combinations).
a final df would look like:
g h i
1 1 1 1
2 2 1 <NA>
3 2 2 2
4 3 1 1
5 3 2 2
6 3 3 3
7 4 1 <NA>
8 4 2 2
9 4 3 3
10 4 4 4
11 4 <NA> 5
12 4 <NA> 6
I need that df to perform a correlation analysis.
Sounds like a merge on h==i, while retaining i, so create a new variable x to join on, and keep join results from both sides (all=TRUE). With a large hat-tip to #Moody_Mudskipper:
merge(transform(a,x=h), transform(b,x=i), all=TRUE)
# g x h i
#1 1 1 1 1
#2 2 1 1 <NA>
#3 2 2 2 2
#4 3 1 1 1
#5 3 2 2 2
#6 3 3 3 3
#7 4 1 1 <NA>
#8 4 2 2 2
#9 4 3 3 3
#10 4 4 4 4
#11 4 5 <NA> 5
#12 4 6 <NA> 6
We can also do this with dplyr
library(dplyr)
a %>%
mutate(x = h) %>%
full_join(mutate(b, x = i)) %>%
select(-x)

calculate each chunk by group using dplyr?

How can I get the expected calculation using dplyr package?
row value group expected
1 2 1 =NA
2 4 1 =4-2
3 5 1 =5-4
4 6 2 =NA
5 11 2 =11-6
6 12 1 =NA
7 15 1 =15-12
I tried
df=read.table(header=1, text=' row value group
1 2 1
2 4 1
3 5 1
4 6 2
5 11 2
6 12 1
7 15 1')
df %>% group_by(group) %>% mutate(expected=value-lag(value))
How can I calculate for each chunk (row 1-3, 4-5, 6-7) although row 1-3 and 6-7 are labelled as the same group number?
Here is a similar approach. I created a new group variable using cumsum. Whenever the difference between two numbers in group is not 0, R assigns a new group number. If you have more data, this approach may be helpful.
library(dplyr)
mutate(df, foo = cumsum(c(T, diff(group) != 0))) %>%
group_by(foo) %>%
mutate(out = value - lag(value))
# row value group foo out
#1 1 2 1 1 NA
#2 2 4 1 1 2
#3 3 5 1 1 1
#4 4 6 2 2 NA
#5 5 11 2 2 5
#6 6 12 1 3 NA
#7 7 15 1 3 3
As your group variable is not useful for this, create a new variable aux and use it as the grouping variable:
library(dplyr)
df$aux <- rep(seq_along(rle(df$group)$values), times = rle(df$group)$lengths)
df %>% group_by(aux) %>% mutate(expected = value - lag(value))
Source: local data frame [7 x 5]
Groups: aux
row value group aux expected
1 1 2 1 1 NA
2 2 4 1 1 2
3 3 5 1 1 1
4 4 6 2 2 NA
5 5 11 2 2 5
6 6 12 1 3 NA
7 7 15 1 3 3
Here is an option using data.table_1.9.5. The devel version introduced new functions rleid and shift (default type is "lag" and fill is "NA") that can be useful for this.
library(data.table)
setDT(df)[, expected:=value-shift(value) ,by = rleid(group)][]
# row value group expected
#1: 1 2 1 NA
#2: 2 4 1 2
#3: 3 5 1 1
#4: 4 6 2 NA
#5: 5 11 2 5
#6: 6 12 1 NA
#7: 7 15 1 3

Resources