I have two dataframes DF1 and DF2, created like this:
A<-c("hello", "dave", "welcome", "to", "eden")
B<-1:5
C<-6:10
DF1<-data.frame(A,B,C)
D<-c("do", "you", "want", "this", "book")
E<-11:15
F<- 16:20
DF2<-data.frame(D,E,F)
Essentially, columns 2 and 3 are dimensions of the word in column 1. I want to compute cosine similarity of each word in DF1 to each word in DF2 and store it in a tabular form. Thanks for your help.
You can try with fuzzy join package.
library(dplyr)
library(fuzzyjoin)
A<-c("hello", "dave", "welcome", "to", "eden")
B<-1:5
C<-6:10
DF1<-data.frame(A,B,C)
D<-c("do", "you", "want", "this", "book")
E<-11:15
F<- 16:20
DF2<-data.frame(D,E,F)
DF1 %>%
stringdist_full_join(DF2, by = c('A' = 'D'),
method = "cosine",
distance_col = "distance")
It gives
A B C D E F distance
1 hello 1 6 do 11 16 0.7327388
2 hello 1 6 you 12 17 0.7817821
3 hello 1 6 want 13 18 1.0000000
4 hello 1 6 this 14 19 0.8110178
5 hello 1 6 book 15 20 0.6913933
6 dave 2 7 do 11 16 0.6464466
7 dave 2 7 you 12 17 1.0000000
8 dave 2 7 want 13 18 0.7500000
9 dave 2 7 this 14 19 1.0000000
10 dave 2 7 book 15 20 1.0000000
11 welcome 3 8 do 11 16 0.7642977
12 welcome 3 8 you 12 17 0.8075499
13 welcome 3 8 want 13 18 0.8333333
14 welcome 3 8 this 14 19 1.0000000
15 welcome 3 8 book 15 20 0.7278345
16 to 4 9 do 11 16 0.5000000
17 to 4 9 you 12 17 0.5917517
18 to 4 9 want 13 18 0.6464466
19 to 4 9 this 14 19 0.6464466
20 to 4 9 book 15 20 0.4226497
21 eden 5 10 do 11 16 0.7113249
22 eden 5 10 you 12 17 1.0000000
23 eden 5 10 want 13 18 0.7958759
24 eden 5 10 this 14 19 1.0000000
25 eden 5 10 book 15 20 1.0000000
But as you said below, you want to calculate vector cosine and not stringsim cosine so you can do something like:
DF_full <- DF2 %>%
mutate(id = 1) %>%
inner_join(DF1 %>% mutate(id = 1), by = 'id') %>%
mutate(vector_word_1 = purrr::map2(B,C, c),
vector_word_2 = purrr::map2(E,F, c)) %>%
mutate(cosine_sim = purrr::map2(vector_word_1, vector_word_2, lsa::cosine))
DF_full
D E F id A B C vector_word_1 vector_word_2 cosine_sim
1 do 11 16 1 hello 1 6 1, 6 11, 16 0.9059667
2 do 11 16 1 dave 2 7 2, 7 11, 16 0.9479735
3 do 11 16 1 welcome 3 8 3, 8 11, 16 0.970496
4 do 11 16 1 to 4 9 4, 9 11, 16 0.9831082
5 do 11 16 1 eden 5 10 5, 10 11, 16 0.9904049
6 you 12 17 1 hello 1 6 1, 6 12, 17 0.9006583
7 you 12 17 1 dave 2 7 2, 7 12, 17 0.9439612
8 you 12 17 1 welcome 3 8 3, 8 12, 17 0.9674378
9 you 12 17 1 to 4 9 4, 9 12, 17 0.9807679
In your second comment, you note, what to do when vector length is 100 for instance, maybe:
df_1 <- tibble(A = rep(A, each = 100),
B = rnorm(500)) %>%
group_by(A) %>%
summarise(B = list(B))
df_2 <- tibble(D = rep(D, each = 100),
E = rnorm(500)) %>%
group_by(D) %>%
summarise(E = list(E))
DF_full <- df_2 %>%
mutate(id = 1) %>%
inner_join(df_1 %>% mutate(id = 1), by = 'id') %>%
mutate(cosine_sim = purrr::map2_dbl(B, E, lsa::cosine))
I found another way:
DF1 <- data.frame(DF1, row.names = 1)# convert the first column words as row names
DF2<-data.frame(DF2, row.names = 1)# # convert the first column words as row names
library(dplyr)
DF1 <-DF1 %>%
as.matrix() ## convert from dataframe to matrix
DF2 <-DF2 %>%
as.matrix() ## convert from dataframe to matrix
library(word2vec)
word2vec_similarity(DF1, DF2, type = "cosine")
Related
I want to take differences for each pair of consecutive columns but for an arbitrary number of columns. For example...
df <- as.tibble(data.frame(group = rep(c("a", "b", "c"), each = 4),
subgroup = rep(c("adam", "boy", "charles", "david"), times = 3),
iter1 = 1:12,
iter2 = c(13:22, NA, 24),
iter3 = c(25:35, NA)))
I want to calculate the differences by column. I would normally use...
df %>%
mutate(diff_iter2 = iter2 - iter1,
diff_iter3 = iter3 - iter2)
But... I'd like to:
accomodate an arbitrary number of columns and
treat NAs such that:
if the number we're subtracting from is NA, then the result should be NA. E.g. NA - 11 = NA
if the number we're subtracting is NA, then that NA is effectively treated as a 0. E.g. 35 - NA = 35
The result should look like this...
group subgroup iter1 iter2 iter3 diff_iter2 diff_iter3
<chr> <chr> <int> <dbl> <int> <dbl> <dbl>
1 a adam 1 13 25 12 12
2 a boy 2 14 26 12 12
3 a charles 3 15 27 12 12
4 a david 4 16 28 12 12
5 b adam 5 17 29 12 12
6 b boy 6 18 30 12 12
7 b charles 7 19 31 12 12
8 b david 8 20 32 12 12
9 c adam 9 21 33 12 12
10 c boy 10 22 34 12 12
11 c charles 11 NA 35 NA 35
12 c david 12 24 NA 12 NA
Originally, this df was in long format but the problem was that I believe the lag() function operates on position within groups and all the groups aren't the same because some have missing records (hence the NA in the wider table shown above).
Starting with long format would do but then please assume the records shown above with NA values would not exist in that longer dataframe.
Any help is appreciated.
An option in tidyverse would be - loop across the columns of 'iter' other than the iter1, then get the column value by replacing the column name (cur_column()) substring by subtracting 1 (as.numeric(x) -1) with str_replace, then replace the NA elements with 0 (replace_na) based on the OP's logic, subtract from the looped column and create new columns by adding prefix in .names ("diff_{.col}" - {.col} will be the original column name)
library(dplyr)
library(stringr)
library(tidyr)
df <- df %>%
mutate(across(iter2:iter3, ~
. - replace_na(get(str_replace(cur_column(), '\\d+',
function(x) as.numeric(x) - 1)), 0), .names = 'diff_{.col}'))
-output
df
# A tibble: 12 × 7
group subgroup iter1 iter2 iter3 diff_iter2 diff_iter3
<chr> <chr> <int> <dbl> <int> <dbl> <dbl>
1 a adam 1 13 25 12 12
2 a boy 2 14 26 12 12
3 a charles 3 15 27 12 12
4 a david 4 16 28 12 12
5 b adam 5 17 29 12 12
6 b boy 6 18 30 12 12
7 b charles 7 19 31 12 12
8 b david 8 20 32 12 12
9 c adam 9 21 33 12 12
10 c boy 10 22 34 12 12
11 c charles 11 NA 35 NA 35
12 c david 12 24 NA 12 NA
Find the columns whose names start with iter, ix, and then take all but the first as df1, all but the last as df2 and replace the NAs in df2 with 0. Then subtract them and cbind df to that. No packages are used.
ix <- grep("^iter", names(df))
df1 <- df[tail(ix, -1)]
df2 <- df[head(ix, -1)]
df2[is.na(df2)] <- 0
cbind(df, diff = df1 - df2)
giving:
group subgroup iter1 iter2 iter3 diff.iter2 diff.iter3
1 a adam 1 13 25 12 12
2 a boy 2 14 26 12 12
3 a charles 3 15 27 12 12
4 a david 4 16 28 12 12
5 b adam 5 17 29 12 12
6 b boy 6 18 30 12 12
7 b charles 7 19 31 12 12
8 b david 8 20 32 12 12
9 c adam 9 21 33 12 12
10 c boy 10 22 34 12 12
11 c charles 11 NA 35 NA 35
12 c david 12 24 NA 12 NA
Purpose
Suppose I have four variables: Two variables are original variables and the other two variables are the predictions of the original variables. (In actual data, there are a greater number of original variables)
I want to use for loop and mutate to create columns that compute the difference between the original and prediction variable. The sample data and the current approach are following:
Sample data
set.seed(10000)
id <- sample(1:20, 100, replace=T)
set.seed(10001)
dv.1 <- sample(1:20, 100, replace=T)
set.seed(10002)
dv.2 <- sample(1:20, 100, replace=T)
set.seed(10003)
pred_dv.1 <- sample(1:20, 100, replace=T)
set.seed(10004)
pred_dv.2 <- sample(1:20, 100, replace=T)
d <-
data.frame(id, dv.1, dv.2, pred_dv.1, pred_dv.2)
Current approach (with Error)
original <- d %>% select(starts_with('dv.')) %>% names(.)
pred <- d %>% select(starts_with('pred_dv.')) %>% names(.)
for (i in 1:length(original)){
d <-
d %>%
mutate(diff = original[i] - pred[i])
l <- length(d)
colnames(d[l]) <- paste0(original[i], '.diff')
}
Error: Problem with mutate() input diff. # x non-numeric
argument to binary operator # ℹ Input diff is original[i] - pred[i].
d %>%
mutate(
across(
.cols = starts_with("dv"),
.fns = ~ . - (get(paste0("pred_",cur_column()))),
.names = "diff_{.col}"
)
)
# A tibble: 100 x 7
id dv.1 dv.2 pred_dv.1 pred_dv.2 diff_dv.1 diff_dv.2
<int> <int> <int> <int> <int> <int> <int>
1 15 5 1 5 15 0 -14
2 13 4 4 5 11 -1 -7
3 12 20 13 6 13 14 0
4 20 11 8 13 3 -2 5
5 9 11 10 7 13 4 -3
6 13 3 3 6 17 -3 -14
7 3 12 19 6 17 6 2
8 19 6 7 11 4 -5 3
9 6 7 12 19 6 -12 6
10 13 10 15 6 7 4 8
# ... with 90 more rows
Subtraction can be applied on dataframes directly.
So you can create a vector of original column names and another vector of prediction column names and subtract them creating new columns.
orig_var <- grep('^dv', names(d), value = TRUE)
pred_var <- grep('pred', names(d), value = TRUE)
d[paste0(orig_var, '.diff')] <- d[orig_var] - d[pred_var]
d
# id dv.1 dv.2 pred_dv.1 pred_dv.2 dv.1.diff dv.2.diff
#1 15 5 1 5 15 0 -14
#2 13 4 4 5 11 -1 -7
#3 12 20 13 6 13 14 0
#4 20 11 8 13 3 -2 5
#5 9 11 10 7 13 4 -3
#...
#...
I have a dataframe with 900 columns. I want to use tidyverse to append/bind columns in multiples of three (or another number). For example, append columns 2:3 to 1; columns 5:6 to 4, columns 8:9 to 7, and so on for the entire dataframe. Thus at the end I will have 300 columns, while keeping the name of the main column (where other columns have been appended to).
How do I do this? Thank you very much :)
A tidyverse approach:
library(tidyverse)
# data
df = data.frame(matrix(1:27, ncol=9))
names(df) <- paste('Int', rep(1:3, each=3), 'A', rep(1:3, 3), sep='_')
n = 3
df %>%
# split the data frame into three data frames
split.default(rep(1:n, ncol(df) / n)) %>%
# rename and row bind the three data frames together
map_df(
~ set_names(.x, names(df)[c(T, rep(F, n - 1))]) %>%
tibble::rownames_to_column('gene')
)
# gene Int_1_A_1 Int_2_A_1 Int_3_A_1
#1 1 1 10 19
#2 2 2 11 20
#3 3 3 12 21
#4 1 4 13 22
#5 2 5 14 23
#6 3 6 15 24
#7 1 7 16 25
#8 2 8 17 26
#9 3 9 18 27
More notes on set_names: c(T, rep(F, n - 1)) first create a vector as c(T, F, F, ...), and so names(df)[c(T, rep(F, n - 1))] picks up a name every n elements due to R Cycling rule.
Or if you start from a matrix, you can reshape it with array function and desired shape:
m = matrix(1:27, ncol=9)
m
# [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9]
#[1,] 1 4 7 10 13 16 19 22 25
#[2,] 2 5 8 11 14 17 20 23 26
#[3,] 3 6 9 12 15 18 21 24 27
array(m, c(nrow(m) * 3, ncol(m) / 3))
# [,1] [,2] [,3]
# [1,] 1 10 19
# [2,] 2 11 20
# [3,] 3 12 21
# [4,] 4 13 22
# [5,] 5 14 23
# [6,] 6 15 24
# [7,] 7 16 25
# [8,] 8 17 26
# [9,] 9 18 27
To keep the names, you can use data.table::melt:
library(data.table)
Sample Data:
df = data.frame(matrix(1:27, ncol=9))
names(df) <- paste('Int', rep(1:3, each=3), 'A', rep(1:3, 3), sep='_')
df
# Int_1_A_1 Int_1_A_2 Int_1_A_3 Int_2_A_1 Int_2_A_2 Int_2_A_3 Int_3_A_1 Int_3_A_2 Int_3_A_3
#1 1 4 7 10 13 16 19 22 25
#2 2 5 8 11 14 17 20 23 26
#3 3 6 9 12 15 18 21 24 27
# create the patterns that group data frames
cols <- paste0('Int_', seq_len(ncol(df) / 3), '_A')
# melt the data.table based on the column patterns and here you also get an id column telling
# you where the data comes from the 1st, 2nd or 3rd ..
setNames(melt(setDT(df), measure=patterns(cols)), c('id', cols))
# id Int_1_A Int_2_A Int_3_A
#1: 1 1 10 19
#2: 1 2 11 20
#3: 1 3 12 21
#4: 2 4 13 22
#5: 2 5 14 23
#6: 2 6 15 24
#7: 3 7 16 25
#8: 3 8 17 26
#9: 3 9 18 27
A solution can be achieved using tidyr::unite and tidyr::separate_rows. The approach is to first unite columns in group of 3 and then use tidyr::separate_rows function to expand those in rows.
I have taken data created by #Psidom in his answer. Also, I should mention that data.table::melt based is most appropriate for problem. But one can explore different ideas using different approach.
library(tidyverse)
# data
df = data.frame(matrix(1:27, ncol=9))
names(df) <- paste('Int', rep(1:3, each=3), 'A', rep(1:3, 3), sep='_')
lapply(split(names(df),cut(1:ncol(df),3, labels = seq_len(ncol(df) / 3))),
function(x){unite_(df[,x], paste(x[1],x[3], sep = ":"), x, sep = ",",
remove = TRUE)}) %>%
bind_cols() %>%
separate_rows(., seq_len(ncol(.)), sep = ",")
# Int_1_A_1:Int_1_A_3 Int_2_A_1:Int_2_A_3 Int_3_A_1:Int_3_A_3
# 1 1 10 19
# 2 4 13 22
# 3 7 16 25
# 4 2 11 20
# 5 5 14 23
# 6 8 17 26
# 7 3 12 21
# 8 6 15 24
# 9 9 18 27
A base R solution:
df <- head(mtcars)[-1:-2] # 9 cols
df[(seq(df)-1) %% 3 == 0] <-
lapply(split(seq(df), (seq(df)-1) %/% 3),
function(x) apply(df[x], 1, paste, collapse="_"))
df <- df[(seq(df)-1) %% 3 == 0]
df
# disp wt am
# Mazda RX4 160_110_3.9 2.62_16.46_0 1_4_4
# Mazda RX4 Wag 160_110_3.9 2.875_17.02_0 1_4_4
# Datsun 710 108_93_3.85 2.32_18.61_1 1_4_1
# Hornet 4 Drive 258_110_3.08 3.215_19.44_1 0_3_1
# Hornet Sportabout 360_175_3.15 3.44_17.02_0 0_3_2
# Valiant 225_105_2.76 3.46_20.22_1 0_3_1
I have to following issue using R. In short I want to create multiple new columns in a data frame based on calculations of different column pairs in the data frame.
The data looks as follows:
df <- data.frame(a1 = c(1:5),
b1 = c(4:8),
c1 = c(10:14),
a2 = c(9:13),
b2 = c(3:7),
c2 = c(15:19))
df
a1 b1 c1 a2 b2 c2
1 4 10 9 3 15
2 5 11 10 4 16
3 6 12 11 5 17
4 7 13 12 6 18
5 8 14 13 7 19
The output is supposed to look like the following:
a1 b1 c1 a2 b2 c2 sum_a sum_b sum_c
1 4 10 9 3 15 10 7 25
2 5 11 10 4 16 12 9 27
4 7 13 12 6 18 16 13 31
5 8 14 13 7 19 18 15 33
I can achieve this using dplyr doing some manual work in the following way:
df %>% rowwise %>% mutate(sum_a = sum(a1, a2),
sum_b = sum(b1, b2),
sum_c = sum(c1, c2)) %>%
as.data.frame()
So what is being done is: take columns with the letter "a" in it, calulate the sum rowwise, and create a new column with the sum named sum_[letter]. Repeat for columns with different letters.
This is working, however, if I have a large data set with say 300 different column pairs the manual input would be significant, since I would have to write 300 mutate calls.
I recently stumbled upon the R package "purrr" and my guess is that this would solve my problem of doing what I want in a more automated way.
In particular, I would think to be able to use purrr:map2 to which I pass two lists of column names.
list1 = all columns with the number 1 in it
list2 = all columns with the number 2 in it
Then I could calculate the sum of each matching list entry, in the form of:
map2(list1, list2, ~mutate(sum))
However, I am not able to figure out how to best approach this problem using purrr. I am rather new to using purrr, so I would really appreciate any help on this issue.
Here is one option with purrr. We get the unique prefix of the names of the dataset ('nm1'), use map (from purrr) to loop through the unique names, select the column that matches the prefix value of 'nm1', add the rows using reduce and the bind the columns (bind_cols) with the original dataset
library(tidyverse)
nm1 <- names(df) %>%
substr(1, 1) %>%
unique
nm1 %>%
map(~ df %>%
select(matches(.x)) %>%
reduce(`+`)) %>%
set_names(paste0("sum_", nm1)) %>%
bind_cols(df, .)
# a1 b1 c1 a2 b2 c2 sum_a sum_b sum_c
#1 1 4 10 9 3 15 10 7 25
#2 2 5 11 10 4 16 12 9 27
#3 3 6 12 11 5 17 14 11 29
#4 4 7 13 12 6 18 16 13 31
#5 5 8 14 13 7 19 18 15 33
df %>%
mutate(sum_a = pmap_dbl(select(., starts_with("a")), sum),
sum_b = pmap_dbl(select(., starts_with("b")), sum),
sum_c = pmap_dbl(select(., starts_with("c")), sum))
a1 b1 c1 a2 b2 c2 sum_a sum_b sum_c
1 1 4 10 9 3 15 10 7 25
2 2 5 11 10 4 16 12 9 27
3 3 6 12 11 5 17 14 11 29
4 4 7 13 12 6 18 16 13 31
5 5 8 14 13 7 19 18 15 33
EDIT:
In the case there are many columns, and you wish to apply it programmatically:
row_sums <- function(x) {
transmute(df, !! paste0("sum_", quo_name(x)) := pmap_dbl(select(df, starts_with(x)), sum))
}
newdf <- map_dfc(letters[1:3], row_sums)
newdf
sum_a sum_b sum_c
1 10 7 25
2 12 9 27
3 14 11 29
4 16 13 31
5 18 15 33
And if needed you can tack on the original variables with:
bind_cols(df, dfnew)
a1 b1 c1 a2 b2 c2 sum_a sum_b sum_c
1 1 4 10 9 3 15 10 7 25
2 2 5 11 10 4 16 12 9 27
3 3 6 12 11 5 17 14 11 29
4 4 7 13 12 6 18 16 13 31
5 5 8 14 13 7 19 18 15 33
In case you like to consider a base R approach, here's how you could do it:
cbind(df, lapply(split.default(df, substr(names(df), 0,1)), rowSums))
# a1 b1 c1 a2 b2 c2 a b c
#1 1 4 10 9 3 15 10 7 25
#2 2 5 11 10 4 16 12 9 27
#3 3 6 12 11 5 17 14 11 29
#4 4 7 13 12 6 18 16 13 31
#5 5 8 14 13 7 19 18 15 33
It splits the data column-wise into a list, based on the first letter of each column name (either a, b, or c).
If you have a large number of columns and need to differentiate between all characters except the numbers at the end of each column name, you could modify the approach to:
cbind(df, lapply(split.default(df, sub("\\d+$", "", names(df))), rowSums))
in base R, all vectorized:
nms <- names(df)
df[paste0("sum_",unique(gsub("[1-9]","",nms)))] <-
df[endsWith(nms,"1")] + df[endsWith(nms,"2")]
# a1 b1 c1 a2 b2 c2 sum_a sum_b sum_c
# 1 1 4 10 9 3 15 10 7 25
# 2 2 5 11 10 4 16 12 9 27
# 3 3 6 12 11 5 17 14 11 29
# 4 4 7 13 12 6 18 16 13 31
# 5 5 8 14 13 7 19 18 15 33
Here is another tidyverse approach that uses only the pipe and doesn't require to create new objects.
library(tidyverse)
df %>%
bind_cols(
map_dfc(.x = list("a", "b", "c"),
.f = ~ .y %>%
rowwise() %>%
transmute(!!str_c("sum_", .x) := sum(c_across(starts_with(.x)))),
.y = .)
)
#> a1 b1 c1 a2 b2 c2 sum_a sum_b sum_c
#> 1 1 4 10 9 3 15 10 7 25
#> 2 2 5 11 10 4 16 12 9 27
#> 3 3 6 12 11 5 17 14 11 29
#> 4 4 7 13 12 6 18 16 13 31
#> 5 5 8 14 13 7 19 18 15 33
Explanation
The dataframe is piped into bind_cols() which binds the original columns with the newly created columns. The new columns are created with purrr::map_dfc() which takes a list of variable prefixes (.x) and the transforming function (.f). Additionally, the piped data (.) is assigned as another argument (.y). Since rowwise operations are required, rowwise() and c_across() are used in each iteration over the prefixes. transmute is used so that the original variables are not duplicated. In order to dynamically create variable names, the bang-bang operator (!!) along with := are used inside transmute.
Note
It would be shorter to use rowSums() instead of rowwise() and c_across() but other functions can easier be implemented using this approach.
For a hackish tidy solution, check this out:
library(tidyr)
library(dplyr)
df %>%
rownames_to_column(var = 'row') %>%
gather(a1:c2, key = 'key', value = 'value') %>%
extract(key, into = c('col.base', 'col.index'), regex = '([a-zA-Z]+)([0-9]+)') %>%
group_by(row, col.base) %>%
summarize(.sum = sum(value)) %>%
spread(col.base, .sum) %>%
bind_cols(df, .) %>%
select(-row)
Basically, I collect all pairs of columns with their values across all rows, separate the column name in two parts, calculate the row sums for columns with the same letter, and cast it back to the wide form.
Another solution that splits df by the numbers than use Reduce to calculate the sum
library(tidyverse)
df %>%
split.default(., substr(names(.), 2, 3)) %>%
Reduce('+', .) %>%
set_names(paste0("sum_", substr(names(.), 1, 1))) %>%
cbind(df, .)
#> a1 b1 c1 a2 b2 c2 sum_a sum_b sum_c
#> 1 1 4 10 9 3 15 10 7 25
#> 2 2 5 11 10 4 16 12 9 27
#> 3 3 6 12 11 5 17 14 11 29
#> 4 4 7 13 12 6 18 16 13 31
#> 5 5 8 14 13 7 19 18 15 33
Created on 2018-04-13 by the reprex package (v0.2.0).
1) dplyr/tidyr Convert to long form, summarize and convert back to wide form:
library(dplyr)
library(tidyr)
DF %>%
mutate(Row = 1:n()) %>%
gather(colname, value, -Row) %>%
group_by(g = gsub("\\d", "", colname), Row) %>%
summarize(sum = sum(value)) %>%
ungroup %>%
mutate(g = paste("sum", g, sep = "_")) %>%
spread(g, sum) %>%
arrange(Row) %>%
cbind(DF, .) %>%
select(-Row)
giving:
a1 b1 c1 a2 b2 c2 sum_a sum_b sum_c
1 1 4 10 9 3 15 10 7 25
2 2 5 11 10 4 16 12 9 27
3 4 7 13 12 6 18 16 13 31
4 5 8 14 13 7 19 18 15 33
2) base using matrix multiplication
nms is a vector of column names without the digits and prefaced with sum_. u is a vector of the unique elements of it. Form a logical matrix using outer from that which when multiplied by DF gives the sums -- the logicals get converted to 0-1 when that is done. Finally bind it to the input.
nms <- gsub("(\\D+)\\d", "sum_\\1", names(DF))
u <- unique(nms)
sums <- as.matrix(DF) %*% outer(nms, setNames(u, u), "==")
cbind(DF, sums)
giving:
a1 b1 c1 a2 b2 c2 sum_a sum_b sum_c
1 1 4 10 9 3 15 10 7 25
2 2 5 11 10 4 16 12 9 27
3 4 7 13 12 6 18 16 13 31
4 5 8 14 13 7 19 18 15 33
3) base with tapply
Using nms from (2) apply tapply to each row:
cbind(DF, t(apply(DF, 1, tapply, nms, sum)))
giving:
a1 b1 c1 a2 b2 c2 sum_a sum_b sum_c
1 1 4 10 9 3 15 10 7 25
2 2 5 11 10 4 16 12 9 27
3 4 7 13 12 6 18 16 13 31
4 5 8 14 13 7 19 18 15 33
You may wish to replace nms with factor(nms, levels = unique(nms)) in the above expression if the names are not in ascending order.
A slightly different approach using base R:
cbind(df, lapply(unique(gsub("\\d+","", colnames(df))), function(li) {
set_names(data.frame(V = apply(df[grep(li, colnames(df), val = T)], FUN = sum, MARGIN = 1)), paste0("sum_", li))
}))
# a1 b1 c1 a2 b2 c2 sum_a sum_b sum_c
#1 1 4 10 9 3 15 10 7 25
#2 2 5 11 10 4 16 12 9 27
#3 3 6 12 11 5 17 14 11 29
#4 4 7 13 12 6 18 16 13 31
#5 5 8 14 13 7 19 18 15 33
I have a big dataset, with 240 cases representing 240 patients. They all have undergone neuropsychological tests and filled in questionnaires. Additionally, their significant others (hereafter: proxies) have also filled in questionnaires. Since 'patient' and 'proxy' are nested in 'couples', I want to conduct a multilevel analysis in R. For this, I need to reshape my dataset to run those kind of analysis.
Simply said, I want to 'duplicate' my rows. For the double subject IDs add a new variable with 1s and 2s, where 1 stands for patient data and 2 stands for proxy data. Then I want the rows to be filled with 1. all the patient data and the columns that contain the proxy data to be NA or empty or whatever, and 2. all the proxy data, and all the patient data NA or empty.
Let's say this is my data:
id <- c(1:5)
names <- c('id', 'p1', 'p2', 'p3', 'pr1', 'pr2', 'pr3')
p1 <- c(sample(1:10, 5))
p2 <- c(sample(10:20, 5))
p3 <- c(sample(20:30, 5))
pr1 <- c(sample(1:10, 5))
pr2 <- c(sample(10:20, 5))
pr3 <- c(sample(20:30, 5))
mydf <- as.data.frame(matrix(c(id, p1, p2, p3, pr1, pr2, pr3), nrow = 5))
colnames(mydf) <- names
>mydf
id p1 p2 p3 pr1 pr2 pr3
1 1 6 20 22 1 10 24
2 2 8 11 24 2 18 29
3 3 7 10 25 6 20 26
4 4 3 14 20 10 15 20
5 5 5 19 29 7 14 22
I want my data finally to look like this:
id2 <- rep(c(1:5), each = 2)
names2 <- c('id', 'couple', 'q1', 'q2', 'q3')
couple <- rep(1:2, 5)
p1 <- c(sample(1:10, 5))
p2 <- c(sample(10:20, 5))
p3 <- c(sample(20:30, 5))
pr1 <- c(sample(1:10, 5))
pr2 <- c(sample(10:20, 5))
pr3 <- c(sample(20:30, 5))
mydf <- as.data.frame(matrix(c(id2, couple, p1, p2, p3, pr1, pr2, pr3), nrow = 10, ncol = 5))
colnames(mydf) <- names2
>mydf
id couple q1 q2 q3
1 1 1 6 23 16
2 1 2 10 28 10
3 2 1 1 27 14
4 2 2 7 21 20
5 3 1 5 30 18
6 3 2 12 2 27
7 4 1 10 1 25
8 4 2 13 7 21
9 5 1 11 6 20
10 5 2 18 3 23
Or, if this is not possible, like this:
id couple bb1 bb2 bb3 pbb1 pbb2 pbb3
1 1 1 6 23 16
2 1 2 10 28 10
3 2 1 1 27 14
4 2 2 7 21 20
5 3 1 5 30 18
6 3 2 12 2 27
7 4 1 10 1 25
8 4 2 13 7 21
9 5 1 11 6 20
10 5 2 18 3 23
Now, to get me there, i've tried the melt() function and the gather() function and it feels like i'm close but still it's not working the way I want it to work.
note, in my dataset the variable names are bb1:bb54 for the patient questionnaire and pbb1:pbb54 for the proxy questionnaire
Example of what I've tried
df_long <- df_reshape %>%
gather(testname, value, -(bb1:bb11), -(pbb1:pbb11), -id, -pgebdat, -p_age, na.rm=T) %>%
arrange(id)
If I understand what you want correctly, you can gather everything to a very long form and then reshape back to a slightly wider form:
library(tidyverse)
set.seed(47) # for reproducibility
mydf <- data.frame(id = c(1:5),
p1 = c(sample(1:10, 5)),
p2 = c(sample(10:20, 5)),
p3 = c(sample(20:30, 5)),
pr1 = c(sample(1:10, 5)),
pr2 = c(sample(10:20, 5)),
pr3 = c(sample(20:30, 5)))
mydf_long <- mydf %>%
gather(var, val, -id) %>%
separate(var, c('couple', 'q'), -2) %>%
mutate(q = paste0('q', q)) %>%
spread(q, val)
mydf_long
#> id couple q1 q2 q3
#> 1 1 p 10 17 21
#> 2 1 pr 10 11 24
#> 3 2 p 4 13 27
#> 4 2 pr 4 15 20
#> 5 3 p 7 14 30
#> 6 3 pr 1 14 29
#> 7 4 p 6 18 24
#> 8 4 pr 8 20 30
#> 9 5 p 9 16 23
#> 10 5 pr 3 18 25
One approach would be to use unite and separate in tidyr, along with the gather function as well.
I'm using your mydf data frame since it was provided, but it should be pretty straightforward to make any changes:
mydf %>%
unite(p1:p3, col = `1`, sep = ";") %>% # Combine responses of 'p1' through 'p3'
unite(pr1:pr3, col = `2`, sep = ";") %>% # Combine responses of 'pr1' through 'pr3'
gather(couple, value, `1`:`2`) %>% # Form into long data
separate(value, sep = ";", into = c("q1", "q2", "q3"), convert = TRUE) %>% # Separate and retrieve original answers
arrange(id)
Which gives you:
id couple q1 q2 q3
1 1 1 9 18 25
2 1 2 10 18 30
3 2 1 1 11 29
4 2 2 2 15 29
5 3 1 10 19 26
6 3 2 3 19 25
7 4 1 7 10 23
8 4 2 1 20 28
9 5 1 6 16 21
10 5 2 5 12 26
Our numbers are different since they were all randomly generated with sample.
Edited per #alistaire comment: add convert = TRUE to the separate call to make sure the responses are still of class integer.