I have a big dataset, with 240 cases representing 240 patients. They all have undergone neuropsychological tests and filled in questionnaires. Additionally, their significant others (hereafter: proxies) have also filled in questionnaires. Since 'patient' and 'proxy' are nested in 'couples', I want to conduct a multilevel analysis in R. For this, I need to reshape my dataset to run those kind of analysis.
Simply said, I want to 'duplicate' my rows. For the double subject IDs add a new variable with 1s and 2s, where 1 stands for patient data and 2 stands for proxy data. Then I want the rows to be filled with 1. all the patient data and the columns that contain the proxy data to be NA or empty or whatever, and 2. all the proxy data, and all the patient data NA or empty.
Let's say this is my data:
id <- c(1:5)
names <- c('id', 'p1', 'p2', 'p3', 'pr1', 'pr2', 'pr3')
p1 <- c(sample(1:10, 5))
p2 <- c(sample(10:20, 5))
p3 <- c(sample(20:30, 5))
pr1 <- c(sample(1:10, 5))
pr2 <- c(sample(10:20, 5))
pr3 <- c(sample(20:30, 5))
mydf <- as.data.frame(matrix(c(id, p1, p2, p3, pr1, pr2, pr3), nrow = 5))
colnames(mydf) <- names
>mydf
id p1 p2 p3 pr1 pr2 pr3
1 1 6 20 22 1 10 24
2 2 8 11 24 2 18 29
3 3 7 10 25 6 20 26
4 4 3 14 20 10 15 20
5 5 5 19 29 7 14 22
I want my data finally to look like this:
id2 <- rep(c(1:5), each = 2)
names2 <- c('id', 'couple', 'q1', 'q2', 'q3')
couple <- rep(1:2, 5)
p1 <- c(sample(1:10, 5))
p2 <- c(sample(10:20, 5))
p3 <- c(sample(20:30, 5))
pr1 <- c(sample(1:10, 5))
pr2 <- c(sample(10:20, 5))
pr3 <- c(sample(20:30, 5))
mydf <- as.data.frame(matrix(c(id2, couple, p1, p2, p3, pr1, pr2, pr3), nrow = 10, ncol = 5))
colnames(mydf) <- names2
>mydf
id couple q1 q2 q3
1 1 1 6 23 16
2 1 2 10 28 10
3 2 1 1 27 14
4 2 2 7 21 20
5 3 1 5 30 18
6 3 2 12 2 27
7 4 1 10 1 25
8 4 2 13 7 21
9 5 1 11 6 20
10 5 2 18 3 23
Or, if this is not possible, like this:
id couple bb1 bb2 bb3 pbb1 pbb2 pbb3
1 1 1 6 23 16
2 1 2 10 28 10
3 2 1 1 27 14
4 2 2 7 21 20
5 3 1 5 30 18
6 3 2 12 2 27
7 4 1 10 1 25
8 4 2 13 7 21
9 5 1 11 6 20
10 5 2 18 3 23
Now, to get me there, i've tried the melt() function and the gather() function and it feels like i'm close but still it's not working the way I want it to work.
note, in my dataset the variable names are bb1:bb54 for the patient questionnaire and pbb1:pbb54 for the proxy questionnaire
Example of what I've tried
df_long <- df_reshape %>%
gather(testname, value, -(bb1:bb11), -(pbb1:pbb11), -id, -pgebdat, -p_age, na.rm=T) %>%
arrange(id)
If I understand what you want correctly, you can gather everything to a very long form and then reshape back to a slightly wider form:
library(tidyverse)
set.seed(47) # for reproducibility
mydf <- data.frame(id = c(1:5),
p1 = c(sample(1:10, 5)),
p2 = c(sample(10:20, 5)),
p3 = c(sample(20:30, 5)),
pr1 = c(sample(1:10, 5)),
pr2 = c(sample(10:20, 5)),
pr3 = c(sample(20:30, 5)))
mydf_long <- mydf %>%
gather(var, val, -id) %>%
separate(var, c('couple', 'q'), -2) %>%
mutate(q = paste0('q', q)) %>%
spread(q, val)
mydf_long
#> id couple q1 q2 q3
#> 1 1 p 10 17 21
#> 2 1 pr 10 11 24
#> 3 2 p 4 13 27
#> 4 2 pr 4 15 20
#> 5 3 p 7 14 30
#> 6 3 pr 1 14 29
#> 7 4 p 6 18 24
#> 8 4 pr 8 20 30
#> 9 5 p 9 16 23
#> 10 5 pr 3 18 25
One approach would be to use unite and separate in tidyr, along with the gather function as well.
I'm using your mydf data frame since it was provided, but it should be pretty straightforward to make any changes:
mydf %>%
unite(p1:p3, col = `1`, sep = ";") %>% # Combine responses of 'p1' through 'p3'
unite(pr1:pr3, col = `2`, sep = ";") %>% # Combine responses of 'pr1' through 'pr3'
gather(couple, value, `1`:`2`) %>% # Form into long data
separate(value, sep = ";", into = c("q1", "q2", "q3"), convert = TRUE) %>% # Separate and retrieve original answers
arrange(id)
Which gives you:
id couple q1 q2 q3
1 1 1 9 18 25
2 1 2 10 18 30
3 2 1 1 11 29
4 2 2 2 15 29
5 3 1 10 19 26
6 3 2 3 19 25
7 4 1 7 10 23
8 4 2 1 20 28
9 5 1 6 16 21
10 5 2 5 12 26
Our numbers are different since they were all randomly generated with sample.
Edited per #alistaire comment: add convert = TRUE to the separate call to make sure the responses are still of class integer.
Related
I have two dataframes DF1 and DF2, created like this:
A<-c("hello", "dave", "welcome", "to", "eden")
B<-1:5
C<-6:10
DF1<-data.frame(A,B,C)
D<-c("do", "you", "want", "this", "book")
E<-11:15
F<- 16:20
DF2<-data.frame(D,E,F)
Essentially, columns 2 and 3 are dimensions of the word in column 1. I want to compute cosine similarity of each word in DF1 to each word in DF2 and store it in a tabular form. Thanks for your help.
You can try with fuzzy join package.
library(dplyr)
library(fuzzyjoin)
A<-c("hello", "dave", "welcome", "to", "eden")
B<-1:5
C<-6:10
DF1<-data.frame(A,B,C)
D<-c("do", "you", "want", "this", "book")
E<-11:15
F<- 16:20
DF2<-data.frame(D,E,F)
DF1 %>%
stringdist_full_join(DF2, by = c('A' = 'D'),
method = "cosine",
distance_col = "distance")
It gives
A B C D E F distance
1 hello 1 6 do 11 16 0.7327388
2 hello 1 6 you 12 17 0.7817821
3 hello 1 6 want 13 18 1.0000000
4 hello 1 6 this 14 19 0.8110178
5 hello 1 6 book 15 20 0.6913933
6 dave 2 7 do 11 16 0.6464466
7 dave 2 7 you 12 17 1.0000000
8 dave 2 7 want 13 18 0.7500000
9 dave 2 7 this 14 19 1.0000000
10 dave 2 7 book 15 20 1.0000000
11 welcome 3 8 do 11 16 0.7642977
12 welcome 3 8 you 12 17 0.8075499
13 welcome 3 8 want 13 18 0.8333333
14 welcome 3 8 this 14 19 1.0000000
15 welcome 3 8 book 15 20 0.7278345
16 to 4 9 do 11 16 0.5000000
17 to 4 9 you 12 17 0.5917517
18 to 4 9 want 13 18 0.6464466
19 to 4 9 this 14 19 0.6464466
20 to 4 9 book 15 20 0.4226497
21 eden 5 10 do 11 16 0.7113249
22 eden 5 10 you 12 17 1.0000000
23 eden 5 10 want 13 18 0.7958759
24 eden 5 10 this 14 19 1.0000000
25 eden 5 10 book 15 20 1.0000000
But as you said below, you want to calculate vector cosine and not stringsim cosine so you can do something like:
DF_full <- DF2 %>%
mutate(id = 1) %>%
inner_join(DF1 %>% mutate(id = 1), by = 'id') %>%
mutate(vector_word_1 = purrr::map2(B,C, c),
vector_word_2 = purrr::map2(E,F, c)) %>%
mutate(cosine_sim = purrr::map2(vector_word_1, vector_word_2, lsa::cosine))
DF_full
D E F id A B C vector_word_1 vector_word_2 cosine_sim
1 do 11 16 1 hello 1 6 1, 6 11, 16 0.9059667
2 do 11 16 1 dave 2 7 2, 7 11, 16 0.9479735
3 do 11 16 1 welcome 3 8 3, 8 11, 16 0.970496
4 do 11 16 1 to 4 9 4, 9 11, 16 0.9831082
5 do 11 16 1 eden 5 10 5, 10 11, 16 0.9904049
6 you 12 17 1 hello 1 6 1, 6 12, 17 0.9006583
7 you 12 17 1 dave 2 7 2, 7 12, 17 0.9439612
8 you 12 17 1 welcome 3 8 3, 8 12, 17 0.9674378
9 you 12 17 1 to 4 9 4, 9 12, 17 0.9807679
In your second comment, you note, what to do when vector length is 100 for instance, maybe:
df_1 <- tibble(A = rep(A, each = 100),
B = rnorm(500)) %>%
group_by(A) %>%
summarise(B = list(B))
df_2 <- tibble(D = rep(D, each = 100),
E = rnorm(500)) %>%
group_by(D) %>%
summarise(E = list(E))
DF_full <- df_2 %>%
mutate(id = 1) %>%
inner_join(df_1 %>% mutate(id = 1), by = 'id') %>%
mutate(cosine_sim = purrr::map2_dbl(B, E, lsa::cosine))
I found another way:
DF1 <- data.frame(DF1, row.names = 1)# convert the first column words as row names
DF2<-data.frame(DF2, row.names = 1)# # convert the first column words as row names
library(dplyr)
DF1 <-DF1 %>%
as.matrix() ## convert from dataframe to matrix
DF2 <-DF2 %>%
as.matrix() ## convert from dataframe to matrix
library(word2vec)
word2vec_similarity(DF1, DF2, type = "cosine")
I am trying to find the angle to the closest point from a line in multiple cases within and across time. I have a data set that looks something like this. Four points in group 1, four in group 2 and one in group 3.
x <- sample(1:50, 27)
y <- sample(1:50, 27)
group <- c(1,1,1,1,2,2,2,2,3,1,1,1,1,2,2,2,2,3,1,1,1,1,2,2,2,2,3)
id <- rep(seq(1,9,1), 3)
time <- rep(1:3, each = 9)
df <- as.data.frame(cbind(x, y, group, id, time))
x y group id time
1 25 36 1 1 1
2 49 35 1 2 1
3 41 27 1 3 1
4 28 47 1 4 1
5 7 3 2 5 1
6 46 25 2 6 1
7 15 7 2 7 1
8 32 15 2 8 1
9 38 29 3 9 1
10 19 4 1 1 2
11 18 14 1 2 2
12 8 37 1 3 2
13 29 8 1 4 2
14 6 1 2 5 2
15 30 6 2 6 2
16 10 19 2 7 2
17 45 49 2 8 2
18 40 43 3 9 2
19 17 48 1 1 3
20 27 21 1 2 3
21 26 20 1 3 3
22 33 50 1 4 3
23 16 16 2 5 3
24 23 46 2 6 3
25 21 26 2 7 3
26 13 31 2 8 3
27 11 41 3 9 3
the item in group 3 is used to identify which point is the base of all of the lines. in this example for time 1 - id 3 in group 1 in closest. this signals that a line should be made to all other points in group 1 (3-1, 3-2 and 3-4). I then need to identify which id in group 2 is closest to each of the 3 lines. for example, point 6 might be closest to the line 3-2 and from that I would calculate the angle in points 6-3-2. I need to calculate this for all other lines in this time, and then perform this again across all other times.
The following code identifies the base point for the lines (it is not optimal but I need the other data it calculates for other uses)
#### calculate distance between all points
distance = function(x1,x2,y1,y2) sqrt(((x2-x1)^2)+((y2-y1)^2)) #distance function
distance2 = function(x,y,.pred) distance(x, x[.pred], y, y[.pred]) #
distance3 = function(x, y, id){
dists = map(1:9, ~distance2(x,y, which(id == .x)))
}
#use distance formula
df2 <- df %>%
group_by(time) %>%
mutate(distances=distance3(x, y, id))
distances <- df2$distances # extract distance list
distances <- do.call(rbind.data.frame, distances) # change list to dataframe
colnames(distances) <- c(paste0("dist", 1:9)) # change column names
df <- cbind(df,distances) # merge dataframes
group3 <- df %>% filter(group == 3)
df <- df %>% filter(group == 1 | group == 2) #remove group 3 (id 9 from data as no longer needed)
#new columns with id and group as closest to position of id 9
df <- df %>% group_by(time) %>% mutate(closest = id[which.min(dist9)]) %>%
mutate(closest.group = group[which.min(dist9)]) %>% ungroup
This is about as far as I can get on my own. I have found the following formula on here which I can use to calculate the distance of a point to a line. in an individual case but I have no idea how to integrate it across the multiple time periods and with conditions.
dist2d <- function(a,b,c) {
v1 <- b - c
v2 <- a - b
m <- cbind(v1,v2)
d <- abs(det(m))/sqrt(sum(v1*v1))
}
for clarification, the line only goes between the two points and does not extend to infinity.
Purpose
Suppose I have four variables: Two variables are original variables and the other two variables are the predictions of the original variables. (In actual data, there are a greater number of original variables)
I want to use for loop and mutate to create columns that compute the difference between the original and prediction variable. The sample data and the current approach are following:
Sample data
set.seed(10000)
id <- sample(1:20, 100, replace=T)
set.seed(10001)
dv.1 <- sample(1:20, 100, replace=T)
set.seed(10002)
dv.2 <- sample(1:20, 100, replace=T)
set.seed(10003)
pred_dv.1 <- sample(1:20, 100, replace=T)
set.seed(10004)
pred_dv.2 <- sample(1:20, 100, replace=T)
d <-
data.frame(id, dv.1, dv.2, pred_dv.1, pred_dv.2)
Current approach (with Error)
original <- d %>% select(starts_with('dv.')) %>% names(.)
pred <- d %>% select(starts_with('pred_dv.')) %>% names(.)
for (i in 1:length(original)){
d <-
d %>%
mutate(diff = original[i] - pred[i])
l <- length(d)
colnames(d[l]) <- paste0(original[i], '.diff')
}
Error: Problem with mutate() input diff. # x non-numeric
argument to binary operator # ℹ Input diff is original[i] - pred[i].
d %>%
mutate(
across(
.cols = starts_with("dv"),
.fns = ~ . - (get(paste0("pred_",cur_column()))),
.names = "diff_{.col}"
)
)
# A tibble: 100 x 7
id dv.1 dv.2 pred_dv.1 pred_dv.2 diff_dv.1 diff_dv.2
<int> <int> <int> <int> <int> <int> <int>
1 15 5 1 5 15 0 -14
2 13 4 4 5 11 -1 -7
3 12 20 13 6 13 14 0
4 20 11 8 13 3 -2 5
5 9 11 10 7 13 4 -3
6 13 3 3 6 17 -3 -14
7 3 12 19 6 17 6 2
8 19 6 7 11 4 -5 3
9 6 7 12 19 6 -12 6
10 13 10 15 6 7 4 8
# ... with 90 more rows
Subtraction can be applied on dataframes directly.
So you can create a vector of original column names and another vector of prediction column names and subtract them creating new columns.
orig_var <- grep('^dv', names(d), value = TRUE)
pred_var <- grep('pred', names(d), value = TRUE)
d[paste0(orig_var, '.diff')] <- d[orig_var] - d[pred_var]
d
# id dv.1 dv.2 pred_dv.1 pred_dv.2 dv.1.diff dv.2.diff
#1 15 5 1 5 15 0 -14
#2 13 4 4 5 11 -1 -7
#3 12 20 13 6 13 14 0
#4 20 11 8 13 3 -2 5
#5 9 11 10 7 13 4 -3
#...
#...
I have a dataframe:
dataframe <- data.frame(Condition = rep(c(1,2,3), each = 5, times = 2),
Time = sort(sample(1:60, 30)))
Condition Time
1 1 1
2 1 3
3 1 4
4 1 7
5 1 9
6 2 11
7 2 12
8 2 14
9 2 16
10 2 18
11 3 19
12 3 24
13 3 25
14 3 28
15 3 30
16 1 31
17 1 34
18 1 35
19 1 38
20 1 39
21 2 40
22 2 42
23 2 44
24 2 47
25 2 48
26 3 49
27 3 54
28 3 55
29 3 57
30 3 59
I want to divide the total length of Time (i.e., max(Time) - min(Time)) per Condition by a constant 'x' (e.g., 3). Then I want to use that quotient to add a new variable Trial such that my dataframe looks like this:
Condition Time Trial
1 1 1 A
2 1 3 A
3 1 4 B
4 1 7 C
5 1 9 C
6 2 11 A
7 2 12 A
8 2 14 B
9 2 16 C
10 2 18 C
... and so on
As you can see, for Condition 1, Trial is populated with unique identifying values (e.g., A, B, C) every 2.67 seconds = 8 (total time) / 3. For Condition 2, Trial is populated every 2.33 seconds = 7 (total time) /3.
I am not getting what I want with my current code:
dataframe %>%
group_by(Condition) %>%
mutate(Trial = LETTERS[cut(Time, 3, labels = F)])
# Groups: Condition [3]
Condition Time Trial
<dbl> <int> <chr>
1 1 1 A
2 1 3 A
3 1 4 A
4 1 7 A
5 1 9 A
6 2 11 A
7 2 12 A
8 2 14 A
9 2 16 A
10 2 18 A
# ... with 20 more rows
Thanks!
We can get the diffrence of range (returns min/max as a vector) and divide by the constant passed into i.e. 3 as the breaks in cut). Then, use integer index (labels = FALSE) to get the corresponding LETTER from the LETTERS builtin R constant
library(dplyr)
dataframe %>%
group_by(Condition) %>%
mutate(Trial = LETTERS[cut(Time, diff(range(Time))/3,
labels = FALSE)])
If the grouping should be based on adjacent values in 'Condition', use rleid from data.table on the 'Condition' column to create the grouping, and apply the same code as above
library(data.table)
dataframe %>%
group_by(grp = rleid(Condition)) %>%
mutate(Trial = LETTERS[cut(Time, diff(range(Time))/3,
labels = FALSE)])
Here's a one-liner using my santoku package. The rleid line is the same as mentioned in #akrun's solution.
dataframe %<>%
group_by(grp = data.table::rleid(Condition)) %>%
mutate(
Trial = chop_evenly(Time, intervals = 3, labels = lbl_seq("A"))
)
I need help with a merge(vlookup) problem that I can not solve.
I have 2 data frames I would like to merge, in addition they also have different column names. My real datasets have many columns and that why its a hard for me to come up with a solution.
I have tried the merge function but I can not figure out how to do it on multiple columns with different names. I would like to explicitly specify
the column names using something like:
output <- merge(df1, df.vlookup, by.x=????, by.y=???, ) #just where I am today
Here is a very simplified example
id<-c(2,4,6,8,10,12,14,16,18,20,22,24,26,28)
bike <- c(1,3,2,1,1,1,2,3,2,3,1,1,1,1)
size <- c(1,2,1,2,1,2,1,2,1,2,1,2,1,2)
color <-c (10,11,13,15,12,12,12,11,11,14,12,11,10,10)
price <- c(1,2,2,2,1,3,1,1,2,1,2,1,2,1)
df1 <- data.frame(id,bike,size,color,price)
id bike size color price
1 2 1 1 10 1
2 4 3 2 11 2
3 6 2 1 13 2
4 8 1 2 15 2
5 10 1 1 12 1
6 12 1 2 12 3
7 14 2 1 12 1
8 16 3 2 11 1
9 18 2 1 11 2
10 20 3 2 14 1
11 22 1 1 12 2
12 24 1 2 11 1
13 26 1 1 10 2
14 28 1 2 10 1
b1<-c(1,2,3)
b2<-c("Alan", "CCM", "Basso")
s1 <- c(1,2)
s2 <- c("L","S")
c1<-c(10,11,12,13,14,15)
c2 <-c("black","blue","green","red","pink")
p1<- c(1,2,3)
p2<- c(1000,2000,3000)
#trick for making a dataframe with unequal vector length
na.pad <- function(x,len){
x[1:len]
}
makePaddedDataFrame <- function(l,...){
maxlen <- max(sapply(l,length))
data.frame(lapply(l,na.pad,len=maxlen),...)
}
df.vlookup <- makePaddedDataFrame(list(b1=b1,b2=b2,s1=s1,s2=s2,c1=c1,c2=c2,p1=p1,p2=p2))
> df.vlookup
b1 b2 s1 s2 c1 c2 p1 p2
1 1 Alan 1 L 10 black 1 1000
2 2 CCM 2 S 11 blue 2 2000
3 3 Basso NA <NA> 12 green 3 3000
4 NA <NA> NA <NA> 13 red NA NA
5 NA <NA> NA <NA> 14 pink NA NA
6 NA <NA> NA <NA> 15 <NA> NA NA
Here is a dataframe that I would like to end up with:
> df.final
id bike b2 size s2 color c2 price
1 2 1 Alan 1 L 10 black 1
2 4 3 Basso 2 S 11 blue 2
3 6 2 CCM 1 L 13 red 2
4 8 1 Alan 2 S 15 #N/A 2
5 10 1 Alan 1 L 12 green 1
6 12 1 Alan 2 S 12 green 3
7 14 2 CCM 1 L 12 green 1
8 16 3 Basso 2 S 11 blue 1
9 18 2 CCM 1 L 11 blue 2
10 20 3 Basso 2 S 14 pink 1
11 22 1 Alan 1 L 12 green 2
12 24 1 Alan 2 S 11 blue 1
13 26 1 Alan 1 L 10 black 2
14 28 1 Alan 2 S 10 black 1
Really appreciate some help on this...
I don't think a single data frame for lookup values is the right approach. What about using named vectors?
For example:
bike_names <- c("Alan" = 1, "CCM" = 2, "Basso" = 3)
df1$b2 <- names(bike_names[ df1$bike ])
If using data frames, put each lookup table in a separate data frame.
lookup <- list(
bike = data.frame( bike = c(1, 2, 3), bike_name = c("Alan", "CCM", "Basso")),
size = data.frame(size = c(1, 2), size_name = c("L", "S")),
color = data.frame(color = c(10, 11, 12, 13, 14, 15), color_name = c("black", "blue", "green", "red", "pink", NA)),
price = data.frame(price = c(1, 2, 3), price_name = c(1000, 2000, 3000))
)
And use it with merge:
Reduce(merge, c(data = list(df1), lookup))
Or use dplyr and joins:
library(dplyr)
df1 %>%
left_join(lookup$bike, by = c("bike")) %>%
left_join(lookup$size, by = c("size")) %>%
left_join(lookup$color, by = c("color")) %>%
left_join(lookup$price, by = c("price"))
Update
But if you really want to start from the df.vlookup data frame, you can convert it to a list of data frames like this:
lookup <- lapply(seq(1, to = ncol(df.vlookup), by = 2), function(i) {
setNames(df.vlookup[,c(i,i+1)], c(names(df1)[i/2+2], names(df.vlookup)[i+1]))
})
And use it in a multiple merge:
Reduce(merge, c(data = list(df1), lookup))
NOTE: When creating lookup list there are some assumptions about column order in df1 and in df.vlookup