I am trying to find the angle to the closest point from a line in multiple cases within and across time. I have a data set that looks something like this. Four points in group 1, four in group 2 and one in group 3.
x <- sample(1:50, 27)
y <- sample(1:50, 27)
group <- c(1,1,1,1,2,2,2,2,3,1,1,1,1,2,2,2,2,3,1,1,1,1,2,2,2,2,3)
id <- rep(seq(1,9,1), 3)
time <- rep(1:3, each = 9)
df <- as.data.frame(cbind(x, y, group, id, time))
x y group id time
1 25 36 1 1 1
2 49 35 1 2 1
3 41 27 1 3 1
4 28 47 1 4 1
5 7 3 2 5 1
6 46 25 2 6 1
7 15 7 2 7 1
8 32 15 2 8 1
9 38 29 3 9 1
10 19 4 1 1 2
11 18 14 1 2 2
12 8 37 1 3 2
13 29 8 1 4 2
14 6 1 2 5 2
15 30 6 2 6 2
16 10 19 2 7 2
17 45 49 2 8 2
18 40 43 3 9 2
19 17 48 1 1 3
20 27 21 1 2 3
21 26 20 1 3 3
22 33 50 1 4 3
23 16 16 2 5 3
24 23 46 2 6 3
25 21 26 2 7 3
26 13 31 2 8 3
27 11 41 3 9 3
the item in group 3 is used to identify which point is the base of all of the lines. in this example for time 1 - id 3 in group 1 in closest. this signals that a line should be made to all other points in group 1 (3-1, 3-2 and 3-4). I then need to identify which id in group 2 is closest to each of the 3 lines. for example, point 6 might be closest to the line 3-2 and from that I would calculate the angle in points 6-3-2. I need to calculate this for all other lines in this time, and then perform this again across all other times.
The following code identifies the base point for the lines (it is not optimal but I need the other data it calculates for other uses)
#### calculate distance between all points
distance = function(x1,x2,y1,y2) sqrt(((x2-x1)^2)+((y2-y1)^2)) #distance function
distance2 = function(x,y,.pred) distance(x, x[.pred], y, y[.pred]) #
distance3 = function(x, y, id){
dists = map(1:9, ~distance2(x,y, which(id == .x)))
}
#use distance formula
df2 <- df %>%
group_by(time) %>%
mutate(distances=distance3(x, y, id))
distances <- df2$distances # extract distance list
distances <- do.call(rbind.data.frame, distances) # change list to dataframe
colnames(distances) <- c(paste0("dist", 1:9)) # change column names
df <- cbind(df,distances) # merge dataframes
group3 <- df %>% filter(group == 3)
df <- df %>% filter(group == 1 | group == 2) #remove group 3 (id 9 from data as no longer needed)
#new columns with id and group as closest to position of id 9
df <- df %>% group_by(time) %>% mutate(closest = id[which.min(dist9)]) %>%
mutate(closest.group = group[which.min(dist9)]) %>% ungroup
This is about as far as I can get on my own. I have found the following formula on here which I can use to calculate the distance of a point to a line. in an individual case but I have no idea how to integrate it across the multiple time periods and with conditions.
dist2d <- function(a,b,c) {
v1 <- b - c
v2 <- a - b
m <- cbind(v1,v2)
d <- abs(det(m))/sqrt(sum(v1*v1))
}
for clarification, the line only goes between the two points and does not extend to infinity.
Related
I have this data frame coming out of certain participant's behaviour in an episodic task, and let's say the episode starts at 90 and finishes when we have a certain trigger that can be in the range of 40s. I am doing a sample dataframe with a column with the number of the rows and the other with the actual triggers.
ex1 <- c(1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20)
ex2 <- c(41,1,1,90,1,1,1,44,1,90,1,2,42,1,1,1,1,90,1,41)
df <- data.frame(ex1,ex2)
> df
ex1 ex2
1 1 41
2 2 1
3 3 1
4 4 90
5 5 1
6 6 1
7 7 1
8 8 44
9 9 1
10 10 90
11 11 1
12 12 2
13 13 42
14 14 1
15 15 1
16 16 1
17 17 1
18 18 90
19 19 1
20 20 41
Now, what I am trying to do is remove all the rows that are outside the beginning and the end of the episode, as they are recordings of typed behaviour that is not interesting as it falls outside of the episode. Therefore, I want to end up with a dataframe like this:
ex1 <- c(1,4,5,6,7,8,10,11,12,13,18,19,20)
ex2 <- c(41,90,1,1,1,44,90,1,2,42,90,1,41)
df <- data.frame(ex1,ex2)
> df
ex1 ex2
1 1 41
2 4 90
3 5 1
4 6 1
5 7 1
6 8 44
7 10 90
8 11 1
9 12 2
10 13 42
11 18 90
12 19 1
13 20 41
I have been trying to use subset but I cannot make it work between a range and a number.
Thanks in advance!
Setting the values:
ex1 <- c(1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20)
ex2 <- c(41,1,1,90,1,1,1,44,1,90,1,2,42,1,1,1,1,90,1,41)
before <- data.frame(ex1,ex2)
before
ex1 ex2
1 1 41
2 2 1
3 3 1
4 4 90
5 5 1
6 6 1
7 7 1
8 8 44
9 9 1
10 10 90
11 11 1
12 12 2
13 13 42
14 14 1
15 15 1
16 16 1
17 17 1
18 18 90
19 19 1
20 20 41
I have built a function that should do the work.
The function is constructed based on my understanding of your problem so there is a chance that my function would not work perfectly to your setting.
However I believe you can do your task by adjusting the function a little bit to satisfy your needs.
library(dplyr)
episode <- function(start = 90, end = 40, data){#the default value of start is 90 and the default value of end is 40
#retrieving all the row indices that correspond to values that indicates an end
end_idx <- which(data$ex2>=end & data$ex2<=end+10)
#retrieving all the row indices that correspond to values that indicates a start
start_idx <- which(data$ex2==start)
#declaring a list that would contain the extracted sub samples in your liking
sub_sample_list <- vector("list", length(start_idx))
#looping through the start indices
for(i in 1:length(start_idx)){
#extracting the minimum among those have values larger than the i-th start_idx value
temp_end <- min(end_idx[end_idx>start_idx[i]])
#extracting the rows between the i-th start index and the minimum end index that is larger than the i-th start index
temp_sub_sample <- data[start_idx[i]:temp_end,]
#saving the sub-sample in the list
sub_sample_list[[i]] <- temp_sub_sample
}
#now row binding all the extracted sub samples
clean.df <- do.call(rbind.data.frame, sub_sample_list)
#if there is an end index that is smaller than the minimum start index
if(min(end_idx)< min(start_idx)){
#only retrieve those corresponding rows and add to the clean.df
clean.df <- rbind(data[end_idx[end_idx<min(start_idx)],], clean.df)
}
#cleaning up the row numbers a bit
rownames(clean.df) <- 1:nrow(clean.df)
#sort the clean.df by ex1
clean.df <- clean.df %>% arrange(ex1)
#returning the clean.df
return(clean.df)
}
Generating the after data set by using the episode function.
after <- episode(start = 90, end = 40, before)
after
ex1 ex2
1 1 41
2 4 90
3 5 1
4 6 1
5 7 1
6 8 44
7 10 90
8 11 1
9 12 2
10 13 42
11 18 90
12 19 1
13 20 41
And base:
ex1 <- c(1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20)
ex2 <- c(41,1,1,90,1,1,1,44,1,90,1,2,42,1,1,1,1,90,1,41)
df <- data.frame(ex1,ex2)
index start of series [90] and if not row 1 and subset out rows prior to start as incomplete:
start_idx <- which(df$ex2 == 90)
df <- df[start_idx[1]:nrow(df), ]
re-index start and index end >= 40 & < 90
start_idx <- which(df$ex2 == 90)
end_idx <- which(df$ex2 >= 40 & df$ex2 < 90)
make an empty list and for loop through, subsetting out start:end sections
df_lst <- list()
for (k in 1:length(start_idx)) {
df_lst[[k]] <- df[start_idx[k]:end_idx[k], ]
}
bring them all together
df2 <- do.call('rbind' df_lst)
df2
ex1 ex2
4 4 90
5 5 1
6 6 1
7 7 1
8 8 44
10 10 90
11 11 1
12 12 2
13 13 42
18 18 90
19 19 1
20 20 41
Fairly compact.
I have a dataframe:
dataframe <- data.frame(Condition = rep(c(1,2,3), each = 5, times = 2),
Time = sort(sample(1:60, 30)))
Condition Time
1 1 1
2 1 3
3 1 4
4 1 7
5 1 9
6 2 11
7 2 12
8 2 14
9 2 16
10 2 18
11 3 19
12 3 24
13 3 25
14 3 28
15 3 30
16 1 31
17 1 34
18 1 35
19 1 38
20 1 39
21 2 40
22 2 42
23 2 44
24 2 47
25 2 48
26 3 49
27 3 54
28 3 55
29 3 57
30 3 59
I want to divide the total length of Time (i.e., max(Time) - min(Time)) per Condition by a constant 'x' (e.g., 3). Then I want to use that quotient to add a new variable Trial such that my dataframe looks like this:
Condition Time Trial
1 1 1 A
2 1 3 A
3 1 4 B
4 1 7 C
5 1 9 C
6 2 11 A
7 2 12 A
8 2 14 B
9 2 16 C
10 2 18 C
... and so on
As you can see, for Condition 1, Trial is populated with unique identifying values (e.g., A, B, C) every 2.67 seconds = 8 (total time) / 3. For Condition 2, Trial is populated every 2.33 seconds = 7 (total time) /3.
I am not getting what I want with my current code:
dataframe %>%
group_by(Condition) %>%
mutate(Trial = LETTERS[cut(Time, 3, labels = F)])
# Groups: Condition [3]
Condition Time Trial
<dbl> <int> <chr>
1 1 1 A
2 1 3 A
3 1 4 A
4 1 7 A
5 1 9 A
6 2 11 A
7 2 12 A
8 2 14 A
9 2 16 A
10 2 18 A
# ... with 20 more rows
Thanks!
We can get the diffrence of range (returns min/max as a vector) and divide by the constant passed into i.e. 3 as the breaks in cut). Then, use integer index (labels = FALSE) to get the corresponding LETTER from the LETTERS builtin R constant
library(dplyr)
dataframe %>%
group_by(Condition) %>%
mutate(Trial = LETTERS[cut(Time, diff(range(Time))/3,
labels = FALSE)])
If the grouping should be based on adjacent values in 'Condition', use rleid from data.table on the 'Condition' column to create the grouping, and apply the same code as above
library(data.table)
dataframe %>%
group_by(grp = rleid(Condition)) %>%
mutate(Trial = LETTERS[cut(Time, diff(range(Time))/3,
labels = FALSE)])
Here's a one-liner using my santoku package. The rleid line is the same as mentioned in #akrun's solution.
dataframe %<>%
group_by(grp = data.table::rleid(Condition)) %>%
mutate(
Trial = chop_evenly(Time, intervals = 3, labels = lbl_seq("A"))
)
I have the following dataframe containing a variable "group" and a variable "number of elements per group"
group elements
1 3
2 1
3 14
4 10
.. ..
.. ..
30 5
then I have a bunch of numbers going from 1 to (let's say) 30
when summing "elements" I would get 900. what I want to obtain is to randomly select a number (from 0 to 30) from 1-30 and assign it to each group until I fill the number of elements for that group. Each of those should appear 30 times in total.
thus, for group 1, I want to randomly select 3 number from 0 to 30
for group 2, 1 number from 0 to 30 etc. until I filled all of the groups.
the final table should look like this:
group number(randomly selected)
1 7
1 20
1 7
2 4
3 21
3 20
...
any suggestions on how I can achieve this?
In base R, if you have df like this...
df
group elements
1 3
2 1
3 14
Then you can do this...
data.frame(group = rep(df$group, #repeat group no...
df$elements), #elements times
number = unlist(sapply(df$elements, #for each elements...
sample.int, #...sample <elements> numbers
n=30, #from 1 to 30
replace = FALSE))) #without duplicates
group number
1 1 19
2 1 15
3 1 28
4 2 15
5 3 20
6 3 18
7 3 27
8 3 10
9 3 23
10 3 12
11 3 25
12 3 11
13 3 14
14 3 13
15 3 16
16 3 26
17 3 22
18 3 7
Give this a try:
df <- read.table(text = "group elements
1 3
2 1
3 14
4 10
30 5", header = TRUE)
# reproducibility
set.seed(1)
df_split2 <- do.call("rbind",
(lapply(split(df, df$group),
function(m) cbind(m,
`number(randomly selected)` =
sample(1:30, replace = TRUE,
size = m$elements),
row.names = NULL
))))
# remove element column name
df_split2$elements <- NULL
head(df_split2)
#> group number(randomly selected)
#> 1.1 1 25
#> 1.2 1 4
#> 1.3 1 7
#> 2 2 1
#> 3.1 3 2
#> 3.2 3 29
The split function splits the df into chunks based on the group column. We then take those smaller data frames and add a column to them by sampling 1:30 a total of elements time. We then do.call on this list to rbind back together.
Yo have to generate a new dataframe repeating $group $element times, and then using sample you can generate the exact number of random numbers:
data<-data.frame(group=c(1,2,3,4,5),
elements=c(2,5,2,1,3))
data.elements<-data.frame(group=rep(data$group,data$elements),
number=sample(1:30,sum(data$elements)))
The result:
group number
1 1 9
2 1 4
3 2 29
4 2 28
5 2 18
6 2 7
7 2 25
8 3 17
9 3 22
10 4 5
11 5 3
12 5 8
13 5 26
I solved as follow:
random_sample <- rep(1:30, each=30)
random_sample <- sample(random_sample)
then I create a df with this variable and a variable containing one group per row repeated by the number of elements in the group itself
I have two dataframes. I need to add the value of one column to every row in the other dataframe where the values of a particular column meet a condition from the first dataframe.
df1:
a b
x 23
s 34
v 15
g 05
k 69
df2:
x y z
1 0 10
2 10 20
3 20 30
4 30 40
5 40 50
6 50 60
7 60 70
Desired output:
a b n
x 23 3
s 34 4
v 15 2
g 05 1
k 69 7
In my dataset the intervals are large, and it's unlikely that a value from df1 is exactly on the boundary of a df2 interval.
Essentially for every row in df1 I need to assign the number which corresponds to which range it fits into in df2. So if df1$b is between df2$y and df2$z, then assign the value of output$n as the corresponding value of df2$x. This is quite a wordy question, so please ask if I need to clarify.
df1 = read.table(text = "
a b
x 23
s 34
v 15
g 05
k 69
", header=T, stringsAsFactors=F)
df2 = read.table(text = "
x y z
1 0 10
2 10 20
3 20 30
4 30 40
5 40 50
6 50 60
7 60 70
", header=T, stringsAsFactors=F)
# function
f = function(x) min(which(x >= df2$y & x <= df2$z))
f = Vectorize(f)
# apply function
df1$n = f(df1$b)
# check updated dataset
df1
# a b n
# 1 x 23 3
# 2 s 34 4
# 3 v 15 2
# 4 g 5 1
# 5 k 69 7
You can try:
library(tidyverse)
df1 %>%
rowwise() %>%
mutate(n=df2[ b > df2$y & b <= df2$z,1]) %>%
ungroup()
# A tibble: 5 x 3
a b n
<chr> <int> <int>
1 x 23 3
2 s 34 4
3 v 15 2
4 g 5 1
5 k 69 7
as already commented you have to change < or > to <= or >= accordingly to your needs.
I have a big dataset, with 240 cases representing 240 patients. They all have undergone neuropsychological tests and filled in questionnaires. Additionally, their significant others (hereafter: proxies) have also filled in questionnaires. Since 'patient' and 'proxy' are nested in 'couples', I want to conduct a multilevel analysis in R. For this, I need to reshape my dataset to run those kind of analysis.
Simply said, I want to 'duplicate' my rows. For the double subject IDs add a new variable with 1s and 2s, where 1 stands for patient data and 2 stands for proxy data. Then I want the rows to be filled with 1. all the patient data and the columns that contain the proxy data to be NA or empty or whatever, and 2. all the proxy data, and all the patient data NA or empty.
Let's say this is my data:
id <- c(1:5)
names <- c('id', 'p1', 'p2', 'p3', 'pr1', 'pr2', 'pr3')
p1 <- c(sample(1:10, 5))
p2 <- c(sample(10:20, 5))
p3 <- c(sample(20:30, 5))
pr1 <- c(sample(1:10, 5))
pr2 <- c(sample(10:20, 5))
pr3 <- c(sample(20:30, 5))
mydf <- as.data.frame(matrix(c(id, p1, p2, p3, pr1, pr2, pr3), nrow = 5))
colnames(mydf) <- names
>mydf
id p1 p2 p3 pr1 pr2 pr3
1 1 6 20 22 1 10 24
2 2 8 11 24 2 18 29
3 3 7 10 25 6 20 26
4 4 3 14 20 10 15 20
5 5 5 19 29 7 14 22
I want my data finally to look like this:
id2 <- rep(c(1:5), each = 2)
names2 <- c('id', 'couple', 'q1', 'q2', 'q3')
couple <- rep(1:2, 5)
p1 <- c(sample(1:10, 5))
p2 <- c(sample(10:20, 5))
p3 <- c(sample(20:30, 5))
pr1 <- c(sample(1:10, 5))
pr2 <- c(sample(10:20, 5))
pr3 <- c(sample(20:30, 5))
mydf <- as.data.frame(matrix(c(id2, couple, p1, p2, p3, pr1, pr2, pr3), nrow = 10, ncol = 5))
colnames(mydf) <- names2
>mydf
id couple q1 q2 q3
1 1 1 6 23 16
2 1 2 10 28 10
3 2 1 1 27 14
4 2 2 7 21 20
5 3 1 5 30 18
6 3 2 12 2 27
7 4 1 10 1 25
8 4 2 13 7 21
9 5 1 11 6 20
10 5 2 18 3 23
Or, if this is not possible, like this:
id couple bb1 bb2 bb3 pbb1 pbb2 pbb3
1 1 1 6 23 16
2 1 2 10 28 10
3 2 1 1 27 14
4 2 2 7 21 20
5 3 1 5 30 18
6 3 2 12 2 27
7 4 1 10 1 25
8 4 2 13 7 21
9 5 1 11 6 20
10 5 2 18 3 23
Now, to get me there, i've tried the melt() function and the gather() function and it feels like i'm close but still it's not working the way I want it to work.
note, in my dataset the variable names are bb1:bb54 for the patient questionnaire and pbb1:pbb54 for the proxy questionnaire
Example of what I've tried
df_long <- df_reshape %>%
gather(testname, value, -(bb1:bb11), -(pbb1:pbb11), -id, -pgebdat, -p_age, na.rm=T) %>%
arrange(id)
If I understand what you want correctly, you can gather everything to a very long form and then reshape back to a slightly wider form:
library(tidyverse)
set.seed(47) # for reproducibility
mydf <- data.frame(id = c(1:5),
p1 = c(sample(1:10, 5)),
p2 = c(sample(10:20, 5)),
p3 = c(sample(20:30, 5)),
pr1 = c(sample(1:10, 5)),
pr2 = c(sample(10:20, 5)),
pr3 = c(sample(20:30, 5)))
mydf_long <- mydf %>%
gather(var, val, -id) %>%
separate(var, c('couple', 'q'), -2) %>%
mutate(q = paste0('q', q)) %>%
spread(q, val)
mydf_long
#> id couple q1 q2 q3
#> 1 1 p 10 17 21
#> 2 1 pr 10 11 24
#> 3 2 p 4 13 27
#> 4 2 pr 4 15 20
#> 5 3 p 7 14 30
#> 6 3 pr 1 14 29
#> 7 4 p 6 18 24
#> 8 4 pr 8 20 30
#> 9 5 p 9 16 23
#> 10 5 pr 3 18 25
One approach would be to use unite and separate in tidyr, along with the gather function as well.
I'm using your mydf data frame since it was provided, but it should be pretty straightforward to make any changes:
mydf %>%
unite(p1:p3, col = `1`, sep = ";") %>% # Combine responses of 'p1' through 'p3'
unite(pr1:pr3, col = `2`, sep = ";") %>% # Combine responses of 'pr1' through 'pr3'
gather(couple, value, `1`:`2`) %>% # Form into long data
separate(value, sep = ";", into = c("q1", "q2", "q3"), convert = TRUE) %>% # Separate and retrieve original answers
arrange(id)
Which gives you:
id couple q1 q2 q3
1 1 1 9 18 25
2 1 2 10 18 30
3 2 1 1 11 29
4 2 2 2 15 29
5 3 1 10 19 26
6 3 2 3 19 25
7 4 1 7 10 23
8 4 2 1 20 28
9 5 1 6 16 21
10 5 2 5 12 26
Our numbers are different since they were all randomly generated with sample.
Edited per #alistaire comment: add convert = TRUE to the separate call to make sure the responses are still of class integer.