Indexing duplicates with respect to certain variables - r

I want to index duplicates with respect to certain variables in R in a seperate, new variable.
Let's assume that I have the following dataset:
a <- seq(from=0, to=1, by=.4)
b <- seq(from=0, to=1, by=.4)
c <- seq(from=0, to=1, by=.4)
d <- seq(from=0, to=1, by=.4)
df <- expand.grid(a=a, b=b, c=c, d=d)
> df[1:20,]
a b c d
1 0.0 0.0 0.0 0
2 0.4 0.0 0.0 0
3 0.8 0.0 0.0 0
4 0.0 0.4 0.0 0
5 0.4 0.4 0.0 0
6 0.8 0.4 0.0 0
7 0.0 0.8 0.0 0
8 0.4 0.8 0.0 0
9 0.8 0.8 0.0 0
10 0.0 0.0 0.4 0
11 0.4 0.0 0.4 0
12 0.8 0.0 0.4 0
13 0.0 0.4 0.4 0
14 0.4 0.4 0.4 0
15 0.8 0.4 0.4 0
16 0.0 0.8 0.4 0
17 0.4 0.8 0.4 0
18 0.8 0.8 0.4 0
19 0.0 0.0 0.8 0
20 0.4 0.0 0.8 0
In this case, the first entry and the tenth entry are identical with respect to a and b. How can I assign a value e.g. "0.00-0.00" to a new variable for all those columns that have this combination (also line 19) and the same for all other combinations (eg. line 2, 11 and 20 etc.).
Thanks a lot in advance!

get duplicated rows like 10th,11th...
duplicated(df[,c(1,2)])
getting original rows as well ie. 1st,2nd...
duplicated(df[,c(1,2)], fromLast = TRUE)
assigning range to original as well as duplicates in new column e
df[duplicated(df[,c(1,2)], fromLast = TRUE) | duplicated(df[,c(1,2)]),"e"] <- paste0(df[duplicated(df[,c(1,2)], fromLast = TRUE) | duplicated(df[,c(1,2)]),1],"-",df[duplicated(df[,c(1,2)], fromLast = TRUE) | duplicated(df[,c(1,2)]),2])
> head(df)
a b c d e
1 0.0 0.0 0 0 0-0
2 0.4 0.0 0 0 0.4-0
3 0.8 0.0 0 0 0.8-0
4 0.0 0.4 0 0 0-0.4
5 0.4 0.4 0 0 0.4-0.4
6 0.8 0.4 0 0 0.8-0.4
Note : in this example, all rows are fitting original/duplicate criteria, therefore range assigned to all

Try this
df$e <- paste(df$a,df$b)
Let me know if you were looking for something else

Related

Create lower triangle genetic distance matrix

I have distance matrix like this
1 2 3 4 5
A 0.1 0.2 0.3 0.5 0.6
B 0.7 0.8 0.9 1 1.1
C 1.2 1.3 1.4 1.5 1.6
D 1.7 1.8 1.9 2 2.1
E 2.2 2.3 2.4 2.5 2.6
and now I want to create lower triangle matrix like this
1 2 3 4 5 A B C D E
1 0
2 0.1 0
3 0.2 0.1 0
4 0.4 0.3 0.2 0
5 0.5 0.4 0.3 0.1 0
A 0.1 0.2 0.3 0.5 0.6 0
B 0.7 0.8 0.9 1 1.1 0.6 0
C 1.2 1.3 1.4 1.5 1.6 1.1 0.5 0
D 1.7 1.8 1.9 2 2.1 1.6 1 0.5 0
E 2.2 2.3 2.4 2.5 2.6 2.1 1.5 1 0.5 0
I just deducted distance between 2 from 1 from first table to get genetic distance between 1 and 2 (0.2 - 0.1=0.1) and like this I did for rest of the entries and I do not know doing like this is correct or not?, after doing calculation like that made lower triangle matrix. I tried like this in R
x <- read.csv("AD2.csv", head = FALSE, sep = ",")
b<-lower.tri(b, diag = FALSE)
but I am getting only TRUE and FALSE as output not like distance matrix.
can any one help to solve this problem and here is link to my example data.
You can make use of dist to calculate sub-matrices. Then use cbind and create the top and bottom half. Then rbind the 2 halves. Then set upper triangular to NA to create the desired output.
mat <- rbind(
cbind(as.matrix(dist(tbl[1,])), tbl),
cbind(tbl, as.matrix(dist(tbl[,1])))
)
mat[upper.tri(mat, diag=FALSE)] <- NA
mat
Hope it helps.
data:
tbl <- as.matrix(read.table(text="1 2 3 4 5
A 0.1 0.2 0.3 0.5 0.6
B 0.7 0.8 0.9 1 1.1
C 1.2 1.3 1.4 1.5 1.6
D 1.7 1.8 1.9 2 2.1
E 2.2 2.3 2.4 2.5 2.6", header=TRUE, check.names=FALSE, row.names=1))

Multiply values depending on values of certains columns

I have two data base, df and cf. I want to multiply each value of A in df by each coefficient in cf depending on the value of B and C in table df.
For example
row 2 in df A= 20 B= 4 and C= 2 so the correct coefficient is 0.3,
the result is 20*0.3 = 6
There is a simple way to do that in R!?
Thanks in advance!!
df
A B C
20 4 2
30 4 5
35 2 2
24 3 3
43 2 1
cf
C
B/C 1 2 3 4 5
1 0.2 0.3 0.5 0.6 0.7
2 0.1 0.5 0.3 0.3 0.4
3 0.9 0.1 0.6 0.6 0.8
4 0.7 0.3 0.7 0.4 0.6
One solution with apply:
#iterate over df's rows
apply(df, 1, function(x) {
x[1] * cf[x[2], x[3]]
})
#[1] 6.0 18.0 17.5 14.4 4.3
Try this vectorized:
df[,1] * cf[as.matrix(df[,2:3])]
#[1] 6.0 18.0 17.5 14.4 4.3
A solution using dplyr and a vectorised function:
df = read.table(text = "
A B C
20 4 2
30 4 5
35 2 2
24 3 3
43 2 1
", header=T, stringsAsFactors=F)
cf = read.table(text = "
0.2 0.3 0.5 0.6 0.7
0.1 0.5 0.3 0.3 0.4
0.9 0.1 0.6 0.6 0.8
0.7 0.3 0.7 0.4 0.6
")
library(dplyr)
# function to get the correct element of cf
# vectorised version
f = function(x,y) cf[x,y]
f = Vectorize(f)
df %>%
mutate(val = f(B,C),
result = val * A)
# A B C val result
# 1 20 4 2 0.3 6.0
# 2 30 4 5 0.6 18.0
# 3 35 2 2 0.5 17.5
# 4 24 3 3 0.6 14.4
# 5 43 2 1 0.1 4.3
The final dataset has both result and val in order to check which value from cf was used each time.

instantaneous velocity - reference previous value

Using a very simple equation, how can I calculate the instantaneous velocity.
Vi = V0 + acceleration * time
The following task is very easy with MS.Excel as one can click on the previous previous cell, but how do we call this in R?
acceleration <- c(1,2,3,4,5,4,3,2,1)
time <- rep(0.1,9)
df1 <- data.frame(acceleration, time)
df1$instant.vel <- df1$acceleration * df1$time + ....
Try using dplyr::lag
library(dplyr)
df1 %>%
mutate(V=(lag(acceleration,default=0)*lag(time,default=0))+(acceleration*time))
acceleration time V
1 1 0.1 0.1
2 2 0.1 0.3
3 3 0.1 0.5
4 4 0.1 0.7
5 5 0.1 0.9
6 4 0.1 0.9
7 3 0.1 0.7
8 2 0.1 0.5
9 1 0.1 0.3
Or step by step:
df1 %>%
mutate(V0=(acceleration*time)) %>%
mutate(V1=V0+(lag(acceleration,default=0)*lag(time,default=0)))
acceleration time V0 V1
1 1 0.1 0.1 0.1
2 2 0.1 0.2 0.3
3 3 0.1 0.3 0.5
4 4 0.1 0.4 0.7
5 5 0.1 0.5 0.9
6 4 0.1 0.4 0.9
7 3 0.1 0.3 0.7
8 2 0.1 0.2 0.5
9 1 0.1 0.1 0.3

Simulation and Scenarios in R + help for function

Suppose that I have the following.
A table with input data
table <- data.frame(id=c(1,2,3,4,5,6),
cost=c(100,200,300,400,500,600))
A list of possible outcomes with and associate probability
values<-list(c(1),
c(0.5),
c(0))
A simulation of different scenarios
esc<-sample(1:3,100,replace=T)
How can I add a new column which contains the next formula?
id cost final
1 100 100*ifelse(esc[1]==1,values[[1]],ifelse(esc[1]==2,values[[2]],values[[3]]))
2 200 200*ifelse(esc[2]==1,values[[1]],ifelse(esc[2]==2,values[[2]],values[[3]]))
Convert esc variable into factor by using values as labels. Then convert into numeric type. This will map values to esc correctly.
esc <- as.numeric ( as.character( factor( esc, levels = sort( unique( esc )), labels = values) ) )
# [1] 1.0 0.5 0.5 0.0 1.0 0.0 0.0 0.5 0.5 1.0 1.0 1.0 0.0 0.5 0.0 0.5 0.0 0.0 0.5 0.0 0.0 1.0 0.5 1.0 1.0 0.5 1.0 0.5 0.0 0.5 0.5 0.5 0.5 1.0 0.0 0.0 0.0
# [38] 1.0 0.0 0.5 0.0 0.5 0.0 0.5 0.5 0.0 1.0 0.5 0.0 0.0 0.5 0.0 0.5 1.0 1.0 1.0 1.0 0.5 0.5 0.5 0.0 1.0 0.5 1.0 0.5 1.0 0.5 0.0 1.0 0.0 0.5 0.0 0.5 0.5
# [75] 0.5 0.0 0.0 0.5 0.0 0.0 0.5 0.0 0.5 1.0 0.0 1.0 0.0 1.0 1.0 1.0 1.0 1.0 0.5 0.0 0.0 0.0 0.5 0.5 0.0 0.5
table$esc <- esc[ 1: nrow(table) ] # add esc to table
Now multiply cost with esc to get final
within( table, final <- cost * esc)
# id cost esc final
# 1 1 100 1.0 100
# 2 2 200 0.5 100
# 3 3 300 0.5 150
# 4 4 400 0.0 0
# 5 5 500 1.0 500
# 6 6 600 0.0 0
Data:
table <- data.frame(id=c(1,2,3,4,5,6), cost=c(100,200,300,400,500,600))
values <- c(1, 0.5, 0)
set.seed(1L)
esc <- sample(1:3,100,replace=T)
esc
# [1] 1 2 2 3 1 3 3 2 2 1 1 1 3 2 3 2 3 3 2 3 3 1 2 1 1 2 1 2 3 2 2 2 2 1 3 3 3 1 3 2 3 2 3 2 2 3 1 2 3 3 2 3 2 1 1 1 1 2 2 2 3 1 2 1 2 1 2 3 1 3 2 3 2 2 2
# [76] 3 3 2 3 3 2 3 2 1 3 1 3 1 1 1 1 1 2 3 3 3 2 2 3 2

How to plot average of multiple columns by factor variables

I am trying to plot what is essentially calculated average time-series data for a dependent variable with 2 independent variables. DV = pupil dilation (at multiple time points "T") in response doing a motor task (IV_A) in combination with 3 different speech-in-noise signals (IV_B).
I would like to plot the average dilation across subjects at each time point (mean for each T column) , with separate lines for each condition.
So, the x axis would be T1 to T5 with a separate line for IV_A(=1):IV_B(=1),IV_A(=1):IV_B(=2),and IV_A(=1):IV_B(=3)
Depending how it looks, I might want the IV_A(=2) lines on a separate plot. But all in one graph would make for an easy visual comparison.
I'm wondering if I need to melt the data, to make it extremely long (there are about 110 T columns), or if there is away to accomplish what I want without restructuring the data frame.
The data look something like this:
Subject IV_A IV_B T1 T2 T3 T4 T5
1 1 1 0.2 0.3 0.5 0.6 0.3
1 1 2 0.3 0.2 0.3 0.4 0.4
1 1 3 0.2 0.4 0.5 0.2 0.3
1 2 1 0.3 0.2 0.3 0.4 0.4
1 2 2 0.2 0.3 0.5 0.6 0.3
1 2 3 0.2 0.4 0.5 0.2 0.3
2 1 1 0.2 0.3 0.5 0.6 0.3
2 1 2 0.3 0.2 0.3 0.4 0.4
2 1 3 0.2 0.4 0.5 0.2 0.3
2 2 1 0.3 0.2 0.3 0.4 0.4
2 2 2 0.2 0.3 0.5 0.6 0.3
2 2 3 0.2 0.4 0.5 0.2 0.3
3 1 1 0.2 0.3 0.5 0.6 0.3
3 1 2 0.3 0.2 0.3 0.4 0.4
3 1 3 0.2 0.4 0.5 0.2 0.3
3 2 1 0.3 0.2 0.3 0.4 0.4
3 2 2 0.2 0.3 0.5 0.6 0.3
3 2 3 0.2 0.4 0.5 0.2 0.3
Edit:
Unfortunately, I can't adapt #eipi10 's code to my actual data frame, which looks as follows:
Subject Trk_Y.N NsCond X.3 X.2 X.1 X0 X1 X2 X3
1 N Pink 0.3 0.4 0.6 0.4 0.8 0.6 0.6
1 N Babble 0.3 0.4 0.6 0.4 0.8 0.6 0.6
1 N Loss 0.3 0.4 0.6 0.4 0.8 0.6 0.6
1 Y Pink 0.3 0.4 0.6 0.4 0.8 0.6 0.6
1 Y Babble 0.3 0.4 0.6 0.4 0.8 0.6 0.6
1 Y Loss 0.3 0.4 0.6 0.4 0.8 0.6 0.6
Trk_Y.N means was the block with or without a secondary motor tracking task ("Yes" or "No"). NsCond is the type of noise the speech stimuli are presented in.
It's likely better to replace "Y" with "Tracking" and "N" with "No_Tracking".
I tried:
test_data[test_data$Trk_Y.N == "Y",]$Trk_Y.N = "Tracking"
But got an error:
Warning message:
In `[<-.factor`(`*tmp*`, iseq, value = c("Tracking", "Tracking", :
invalid factor level, NA generated
I may not have understood your data structure, so please let me know if this isn't what you had in mind:
library(reshape2)
library(ggplot2)
library(dplyr)
"Melt" data to long format. This will give us one observation for each Subject, IV and Time:
# Convert the two `IV` columns into a single column
df.m = df %>% mutate(IV = paste0("A",IV_A,":","B",IV_B)) %>% select(-IV_A,-IV_B)
# Melt to long format
df.m = melt(df.m, id.var=c("Subject","IV"), variable.name="Time", value.name="Pupil_Dilation")
head(df.m)
Subject IV Time Pupil_Dilation
1 1 A1:B1 T1 0.2
2 1 A1:B2 T1 0.3
3 1 A1:B3 T1 0.2
4 1 A2:B1 T1 0.3
5 1 A2:B2 T1 0.2
6 1 A2:B3 T1 0.2
Now we can plot a line giving the average value of Pupil_Dilation for each Time point for each level of IV, plus 95% confidence intervals. In your sample data, there's only a single measurement at each Time for each level of IV so no 95% confidence interval is included in the example graph below. However, if you have multiple measurements in your actual data, then you can use the code below to include the confidence interval:
pd=position_dodge(0.5)
ggplot(df.m, aes(Time, Pupil_Dilation, colour=IV, group=IV)) +
stat_summary(fun.data=mean_cl_boot, geom="errorbar", width=0.1, position=pd) +
stat_summary(fun.y=mean, geom="line", position=pd) +
stat_summary(fun.y=mean, geom="point", position=pd) +
scale_y_continuous(limits=c(0, max(df.m$Pupil_Dilation))) +
theme_bw()

Resources