Group similar strings with numbers and keep order of first appearance

Group similar strings with numbers and keep order of first appearance - r

I have a dataframe which looks like this example (just much larger):
var <- c('Peter','Ben','Mary','Peter.1','Ben.1','Mary.1','Peter.2','Ben.2','Mary.2')
v1 <- c(0.4, 0.6, 0.7, 0.3, 0.9, 0.2, 0.4, 0.6, 0.7)
v2 <- c(0.5, 0.4, 0.2, 0.5, 0.4, 0.2, 0.1, 0.4, 0.2)
df <- data.frame(var, v1, v2)
var v1 v2
1 Peter 0.4 0.5
2 Ben 0.6 0.4
3 Mary 0.7 0.2
4 Peter.1 0.3 0.5
5 Ben.1 0.9 0.4
6 Mary.1 0.2 0.2
7 Peter.2 0.4 0.1
8 Ben.2 0.6 0.4
9 Mary.2 0.7 0.2
I want to group the strings in 'var' according to the names without the suffixes, and keep the original order of first appearance. Desired output:
var v1 v2
1 Peter 0.4 0.5 # Peter appears first in the original data
2 Peter.1 0.3 0.5
3 Peter.2 0.4 0.1
4 Ben 0.6 0.4 # Ben appears second in the original data
5 Ben.1 0.9 0.4
6 Ben.2 0.6 0.4
7 Mary 0.7 0.2 # Mary appears third in the original data
8 Mary.1 0.2 0.2
9 Mary.2 0.7 0.2
How can I achieve that?
Thank you!

An option is to create a temporary column without the . and the digits (\\d+) at the end with str_remove, then use factor with levels specified as the unique values or use match to arrange the data
library(dplyr)
library(stringr)
df <- df %>%
mutate(var1 = str_remove(var, "\\.\\d+$")) %>%
arrange(factor(var1, levels = unique(var1))) %>%
select(-var1)
Or use fct_inorder from forcats which will convert to factor with levels in the order of first appearance
library(forcats)
df %>%
arrange(fct_inorder(str_remove(var, "\\.\\d+$")))
-output
var v1 v2
1 Peter 0.4 0.5
2 Peter.1 0.3 0.5
3 Peter.2 0.4 0.1
4 Ben 0.6 0.4
5 Ben.1 0.9 0.4
6 Ben.2 0.6 0.4
7 Mary 0.7 0.2
8 Mary.1 0.2 0.2
9 Mary.2 0.7 0.2

Compact option with sub and data.table::chgroup
df[chgroup(sub("\\..", "", df$var)),]
var v1 v2
1 Peter 0.4 0.5
4 Peter.1 0.3 0.5
7 Peter.2 0.4 0.1
2 Ben 0.6 0.4
5 Ben.1 0.9 0.4
8 Ben.2 0.6 0.4
3 Mary 0.7 0.2
6 Mary.1 0.2 0.2
9 Mary.2 0.7 0.2
chgroup groups together duplicated values but retains the group order (according the first appearance order of each group), efficiently

If you don't mind that the values in var are ordered alphabetically, then the simplest solution is this:
df %>%
arrange(var)
var v1 v2
1 Ben 0.6 0.4
2 Ben.1 0.9 0.4
3 Ben.2 0.6 0.4
4 Mary 0.7 0.2
5 Mary.1 0.2 0.2
6 Mary.2 0.7 0.2
7 Peter 0.4 0.5
8 Peter.1 0.3 0.5
9 Peter.2 0.4 0.1

separate the var column into two columns, replace the NAs that get generated with 0, sort and remove the extra columns.
This works on the numeric value of the numbers rather than the character representation so that for example, 10 won't come before 2. Also, the match in arrange ensures that the order is based on the first occurrence order.
df %>%
separate(var, c("alpha", "no"), convert=TRUE, remove=FALSE, fill="right") %>%
mutate(no = replace_na(no, 0)) %>%
arrange(match(alpha, alpha), no) %>%
select(-alpha, -no)
giving
var v1 v2
1 Peter 0.4 0.5
2 Peter.1 0.3 0.5
3 Peter.2 0.4 0.1
4 Ben 0.6 0.4
5 Ben.1 0.9 0.4
6 Ben.2 0.6 0.4
7 Mary 0.7 0.2
8 Mary.1 0.2 0.2
9 Mary.2 0.7 0.2
Update
Have removed what was previously the first solution after reading the update to the question.

Related

How I can sort these data in R [duplicate]

This question already has answers here:
Multiply rows of matrix by vector?
(6 answers)
Closed 10 months ago.
A small sample of my data is as follows:
A=c(0.1, 0.3, 0.6, 0.1)
dat<-read.table (text=" D1 D2 D3 D4
10 11 13 14
9 8 8 0
70 100 2 3
4 3 3 200
1 2 3 4
", header=TRUE)
The logic is that 0.1 x D1, 0.3xD2, 0.6xD3 and 0.1xD4.
Here is the outcome
1 3.3 7.8 1.4
0.9 2.4 4.8 0
7 30 1.2 0.3
0.4 0.9 1.8 20
0.1 0.6 1.8 0.4
Please assume I have more than 4 Ds

A possible solution, using dplyr:
library(dplyr)
dat %>%
mutate(across(everything(), ~ .x * A[which(names(dat) == cur_column())]))
#> D1 D2 D3 D4
#> 1 1.0 3.3 7.8 1.4
#> 2 0.9 2.4 4.8 0.0
#> 3 7.0 30.0 1.2 0.3
#> 4 0.4 0.9 1.8 20.0
#> 5 0.1 0.6 1.8 0.4
Another possible solution, in base R:
as.data.frame(t(apply(dat, 1, \(x) x * A)))
Yet another possible solution, using purrr::map2_df:
purrr::map2_df(dat, A, `*`)
Or even:
mapply(`*`, dat, A)

Creating a index for unique combination of columns in R

I got a set of data just like that:
df = data.frame(A = c(0.1, 0.3, 0.7, 0.9, 0.5, 0.4, 0.3, 0.3, 0.9, 0.9),
B = c(0.5, 0.4, 0.8, 0.6, 0.8, 0.5, 0.4, 0.5, 0.6, 0.5),
D = c(0.2, 0.1, 0.5, 0.8, 0.6, 0.7, 0.1, 0.3, 0.8, 0.3))
but i need to create a index for all unique combination of A, B and D. Just like that:
index A B D
1 1 0.1 0.5 0.2
2 2 0.3 0.4 0.1
3 3 0.7 0.8 0.5
4 4 0.9 0.6 0.8
5 5 0.5 0.8 0.6
6 6 0.4 0.5 0.7
7 2 0.3 0.4 0.1
8 7 0.3 0.5 0.3
9 4 0.9 0.6 0.8
10 8 0.9 0.5 0.3
Note that the combination between A, B and D is the same for rows 4 and 9 and for rows 2 and 7. Therefore, they receive the same index value

You can use the following code. Maybe the naming of indices have a slight difference than your output but the logic is the same:
library(dplyr)
df %>%
group_by(A, B, D) %>%
mutate(index = cur_group_id()) %>%
ungroup() %>%
arrange(index)
# A tibble: 10 x 4
A B D index
<dbl> <dbl> <dbl> <int>
1 0.1 0.5 0.2 1
2 0.3 0.4 0.1 2
3 0.3 0.4 0.1 2
4 0.3 0.5 0.3 3
5 0.4 0.5 0.7 4
6 0.5 0.8 0.6 5
7 0.7 0.8 0.5 6
8 0.9 0.5 0.3 7
9 0.9 0.6 0.8 8
10 0.9 0.6 0.8 8

We can use match
library(dplyr)
library(stringr)
df %>%
mutate(index = match(str_c(A, B, D), unique(str_c(A, B, D)))) %>%
arrange(index)

Another dplyr option
df %>%
distinct() %>%
mutate(index = 1:n()) %>%
left_join(x = df)
gives
A B D index
1 0.1 0.5 0.2 1
2 0.3 0.4 0.1 2
3 0.7 0.8 0.5 3
4 0.9 0.6 0.8 4
5 0.5 0.8 0.6 5
6 0.4 0.5 0.7 6
7 0.3 0.4 0.1 2
8 0.3 0.5 0.3 7
9 0.9 0.6 0.8 4
10 0.9 0.5 0.3 8

Multiply values depending on values of certains columns

I have two data base, df and cf. I want to multiply each value of A in df by each coefficient in cf depending on the value of B and C in table df.
For example
row 2 in df A= 20 B= 4 and C= 2 so the correct coefficient is 0.3,
the result is 20*0.3 = 6
There is a simple way to do that in R!?
Thanks in advance!!
df
A B C
20 4 2
30 4 5
35 2 2
24 3 3
43 2 1
cf
C
B/C 1 2 3 4 5
1 0.2 0.3 0.5 0.6 0.7
2 0.1 0.5 0.3 0.3 0.4
3 0.9 0.1 0.6 0.6 0.8
4 0.7 0.3 0.7 0.4 0.6

One solution with apply:
#iterate over df's rows
apply(df, 1, function(x) {
x[1] * cf[x[2], x[3]]
})
#[1] 6.0 18.0 17.5 14.4 4.3

Try this vectorized:
df[,1] * cf[as.matrix(df[,2:3])]
#[1] 6.0 18.0 17.5 14.4 4.3

A solution using dplyr and a vectorised function:
df = read.table(text = "
A B C
20 4 2
30 4 5
35 2 2
24 3 3
43 2 1
", header=T, stringsAsFactors=F)
cf = read.table(text = "
0.2 0.3 0.5 0.6 0.7
0.1 0.5 0.3 0.3 0.4
0.9 0.1 0.6 0.6 0.8
0.7 0.3 0.7 0.4 0.6
")
library(dplyr)
# function to get the correct element of cf
# vectorised version
f = function(x,y) cf[x,y]
f = Vectorize(f)
df %>%
mutate(val = f(B,C),
result = val * A)
# A B C val result
# 1 20 4 2 0.3 6.0
# 2 30 4 5 0.6 18.0
# 3 35 2 2 0.5 17.5
# 4 24 3 3 0.6 14.4
# 5 43 2 1 0.1 4.3
The final dataset has both result and val in order to check which value from cf was used each time.

Lagging variable by group does not work in dplyr

I'm desperately trying to lag a variable by group. I found this post that deals with essentially the same problem I'm facing, but the solution does not work for me, no idea why.
This is my problem:
library(dplyr)
df <- data.frame(monthvec = c(rep(1:2, 2), rep(3:5, 3)))
df <- df %>%
arrange(monthvec) %>%
mutate(growth=ifelse(monthvec==1, 0.3,
ifelse(monthvec==2, 0.5,
ifelse(monthvec==3, 0.7,
ifelse(monthvec==4, 0.1,
ifelse(monthvec==5, 0.6,NA))))))
df%>%
group_by(monthvec) %>%
mutate(lag.growth = lag(growth, order_by=monthvec))
Source: local data frame [13 x 3]
Groups: monthvec [5]
monthvec growth lag.growth
<int> <dbl> <dbl>
1 1 0.3 NA
2 1 0.3 0.3
3 2 0.5 NA
4 2 0.5 0.5
5 3 0.7 NA
6 3 0.7 0.7
7 3 0.7 0.7
8 4 0.1 NA
9 4 0.1 0.1
10 4 0.1 0.1
11 5 0.6 NA
12 5 0.6 0.6
13 5 0.6 0.6
This is what I'd like it to be in the end:
df$lag.growth <- c(NA, NA, 0.3, 0.3, 0.5, 0.5, 0.5, 0.7,0.7,0.7, 0.1,0.1,0.1)
monthvec growth lag.growth
1 1 0.3 NA
2 1 0.3 NA
3 2 0.5 0.3
4 2 0.5 0.3
5 3 0.7 0.5
6 3 0.7 0.5
7 3 0.7 0.5
8 4 0.1 0.7
9 4 0.1 0.7
10 4 0.1 0.7
11 5 0.6 0.1
12 5 0.6 0.1
13 5 0.6 0.1
I believe that one problem is that my groups are not of equal length...
Thanks for helping out.

Here is an idea. We group by monthvec in order to get the number of rows (cnt) of each group. We ungroup and use the first value of cnt as the size of the lag. We regroup on monthvec and replace the values in each group with the first value of each group.
library(dplyr)
df %>%
group_by(monthvec) %>%
mutate(cnt = n()) %>%
ungroup() %>%
mutate(lag.growth = lag(growth, first(cnt))) %>%
group_by(monthvec) %>%
mutate(lag.growth = first(lag.growth)) %>%
select(-cnt)
which gives,
# A tibble: 13 x 3
# Groups: monthvec [5]
monthvec growth lag.growth
<int> <dbl> <dbl>
1 1 0.3 NA
2 1 0.3 NA
3 2 0.5 0.3
4 2 0.5 0.3
5 3 0.7 0.5
6 3 0.7 0.5
7 3 0.7 0.5
8 4 0.1 0.7
9 4 0.1 0.7
10 4 0.1 0.7
11 5 0.6 0.1
12 5 0.6 0.1
13 5 0.6 0.1

You may join your original data with a dataframe with a shifted "monthvec".
left_join(df, df %>% mutate(monthvec = monthvec + 1) %>% unique(), by = "monthvec")
# monthvec growth.x growth.y
# 1 1 0.3 NA
# 2 1 0.3 NA
# 3 2 0.5 0.3
# 4 2 0.5 0.3
# 5 3 0.7 0.5
# 6 3 0.7 0.5
# 7 3 0.7 0.5
# 8 4 0.1 0.7
# 9 4 0.1 0.7
# 10 4 0.1 0.7
# 11 5 0.6 0.1
# 12 5 0.6 0.1
# 13 5 0.6 0.1

How to plot average of multiple columns by factor variables

I am trying to plot what is essentially calculated average time-series data for a dependent variable with 2 independent variables. DV = pupil dilation (at multiple time points "T") in response doing a motor task (IV_A) in combination with 3 different speech-in-noise signals (IV_B).
I would like to plot the average dilation across subjects at each time point (mean for each T column) , with separate lines for each condition.
So, the x axis would be T1 to T5 with a separate line for IV_A(=1):IV_B(=1),IV_A(=1):IV_B(=2),and IV_A(=1):IV_B(=3)
Depending how it looks, I might want the IV_A(=2) lines on a separate plot. But all in one graph would make for an easy visual comparison.
I'm wondering if I need to melt the data, to make it extremely long (there are about 110 T columns), or if there is away to accomplish what I want without restructuring the data frame.
The data look something like this:
Subject IV_A IV_B T1 T2 T3 T4 T5
1 1 1 0.2 0.3 0.5 0.6 0.3
1 1 2 0.3 0.2 0.3 0.4 0.4
1 1 3 0.2 0.4 0.5 0.2 0.3
1 2 1 0.3 0.2 0.3 0.4 0.4
1 2 2 0.2 0.3 0.5 0.6 0.3
1 2 3 0.2 0.4 0.5 0.2 0.3
2 1 1 0.2 0.3 0.5 0.6 0.3
2 1 2 0.3 0.2 0.3 0.4 0.4
2 1 3 0.2 0.4 0.5 0.2 0.3
2 2 1 0.3 0.2 0.3 0.4 0.4
2 2 2 0.2 0.3 0.5 0.6 0.3
2 2 3 0.2 0.4 0.5 0.2 0.3
3 1 1 0.2 0.3 0.5 0.6 0.3
3 1 2 0.3 0.2 0.3 0.4 0.4
3 1 3 0.2 0.4 0.5 0.2 0.3
3 2 1 0.3 0.2 0.3 0.4 0.4
3 2 2 0.2 0.3 0.5 0.6 0.3
3 2 3 0.2 0.4 0.5 0.2 0.3
Edit:
Unfortunately, I can't adapt #eipi10 's code to my actual data frame, which looks as follows:
Subject Trk_Y.N NsCond X.3 X.2 X.1 X0 X1 X2 X3
1 N Pink 0.3 0.4 0.6 0.4 0.8 0.6 0.6
1 N Babble 0.3 0.4 0.6 0.4 0.8 0.6 0.6
1 N Loss 0.3 0.4 0.6 0.4 0.8 0.6 0.6
1 Y Pink 0.3 0.4 0.6 0.4 0.8 0.6 0.6
1 Y Babble 0.3 0.4 0.6 0.4 0.8 0.6 0.6
1 Y Loss 0.3 0.4 0.6 0.4 0.8 0.6 0.6
Trk_Y.N means was the block with or without a secondary motor tracking task ("Yes" or "No"). NsCond is the type of noise the speech stimuli are presented in.
It's likely better to replace "Y" with "Tracking" and "N" with "No_Tracking".
I tried:
test_data[test_data$Trk_Y.N == "Y",]$Trk_Y.N = "Tracking"
But got an error:
Warning message:
In `[<-.factor`(`*tmp*`, iseq, value = c("Tracking", "Tracking", :
invalid factor level, NA generated

I may not have understood your data structure, so please let me know if this isn't what you had in mind:
library(reshape2)
library(ggplot2)
library(dplyr)
"Melt" data to long format. This will give us one observation for each Subject, IV and Time:
# Convert the two `IV` columns into a single column
df.m = df %>% mutate(IV = paste0("A",IV_A,":","B",IV_B)) %>% select(-IV_A,-IV_B)
# Melt to long format
df.m = melt(df.m, id.var=c("Subject","IV"), variable.name="Time", value.name="Pupil_Dilation")
head(df.m)
Subject IV Time Pupil_Dilation
1 1 A1:B1 T1 0.2
2 1 A1:B2 T1 0.3
3 1 A1:B3 T1 0.2
4 1 A2:B1 T1 0.3
5 1 A2:B2 T1 0.2
6 1 A2:B3 T1 0.2
Now we can plot a line giving the average value of Pupil_Dilation for each Time point for each level of IV, plus 95% confidence intervals. In your sample data, there's only a single measurement at each Time for each level of IV so no 95% confidence interval is included in the example graph below. However, if you have multiple measurements in your actual data, then you can use the code below to include the confidence interval:
pd=position_dodge(0.5)
ggplot(df.m, aes(Time, Pupil_Dilation, colour=IV, group=IV)) +
stat_summary(fun.data=mean_cl_boot, geom="errorbar", width=0.1, position=pd) +
stat_summary(fun.y=mean, geom="line", position=pd) +
stat_summary(fun.y=mean, geom="point", position=pd) +
scale_y_continuous(limits=c(0, max(df.m$Pupil_Dilation))) +
theme_bw()

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Group similar strings with numbers and keep order of first appearance - r

If you don't mind that the values in var are ordered alphabetically, then the simplest solution is this: df %>% arrange(var) var v1 v2 1 Ben 0.6 0.4 2 Ben.1 0.9 0.4 3 Ben.2 0.6 0.4 4 Mary 0.7 0.2 5 Mary.1 0.2 0.2 6 Mary.2 0.7 0.2 7 Peter 0.4 0.5 8 Peter.1 0.3 0.5 9 Peter.2 0.4 0.1

Related

How I can sort these data in R [duplicate]

Creating a index for unique combination of columns in R

Multiply values depending on values of certains columns

Lagging variable by group does not work in dplyr

How to plot average of multiple columns by factor variables

Categories

Resources