How to remove duplicates from data? [closed] - r

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 2 days ago.
Improve this question
I have a data where I am taking measurements of length each day (repeated measure analysis). However, I do have duplicates in my data and I want to get rid off those duplicates. Here is the data (df3). Where I took observation on
Replicate # Day length_avg
<chr> <chr> <dbl>
1 1 D1 0.663
2 1 D1 0.663
3 1 D1 0.663
4 1 D1 0.663
5 1 D2 0.688
6 1 D2 0.688
7 1 D2 0.688
8 1 D2 0.688
9 1 D3 0.692
10 1 D3 0.692
11 1 D3 0.692
12 1 D3 0.692
13 1 D4 0.691
14 1 D4 0.691
I want to have only one value for D1, D2, D3 etc. I also have 19 Replicates.
I tried to group it by
df4<-df3 %>%
group_by(Day) %>%
but I get only 1 value for each day per replicate. I need all D1,D2,....D19 values for each Replicate but I am getting
Replicate # Day length_avg
<chr> <chr> <dbl>
1 1 D1 0.663
2 1 D10 0.668
3 1 D11 0.688
4 1 D12 0.682
5 1 D13 0.636
6 4 D14 0.658
7 4 D15 0.667
8 4 D16 0.664
9 4 D17 0.662
10 1 D2 0.688
11 1 D3 0.692
12 1 D4 0.691
13 1 D5 0.687
14 1 D6 0.683
15 1 D7 0.686
16 1 D8 0.678
17 1 D9 0.697

You need slice_head:
df4 <- df3 %>%
group_by(Day) %>%


Is it possible to fit Partial Credit Model when one of possible responses is never selected in one of items?

I'm fitting a Partial Credit Model (PCM) with ltm package.
Suppose, my data contains 3 items each scored 1, 2 or 3, like this one:
X1 = c(1,1,3,1,1,3,1,3,1,1,3,3,3,3,3,3,3,3,1,3,3,3,3,1,1,3,3,3,3,3,3,3,3,1,3,3,3,1,1,3),
X2 = c(1,1,2,3,2,3,2,3,3,3,3,3,3,3,2,2,2,2,2,2,2,2,3,3,3,3,3,3,2,2,2,2,2,2,2,2,3,2,1,1),
X3 = c(2,1,2,2,3,3,2,3,1,2,1,1,1,3,2,2,1,1,1,2,3,1,3,3,2,3,1,2,1,1,1,3,2,2,1,1,1,2,2,1)
But it happened that no one have chosen option 2 in the first item:
lapply(my_data, table)
1 3
13 27
1 2 3
4 20 16
1 2 3
17 14 9
Now, when I run ltm::gpcm() to fit the model and factor.scores() to examine person abilities, I get the following output:
fit<-gpcm(my_data, constraint='rasch')
gpcm(data = my_data, constraint = "rasch")
Scoring Method: Empirical Bayes
Factor-Scores for observed response patterns:
X1 X2 X3 Obs Exp z1 se.z1
1 1 1 1 1 1.578 -1.414 0.744
2 1 1 2 2 0.486 -0.880 0.718
3 1 2 1 1 4.228 -0.880 0.718
4 1 2 2 3 2.209 -0.379 0.700
5 1 2 3 1 0.787 0.104 0.694
6 1 3 1 1 1.546 -0.379 0.700
7 1 3 2 3 1.343 0.104 0.694
8 1 3 3 1 0.793 0.591 0.705
9 2 1 1 1 1.159 -0.880 0.718
10 2 2 1 8 5.267 -0.379 0.700
11 2 2 2 5 4.573 0.104 0.694
12 2 2 3 2 2.701 0.591 0.705
13 2 3 1 5 3.201 0.104 0.694
14 2 3 2 1 4.607 0.591 0.705
15 2 3 3 5 4.597 1.107 0.737
It looks like X1 is treated like it had two possible responses: "1" and "2", not "1" and "3"!
Is there any way to inlude unobserved response "2" for X1?
Why this is important?
It's all about scoring. Look at lines 2 and 9 above:
Line 2 is espondent, who scored 1, 1 and 2 (respectively on X1, X2 and X3).
Line 9 is respondent who scored 3, 1, 1 (since X1=3 in original dataset is recoded to X1=2 by ltm package)
Those two people have:
exatly the same person-ability score assigned (column z1),
different raw scores (4 and 5, respectively),
which should not happen.
To be precise: I understand why this happens. My question is how to overcome such behaviour?

Flag run-length of grouped intervals

I have a dataframe grouped by grp:
df <- data.frame(
v = rnorm(25),
grp = c(rep("A",10), rep("B",15)),
size = 2)
I want to flag the run-length of intervals determined by size. For example, for grp == "A", size is 2, and the number of rows is 10. So the interval should have length 10/2 = 5. This code, however, creates intervals with length 2:
df %>%
group_by(grp) %>%
interval = (row_number() -1) %/% size)
# A tibble: 25 × 4
# Groups: grp [2]
v grp size interval
<dbl> <chr> <dbl> <dbl>
1 -0.166 A 2 0
2 -1.12 A 2 0
3 0.941 A 2 1
4 -0.913 A 2 1
5 0.486 A 2 2
6 -1.80 A 2 2
7 -0.370 A 2 3
8 -0.209 A 2 3
9 -0.661 A 2 4
10 -0.177 A 2 4
# … with 15 more rows
How can I flag the correct run-length of the size-determined intervals? The desired output is this:
# A tibble: 25 × 4
# Groups: grp [2]
v grp size interval
<dbl> <chr> <dbl> <dbl>
1 -0.166 A 2 0
2 -1.12 A 2 0
3 0.941 A 2 0
4 -0.913 A 2 0
5 0.486 A 2 0
6 -1.80 A 2 1
7 -0.370 A 2 1
8 -0.209 A 2 1
9 -0.661 A 2 1
10 -0.177 A 2 1
# … with 15 more rows
If I interpreted your question correctly, this small change should do the trick?
df %>%
group_by(grp) %>%
interval = (row_number() -1) %/% (n()/size))
You can use gl:
df %>%
group_by(grp) %>%
mutate(interval = gl(first(size), ceiling(n() / first(size)))[1:n()])
# A tibble: 26 × 4
# Groups: grp [2]
v grp size interval
<dbl> <chr> <dbl> <fct>
1 -1.12 A 2 1
2 3.04 A 2 1
3 0.235 A 2 1
4 -0.0333 A 2 1
5 -2.73 A 2 1
6 -0.0998 A 2 1
7 0.976 A 2 2
8 0.414 A 2 2
9 0.912 A 2 2
10 1.98 A 2 2
11 1.17 A 2 2
12 -0.509 B 2 1
13 0.704 B 2 1
14 -0.198 B 2 1
15 -0.538 B 2 1
16 -2.86 B 2 1
17 -0.790 B 2 1
18 0.488 B 2 1
19 2.17 B 2 1
20 0.501 B 2 2
21 0.620 B 2 2
22 -0.966 B 2 2
23 0.163 B 2 2
24 -2.08 B 2 2
25 0.485 B 2 2
26 0.697 B 2 2

TIdyverse mutate using value from column to reference another column value in perhaps a different row

I have created a tibble thus:
a <- c(1, 2, 3, 4, 5)
b <- runif(5)
c <- c(1, 3, 3, 3, 1)
tib <- tibble(a, b, c)
which produces this
# A tibble: 5 x 3
a b c
<dbl> <dbl> <dbl>
1 1 0.924 1
2 2 0.661 3
3 3 0.402 3
4 4 0.637 3
5 5 0.353 1
I would like to add another column, d, which is the value of b according to the a value given in column c. The resulting data frame should look thus:
a b c d
<dbl> <dbl> <dbl> <dbl>
1 1 0.924 1 0.924
2 2 0.661 3 0.402
3 3 0.402 3 0.402
4 4 0.637 3 0.402
5 5 0.353 1 0.924
Thanks for looking!
Use c to index the desired row of b:
tib %>% mutate(d = b[c])
a b c d
<dbl> <dbl> <dbl> <dbl>
1 1 0.924 1 0.924
2 2 0.661 3 0.402
3 3 0.402 3 0.402
4 4 0.637 3 0.402
5 5 0.353 1 0.924

R function to spread rows into columns [duplicate]

This question already has answers here:
How to reshape data from long to wide format
(14 answers)
Closed 3 years ago.
Spread row values into columns
Data looks like
> head(df2)
ID fungi Conc Abs date_no
1 1 R3 2.500000 0.209 0
22 1 R3 1.250000 0.153 0
43 1 R3 0.625000 0.159 0
64 1 R3 0.312500 0.164 0
85 1 R3 0.156250 0.157 0
106 1 R3 0.078125 0.170 0
And I used this function, which spread the date column into three columns but didn't populate them correctly.
separate_DF <- spread(df2, "date_no", "Abs")
What I get is this...
> head(df3)
ID fungi Conc date_no_0 date_no_1 date_no_3
1 1 R3 0.01953125 0.162 NA NA
2 1 R3 0.03906253 0.169 NA NA
3 1 R3 0.07812500 0.170 NA NA
4 1 R3 0.15625000 0.157 NA NA
5 1 R3 0.31250000 0.164 NA NA
6 1 R3 0.62500000 0.159 NA NA
So that the three date columns are populated by the Abs values. And each fungi at each concentration is its own row.
Try this one,
txt <- "fungi date Abs Conc
1 1 x 2.5
1 2 x 2.5
1 3 x 2.5
2 1 x 2.5
2 2 x 2.5
2 3 x 2.5
date_df <- read.table(textConnection(txt), header = TRUE)
print(spread(date_df, date, Abs, sep=""))
fungi Conc date1 date2 date3
1 1 2.5 x x x
2 2 2.5 x x x

R: add a dplyr group label as a number [duplicate]

This question already has answers here:
R - Group by variable and then assign a unique ID [duplicate]
(3 answers)
How to create a consecutive group number
(13 answers)
Closed 4 years ago.
I can not get my head around this must be simple task. How to get a group label as a consecutive number.
df <- data.frame(id = sample(c('a','b'), 20, T),
name = sample(c('N1', 'N2', 'N3'), 20, T),
val = runif(20)) %>%
group_by(id) %>%
arrange(id, name)
What I want is a label group_no that indicates the number of categories of the variable name within each id dplyr group. I can not find a solution in the dplyr package itself. Something like this:
# A tibble: 20 x 4
# Groups: id [2]
id name val group_no
<fct> <fct> <dbl> <int>
1 a N1 0.647 1
2 a N1 0.530 1
3 a N1 0.245 1
4 a N2 0.693 2
5 a N2 0.478 2
6 a N2 0.861 2
7 a N3 0.821 3
8 a N3 0.0995 3
9 a N3 0.662 3
10 b N1 0.553 1
11 b N1 0.0233 1
12 b N1 0.519 1
13 b N2 0.783 2
14 b N2 0.789 2
15 b N2 0.477 2
16 b N2 0.438 2
17 b N2 0.407 2
18 b N3 0.732 3
19 b N3 0.0707 3
20 b N3 0.316 3
Note, that the values of name could be anything and certainly are not normally suffixed by a number as in the example (otherwise I could do sub("^N", "", df$name).
I am looking for something a little different than the 1:n() solution in SO posts such as here.
I think in this case something as simple as :
df %>%
mutate(group_no = as.integer(name))
will work
# A tibble: 20 x 4
# Groups: id [2]
id name val group_no
<fct> <fct> <dbl> <int>
1 a N1 0.647 1
2 a N1 0.530 1
3 a N1 0.245 1
4 a N2 0.693 2
5 a N2 0.478 2
6 a N2 0.861 2
7 a N3 0.821 3
8 a N3 0.0995 3
9 a N3 0.662 3
10 b N1 0.553 1
11 b N1 0.0233 1
12 b N1 0.519 1
13 b N2 0.783 2
14 b N2 0.789 2
15 b N2 0.477 2
16 b N2 0.438 2
17 b N2 0.407 2
18 b N3 0.732 3
19 b N3 0.0707 3
20 b N3 0.316 3
We can do
df %>%
group_by(id) %>%
mutate(group_no = cumsum(c(TRUE, name[-1] != name[-n()])))
Or with match
df %>%
group_by(id) %>%
mutate(group_no = match(name, unique(name)))
# A tibble: 20 x 4
# Groups: id [2]
# id name val group_no
# <fct> <fct> <dbl> <int>
# 1 a N1 0.647 1
# 2 a N1 0.530 1
# 3 a N1 0.245 1
# 4 a N2 0.693 2
# 5 a N2 0.478 2
# 6 a N2 0.861 2
# 7 a N3 0.821 3
# 8 a N3 0.0995 3
# 9 a N3 0.662 3
#10 b N1 0.553 1
#11 b N1 0.0233 1
#12 b N1 0.519 1
#13 b N2 0.783 2
#14 b N2 0.789 2
#15 b N2 0.477 2
#16 b N2 0.438 2
#17 b N2 0.407 2
#18 b N3 0.732 3
#19 b N3 0.0707 3
#20 b N3 0.316 3
Here is a solution that uses left_join.
df %>%
left_join(df %>%
group_by(id, name) %>%
summarise(group_no = row_number()))
