How to remove duplicates from data? [closed] - r

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 2 days ago.
Improve this question
I have a data where I am taking measurements of length each day (repeated measure analysis). However, I do have duplicates in my data and I want to get rid off those duplicates. Here is the data (df3). Where I took observation on
Replicate # Day length_avg
<chr> <chr> <dbl>
1 1 D1 0.663
2 1 D1 0.663
3 1 D1 0.663
4 1 D1 0.663
5 1 D2 0.688
6 1 D2 0.688
7 1 D2 0.688
8 1 D2 0.688
9 1 D3 0.692
10 1 D3 0.692
11 1 D3 0.692
12 1 D3 0.692
13 1 D4 0.691
14 1 D4 0.691
I want to have only one value for D1, D2, D3 etc. I also have 19 Replicates.
I tried to group it by
df4<-df3 %>%
group_by(Day) %>%
slice(1L)
but I get only 1 value for each day per replicate. I need all D1,D2,....D19 values for each Replicate but I am getting
Replicate # Day length_avg
<chr> <chr> <dbl>
1 1 D1 0.663
2 1 D10 0.668
3 1 D11 0.688
4 1 D12 0.682
5 1 D13 0.636
6 4 D14 0.658
7 4 D15 0.667
8 4 D16 0.664
9 4 D17 0.662
10 1 D2 0.688
11 1 D3 0.692
12 1 D4 0.691
13 1 D5 0.687
14 1 D6 0.683
15 1 D7 0.686
16 1 D8 0.678
17 1 D9 0.697

You need slice_head:
library(dplyr)
df4 <- df3 %>%
group_by(Day) %>%
slice_head()

Related

Is it possible to fit Partial Credit Model when one of possible responses is never selected in one of items?

I'm fitting a Partial Credit Model (PCM) with ltm package.
Suppose, my data contains 3 items each scored 1, 2 or 3, like this one:
my_data<-data.frame(
X1 = c(1,1,3,1,1,3,1,3,1,1,3,3,3,3,3,3,3,3,1,3,3,3,3,1,1,3,3,3,3,3,3,3,3,1,3,3,3,1,1,3),
X2 = c(1,1,2,3,2,3,2,3,3,3,3,3,3,3,2,2,2,2,2,2,2,2,3,3,3,3,3,3,2,2,2,2,2,2,2,2,3,2,1,1),
X3 = c(2,1,2,2,3,3,2,3,1,2,1,1,1,3,2,2,1,1,1,2,3,1,3,3,2,3,1,2,1,1,1,3,2,2,1,1,1,2,2,1)
)
But it happened that no one have chosen option 2 in the first item:
lapply(my_data, table)
$X1
1 3
13 27
$X2
1 2 3
4 20 16
$X3
1 2 3
17 14 9
Now, when I run ltm::gpcm() to fit the model and factor.scores() to examine person abilities, I get the following output:
library('ltm')
fit<-gpcm(my_data, constraint='rasch')
factor.scores(fit)
Call:
gpcm(data = my_data, constraint = "rasch")
Scoring Method: Empirical Bayes
Factor-Scores for observed response patterns:
X1 X2 X3 Obs Exp z1 se.z1
1 1 1 1 1 1.578 -1.414 0.744
2 1 1 2 2 0.486 -0.880 0.718
3 1 2 1 1 4.228 -0.880 0.718
4 1 2 2 3 2.209 -0.379 0.700
5 1 2 3 1 0.787 0.104 0.694
6 1 3 1 1 1.546 -0.379 0.700
7 1 3 2 3 1.343 0.104 0.694
8 1 3 3 1 0.793 0.591 0.705
9 2 1 1 1 1.159 -0.880 0.718
10 2 2 1 8 5.267 -0.379 0.700
11 2 2 2 5 4.573 0.104 0.694
12 2 2 3 2 2.701 0.591 0.705
13 2 3 1 5 3.201 0.104 0.694
14 2 3 2 1 4.607 0.591 0.705
15 2 3 3 5 4.597 1.107 0.737
It looks like X1 is treated like it had two possible responses: "1" and "2", not "1" and "3"!
Is there any way to inlude unobserved response "2" for X1?
Why this is important?
It's all about scoring. Look at lines 2 and 9 above:
Line 2 is espondent, who scored 1, 1 and 2 (respectively on X1, X2 and X3).
Line 9 is respondent who scored 3, 1, 1 (since X1=3 in original dataset is recoded to X1=2 by ltm package)
Those two people have:
exatly the same person-ability score assigned (column z1),
different raw scores (4 and 5, respectively),
which should not happen.
To be precise: I understand why this happens. My question is how to overcome such behaviour?

Flag run-length of grouped intervals

I have a dataframe grouped by grp:
df <- data.frame(
v = rnorm(25),
grp = c(rep("A",10), rep("B",15)),
size = 2)
I want to flag the run-length of intervals determined by size. For example, for grp == "A", size is 2, and the number of rows is 10. So the interval should have length 10/2 = 5. This code, however, creates intervals with length 2:
df %>%
group_by(grp) %>%
mutate(
interval = (row_number() -1) %/% size)
# A tibble: 25 × 4
# Groups: grp [2]
v grp size interval
<dbl> <chr> <dbl> <dbl>
1 -0.166 A 2 0
2 -1.12 A 2 0
3 0.941 A 2 1
4 -0.913 A 2 1
5 0.486 A 2 2
6 -1.80 A 2 2
7 -0.370 A 2 3
8 -0.209 A 2 3
9 -0.661 A 2 4
10 -0.177 A 2 4
# … with 15 more rows
How can I flag the correct run-length of the size-determined intervals? The desired output is this:
# A tibble: 25 × 4
# Groups: grp [2]
v grp size interval
<dbl> <chr> <dbl> <dbl>
1 -0.166 A 2 0
2 -1.12 A 2 0
3 0.941 A 2 0
4 -0.913 A 2 0
5 0.486 A 2 0
6 -1.80 A 2 1
7 -0.370 A 2 1
8 -0.209 A 2 1
9 -0.661 A 2 1
10 -0.177 A 2 1
# … with 15 more rows
If I interpreted your question correctly, this small change should do the trick?
df %>%
group_by(grp) %>%
mutate(
interval = (row_number() -1) %/% (n()/size))
You can use gl:
df %>%
group_by(grp) %>%
mutate(interval = gl(first(size), ceiling(n() / first(size)))[1:n()])
output
# A tibble: 26 × 4
# Groups: grp [2]
v grp size interval
<dbl> <chr> <dbl> <fct>
1 -1.12 A 2 1
2 3.04 A 2 1
3 0.235 A 2 1
4 -0.0333 A 2 1
5 -2.73 A 2 1
6 -0.0998 A 2 1
7 0.976 A 2 2
8 0.414 A 2 2
9 0.912 A 2 2
10 1.98 A 2 2
11 1.17 A 2 2
12 -0.509 B 2 1
13 0.704 B 2 1
14 -0.198 B 2 1
15 -0.538 B 2 1
16 -2.86 B 2 1
17 -0.790 B 2 1
18 0.488 B 2 1
19 2.17 B 2 1
20 0.501 B 2 2
21 0.620 B 2 2
22 -0.966 B 2 2
23 0.163 B 2 2
24 -2.08 B 2 2
25 0.485 B 2 2
26 0.697 B 2 2

TIdyverse mutate using value from column to reference another column value in perhaps a different row

I have created a tibble thus:
library(tidyverse)
set.seed(68)
a <- c(1, 2, 3, 4, 5)
b <- runif(5)
c <- c(1, 3, 3, 3, 1)
tib <- tibble(a, b, c)
which produces this
tib
# A tibble: 5 x 3
a b c
<dbl> <dbl> <dbl>
1 1 0.924 1
2 2 0.661 3
3 3 0.402 3
4 4 0.637 3
5 5 0.353 1
I would like to add another column, d, which is the value of b according to the a value given in column c. The resulting data frame should look thus:
a b c d
<dbl> <dbl> <dbl> <dbl>
1 1 0.924 1 0.924
2 2 0.661 3 0.402
3 3 0.402 3 0.402
4 4 0.637 3 0.402
5 5 0.353 1 0.924
Thanks for looking!
Use c to index the desired row of b:
tib %>% mutate(d = b[c])
a b c d
<dbl> <dbl> <dbl> <dbl>
1 1 0.924 1 0.924
2 2 0.661 3 0.402
3 3 0.402 3 0.402
4 4 0.637 3 0.402
5 5 0.353 1 0.924

R function to spread rows into columns [duplicate]

This question already has answers here:
How to reshape data from long to wide format
(14 answers)
Closed 3 years ago.
Spread row values into columns
Data looks like
> head(df2)
ID fungi Conc Abs date_no
1 1 R3 2.500000 0.209 0
22 1 R3 1.250000 0.153 0
43 1 R3 0.625000 0.159 0
64 1 R3 0.312500 0.164 0
85 1 R3 0.156250 0.157 0
106 1 R3 0.078125 0.170 0
And I used this function, which spread the date column into three columns but didn't populate them correctly.
separate_DF <- spread(df2, "date_no", "Abs")
What I get is this...
> head(df3)
ID fungi Conc date_no_0 date_no_1 date_no_3
1 1 R3 0.01953125 0.162 NA NA
2 1 R3 0.03906253 0.169 NA NA
3 1 R3 0.07812500 0.170 NA NA
4 1 R3 0.15625000 0.157 NA NA
5 1 R3 0.31250000 0.164 NA NA
6 1 R3 0.62500000 0.159 NA NA
So that the three date columns are populated by the Abs values. And each fungi at each concentration is its own row.
Try this one,
library(tidyr)
txt <- "fungi date Abs Conc
1 1 x 2.5
1 2 x 2.5
1 3 x 2.5
2 1 x 2.5
2 2 x 2.5
2 3 x 2.5
"
date_df <- read.table(textConnection(txt), header = TRUE)
print(spread(date_df, date, Abs, sep=""))
Result:
fungi Conc date1 date2 date3
1 1 2.5 x x x
2 2 2.5 x x x

R: add a dplyr group label as a number [duplicate]

This question already has answers here:
R - Group by variable and then assign a unique ID [duplicate]
(3 answers)
How to create a consecutive group number
(13 answers)
Closed 4 years ago.
I can not get my head around this must be simple task. How to get a group label as a consecutive number.
library(dplyr)
set.seed(1)
df <- data.frame(id = sample(c('a','b'), 20, T),
name = sample(c('N1', 'N2', 'N3'), 20, T),
val = runif(20)) %>%
group_by(id) %>%
arrange(id, name)
What I want is a label group_no that indicates the number of categories of the variable name within each id dplyr group. I can not find a solution in the dplyr package itself. Something like this:
# A tibble: 20 x 4
# Groups: id [2]
id name val group_no
<fct> <fct> <dbl> <int>
1 a N1 0.647 1
2 a N1 0.530 1
3 a N1 0.245 1
4 a N2 0.693 2
5 a N2 0.478 2
6 a N2 0.861 2
7 a N3 0.821 3
8 a N3 0.0995 3
9 a N3 0.662 3
10 b N1 0.553 1
11 b N1 0.0233 1
12 b N1 0.519 1
13 b N2 0.783 2
14 b N2 0.789 2
15 b N2 0.477 2
16 b N2 0.438 2
17 b N2 0.407 2
18 b N3 0.732 3
19 b N3 0.0707 3
20 b N3 0.316 3
Note, that the values of name could be anything and certainly are not normally suffixed by a number as in the example (otherwise I could do sub("^N", "", df$name).
I am looking for something a little different than the 1:n() solution in SO posts such as here.
I think in this case something as simple as :
df %>%
mutate(group_no = as.integer(name))
will work
# A tibble: 20 x 4
# Groups: id [2]
id name val group_no
<fct> <fct> <dbl> <int>
1 a N1 0.647 1
2 a N1 0.530 1
3 a N1 0.245 1
4 a N2 0.693 2
5 a N2 0.478 2
6 a N2 0.861 2
7 a N3 0.821 3
8 a N3 0.0995 3
9 a N3 0.662 3
10 b N1 0.553 1
11 b N1 0.0233 1
12 b N1 0.519 1
13 b N2 0.783 2
14 b N2 0.789 2
15 b N2 0.477 2
16 b N2 0.438 2
17 b N2 0.407 2
18 b N3 0.732 3
19 b N3 0.0707 3
20 b N3 0.316 3
We can do
df %>%
group_by(id) %>%
mutate(group_no = cumsum(c(TRUE, name[-1] != name[-n()])))
Or with match
df %>%
group_by(id) %>%
mutate(group_no = match(name, unique(name)))
# A tibble: 20 x 4
# Groups: id [2]
# id name val group_no
# <fct> <fct> <dbl> <int>
# 1 a N1 0.647 1
# 2 a N1 0.530 1
# 3 a N1 0.245 1
# 4 a N2 0.693 2
# 5 a N2 0.478 2
# 6 a N2 0.861 2
# 7 a N3 0.821 3
# 8 a N3 0.0995 3
# 9 a N3 0.662 3
#10 b N1 0.553 1
#11 b N1 0.0233 1
#12 b N1 0.519 1
#13 b N2 0.783 2
#14 b N2 0.789 2
#15 b N2 0.477 2
#16 b N2 0.438 2
#17 b N2 0.407 2
#18 b N3 0.732 3
#19 b N3 0.0707 3
#20 b N3 0.316 3
Here is a solution that uses left_join.
df %>%
left_join(df %>%
group_by(id, name) %>%
summarise(group_no = row_number()))

Resources