data.table efficiently finding common pairs between 2 columns - r

say I have a dataframe
subject stim1 stim2 feedback
1 1003 50 51 1
2 1003 48 50 1
3 1003 49 51 1
4 1003 47 49 1
5 1003 47 46 1
6 1003 46 48 1
10 1003 50 48 1
428 1003 48 51 0
433 1003 46 50 0
434 1003 50 49 0
435 1003 54 59 0
I want to create a new column "transitive_pair" by
group by subject (column 1),
For every row in which feedback==0 (starting index 428, otherwise transitive_pair=NaN).
I want to return a boolean which tells me whether there is any chain of pairings (but only those in which feedback==1) that would transitively link stim1 and stim2 values.
Working out a few examples.
row 428- stim1=48 and stim2=51
48 and 51 are not paired but 51 was paired with 50 (e.g.row 1 ) and 50 was paired with 48 (row 10) so transitive_pair[428]=True
row 433- stim 1=46 and stim2=50
46 and 48 were paired (row 6) and 48 was paired with 50 (row 2) so transitive_pair[433]=True
in row 435, stim1=54, stim2=59
there is no chain of pairs that could link them (59 is not paired with anything while feedback==1) so transitive_pair[435]=False
desired output
subject stim1 stim2 feedback transitive_pair
1 1003 50 51 1 NaN
2 1003 48 50 1 NaN
3 1003 49 51 1 NaN
4 1003 47 49 1 NaN
5 1003 47 46 1 NaN
6 1003 46 48 1 NaN
10 1003 50 48 1 NaN
428 1003 48 51 0 1
433 1003 46 50 0 1
434 1003 50 49 0 1
435 1003 54 59 0 0
any help would be greatly appreciated!!
and putting a recreateble df here
structure(list(subject = c(1003L, 1003L, 1003L, 1003L, 1003L,
1003L, 1003L, 1003L, 1003L, 1003L, 1003L), stim1 = c(50L, 48L,
49L, 47L, 47L, 46L, 50L, 48L, 46L, 50L, 54L), stim2 = c(51L,
50L, 51L, 49L, 46L, 48L, 48L, 51L, 50L, 49L, 59L), feedback = c(1L,
1L, 1L, 1L, 1L, 1L, 1L, 0L, 0L, 0L, 0L), transitive_pair = c(NaN,
NaN, NaN, NaN, NaN, NaN, NaN, 1, 1, 1, 0)), row.names = c(1L,
2L, 3L, 4L, 5L, 6L, 10L, 428L, 433L, 434L, 435L), class = "data.frame")

The columns "stim1" and "stim2" define an undirected graph. Create the graph for feedback == 1, get its connected components and for each row of the data.frame, check if the values of "stim1" and "stim2" belong to the same component. In the end assign NaN to the rows where feedback is 1.
suppressPackageStartupMessages(library(igraph))
inx <- df1$feedback == 1
g <- graph_from_data_frame(df1[inx, c("stim1", "stim2")], directed = FALSE)
plot(g)
g_comp <- components(g)$membership
df1$transitive_pair_2 <- apply(df1[c("stim1", "stim2")], 1, \(x) {
i <- names(g_comp) == x[1]
j <- names(g_comp) == x[2]
if(any(i) & any(j))
g_comp[i] == g_comp[j]
else 0L
})
df1$transitive_pair_2[inx] <- NaN
df1
#> subject stim1 stim2 feedback transitive_pair transitive_pair_2
#> 1 1003 50 51 1 NaN NaN
#> 2 1003 48 50 1 NaN NaN
#> 3 1003 49 51 1 NaN NaN
#> 4 1003 47 49 1 NaN NaN
#> 5 1003 47 46 1 NaN NaN
#> 6 1003 46 48 1 NaN NaN
#> 10 1003 50 48 1 NaN NaN
#> 428 1003 48 51 0 1 1
#> 433 1003 46 50 0 1 1
#> 434 1003 50 49 0 1 1
#> 435 1003 54 59 0 0 0
Created on 2022-07-31 by the reprex package (v2.0.1)

Related

Merge 2 data frames using common date, plus 2 rows before and n-1 rows after

So i need to merge 2 data frames:
The first data frame contains dates in YYYY-mm-dd format and event lengths:
datetime length
2003-06-03 1
2003-06-07 1
2003-06-13 1
2003-06-17 3
2003-06-28 5
2003-07-10 1
2003-07-23 1
...
The second data frame contains dates in the same format and discharge data:
datetime q
2003-05-29 36.2
2003-05-30 34.6
2003-05-31 33.1
2003-06-01 30.7
2003-06-02 30.0
2003-06-03 153.0
2003-06-04 69.0
...
The second data frame is much larger.
I want to merge/join only the following rows of the second data frame to the first:
all rows that have the same date as the first frame (I know this can be done with left_join(df1,df2, by = c("datetime"))
two rows before that row
n-1 rows after that row, where n = "length" value of row in first data frame.
I would like to identify the rows belonging to the same event as well.
Ideally i would have the following output: (Notice the event from 2003-06-17)
EventDatesNancy length q event#
2003-06-03 1 153.0 1
2003-06-07 1 120.0 2
2003-06-13 1 45.3 3
2003-06-15 na 110.0 4
2003-06-16 na 53.1 4
2003-06-17 3 78.0 4
2003-06-18 na 167.0 4
2003-06-19 na 145.0 4
...
I hope this makes clear what I am trying to do.
This might be one approach using tidyverse and fuzzyjoin.
First, indicate event numbers in your first data.frame. Add two columns to indicate the start and end dates (start date is 2 days before the date, and end date is length days - 1 after the date).
Then, you can use fuzzy_inner_join to get the selected rows from the second data.frame. Here, you will want to include where the datetime in the second data.frame falls after the start date and before the end date of the first data.frame.
library(tidyverse)
library(fuzzyjoin)
df1$event <- seq_along(1:nrow(df1))
df1$start_date <- df1$datetime - 2
df1$end_date <- df1$datetime + df1$length - 1
fuzzy_inner_join(
df1,
df2,
by = c("start_date" = "datetime", "end_date" = "datetime"),
match_fun = c(`<=`, `>=`)
) %>%
select(datetime.y, length, q, event)
I tried this out with some made up data:
R> df1
datetime length
1 2003-06-03 1
2 2003-06-12 1
3 2003-06-21 1
4 2003-06-30 3
5 2003-07-09 5
6 2003-07-18 1
7 2003-07-27 1
8 2003-08-05 2
9 2003-08-14 1
10 2003-08-23 1
11 2003-09-01 3
R> df2
datetime q
1 2003-06-03 44
2 2003-06-04 52
3 2003-06-05 34
4 2003-06-06 20
5 2003-06-07 57
6 2003-06-08 67
7 2003-06-09 63
8 2003-06-10 51
9 2003-06-11 56
10 2003-06-12 37
11 2003-06-13 16
12 2003-06-14 54
13 2003-06-15 46
14 2003-06-16 6
15 2003-06-17 32
16 2003-06-18 91
17 2003-06-19 61
18 2003-06-20 42
19 2003-06-21 28
20 2003-06-22 98
21 2003-06-23 77
22 2003-06-24 81
23 2003-06-25 13
24 2003-06-26 15
25 2003-06-27 73
26 2003-06-28 38
27 2003-06-29 27
28 2003-06-30 49
29 2003-07-01 10
30 2003-07-02 89
31 2003-07-03 9
32 2003-07-04 80
33 2003-07-05 68
34 2003-07-06 26
35 2003-07-07 31
36 2003-07-08 29
37 2003-07-09 84
38 2003-07-10 60
39 2003-07-11 19
40 2003-07-12 97
41 2003-07-13 35
42 2003-07-14 47
43 2003-07-15 70
This will give the following output:
datetime.y length q event
1 2003-06-03 1 44 1
2 2003-06-10 1 51 2
3 2003-06-11 1 56 2
4 2003-06-12 1 37 2
5 2003-06-19 1 61 3
6 2003-06-20 1 42 3
7 2003-06-21 1 28 3
8 2003-06-28 3 38 4
9 2003-06-29 3 27 4
10 2003-06-30 3 49 4
11 2003-07-01 3 10 4
12 2003-07-02 3 89 4
13 2003-07-07 5 31 5
14 2003-07-08 5 29 5
15 2003-07-09 5 84 5
16 2003-07-10 5 60 5
17 2003-07-11 5 19 5
18 2003-07-12 5 97 5
19 2003-07-13 5 35 5
If the output desired is different than above, please let me know what should be different so that I can correct it.
Data
df1 <- structure(list(datetime = structure(c(12206, 12215, 12224, 12233,
12242, 12251, 12260, 12269, 12278, 12287, 12296), class = "Date"),
length = c(1, 1, 1, 3, 5, 1, 1, 2, 1, 1, 3), event = 1:11,
start_date = structure(c(12204, 12213, 12222, 12231, 12240,
12249, 12258, 12267, 12276, 12285, 12294), class = "Date"),
end_date = structure(c(12206, 12215, 12224, 12235, 12246,
12251, 12260, 12270, 12278, 12287, 12298), class = "Date")), row.names = c(NA,
-11L), class = "data.frame")
df2 <- structure(list(datetime = structure(c(12206, 12207, 12208, 12209,
12210, 12211, 12212, 12213, 12214, 12215, 12216, 12217, 12218,
12219, 12220, 12221, 12222, 12223, 12224, 12225, 12226, 12227,
12228, 12229, 12230, 12231, 12232, 12233, 12234, 12235, 12236,
12237, 12238, 12239, 12240, 12241, 12242, 12243, 12244, 12245,
12246, 12247, 12248), class = "Date"), q = c(44L, 52L, 34L, 20L,
57L, 67L, 63L, 51L, 56L, 37L, 16L, 54L, 46L, 6L, 32L, 91L, 61L,
42L, 28L, 98L, 77L, 81L, 13L, 15L, 73L, 38L, 27L, 49L, 10L, 89L,
9L, 80L, 68L, 26L, 31L, 29L, 84L, 60L, 19L, 97L, 35L, 47L, 70L
)), class = "data.frame", row.names = c(NA, -43L))

filling in missing data using fitted value in R

I have a dataframe like this:
ID year age wage
1 2 1981 22 10000
2 2 1982 23 11000
3 2 1983 24 11500
4 2 1984 25 11000
5 2 1985 26 14000
6 2 1986 27 16000
7 2 1987 28 20000
8 2 1988 29 19000
9 2 1989 30 20000
10 2 1990 31 20000
11 2 1991 32 22000
12 2 1992 33 25000
13 2 1993 34 0
14 2 1994 35 NA
15 2 1995 36 0
16 2 1996 37 NA
17 2 1997 38 0
18 2 1998 39 NA
19 2 1999 40 0
20 2 2000 41 NA
21 2 2001 42 0
22 2 2002 43 NA
23 2 2003 44 0
24 2 2004 45 NA
25 2 2005 46 5500
26 2 2006 47 NA
27 2 2007 48 5000
28 2 2008 49 NA
29 2 2009 50 6000
30 2 2010 51 NA
31 2 2011 52 19000
32 2 2012 53 NA
33 2 2013 54 21000
34 2 2014 55 NA
35 2 2015 56 23000
36 3 1984 22 1300
37 3 1985 23 0
38 3 1986 24 1500
39 3 1987 25 1000
40 3 1988 26 0
I want to use an individual-specific regression of wage on age and age-squared to impute missing wage observations. I want to only impute when at least 5 non-missing observations are available.
As suggested by jay.sf, I tried the following but with fitted values:
df_imp <- do.call(rbind,
by(df, df$ID, function(x) {
IDs <- which(is.na(x$wage))
if (length(x$wage[- IDs]) >= 5) {
b <- lm(wage ~ poly(age, 2, raw=TRUE), x)$fitted.values
x$wage[IDs] <- with(x, b)[IDs]
}
return(x)
}))
I got the following results:
ID year age wage
36 2 1981 22 10000.000
37 2 1982 23 11000.000
38 2 1983 24 11500.000
39 2 1984 25 11000.000
40 2 1985 26 14000.000
41 2 1986 27 16000.000
42 2 1987 28 20000.000
43 2 1988 29 19000.000
44 2 1989 30 20000.000
45 2 1990 31 20000.000
46 2 1991 32 22000.000
47 2 1992 33 25000.000
48 2 1993 34 0.000
49 2 1994 35 7291.777
50 2 1995 36 0.000
51 2 1996 37 6779.133
52 2 1997 38 0.000
53 2 1998 39 7591.597
54 2 1999 40 0.000
55 2 2000 41 9729.168
56 2 2001 42 0.000
57 2 2002 43 13191.847
58 2 2003 44 0.000
59 2 2004 45 17979.633
60 2 2005 46 5500.000
61 2 2006 47 NA
62 2 2007 48 5000.000
63 2 2008 49 NA
64 2 2009 50 6000.000
65 2 2010 51 NA
66 2 2011 52 19000.000
67 2 2012 53 NA
68 2 2013 54 21000.000
69 2 2014 55 NA
70 2 2015 56 23000.000
You could use a simple if statement, without an else. Define an ID vector IDs that identifies missings, which you use to count them and to subset your Y column wage.
For this you can use by(), which splits your data similar to split() but you may apply a function; just rbind the result.
It's probably wiser to rather use the coefficients than the fitted values, because the latter also would be NA if your Y are NA. And you need to use raw=TRUE in the poly.
DF.imp <- do.call(rbind,
by(DF, DF$ID, function(x) {
IDs <- which(is.na(x$wage))
if (length(x$wage[- IDs]) >= 5) {
b <- lm(wage ~ poly(age, 2, raw=TRUE), x)$coefficients
x$wage[IDs] <- with(x, (b[1] + b[2]*age + b[3]*age^2))[IDs]
}
return(x)
}))
Note that I've slightly changed your example data, so that ID 3 also has missings, but less than 5 non-missings.
Result
DF.imp
# ID year age wage
# 2.1 2 1981 22 10000.000
# 2.2 2 1982 23 11000.000
# 2.3 2 1983 24 11500.000
# 2.4 2 1984 25 11000.000
# 2.5 2 1985 26 14000.000
# 2.6 2 1986 27 16000.000
# 2.7 2 1987 28 20000.000
# 2.8 2 1988 29 19000.000
# 2.9 2 1989 30 20000.000
# 2.10 2 1990 31 20000.000
# 2.11 2 1991 32 22000.000
# 2.12 2 1992 33 25000.000
# 2.13 2 1993 34 0.000
# 2.14 2 1994 35 7626.986
# 2.15 2 1995 36 0.000
# 2.16 2 1996 37 7039.387
# 2.17 2 1997 38 0.000
# 2.18 2 1998 39 6783.065
# 2.19 2 1999 40 0.000
# 2.20 2 2000 41 6858.020
# 2.21 2 2001 42 0.000
# 2.22 2 2002 43 7264.252
# 2.23 2 2003 44 0.000
# 2.24 2 2004 45 8001.761
# 2.25 2 2005 46 5500.000
# 2.26 2 2006 47 9070.546
# 2.27 2 2007 48 5000.000
# 2.28 2 2008 49 10470.609
# 2.29 2 2009 50 6000.000
# 2.30 2 2010 51 12201.948
# 2.31 2 2011 52 19000.000
# 2.32 2 2012 53 14264.565
# 2.33 2 2013 54 21000.000
# 2.34 2 2014 55 16658.458
# 2.35 2 2015 56 23000.000
# 3.36 3 1984 22 1300.000
# 3.37 3 1985 23 NA
# 3.38 3 1986 24 1500.000
# 3.39 3 1987 25 1000.000
# 3.40 3 1988 26 NA
Data
DF <- structure(list(ID = c(2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L,
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L,
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L, 3L), year = c(1981L,
1982L, 1983L, 1984L, 1985L, 1986L, 1987L, 1988L, 1989L, 1990L,
1991L, 1992L, 1993L, 1994L, 1995L, 1996L, 1997L, 1998L, 1999L,
2000L, 2001L, 2002L, 2003L, 2004L, 2005L, 2006L, 2007L, 2008L,
2009L, 2010L, 2011L, 2012L, 2013L, 2014L, 2015L, 1984L, 1985L,
1986L, 1987L, 1988L), age = c(22L, 23L, 24L, 25L, 26L, 27L, 28L,
29L, 30L, 31L, 32L, 33L, 34L, 35L, 36L, 37L, 38L, 39L, 40L, 41L,
42L, 43L, 44L, 45L, 46L, 47L, 48L, 49L, 50L, 51L, 52L, 53L, 54L,
55L, 56L, 22L, 23L, 24L, 25L, 26L), wage = c(10000L, 11000L,
11500L, 11000L, 14000L, 16000L, 20000L, 19000L, 20000L, 20000L,
22000L, 25000L, 0L, NA, 0L, NA, 0L, NA, 0L, NA, 0L, NA, 0L, NA,
5500L, NA, 5000L, NA, 6000L, NA, 19000L, NA, 21000L, NA, 23000L,
1300L, NA, 1500L, 1000L, NA)), row.names = c(NA, -40L), class = "data.frame")

Cross join two dataframes by key column using condition in R

I have two dataframes.
mydata1=structure(list(ID_WORKES = c(58005854L, 58005854L, 58002666L,
58002666L), ID_SP_NAR = c(463L, 1951L, 21L, 465L), KOD_DEPO = c(3786L,
3786L, 1439L, 1439L), KOD_DOR = c(58L, 58L, 92L, 92L), COLUMN_MASH = c(6L,
6L, 5L, 5L), prop_violations = structure(c(1L, 2L, 2L, 2L), .Label = c("0.2",
"1"), class = "factor"), mash_score = c(0L, 2L, 2L, 2L)), .Names = c("ID_WORKES",
"ID_SP_NAR", "KOD_DEPO", "KOD_DOR", "COLUMN_MASH", "prop_violations",
"mash_score"), class = "data.frame", row.names = c(NA, -4L))
mydata2=structure(list(ID_SP_NAR = c(463L, 1951L, 21L, 465L, 500L, 600L
)), .Names = "ID_SP_NAR", class = "data.frame", row.names = c(NA,
-6L))
i need crossjoin merge these dataframes by ID_SP_NAR. Mydata2 contatins only key variable ID_SP_NAR.
I need join this in such a way that if the id_workers does not have any codes from the ID_SP_NAR from mydata2, then these code are inserted into the dataset, but for them in variables prop_violations and mash_score must be inserted zero values.
I.E. SP_ID_NAR in mydata2 has such values
ID_SP_NAR
463
1951
21
465
500
600
ID_workes =58005854 has
463,
1951
but another not have.
and
ID_workes =58002666 has 21 and 465 and not anonter!
So desired output after cross join
ID_WORKES ID_SP_NAR KOD_DEPO KOD_DOR COLUMN_MASH prop_violations mash_score
1 58005854 463 3786 58 6 0.2 0
2 58005854 1951 3786 58 6 1 2
3 58005854 21 3786 58 6 0 0
4 58005854 465 3786 58 6 0 0
5 58005854 500 3786 58 6 0 0
6 58005854 600 3786 58 6 0 0
7 58002666 21 1439 92 5 1 2
8 58002666 465 1439 92 5 1 2
9 58002666 500 1439 92 5 0 0
10 58002666 600 1439 92 5 0 0
11 58002666 463 1439 92 5 0 0
12 58002666 1951 1439 92 5 0 0
KOD_DEPO,KOD_DOR,COLUMN_MASH have fixed value , it must be saved too.
How to do that?
merge(mydata1,mydata2, by = ID_SP_NAR) is not working( i try use via left join doesn't work), it doesn't insert zeros as i want .
We could use complete from tidyr to expand the dataset based on the 'ID_WORKES' and the valuse of 'ID_SP_NAR' in the second dataset
library(tidyverse)
mydata1 %>%
mutate_if(is.factor, as.character) %>%
complete(ID_WORKES, ID_SP_NAR = mydata2$ID_SP_NAR,
fill = list(prop_violations = '0', mash_score = 0)) %>%
fill(3:5)
# A tibble: 12 x 7
# ID_WORKES ID_SP_NAR KOD_DEPO KOD_DOR COLUMN_MASH prop_violations mash_score
# <int> <int> <int> <int> <int> <chr> <dbl>
# 1 58002666 21 1439 92 5 1 2
# 2 58002666 463 1439 92 5 0 0
# 3 58002666 465 1439 92 5 1 2
# 4 58002666 500 1439 92 5 0 0
# 5 58002666 600 1439 92 5 0 0
# 6 58002666 1951 1439 92 5 0 0
# 7 58005854 21 1439 92 5 0 0
# 8 58005854 463 3786 58 6 0.2 0
# 9 58005854 465 3786 58 6 0 0
#10 58005854 500 3786 58 6 0 0
#11 58005854 600 3786 58 6 0 0
#12 58005854 1951 3786 58 6 1 2

R delete first and last x % of rows

I have a data frame with 3 ID variables, then several values for each ID.
user Log Pass Value
2 2 123 342
2 2 123 543
2 2 123 231
2 2 124 257
2 2 124 342
4 3 125 543
4 3 125 231
4 3 125 257
4 3 125 342
4 3 125 543
4 3 125 231
4 3 125 257
4 3 125 543
4 3 125 231
4 3 125 257
4 3 125 543
4 3 125 231
4 3 125 257
4 3 125 543
4 3 125 231
4 3 125 257
The start and end of each set of values is sometimes noisy, and I want to be able to delete the first few values. Unfortunately the number of values varies significantly, but it is always the first and last 20% of values that are noisy.
I want to delete the first 20% of rows, with a minimum of 1 row deleted.
So for instance if there are 20 values for user 2 log 2 pass 123 I want to delete the first and last 4 rows. If there are only 3 values for the ID variable I want to delete the first and last row.
The resulting dataset would be:
user Log Pass Value
2 2 123 543
4 3 125 543
4 3 125 231
4 3 125 257
4 3 125 543
4 3 125 231
4 3 125 257
4 3 125 543
4 3 125 231
I've tried fiddling around with nrow but I struggle to figure out how to reference the % of rows by id variable.
Thanks.
Jonathan.
I believe the following can do it.
DATA.
dat <-
structure(list(user = c(2L, 2L, 2L, 2L, 2L, 4L, 4L, 4L, 4L, 4L,
4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L), Log = c(2L, 2L,
2L, 2L, 2L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L,
3L, 3L, 3L), Pass = c(123L, 123L, 123L, 124L, 124L, 125L, 125L,
125L, 125L, 125L, 125L, 125L, 125L, 125L, 125L, 125L, 125L, 125L,
125L, 125L, 125L), Value = c(342L, 543L, 231L, 257L, 342L, 543L,
231L, 257L, 342L, 543L, 231L, 257L, 543L, 231L, 257L, 543L, 231L,
257L, 543L, 231L, 257L)), .Names = c("user", "Log", "Pass", "Value"
), class = "data.frame", row.names = c(NA, -21L))
CODE.
fun <- function(x, p = 0.20){
n <- nrow(x)
m <- max(1, round(n*p))
inx <- c(seq_len(m), n - seq_len(m) + 1)
x[-inx, ]
}
result <- do.call(rbind, lapply(split(dat, dat$user), fun))
row.names(result) <- NULL
result
# user Log Pass Value
#1 2 2 123 543
#2 2 2 123 231
#3 2 2 124 257
#4 4 3 125 342
#5 4 3 125 543
#6 4 3 125 231
#7 4 3 125 257
#8 4 3 125 543
#9 4 3 125 231
#10 4 3 125 257
#11 4 3 125 543
#12 4 3 125 231
#13 4 3 125 257
Would something like this help?
For a dataframe df:
df[-c(1:floor(nrow(df)*0.2), (1+ceiling(nrow(df)*0.8)):nrow(df)),]
Just removing the first and last 20%, taking the upper and lower values so that for smaller data frame you keep some of the information:
> df<-data.frame(a=1:100)
> df[-c(1:floor(nrow(df)*0.2),(1+ceiling(nrow(df)*0.8)):nrow(df)),]
[1] 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50
[31] 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80
> df<-data.frame(1:3)
> df[-c(1:floor(nrow(df)*0.2),(1+ceiling(nrow(df)*0.8)):nrow(df)),]
[1] 2
You can do this with dplyr...
library(dplyr)
df2 <- df %>% group_by(user, Log, Pass) %>%
filter(n()>2) %>% #remove those with just two elements or fewer
slice(max(2, 1+ceiling(n()*0.2)):min(n()-1, floor(0.8*n())))
df2
user Log Pass Value
1 2 2 123 543
2 4 3 125 543
3 4 3 125 231
4 4 3 125 257
5 4 3 125 543
6 4 3 125 231
7 4 3 125 257
8 4 3 125 543
9 4 3 125 231
Calculate the offset for what you want to retain:
rem <- ceiling( nrow( x ) * .2 ) + 1
Then take out the records you don-t want:
dat <- dat[ rem : ( nrow( dat ) - rem ), ]
Here is an idea using base R that returns the row indices of each user to keep and then subsets on these indices.
idx <- unlist(lapply(split(seq_along(dat[["user"]]), dat[["user"]]), function(x) {
tmp <- max(1, ceiling(.2 * length(x)))
tail(head(x, -tmp), -tmp)}),
use.names=FALSE)
split(seq_along(dat[["user"]]), dat[["user"]]) returns a list of the rows for each user. lapply loops through these rows, calculating the number of rows to drop from each end with split(seq_along(dat[["user"]]), dat[["user"]]), and then dropping them with tail(head(x, -tmp), -tmp)}). Since lapply returns a named list, this is unlisted and the names are dropped.
This returns
idx
2 3 4 10 11 12 13 14 15 16 17
Now subset
dat[idx,]
user Log Pass Value
2 2 2 123 543
3 2 2 123 231
4 2 2 124 257
10 4 3 125 543
11 4 3 125 231
12 4 3 125 257
13 4 3 125 543
14 4 3 125 231
15 4 3 125 257
16 4 3 125 543
17 4 3 125 231

Change data set from wide to long while retaining group id, and also gathering columns [duplicate]

This question already has answers here:
Reshaping multiple sets of measurement columns (wide format) into single columns (long format)
(8 answers)
Closed 5 years ago.
I'd really appreciate some help getting this messy set of new survey data into a usable form. It was collected in a strange way and now I've got strange data to work with. I've looked through tidyr and used those approaches to no end. I suspect my problem is that I'm thinking about this dataset all wrong and I'm blind to some real answer. But given all the things I need to do to this df, I cant figure out where to start and thus where to start googling.
What I need:
For each person to be their own row
Each person retains their GroupID and Treated value
For the variables currently attached to each person individually to become columns (age, weight, height)
Fake (and much smaller):
structure(list(GroupID = 1:5, Treated = c("Y", "Y", "N", "Y",
"N"), person1_age = c(45L, 33L, 71L, 19L, 52L), person1_weight = c(187L,
145L, 136L, 201L, 168L), person1_height = c(69L, 64L, 51L, 70L,
66L), person2_age = c(54L, 20L, 48L, 63L, 26L), person2_weight = c(140L,
122L, 186L, 160L, 232L), person2_height = c(62L, 70L, 65L, 72L,
74L), person3_age = c(21L, 56L, 40L, 59L, 67L), person3_weight = c(112L,
143L, 187L, 194L, 159L), person3_height = c(61L, 69L, 73L, 63L,
72L)), .Names = c("GroupID", "Treated", "person1_age", "person1_weight",
"person1_height", "person2_age", "person2_weight", "person2_height",
"person3_age", "person3_weight", "person3_height"), row.names = c(NA,
5L), class = "data.frame")
Any help or further readings you could point me to would be very much appreciated.
reshape can do this, with the appropriate arguments:
> reshape(x, direction="long", varying=names(x)[3:11], timevar='person', v.names=c('height', 'age', 'weight'), sep='_')
GroupID Treated person height age weight id
1.1 1 Y 1 187 45 69 1
2.1 2 Y 1 145 33 64 2
3.1 3 N 1 136 71 51 3
4.1 4 Y 1 201 19 70 4
5.1 5 N 1 168 52 66 5
1.2 1 Y 2 140 54 62 1
2.2 2 Y 2 122 20 70 2
3.2 3 N 2 186 48 65 3
4.2 4 Y 2 160 63 72 4
5.2 5 N 2 232 26 74 5
1.3 1 Y 3 112 21 61 1
2.3 2 Y 3 143 56 69 2
3.3 3 N 3 187 40 73 3
4.3 4 Y 3 194 59 63 4
5.3 5 N 3 159 67 72 5
This relies on the order of the columns in your original data, for the varying argument, being in increasing order in the original data.
If that's not the case, specify varying manually. Here's what is used above:
> names(x)[3:11]
[1] "person1_age" "person1_weight" "person1_height" "person2_age" "person2_weight" "person2_height"
[7] "person3_age" "person3_weight" "person3_height"
We can also use melt from data.table which can take multiple patterns in the measure argument
library(data.table)
melt(setDT(x), measure = patterns("age$", "weight$", "height$"),
variable.name = "person", value.name = c("age", "weight", "height"))
# GroupID Treated person age weight height
# 1: 1 Y 1 45 187 69
# 2: 2 Y 1 33 145 64
# 3: 3 N 1 71 136 51
# 4: 4 Y 1 19 201 70
# 5: 5 N 1 52 168 66
# 6: 1 Y 2 54 140 62
# 7: 2 Y 2 20 122 70
# 8: 3 N 2 48 186 65
# 9: 4 Y 2 63 160 72
#10: 5 N 2 26 232 74
#11: 1 Y 3 21 112 61
#12: 2 Y 3 56 143 69
#13: 3 N 3 40 187 73
#14: 4 Y 3 59 194 63
#15: 5 N 3 67 159 72

Resources