Creating differences in a new column for certain dates in R - r

i have a data frame that looks like this;
Date Value1 Value 2 Value 3
1997Q1 100 130 120
1997Q1 100 130 124
1997Q1 120 136 154
1997Q2 180 145 154
1997Q2 186 134 126
1997Q2 186 124 176
1997Q3 190 143 176
1997Q3 192 143 123
I would like to calculate differences for each values within the same date, for example the differences in value 1 column for 1997q1, then 1997q2 and so on.
I would like these differences to be shown in a new column, so that the results would look something like this;
Date Value1 Value 2 Value 3 Diff Val 1 Diff Val 2 Diff Val 3
1997Q1 100 130 120 0 0 4
1997Q1 100 130 124 20 6 30
1997Q1 120 136 154 N/A N/A N/A
1997Q2 180 145 154 6 -11 -28
1997Q2 186 134 126 0 10 50
1997Q2 186 124 176 N/A N/A N/A
1997Q3 190 143 176 2 0 -53
1997Q3 192 143 123

You can use dplyr functions for this. The ~ .x - lead(.x) is the function applied to every value column, selected with starts_with. we take the current value minus the next value. If you need lag, switch it around, ~ lag(.x) - .x
library(dplyr)
df1 %>%
group_by(Date) %>%
mutate(across(starts_with("Value"), ~.x - lead(.x), .names = "diff_{.col}"))
if the values are numeric and the column names are not easily found, you can use mutate(across(where(is.numeric), ~.x - lead(.x), .names = "diff_{.col}")).
# A tibble: 8 × 7
# Groups: Date [3]
Date Value1 Value2 Value3 diff_Value1 diff_Value2 diff_Value3
<chr> <int> <int> <int> <int> <int> <int>
1 1997Q1 100 130 120 0 0 -4
2 1997Q1 100 130 124 -20 -6 -30
3 1997Q1 120 136 154 NA NA NA
4 1997Q2 180 145 154 -6 11 28
5 1997Q2 186 134 126 0 10 -50
6 1997Q2 186 124 176 NA NA NA
7 1997Q3 190 143 176 -2 0 53
8 1997Q3 192 143 123 NA NA NA
data:
df1 <- structure(list(Date = c("1997Q1", "1997Q1", "1997Q1", "1997Q2",
"1997Q2", "1997Q2", "1997Q3", "1997Q3"), Value1 = c(100L, 100L,
120L, 180L, 186L, 186L, 190L, 192L), Value2 = c(130L, 130L, 136L,
145L, 134L, 124L, 143L, 143L), Value3 = c(120L, 124L, 154L, 154L,
126L, 176L, 176L, 123L)), class = "data.frame", row.names = c(NA,
-8L))

Related

Splitting data.frame into matrices and multiplying the diagonal elements to produce a new column

here is my data structure ;
structure(list(a = c(57L, 39L, 31L, 70L, 8L, 93L, 68L, 85L),
b = c(161L, 122L, 101L, 104L, 173L, 192L, 110L, 152L)), class = "data.frame", row.names = c(NA,
-8L))
each two row represents a separate matrix, for example;
a b
<int> <int>
1 57 161
2 39 122
I want to multiply first row's a and second row's b then save it into a variable called c. Then repeat the operation for first row's b and second row's a then save it c again.
For a matrix, desired output is like this;
a b c
<int> <int> <dbl>
1 57 161 6954
2 39 122 6279
For whole data, desired output is like this;
a b c
<int> <int> <dbl>
1 57 161 6954
2 39 122 6279
3 31 101 3224
4 70 104 7070
5 8 173 1536
6 93 192 16089
7 68 110 10336
8 85 152 9350
base R functions would be much better.
Thanks in advance.
We can create a group with gl
library(dplyr)
df1 %>%
group_by(grp = as.integer(gl(n(), 2, n()))) %>%
mutate(c = a * rev(b)) %>%
ungroup %>%
select(-grp)
-output
# A tibble: 8 × 3
a b c
<int> <int> <int>
1 57 161 6954
2 39 122 6279
3 31 101 3224
4 70 104 7070
5 8 173 1536
6 93 192 16089
7 68 110 10336
8 85 152 9350
Or with ave from base R
df1$c <- with(df1, a * ave(b, as.integer(gl(length(b), 2, length(b))), FUN = rev))
df1$c
[1] 6954 6279 3224 7070 1536 16089 10336 9350
Here's another way -
inds <- seq(nrow(df))
df$c <- df$a * df$b[inds + rep(c(1, -1), length.out = nrow(df))]
df
# a b c
#1 57 161 6954
#2 39 122 6279
#3 31 101 3224
#4 70 104 7070
#5 8 173 1536
#6 93 192 16089
#7 68 110 10336
#8 85 152 9350
Explanation -
We create an alternating 1 and -1 value and add it to the row number generate to get the corresponding b value to multiply with a.
inds
#[1] 1 2 3 4 5 6 7 8
rep(c(1, -1), length.out = nrow(df))
#[1] 1 -1 1 -1 1 -1 1 -1
inds + rep(c(1, -1), length.out = nrow(df))
#[1] 2 1 4 3 6 5 8 7

Subsetting a list of data frames by condition

Sorry I can't embed pictures yet
I have 21 data frames in a list (listb), all with the same headings of Timestamp, Rainfall
I would like to sort them by Rainfall (descending) and then subset the top 30 (to include the corresponding Timestamp) of each of the 21 data frames. Then put them back into a single dataframe with the name of the initial data frame as a heading?
Please find the list of data frames below, and a small cut from the b1 dataframe
Would I need to create a new dataframe for each of the new subsets then combine them into a list later?
Descending_b1 <- listb$b1[order(-Rainfall),]
b1_30 <- Descending_b1[1:30,1:2]
From that, I produce the following
b1_30 <- structure(list(Timestamp = c("25/1/2013", "24/1/2013", "2/2/2004",
"21/3/2010", "16/7/2016", "1/2/2010", "26/1/2007", "29/12/1998",
"24/2/2008", "5/2/2003", "6/2/2003", "11/11/2001", "3/12/2010",
"8/3/2020", "27/12/2010", "29/1/1998", "18/10/2017", "13/3/2007",
"5/4/2006", "10/6/2006", "19/11/2008", "20/2/2015", "26/3/2014",
"15/3/2017", "27/8/2011", "1/3/2013", "27/8/1998", "11/2/2012",
"11/2/2008", "26/1/2013"),
Rainfall = c(238L, 158L, 131L, 131L,129L, 122L, 112L, 109L, 101L, 94L,
92L, 88L, 82L, 81L, 78L, 74L, 71L, 69L, 65L, 64L, 64L,
64L, 63L, 63L, 62L, 61L, 60L, 60L, 58L,57L)),
row.names = c(5915L, 5914L, 2640L, 4874L, 7183L, 4826L, 3725L, 939L, 4118L, 2278L, 2279L, 1827L, 5131L, 8514L, 5155L,
605L, 7642L, 3771L, 3429L, 3495L, 4387L, 6671L, 6340L, 7425L,
5398L, 5950L, 815L, 5566L, 4105L, 5916L), class = "data.frame")
b1_30
#> Timestamp Rainfall
#> 5915 25/1/2013 238
#> 5914 24/1/2013 158
#> 2640 2/2/2004 131
#> 4874 21/3/2010 131
#> 7183 16/7/2016 129
#> 4826 1/2/2010 122
#> 3725 26/1/2007 112
#> 939 29/12/1998 109
#> 4118 24/2/2008 101
#> 2278 5/2/2003 94
#> 2279 6/2/2003 92
#> 1827 11/11/2001 88
#> 5131 3/12/2010 82
#> 8514 8/3/2020 81
#> 5155 27/12/2010 78
#> 605 29/1/1998 74
#> 7642 18/10/2017 71
#> 3771 13/3/2007 69
#> 3429 5/4/2006 65
#> 3495 10/6/2006 64
#> 4387 19/11/2008 64
#> 6671 20/2/2015 64
#> 6340 26/3/2014 63
#> 7425 15/3/2017 63
#> 5398 27/8/2011 62
#> 5950 1/3/2013 61
#> 815 27/8/1998 60
#> 5566 11/2/2012 60
#> 4105 11/2/2008 58
#> 5916 26/1/2013 57
So yeah I hope to do that with the rest of the data frames within the list to create a new data frame whilst keeping the initial data frame name, and then combine them into a new list
Suppose you have a list like this
set.seed(2021)
listb <- list(b1 = data.frame(Timestamp = as.Date("2010-01-01") + days(sample(1:100, 10)),
Rainfall = sample(200:300, 10)),
b2 = data.frame(Timestamp = as.Date("2010-01-01") + days(sample(1:100, 10)),
Rainfall = sample(200:300, 10)),
b3 = data.frame(Timestamp = as.Date("2010-01-01") + days(sample(1:100, 10)),
Rainfall = sample(200:300, 10)))
> listb
$b1
Timestamp Rainfall
1 2010-01-08 275
2 2010-02-08 250
3 2010-02-16 259
4 2010-02-28 217
5 2010-01-13 298
6 2010-03-12 202
7 2010-03-06 245
8 2010-04-10 225
9 2010-03-11 235
10 2010-01-24 285
$b2
Timestamp Rainfall
1 2010-02-01 242
2 2010-04-09 258
3 2010-01-20 269
4 2010-03-10 285
5 2010-03-28 298
6 2010-01-06 262
7 2010-03-15 278
8 2010-03-05 233
9 2010-02-08 221
10 2010-01-19 215
$b3
Timestamp Rainfall
1 2010-03-21 216
2 2010-03-30 240
3 2010-01-18 230
4 2010-01-21 272
5 2010-03-10 292
6 2010-04-05 226
7 2010-03-14 210
8 2010-03-25 235
9 2010-03-09 237
10 2010-01-03 278
Now you need to do this only (Needless to say replace n argument in slice_max with your desired n=30)
purrr::map2_dfr(listb, names(listb), ~ .x %>%
mutate(list_name = .y) %>%
slice_max(Rainfall, n=5))
Timestamp Rainfall list_name
1 2010-01-13 298 b1
2 2010-01-24 285 b1
3 2010-01-08 275 b1
4 2010-02-16 259 b1
5 2010-02-08 250 b1
6 2010-03-28 298 b2
7 2010-03-10 285 b2
8 2010-03-15 278 b2
9 2010-01-20 269 b2
10 2010-01-06 262 b2
11 2010-03-10 292 b3
12 2010-01-03 278 b3
13 2010-01-21 272 b3
14 2010-03-30 240 b3
15 2010-03-09 237 b3
If you want to return the output back into a similar list
purrr::map(listb, ~ .x %>%
slice_max(Rainfall, n=5))
$b1
Timestamp Rainfall
1 2010-01-13 298
2 2010-01-24 285
3 2010-01-08 275
4 2010-02-16 259
5 2010-02-08 250
$b2
Timestamp Rainfall
1 2010-03-28 298
2 2010-03-10 285
3 2010-03-15 278
4 2010-01-20 269
5 2010-01-06 262
$b3
Timestamp Rainfall
1 2010-03-10 292
2 2010-01-03 278
3 2010-01-21 272
4 2010-03-30 240
5 2010-03-09 237

Insert rows based on difference between value from col A row N and col B row N+1

I have data with an example as follows (I use R):
A B C
1 2 Background
3 19 Background
26 41 person
43 69 person
83 97 Background
107 129 Background
132 179 Background
189 235 Background
243 258 Background
261 279 person
I would like to add rows where the difference between col A row N+1 and col B row N > 1 and row C gets a label (e.g. 'other'). So the data would look like this:
A B C
1 2 Background
3 19 Background
20 25 other
26 41 person
43 69 person
70 82 other
83 97 Background
98 106 other
107 129 Background
130 131 other
132 179 Background
180 188 other
189 235 Background
236 242 other
243 258 Background
259 260 other
261 279 person
Thanks!
Here is one way using base R, assuming the 4th row A value is 42 (and not 43).
#Find out row indices where difference of A value for N + 1 row and
#B value in N row is not equal to 1.
inds <- which(tail(df$A, -1) - head(df$B, -1) != 1)
#Create a dataframe which we want to insert in the current dataframe
#using values from A and B column and inds indices
include_df <- data.frame(A = df$B[inds] + 1,B = df$A[inds + 1] - 1, C = 'other',
stringsAsFactors = FALSE)
#Repeat rows at inds to make space to insert new rows
df <- df[sort(c(seq_len(nrow(df)), inds)), ]
#Insert the new rows in their respective position
df[inds + seq_along(inds), ] <- include_df
#Remove row names
row.names(df) <- NULL
df
# A B C
#1 1 2 Background
#2 3 19 Background
#3 20 25 other
#4 26 41 person
#5 42 69 person
#6 70 82 other
#7 83 97 Background
#8 98 106 other
#9 107 129 Background
#10 130 131 other
#11 132 179 Background
#12 180 188 other
#13 189 235 Background
#14 236 242 other
#15 243 258 Background
#16 259 260 other
#17 261 279 person
data
df <- structure(list(A = c(1, 3, 26, 42, 83, 107, 132, 189, 243, 261
), B = c(2L, 19L, 41L, 69L, 97L, 129L, 179L, 235L, 258L, 279L
), C = c("Background", "Background", "person", "person", "Background",
"Background", "Background", "Background", "Background", "person"
)), row.names = c(NA, -10L), class = "data.frame")
An option using data.table using same data edit as Ronak:
ix <- DT[shift(A, -1L) - B > 1L, which=TRUE]
rbindlist(list(DT,
data.table(A=DT$B[ix]+1L, B=DT$A[ix+1L]-1L, C="other")))[order(A)]
output:
A B C
1: 1 2 Background
2: 3 19 Background
3: 20 25 other
4: 26 41 person
5: 42 69 person
6: 70 82 other
7: 83 97 Background
8: 98 106 other
9: 107 129 Background
10: 130 131 other
11: 132 179 Background
12: 180 188 other
13: 189 235 Background
14: 236 242 other
15: 243 258 Background
16: 259 260 other
17: 261 279 person
data:
library(data.table)
DT <- fread("A B C
1 2 Background
3 19 Background
26 41 person
42 69 person
83 97 Background
107 129 Background
132 179 Background
189 235 Background
243 258 Background
261 279 person")

How can I delete "a lot" of rows from a dataframe in r

I tried all the similar posts but none of the answers seemed to work for me. I want to delete 8500+ rows (by rowname only) from a dataframe with 27,000+. The other columns are completely different, but the smaller dataset was derived from the larger one, and just looking for names shows me that whatever I look for from smaller df it is present in larger df. I could of course do this manually (busy work for sure!), but seems like there should be a simple computational answer.
I have tried:
fordel<-df2[1,]
df3<-df1[!rownames(df1) %in% fordel
l1<- as.vector(df2[1,])
df3<- df1[1-c(l1),]
and lots of other crazy ideas!
Here is a smallish example: df1:
Ent_gene_id clone57_RNA clone43_RNA_2 clone67_RNA clone55_RNA
ENSMUSG00000000001.4 10634 6954 6835 6510
ENSMUSG00000000003.15 0 0 0 0
ENSMUSG00000000028.14 559 1570 807 1171
ENSMUSG00000000031.15 5748 174 4103 146
ENSMUSG00000000037.16 37 194 49 96
ENSMUSG00000000049.11 0 3 1 0
ENSMUSG00000000056.7 1157 1125 806 947
ENSMUSG00000000058.6 75 304 123 169
ENSMUSG00000000078.6 4012 4391 5637 3854
ENSMUSG00000000085.16 381 560 482 368
ENSMUSG00000000088.6 2667 4777 3483 3450
ENSMUSG00000000093.6 3 48 41 22
ENSMUSG00000000094.12 23 201 102 192
df2
structure(list(base_mean = c(7962.408875, 947.1240794, 43.76698418 ), log2foldchange = c(-0.363434063, -0.137403759, -0.236463207 ), lfcSE = c(0.096816743, 0.059823215, 0.404929452), stat = c(-3.753834854, -2.296830066, -0.583961493)), row.names = c("ENSMUSG00000000001.4", "ENSMUSG00000000056.7", "ENSMUSG00000000093.6"), class = "data.frame")
I want to delete from df1 the rows corresponding to the rownames in df2.
Tried to format it, but seems no longer formatted... oh well....
Suggestions really appreciated!
You mentioned row names but your data does not include that, so I'll assume that they really don't matter (or exist). Also, your df2 has more column headers than columns, not sure what's going on there ... so I'll ignore it.
Data
df1 <- structure(list(Ent_gene_id = c("ENSMUSG00000000001.4", "ENSMUSG00000000003.15",
"ENSMUSG00000000028.14", "ENSMUSG00000000031.15", "ENSMUSG00000000037.16",
"ENSMUSG00000000049.11", "ENSMUSG00000000056.7", "ENSMUSG00000000058.6",
"ENSMUSG00000000078.6", "ENSMUSG00000000085.16", "ENSMUSG00000000088.6",
"ENSMUSG00000000093.6", "ENSMUSG00000000094.12"), clone57_RNA = c(10634L,
0L, 559L, 5748L, 37L, 0L, 1157L, 75L, 4012L, 381L, 2667L, 3L,
23L), clone43_RNA_2 = c(6954L, 0L, 1570L, 174L, 194L, 3L, 1125L,
304L, 4391L, 560L, 4777L, 48L, 201L), clone67_RNA = c(6835L,
0L, 807L, 4103L, 49L, 1L, 806L, 123L, 5637L, 482L, 3483L, 41L,
102L), clone55_RNA = c(6510L, 0L, 1171L, 146L, 96L, 0L, 947L,
169L, 3854L, 368L, 3450L, 22L, 192L)), class = "data.frame", row.names = c(NA,
-13L))
df2 <- structure(list(Ent_gene_id = c("ENSMUSG00000000001.4", "ENSMUSG00000000056.7",
"ENSMUSG00000000093.6"), base_mean = c(7962.408875, 947.1240794,
43.76698418), log2foldchange = c(-0.36343406, -0.137403759, -0.236463207
), pvalue = c(0.00017415, 0.021628466, 0.55924622)), class = "data.frame", row.names = c(NA,
-3L))
Base
df1[!df1$Ent_gene_id %in% df2$Ent_gene_id,]
# Ent_gene_id clone57_RNA clone43_RNA_2 clone67_RNA clone55_RNA
# 2 ENSMUSG00000000003.15 0 0 0 0
# 3 ENSMUSG00000000028.14 559 1570 807 1171
# 4 ENSMUSG00000000031.15 5748 174 4103 146
# 5 ENSMUSG00000000037.16 37 194 49 96
# 6 ENSMUSG00000000049.11 0 3 1 0
# 8 ENSMUSG00000000058.6 75 304 123 169
# 9 ENSMUSG00000000078.6 4012 4391 5637 3854
# 10 ENSMUSG00000000085.16 381 560 482 368
# 11 ENSMUSG00000000088.6 2667 4777 3483 3450
# 13 ENSMUSG00000000094.12 23 201 102 192
dplyr
dplyr::anti_join(df1, df2, by = "Ent_gene_id")
# Ent_gene_id clone57_RNA clone43_RNA_2 clone67_RNA clone55_RNA
# 1 ENSMUSG00000000003.15 0 0 0 0
# 2 ENSMUSG00000000028.14 559 1570 807 1171
# 3 ENSMUSG00000000031.15 5748 174 4103 146
# 4 ENSMUSG00000000037.16 37 194 49 96
# 5 ENSMUSG00000000049.11 0 3 1 0
# 6 ENSMUSG00000000058.6 75 304 123 169
# 7 ENSMUSG00000000078.6 4012 4391 5637 3854
# 8 ENSMUSG00000000085.16 381 560 482 368
# 9 ENSMUSG00000000088.6 2667 4777 3483 3450
# 10 ENSMUSG00000000094.12 23 201 102 192
Edit: same thing but with row names:
# update my df1 to change Ent_gene_id from a column to rownames
rownames(df1) <- df1$Ent_gene_id
df1$Ent_gene_id <- NULL
# use your updated df2 (from dput)
# df2 <- structure(...)
df1[ !rownames(df1) %in% rownames(df2), ]
# clone57_RNA clone43_RNA_2 clone67_RNA clone55_RNA
# ENSMUSG00000000003.15 0 0 0 0
# ENSMUSG00000000028.14 559 1570 807 1171
# ENSMUSG00000000031.15 5748 174 4103 146
# ENSMUSG00000000037.16 37 194 49 96
# ENSMUSG00000000049.11 0 3 1 0
# ENSMUSG00000000058.6 75 304 123 169
# ENSMUSG00000000078.6 4012 4391 5637 3854
# ENSMUSG00000000085.16 381 560 482 368
# ENSMUSG00000000088.6 2667 4777 3483 3450
# ENSMUSG00000000094.12 23 201 102 192

R delete first and last x % of rows

I have a data frame with 3 ID variables, then several values for each ID.
user Log Pass Value
2 2 123 342
2 2 123 543
2 2 123 231
2 2 124 257
2 2 124 342
4 3 125 543
4 3 125 231
4 3 125 257
4 3 125 342
4 3 125 543
4 3 125 231
4 3 125 257
4 3 125 543
4 3 125 231
4 3 125 257
4 3 125 543
4 3 125 231
4 3 125 257
4 3 125 543
4 3 125 231
4 3 125 257
The start and end of each set of values is sometimes noisy, and I want to be able to delete the first few values. Unfortunately the number of values varies significantly, but it is always the first and last 20% of values that are noisy.
I want to delete the first 20% of rows, with a minimum of 1 row deleted.
So for instance if there are 20 values for user 2 log 2 pass 123 I want to delete the first and last 4 rows. If there are only 3 values for the ID variable I want to delete the first and last row.
The resulting dataset would be:
user Log Pass Value
2 2 123 543
4 3 125 543
4 3 125 231
4 3 125 257
4 3 125 543
4 3 125 231
4 3 125 257
4 3 125 543
4 3 125 231
I've tried fiddling around with nrow but I struggle to figure out how to reference the % of rows by id variable.
Thanks.
Jonathan.
I believe the following can do it.
DATA.
dat <-
structure(list(user = c(2L, 2L, 2L, 2L, 2L, 4L, 4L, 4L, 4L, 4L,
4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L), Log = c(2L, 2L,
2L, 2L, 2L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L,
3L, 3L, 3L), Pass = c(123L, 123L, 123L, 124L, 124L, 125L, 125L,
125L, 125L, 125L, 125L, 125L, 125L, 125L, 125L, 125L, 125L, 125L,
125L, 125L, 125L), Value = c(342L, 543L, 231L, 257L, 342L, 543L,
231L, 257L, 342L, 543L, 231L, 257L, 543L, 231L, 257L, 543L, 231L,
257L, 543L, 231L, 257L)), .Names = c("user", "Log", "Pass", "Value"
), class = "data.frame", row.names = c(NA, -21L))
CODE.
fun <- function(x, p = 0.20){
n <- nrow(x)
m <- max(1, round(n*p))
inx <- c(seq_len(m), n - seq_len(m) + 1)
x[-inx, ]
}
result <- do.call(rbind, lapply(split(dat, dat$user), fun))
row.names(result) <- NULL
result
# user Log Pass Value
#1 2 2 123 543
#2 2 2 123 231
#3 2 2 124 257
#4 4 3 125 342
#5 4 3 125 543
#6 4 3 125 231
#7 4 3 125 257
#8 4 3 125 543
#9 4 3 125 231
#10 4 3 125 257
#11 4 3 125 543
#12 4 3 125 231
#13 4 3 125 257
Would something like this help?
For a dataframe df:
df[-c(1:floor(nrow(df)*0.2), (1+ceiling(nrow(df)*0.8)):nrow(df)),]
Just removing the first and last 20%, taking the upper and lower values so that for smaller data frame you keep some of the information:
> df<-data.frame(a=1:100)
> df[-c(1:floor(nrow(df)*0.2),(1+ceiling(nrow(df)*0.8)):nrow(df)),]
[1] 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50
[31] 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80
> df<-data.frame(1:3)
> df[-c(1:floor(nrow(df)*0.2),(1+ceiling(nrow(df)*0.8)):nrow(df)),]
[1] 2
You can do this with dplyr...
library(dplyr)
df2 <- df %>% group_by(user, Log, Pass) %>%
filter(n()>2) %>% #remove those with just two elements or fewer
slice(max(2, 1+ceiling(n()*0.2)):min(n()-1, floor(0.8*n())))
df2
user Log Pass Value
1 2 2 123 543
2 4 3 125 543
3 4 3 125 231
4 4 3 125 257
5 4 3 125 543
6 4 3 125 231
7 4 3 125 257
8 4 3 125 543
9 4 3 125 231
Calculate the offset for what you want to retain:
rem <- ceiling( nrow( x ) * .2 ) + 1
Then take out the records you don-t want:
dat <- dat[ rem : ( nrow( dat ) - rem ), ]
Here is an idea using base R that returns the row indices of each user to keep and then subsets on these indices.
idx <- unlist(lapply(split(seq_along(dat[["user"]]), dat[["user"]]), function(x) {
tmp <- max(1, ceiling(.2 * length(x)))
tail(head(x, -tmp), -tmp)}),
use.names=FALSE)
split(seq_along(dat[["user"]]), dat[["user"]]) returns a list of the rows for each user. lapply loops through these rows, calculating the number of rows to drop from each end with split(seq_along(dat[["user"]]), dat[["user"]]), and then dropping them with tail(head(x, -tmp), -tmp)}). Since lapply returns a named list, this is unlisted and the names are dropped.
This returns
idx
2 3 4 10 11 12 13 14 15 16 17
Now subset
dat[idx,]
user Log Pass Value
2 2 2 123 543
3 2 2 123 231
4 2 2 124 257
10 4 3 125 543
11 4 3 125 231
12 4 3 125 257
13 4 3 125 543
14 4 3 125 231
15 4 3 125 257
16 4 3 125 543
17 4 3 125 231

Resources