New dataframe with last 6 rows per group in R - r
I have a dataframe with several groups and a different number of observations per group. I would like to create a new dataframe with no more than n observations per group. Specifically, for the groups that have a largen number I would like to select the n last observations. An example data set:
timea <- c(1,2,3,4,5,6,7,8,9,10,11,12,13,14,21,22,23,24,25,26,27,28,29,30,5,6,7,8,9,10,25,26,27)
groupa <- c(1,1,1,1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2,2,3,3,3,3,3,3,4,4,4)
vara <-c(7,7,8,10,9,2.5,7,8,9,1,3,4,8,9,10,2.5,3,9,8,3,5,8,1,7,9,10,2,6,4,3.5,9,8,6)
test1 <- data.frame(timea,groupa,vara)
I would like a new dataframe with no more than 6 observations per group (groupa), by selecting the last 6 per group. I was trying to find a dplyr solution, maybe using the lag function but I am not sure how to account for the ones that have less than 6 observations.
The expected output would be:
timea <- c(9,10,11,12,13,14,25,26,27,28,29,30,5,6,7,8,9,10, 25, 26,27)
groupa <- c(1,1,1,1,1,1,2,2,2,2,2,2,3,3,3,3,3,3,4,4,4)
vara <-c(9,1,3,4,8,9,8,3,5,8,1,7,9,10,2,6,4,3.5,9,8,6)
output <- data.frame(timea,groupa,vara)
Any ideas would be really appreciated.
You can use slice_tail function in dplyr to get last n rows from each group. If the number of rows in a group is less than 6, it will return all the rows for that group.
library(dplyr)
test1 %>% group_by(groupa) %>% slice_tail(n = 6) %>% ungroup
# A tibble: 21 x 3
# timea groupa vara
# <dbl> <dbl> <dbl>
# 1 9 1 9
# 2 10 1 1
# 3 11 1 3
# 4 12 1 4
# 5 13 1 8
# 6 14 1 9
# 7 25 2 8
# 8 26 2 3
# 9 27 2 5
#10 28 2 8
# … with 11 more rows
We could use data.table methods
Convert the 'data.frame' to 'data.table' (setDT)
Grouped by 'groupa', get the rowindex (.I) of the last 6 rows
Extract the index and subset the data
library(data.table)
setDT(test1)[test1[, .I[tail(seq_len(.N), 6)], groupa]$V1]
Related
Use replicate to create new variable
I have the following code: Ni <- 133 # number of individuals MXmeas <- 10 # number of measurements # simulate number of observations for each individual Nmeas <- round(runif(Ni, 1, MXmeas)) # simulate observation moments (under the assumption that everybody has at least one observation) obs <- unlist(sapply(Nmeas, function(x) c(1, sort(sample(2:MXmeas, x-1, replace = FALSE))))) # set up dataframe (id, observations) dat <- data.frame(ID = rep(1:Ni, times = Nmeas), observations = obs) This results in the following output: ID observations 1 1 1 3 1 4 1 5 1 6 1 8 However, I also want a variable 'times' to indicate how many times of measurement there were for each individual. But since every ID has a different length, I am not sure how to implement this. This anybody know how to include that? I want it to look like this: ID observations times 1 1 1 1 3 2 1 4 3 1 5 4 1 6 5 1 8 6
Using dplyr you could group by ID and use the row number for times: library(dplyr) dat |> group_by(ID) |> mutate(times = row_number()) |> ungroup() With base we could create the sequence based on each of the lengths of the ID variable: dat$times <- sequence(rle(dat$ID)$lengths) Output: # A tibble: 734 × 3 ID observations times <int> <dbl> <int> 1 1 1 1 2 1 3 2 3 1 9 3 4 2 1 1 5 2 5 2 6 2 6 3 7 2 8 4 8 3 1 1 9 3 2 2 10 3 5 3 # … with 724 more rows Data (using a seed): set.seed(1) Ni <- 133 # number of individuals MXmeas <- 10 # number of measurements # simulate number of observations for each individual Nmeas <- round(runif(Ni, 1, MXmeas)) # simulate observation moments (under the assumption that everybody has at least one observation) obs <- unlist(sapply(Nmeas, function(x) c(1, sort(sample(2:MXmeas, x-1, replace = FALSE))))) # set up dataframe (id, observations) dat <- data.frame(ID = rep(1:Ni, times = Nmeas), observations = obs)
Get the sum of all duplicate rows for a each column without hard coding the column name? [duplicate]
This question already has answers here: Group by multiple columns and sum other multiple columns (7 answers) Aggregate / summarize multiple variables per group (e.g. sum, mean) (10 answers) Closed 1 year ago. Hi suppose I have a table with many columns ( in the thousands) and some rows that are duplicates. What I like to do is sum any duplicates for each row and for every columns. I'm stuck because I don't want to have to hard code or loop through each column and remerge. Is there a better way to do this? Here is an example with only 3 columns for simplicity. dat <- read.table(text='name etc4 etc1 etc2 A 9 0 3 A 10 10 4 A 11 9 4 B 2 7 5 C 40 6 0 C 50 6 1',header=TRUE) # I could aggregate one column at a time # but is there a way to do for each columns without prior hard coding? aggregate( etc4 ~ name, data = dat, sum)
We can specify the . which signifies all the rest of the columns other than the 'name' column in aggregate aggregate(. ~ name, data = dat, sum) name etc4 etc1 etc2 1 A 30 19 11 2 B 2 7 5 3 C 90 12 1 Or if we need more fine control i.e if there are other columns as well and want to avoid, either subset the data with select or use cbind aggregate(cbind(etc1, etc2, etc4) ~ name, data = dat, sum) name etc1 etc2 etc4 1 A 19 11 30 2 B 7 5 2 3 C 12 1 90 If we need to store the names and reuse, subset the data and then convert to matrix cname <- c("etc4", "etc1" ) aggregate(as.matrix(dat[cname]) ~ name, data = dat, sum) name etc4 etc1 1 A 30 19 2 B 2 7 3 C 90 12 Or this may also be done in a faster way with fsum library(collapse) fsum(get_vars(dat, is.numeric), g = dat$name) etc4 etc1 etc2 A 30 19 11 B 2 7 5 C 90 12 1
A tidyverse approach dat %>% group_by(name) %>% summarise(across(.cols = starts_with("etc"),.fns = sum)) # A tibble: 3 x 4 name etc4 etc1 etc2 <chr> <int> <int> <int> 1 A 30 19 11 2 B 2 7 5 3 C 90 12 1
Subsetting datatable based on consecutive difference of dates
Preamble: The main problem is how to subset a datatable based on IDs, forming subsets within an ID based on consecutive time differences. A hint regarding this would be most welcome. The complete question/setup: I have a dataset dt in data.table format that looks like date id val1 val2 %d.%m.%Y 1 01.01.2000 1 5 10 2 09.01.2000 1 4 9 3 01.08.2000 1 3 8 4 01.01.2000 2 2 7 5 01.01.2000 3 1 6 6 14.01.2000 3 7 5 7 28.01.2000 3 8 4 8 01.06.2000 3 9 3 I want to combine observations (grouped by id) which are not more than two weeks apart (consecutively from observation to observation). By combining I mean that for each subset, I keep the value of the last observation of val1 replace val2 of the last observation with the sum of all values of val2 of the group add counter for how many observations came together in this group. I.e., I want to end up with a dataset like this date id val1 val2 counter %d.%m.%Y 2 09.01.2000 1 4 19 2 3 01.08.2000 1 3 8 1 4 01.01.2000 2 2 7 1 7 28.01.2000 3 8 15 3 8 01.06.2000 3 9 3 1 Still, I am trying to wrap my head around data.table functions, particularly .SD and want to solve the issue with these tools. So far I know that I can indicate what I mean by first and last using setkey(dt,date) that I can replace the last val2 of a subset with the sum dt[, val2 := replace(val2, .N, sum(val2[-.N], na.rm = TRUE)), by=id] that I get the length of a subset with [.N] how to delete rows that I can calculate the difference between two dates with difftime(strptime(dt$date[1],format ="%d.%m.%Y"),strptime(dt$date[2],format ="%d.%m.%Y"),units="weeks") However I can't get my head around how to subset the observations such that each subset contains only groups of observations of the same id with dates of (consecutive) distance at max 2 weeks. Any help is appreciated. Many thanks in advance.
The trick is to use cumsum() on a condition. In this case, the condition is being more than 14 days. When the condition is true, the cumulative sum increments. df %>% mutate(rownumber = row_number()) %>% group_by(id) %>% mutate(interval = as.numeric(as.Date(date, format = "%d.%m.%Y") - as.Date(lag(date), format = "%d.%m.%Y"))) %>% mutate(interval = ifelse(is.na(interval), 0, interval)) %>% mutate(group = cumsum(interval > 14) + 1) %>% ungroup() %>% group_by(id, group) %>% summarise( rownumber = last(rownumber), date = last(date), val1 = last(val1), val2 = sum(val2), counter = n() ) %>% select(rownumber, date, id, val1, val2, counter) Output rownumber date id val1 val2 counter <int> <chr> <int> <int> <int> <int> 1 2 09.01.2000 1 4 19 2 2 3 01.08.2000 1 3 8 1 3 4 01.01.2000 2 2 7 1 4 7 28.01.2000 3 8 15 3 5 8 01.06.2000 3 9 3 1
How do I subset patient data based on number of readings for a particular variable for each patient?
I keep trying to find an answer, but haven't had much luck. I'll add a sample of some similar data. What I'd be trying to do here is exclude patient 1 and patient 4 from my subset, as they only have one reading for "Mobility Score". So far, I've been unable to work out a way of counting the number of readings under each variable for each patient. If the patient only has one or zero readings, I'd like to exclude them from a subset. This is an imgur link to the sample data. I can't upload the real data, but it's similar to this
This can be done with dplyr and group_by. For more information see ?group_by and ?summarize # Create random data dta <- data.frame(patient = rep(c(1,2),4), MobiScor = runif(8, 0,20)) dta$MobiScor[sample(1:8,3)] <- NA # Count all avaiable Mobility scores per patient and leave original format library(dplyr) dta %>% group_by(patient) %>% mutate(count = sum(!is.na(MobiScor))) # Merge and create pivot table dta %>% group_by(patient) %>% summarize(count = sum(!is.na(MobiScor))) Example data patient MobiScor 1 1 19.203898 2 2 13.684209 3 1 17.581468 4 2 NA 5 1 NA 6 2 NA 7 1 7.794959 8 2 NA Result (mutate) 1) patient MobiScor count <dbl> <dbl> <int> 1 1 19.2 3 2 2 13.7 1 3 1 17.6 3 4 2 NA 1 5 1 NA 3 6 2 NA 1 7 1 7.79 3 8 2 NA 1 Result (summarize) 2) patient count <dbl> <int> 1 1 3 2 2 1
You can count the number of non-NA in each group and then filter based on that. This can be done in base R : subset(df, ave(!is.na(Mobility_score), patient, FUN = sum) > 1) Using dplyr library(dplyr) df %>% group_by(patient) %>% filter(sum(!is.na(Mobility_score)) > 1) and data.table library(data.table) setDT(df)[, .SD[sum(!is.na(Mobility_score)) > 1], patient]
R use diff() for groups of rows in dataframe
Here is a simple example: > df <- data.frame(sn=rep(c("a","b"), 3), t=c(10,10,20,20,25,25), r=c(7,8,10,15,11,17)) > df sn t r 1 a 10 7 2 b 10 8 3 a 20 10 4 b 20 15 5 a 25 11 6 b 25 17 Expected result is sn t r 1 a 20 3 2 a 25 1 3 b 20 7 4 b 25 2 I want to group by a specific column ("sn"), leave some columns unchanged ("t" for this example), and apply diff() to remaining columns ("r" for this example). I explored "dplyr" package to try something like: df1 %>% group_by(sn) %>% do( ... diff(r)...) but couldn't figure out correct code. Can anyone recommend me a clean way to get expected result?
You can do like this (I don't use directly diff because it returns n-1 values): library(dplyr) df %>% arrange(sn) %>% group_by(sn) %>% mutate(r = r-lag(r)) %>% slice(2:n()) #### sn t r #### <fctr> <dbl> <dbl> #### 1 a 20 3 #### 2 a 25 1 #### 3 b 20 7 #### 4 b 25 2 The slice fonction is here to remove the NA rows created by the differenciation at the beginning of each group. One could also use na.omit instead, but it could also remove other lines unintentionally
We can also use data.table. Convert the 'data.frame' to 'data.table' (setDT(df)), set the key as 'sn' (it will order it based on 'sn'), grouped by 'sn', get the difference of 'r' with the lag of 'r' (i.e. shift in data.table does that) and remove the NA rows with `na.rows. library(data.table) na.omit(setDT(df, key = "sn")[, r := r-shift(r) , sn]) # sn t r #1: a 20 3 #2: a 25 1 #3: b 20 7 #4: b 25 2 Or if we are using diff, then make sure that length are the same as the diff output will be one less than the length of the original vector. So, we can pad with NA and later remove by filter library(dplyr) df %>% arrange(sn) %>% group_by(sn) %>% mutate(r = c(NA, diff(r))) %>% filter(!is.na(r)) # sn t r # <fctr> <dbl> <dbl> #1 a 20 3 #2 a 25 1 #3 b 20 7 #4 b 25 2