Here is a simple example:
> df <- data.frame(sn=rep(c("a","b"), 3), t=c(10,10,20,20,25,25), r=c(7,8,10,15,11,17))
> df
sn t r
1 a 10 7
2 b 10 8
3 a 20 10
4 b 20 15
5 a 25 11
6 b 25 17
Expected result is
sn t r
1 a 20 3
2 a 25 1
3 b 20 7
4 b 25 2
I want to group by a specific column ("sn"), leave some columns unchanged ("t" for this example), and apply diff() to remaining columns ("r" for this example).
I explored "dplyr" package to try something like:
df1 %>% group_by(sn) %>% do( ... diff(r)...)
but couldn't figure out correct code.
Can anyone recommend me a clean way to get expected result?
You can do like this (I don't use directly diff because it returns n-1 values):
library(dplyr)
df %>% arrange(sn) %>% group_by(sn) %>% mutate(r = r-lag(r)) %>% slice(2:n())
#### sn t r
#### <fctr> <dbl> <dbl>
#### 1 a 20 3
#### 2 a 25 1
#### 3 b 20 7
#### 4 b 25 2
The slice fonction is here to remove the NA rows created by the differenciation at the beginning of each group. One could also use na.omit instead, but it could also remove other lines unintentionally
We can also use data.table. Convert the 'data.frame' to 'data.table' (setDT(df)), set the key as 'sn' (it will order it based on 'sn'), grouped by 'sn', get the difference of 'r' with the lag of 'r' (i.e. shift in data.table does that) and remove the NA rows with `na.rows.
library(data.table)
na.omit(setDT(df, key = "sn")[, r := r-shift(r) , sn])
# sn t r
#1: a 20 3
#2: a 25 1
#3: b 20 7
#4: b 25 2
Or if we are using diff, then make sure that length are the same as the diff output will be one less than the length of the original vector. So, we can pad with NA and later remove by filter
library(dplyr)
df %>%
arrange(sn) %>%
group_by(sn) %>%
mutate(r = c(NA, diff(r))) %>%
filter(!is.na(r))
# sn t r
# <fctr> <dbl> <dbl>
#1 a 20 3
#2 a 25 1
#3 b 20 7
#4 b 25 2
Related
I have a dataframe with several groups and a different number of observations per group. I would like to create a new dataframe with no more than n observations per group. Specifically, for the groups that have a largen number I would like to select the n last observations. An example data set:
timea <- c(1,2,3,4,5,6,7,8,9,10,11,12,13,14,21,22,23,24,25,26,27,28,29,30,5,6,7,8,9,10,25,26,27)
groupa <- c(1,1,1,1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2,2,3,3,3,3,3,3,4,4,4)
vara <-c(7,7,8,10,9,2.5,7,8,9,1,3,4,8,9,10,2.5,3,9,8,3,5,8,1,7,9,10,2,6,4,3.5,9,8,6)
test1 <- data.frame(timea,groupa,vara)
I would like a new dataframe with no more than 6 observations per group (groupa), by selecting the last 6 per group. I was trying to find a dplyr solution, maybe using the lag function but I am not sure how to account for the ones that have less than 6 observations.
The expected output would be:
timea <- c(9,10,11,12,13,14,25,26,27,28,29,30,5,6,7,8,9,10, 25, 26,27)
groupa <- c(1,1,1,1,1,1,2,2,2,2,2,2,3,3,3,3,3,3,4,4,4)
vara <-c(9,1,3,4,8,9,8,3,5,8,1,7,9,10,2,6,4,3.5,9,8,6)
output <- data.frame(timea,groupa,vara)
Any ideas would be really appreciated.
You can use slice_tail function in dplyr to get last n rows from each group. If the number of rows in a group is less than 6, it will return all the rows for that group.
library(dplyr)
test1 %>% group_by(groupa) %>% slice_tail(n = 6) %>% ungroup
# A tibble: 21 x 3
# timea groupa vara
# <dbl> <dbl> <dbl>
# 1 9 1 9
# 2 10 1 1
# 3 11 1 3
# 4 12 1 4
# 5 13 1 8
# 6 14 1 9
# 7 25 2 8
# 8 26 2 3
# 9 27 2 5
#10 28 2 8
# … with 11 more rows
We could use data.table methods
Convert the 'data.frame' to 'data.table' (setDT)
Grouped by 'groupa', get the rowindex (.I) of the last 6 rows
Extract the index and subset the data
library(data.table)
setDT(test1)[test1[, .I[tail(seq_len(.N), 6)], groupa]$V1]
I keep trying to find an answer, but haven't had much luck. I'll add a sample of some similar data.
What I'd be trying to do here is exclude patient 1 and patient 4 from my subset, as they only have one reading for "Mobility Score". So far, I've been unable to work out a way of counting the number of readings under each variable for each patient. If the patient only has one or zero readings, I'd like to exclude them from a subset.
This is an imgur link to the sample data. I can't upload the real data, but it's similar to this
This can be done with dplyr and group_by. For more information see ?group_by and ?summarize
# Create random data
dta <- data.frame(patient = rep(c(1,2),4), MobiScor = runif(8, 0,20))
dta$MobiScor[sample(1:8,3)] <- NA
# Count all avaiable Mobility scores per patient and leave original format
library(dplyr)
dta %>% group_by(patient) %>% mutate(count = sum(!is.na(MobiScor)))
# Merge and create pivot table
dta %>% group_by(patient) %>% summarize(count = sum(!is.na(MobiScor)))
Example data
patient MobiScor
1 1 19.203898
2 2 13.684209
3 1 17.581468
4 2 NA
5 1 NA
6 2 NA
7 1 7.794959
8 2 NA
Result (mutate) 1)
patient MobiScor count
<dbl> <dbl> <int>
1 1 19.2 3
2 2 13.7 1
3 1 17.6 3
4 2 NA 1
5 1 NA 3
6 2 NA 1
7 1 7.79 3
8 2 NA 1
Result (summarize) 2)
patient count
<dbl> <int>
1 1 3
2 2 1
You can count the number of non-NA in each group and then filter based on that.
This can be done in base R :
subset(df, ave(!is.na(Mobility_score), patient, FUN = sum) > 1)
Using dplyr
library(dplyr)
df %>% group_by(patient) %>% filter(sum(!is.na(Mobility_score)) > 1)
and data.table
library(data.table)
setDT(df)[, .SD[sum(!is.na(Mobility_score)) > 1], patient]
This question already has answers here:
Calculate row means on subset of columns
(7 answers)
Closed 3 years ago.
I have the following data frame
ID <- c(1,1,2,3,4,5,6)
Value1 <- c(20,50,30,10,15,10,NA)
Value2 <- c(40,33,84,NA,20,1,NA)
Value3 <- c(60,40,60,10,25,NA,NA)
Grade1 <- c(20,50,30,10,15,10,NA)
Grade2 <- c(40,33,84,NA,20,1,NA)
DF <- data.frame(ID,Value1,Value2,Value3,Grade1,Grade2)
ID Value1 Value2 Value3 Grade1 Grade2
1 1 20 40 60 20 40
2 1 50 33 40 50 33
3 2 30 84 60 30 84
4 3 10 NA 10 10 NA
5 4 15 20 25 15 20
6 5 10 1 NA 10 1
7 6 NA NA NA NA NA
I would like to group by ID, select columns with names contain the string ("Value"), and get the mean of these columns with NA not included.
Here is an example of the desired output
ID mean(Value)
1 41
2 58
3 10
....
In my attempt to solve this challenge, I wrote the following code
Library(tidyverse)
DF %>% group_by (ID) %>% select(contains("Value")) %>% summarise(mean(.,na.rm = TRUE))
The code groups the data by IDs, select columns with column name containing ("Value"), and attempts to summarise the selected columns by using the mean function. When I run my code, I get the following output
> DF %>% group_by (ID) %>% select(contains("Value")) %>% summarise(mean(.))
Adding missing grouping variables: `ID`
# A tibble: 6 x 2
ID `mean(.)`
<dbl> <dbl>
1 1 NA
2 2 NA
3 3 NA
4 4 NA
5 5 NA
6 6 NA
I would appreciate your help in this manner.
You should try using pivot_longer to get your data from wide to long form Read latest tidyR update on pivot_longer & pivot_wider (https://tidyr.tidyverse.org/articles/pivot.html)
library(tidyverse)
ID <- c(1,2,3,4,5,6)
Value1 <- c(50,30,10,15,10,NA)
Value2 <- c(33,84,NA,20,1,NA)
Value3 <- c(40,60,10,25,NA,NA)
DF <- data.frame(ID,Value1,Value2,Value3)
DF %>% pivot_longer(-ID) %>%
group_by(ID) %>% summarise(mean=mean(value,na.rm=TRUE))
Output here
ID mean
<dbl> <dbl>
1 1 41
2 2 58
3 3 10
4 4 20
5 5 5.5
6 6 NaN
Without using dplyr or any specific package, this would help :
DF$mean<- rowMeans(DF[,c(2:4)], na.rm = T)
I'm working with grouping and median, I'd like to have a grouping of a data.frame with the median of certain rows (not all) and the last value.
My data are something like this:
test <- data.frame(
id = c('A','A','A','A','A','B','B','B','B','B','C','C','C','C'),
value = c(1,2,3,4,5,3,4,5,1,8,3,4,2,9))
> test
id value
1 A 1
2 A 2
3 A 3
4 A 4
5 A 5
6 B 3
7 B 4
8 B 5
9 B 1
10 B 8
11 C 3
12 C 4
13 C 2
14 C 9
For each id, I need the median of the three (number may vary, in this case three) central rows, then the last value.
I've tried first of all with only one id.
test_a <- test[which(test$id == 'A'),]
> test_a
id value
1 A 1
2 A 2
3 A 3
4 A 4
5 A 5
The desired output is this for this one,
Having this:
median(test_a[(nrow(test_a)-3):(nrow(test_a)-1),]$value) # median of three central values
tail(test_a,1)$value # last value
I used this:
library(tidyverse)
test_a %>% group_by(id) %>%
summarise(m = median(test_a[(nrow(test_a)-3):(nrow(test_a)-1),]$value),
last = tail(test_a,1)$value) %>%
data.frame()
id m last
1 A 3 5
But when I tried to generalize to all id:
test %>% group_by(id) %>%
summarise(m = median(test[(nrow(test)-3):(nrow(test)-1),]$value),
last = tail(test,1)$value) %>%
data.frame()
id m last
1 A 3 9
2 B 3 9
3 C 3 9
I think that the formulas take the full dataset to calculate last value and median, but I cannot imagine how to make it works. Thanks in advance.
This works:
test %>%
group_by(id) %>%
summarise(m = median(value[(length(value)-3):(length(value)-1)]),
last = value[length(value)])
# A tibble: 3 x 3
id m last
<fctr> <dbl> <dbl>
1 A 3 5
2 B 4 8
3 C 4 9
You just refer to variable value instead of the whole dataset within summarise.
Edit: Here's a generalized version.
test %>%
group_by(id) %>%
summarise(m = ifelse(length(value) == 1, value,
ifelse(length(value) == 2, median(value),
median(value[(ceiling(length(value)/2)-1):(ceiling(length(value)/2)+1)])),
last = value[length(value)])
If a group has only one row, the value itself will be stored in m. If it has only two rows, the median of these two rows will be stored in m. If it has three or more rows, the middle three rows will be chosen dynamically and the median of those will be stored in m.
I would like to unique count x value for splitting column x by '|' and substring left 2 character in R.
df <-data.frame(id = c(11,22,33,44),
x = c(NA,'cna|cnb|jpa|usa|jpb|usb','kra|krb|kru|usb|usa','jpa|jpu|epa|epb|usa|woa|cna|jpu'))
> df
id x
1 11 <NA>
2 22 cna|cnb|jpa|usa|jpb|usb
3 33 kra|krb|kru|usb|usa
4 44 jpa|jpu|epa|epb|usa|woa|cna|jpu
I want to get below.
id count
1 11 0
2 22 3
3 33 2
4 44 5
line 1 is 0
line 2 is cn,jp,us(3 data)
line 3 is kr,us(2 data)
line 4 is jp,ep,us,wo,cn(5 data)
Here is another approach. It is not as compact and simple as akrun's answer, but it doesn't depend on any libraries:
df$count <- sapply(df$x, function(varx){
strs <- unique(sapply(unlist(strsplit(varx, "|", fixed = T)), function(string){
substr(string, 1, 2)
}))
length(strs[!is.na(strs)])
})
Output:
id x count
1 11 <NA> 0
2 22 cna|cnb|jpa|usa|jpb|usb 3
3 33 kra|krb|kru|usb|usa 2
4 44 jpa|jpu|epa|epb|usa|woa|cna|jpu 5
We can use tidyverse. We split the elements in 'x' and expand to long format with separate_rows, mutate the 'x' by taking only the first two characters (substr), grouped by 'id', find the count of unique non-NA elements using n_distinct
library(tidyverse)
df %>%
separate_rows(x) %>%
mutate(x= substr(x, 1, 2)) %>%
group_by(id) %>%
summarise(count = n_distinct(x[!is.na(x)]))
# A tibble: 4 x 2
# id count
# <dbl> <int>
#1 11 0
#2 22 3
#3 33 2
#4 44 5