Select top x % of values per group - retain row ID - r

I am trying to identify the top 15% of scores for each watershed but retain the polygon ID when I print the results.
# here's a small example dataset (called "data"):
polygon watershed score
1 1 61
2 1 81
3 1 16
4 2 18
5 2 12
6 3 78
7 3 81
8 3 20
9 3 97
10 3 95
# I obtain the top 15% using this method:
top15 <- (data %>% select(watershed, score) %>%
group_by(watershed) %>%
arrange(watershed, desc(score)) %>%
filter(score > quantile(score, 0.15)))
# results look like this:
<int> <int>
1 1 81
2 1 61
3 2 18
4 3 97
5 3 95
6 3 81
7 3 78
How can I include the column "polygon" when I print the results?
Thanks so much for the help!

In your statement you selected only watershed and score but excluded polygon. So remove the select statement and you should get what you want. Additionally the arrange doesn't add value so I removed it:
library(dplyr)
mdat <- structure(list(polygon = 1:10,
watershed = c(1L, 1L, 1L, 2L, 2L, 3L, 3L, 3L, 3L, 3L),
score = c(61L, 81L, 16L, 18L, 12L, 78L, 81L, 20L, 97L, 95L)),
class = "data.frame", row.names = c(NA, -10L))
mdat %>%
group_by(watershed) %>%
filter(score > quantile(score, 0.15))
# # A tibble: 7 x 3
# # Groups: watershed [3]
# polygon watershed score
# <int> <int> <int>
# 1 1 1 61
# 2 2 1 81
# 3 4 2 18
# 4 6 3 78
# 5 7 3 81
# 6 9 3 97
# 7 10 3 95

Related

How to use R to replace missing values with the sum of previous 4 values in a column?

I have a dataframe that contains (among other things) three columns that have missing values every 5 rows. These missing values need to be replaced with the sum of the previous 4 values in their respective column.
For example, let's say my dataframe looked like this:
id category1 category2 category3
123 5 10 10
123 6 11 15
123 6 12 23
123 4 10 6
123 NA NA NA
567 24 17 15
Those NAs need to represent a "total" based on the sum of the previous 4 values in their column, and this needs to repeat throughout the entire dataframe because the NAs occur every 5 rows. For instance, the three NAs in the mock example above should be replaced with 21, 43, and 54. 5 rows later, the same process will need to be repeated. How can I achieve this?
Another possible solution:
library(dplyr)
df %>%
group_by(id) %>%
mutate(across(everything(), ~ if_else(is.na(.x), sum(.x, na.rm = T), .x))) %>%
ungroup
#> # A tibble: 6 × 4
#> id category1 category2 category3
#> <int> <int> <int> <int>
#> 1 123 5 10 10
#> 2 123 6 11 15
#> 3 123 6 12 23
#> 4 123 4 10 6
#> 5 123 21 43 54
#> 6 567 24 17 15
The following should work if there are no occurrences of NA values within the first 4 rows and I am assuming that the NA values appear in all columns at the same time.
for(i in 1:nrow(data)){
if(is.na(data[i, 2])){
data[i, 2] <- sum(data[seq(i-5, i-1), 2])
data[i, 3] <- sum(data[seq(i-5, i-1), 3])
data[i, 4] <- sum(data[seq(i-5, i-1), 4])
}
}
If the NAs appear at the end row for each 'id', we may remove it and do a group by summarise to create a row
library(dplyr)
df1 <- df1 %>%
na.omit %>%
group_by(id) %>%
summarise(across(everything(), ~ c(.x, sum(.x))), .groups = 'drop')
-output
df1
# A tibble: 7 × 4
id category1 category2 category3
<int> <int> <int> <int>
1 123 5 10 10
2 123 6 11 15
3 123 6 12 23
4 123 4 10 6
5 123 21 43 54
6 567 24 17 15
7 567 24 17 15
Or another approach would be to replace the NA with the sum using na.aggregate from zoo
library(zoo)
df1 %>%
group_by(id) %>%
mutate(across(everything(), na.aggregate, FUN = sum)) %>%
ungroup
# A tibble: 6 × 4
id category1 category2 category3
<int> <int> <int> <int>
1 123 5 10 10
2 123 6 11 15
3 123 6 12 23
4 123 4 10 6
5 123 21 43 54
6 567 24 17 15
data
df1 <- structure(list(id = c(123L, 123L, 123L, 123L, 123L, 567L),
category1 = c(5L,
6L, 6L, 4L, NA, 24L), category2 = c(10L, 11L, 12L, 10L, NA, 17L
), category3 = c(10L, 15L, 23L, 6L, NA, 15L)),
class = "data.frame", row.names = c(NA,
-6L))

Ignore NA values of a column within a statement

Until now I've been working with a medium size dataset for an Ocupation Survey(around 200 mb total), here's the data if you want to review it: https://drive.google.com/drive/folders/1Od8zlOE3U3DO0YRGnBadFz804OUDnuQZ?usp=sharing
I have the following code:
hogares<-read.csv("/home/servicio/Escritorio/TR_VIVIENDA01.CSV")
personas<-read.csv("/home/servicio/Escritorio/TR_PERSONA01.CSV")
datos<-merge(hogares,personas)
library(dplyr)
base<-tibble(ID_VIV=datos$ID_VIV, ID_PERSONA=datos$ID_PERSONA, EDAD=datos$EDAD, CONACT=datos$CONACT)
base$maxage <- ave(base$EDAD, base$ID_VIV, FUN=max)
base$Condición_I<-case_when(base$CONACT==32 & base$EDAD>=60 ~ 1,
base$CONACT>=10 & base$EDAD>=60 & base$CONACT<=16 ~ 2,
base$CONACT==20 & base$EDAD>=60 | base$CONACT==31 & base$EDAD>=60 | (base$CONACT>=33 & base$CONACT<=35 & base$EDAD>=60) ~ 3)
base <- subset(base, maxage >= 60)
base<- base %>% group_by(ID_VIV) %>% mutate(Condición_V = if(n_distinct(Condición_I) > 1) 4 else Condición_I)
base$ID_VIV<-as.character(base$ID_VIV)
base$ID_PERSONA<-as.character(base$ID_PERSONA)
base
And ended up with:
# A tibble: 38,307 x 7
# Groups: ID_VIV [10,499]
ID_VIV ID_PERSONA EDAD CONACT maxage Condición_I Condición_V
<chr> <chr> <int> <int> <int> <dbl> <dbl>
1 10010000007 1001000000701 69 32 69 1 1
2 10010000008 1001000000803 83 33 83 3 4
3 10010000008 1001000000802 47 33 83 NA 4
4 10010000008 1001000000801 47 10 83 NA 4
5 10010000012 1001000001204 4 NA 60 NA 4
6 10010000012 1001000001203 2 NA 60 NA 4
7 10010000012 1001000001201 60 10 60 2 4
8 10010000012 1001000001202 21 10 60 NA 4
9 10010000014 1001000001401 67 32 67 1 4
10 10010000014 1001000001402 64 33 67 3 4
The Condición_I column value is a code for the labour conditions of each individual(row), some of this individuals share house (that's why they share ID_VIV), I only care about the individuals that are 60yo or more, all the NA are individuals who live with a 60+yo but I do not care about their situation (but I need to keep them), I need the column Condición_V to display another value following this conditions:
Condición_I == 1 ~ 1
Condición_I == 2 ~ 2
Condición_I == 3 ~ 3
Any combination of Condición_I ~ 4
This means that if all the 60 and+_yo individuals in a house have Condición_I == 1 then Condición_V will be 1 that's true up to code 3, when there are x.e. one person C_I == 1 and another one C_I == 3 in the same house, then Condición_V will be 4
And I'm hoping to get this kind of result:
A tibble: 38,307 x 7
# Groups: ID_VIV [10,499]
ID_VIV ID_PERSONA EDAD CONACT maxage Condición_I Condición_V
<chr> <chr> <int> <int> <int> <dbl> <dbl>
1 10010000007 1001000000701 69 32 69 1 1
2 10010000008 1001000000803 83 33 83 3 3
3 10010000008 1001000000802 47 33 83 NA 3
4 10010000008 1001000000801 47 10 83 NA 3
5 10010000012 1001000001204 4 NA 60 NA 2
6 10010000012 1001000001203 2 NA 60 NA 2
7 10010000012 1001000001201 60 10 60 2 2
8 10010000012 1001000001202 21 10 60 NA 2
9 10010000014 1001000001401 67 32 67 1 4
10 10010000014 1001000001402 64 33 67 3 4
I know my error is in:
`#base<- base %>% group_by(ID_VIV) %>% mutate(Condición_V = if(n_distinct(Condición_I) > 1) 4 else` Condición_I)
Is there a way to use that line of code ignoring the NA values or is it my best option to do it otherway, I do not have to do it the way I'm trying and any other way or help will be much appreciated!
We can wrap with na.omit on the Condición_I column, check the number of distinct elements with n_distinct and if it is greater than 1, return 4 or else return the na.omit of the column
library(dplyr)
base %>%
group_by(ID_VIV) %>%
mutate(Condición_V = if(n_distinct(na.omit(Condición_I)) > 1)
4 else na.omit(Condición_I)[1])
# A tibble: 10 x 7
# Groups: ID_VIV [4]
# ID_VIV ID_PERSONA EDAD CONACT maxage Condición_I Condición_V
# <chr> <chr> <int> <int> <int> <int> <dbl>
# 1 10010000007 1001000000701 69 32 69 1 1
# 2 10010000008 1001000000803 83 33 83 3 3
# 3 10010000008 1001000000802 47 33 83 NA 3
# 4 10010000008 1001000000801 47 10 83 NA 3
# 5 10010000012 1001000001204 4 NA 60 NA 2
# 6 10010000012 1001000001203 2 NA 60 NA 2
# 7 10010000012 1001000001201 60 10 60 2 2
# 8 10010000012 1001000001202 21 10 60 NA 2
# 9 10010000014 1001000001401 67 32 67 1 4
#10 10010000014 1001000001402 64 33 67 3 4
data
base <- structure(list(ID_VIV = c("10010000007", "10010000008", "10010000008",
"10010000008", "10010000012", "10010000012", "10010000012", "10010000012",
"10010000014", "10010000014"), ID_PERSONA = c("1001000000701",
"1001000000803", "1001000000802", "1001000000801", "1001000001204",
"1001000001203", "1001000001201", "1001000001202", "1001000001401",
"1001000001402"), EDAD = c(69L, 83L, 47L, 47L, 4L, 2L, 60L, 21L,
67L, 64L), CONACT = c(32L, 33L, 33L, 10L, NA, NA, 10L, 10L, 32L,
33L), maxage = c(69L, 83L, 83L, 83L, 60L, 60L, 60L, 60L, 67L,
67L), Condición_I = c(1L, 3L, NA, NA, NA, NA, 2L, NA, 1L, 3L
)), row.names = c("1", "2", "3", "4", "5", "6", "7", "8", "9",
"10"), class = "data.frame")

Can I list the unique values for one column while grouping by another column in R?

I have the following columns:
session condition codes
15 anxiety 1
15 depression 1
15 bipolar 1
15 high blood pressure 3
15 panic attacks 1
66 hypertension 5
66 high blood pressure 3
66 anxiety 1
66 panic attacks 1
75 schizophrenia 1
32 muscular dystrophy 4
32 anxiety 1
32 depression 1
32 panic attacks 1
I want to make a new column with just the unique codes per session and then leave the rest of the rows for that session blank. I know this logically doesn't make sense because this third column doesn't really match up with the first. If it needs to be in a new object or list or something that is fine.
session condition codes unique_codes
15 anxiety 1 1
15 depression 1 3
15 bipolar 1
15 high blood pressure 3
15 panic attacks 1
66 hypertension 5 5
66 high blood pressure 3 3
66 anxiety 1 1
66 panic attacks 1
75 schizophrenia 1 1
32 muscular dystrophy 4 4
32 anxiety 1 1
32 depression 1
32 panic attacks 1
I have tried:
conditions=conditions %>%
group_by(session)%>%
mutate(unique_codes=unique(conditions$codes))
However I get an error that says "must be length 5 (the group size) or one, not 4", which I assume is because I want the rest of the rows blank. Does anyone know a way around this? Thank you!!
The lengths are the issue, we can either paste it together or create a list column
library(dplyr)
conditions %>%
group_by(session)%>%
mutate(unique_codes = toString(unique(codes)))
Or another option is to set the length same by padding NA at the end
conditions %>%
group_by(session) %>%
mutate(unique_codes = `length<-`(unique(codes), n()))
# A tibble: 14 x 4
# Groups: session [4]
# session condition codes unique_codes
# <int> <chr> <int> <int>
# 1 15 anxiety 1 1
# 2 15 depression 1 3
# 3 15 bipolar 1 NA
# 4 15 high blood pressure 3 NA
# 5 15 panic attacks 1 NA
# 6 66 hypertension 5 5
# 7 66 high blood pressure 3 3
# 8 66 anxiety 1 1
# 9 66 panic attacks 1 NA
#10 75 schizophrenia 1 1
#11 32 muscular dystrophy 4 4
#12 32 anxiety 1 1
#13 32 depression 1 NA
#14 32 panic attacks 1 NA
The OP mentioned about n() not working (could be a dplyr version issue). In that case, length should work
conditions %>%
group_by(session) %>%
mutate(unique_codes = `length<-`(unique(codes), length(codes)))
data
conditions <- structure(list(session = c(15L, 15L, 15L, 15L, 15L, 66L, 66L,
66L, 66L, 75L, 32L, 32L, 32L, 32L), condition = c("anxiety",
"depression", "bipolar", "high blood pressure", "panic attacks",
"hypertension", "high blood pressure", "anxiety", "panic attacks",
"schizophrenia", "muscular dystrophy", "anxiety", "depression",
"panic attacks"), codes = c(1L, 1L, 1L, 3L, 1L, 5L, 3L, 1L, 1L,
1L, 4L, 1L, 1L, 1L)), class = "data.frame", row.names = c(NA,
-14L))
Another dplyr option could be:
df %>%
group_by(session) %>%
distinct(codes) %>%
transmute(unique_codes = codes,
rowid = 1:n()) %>%
right_join(df %>%
group_by(session) %>%
mutate(rowid = 1:n())) %>%
ungroup() %>%
select(-rowid)
session unique_codes condition codes
<int> <int> <chr> <int>
1 15 1 anxiety 1
2 15 3 depression 1
3 15 NA bipolar 1
4 15 NA high blood pressure 3
5 15 NA panic attacks 1
6 66 5 hypertension 5
7 66 3 high blood pressure 3
8 66 1 anxiety 1
9 66 NA panic attacks 1
10 75 1 schizophrenia 1
11 32 4 muscular dystrophy 4
12 32 1 anxiety 1
13 32 NA depression 1
14 32 NA panic attacks 1

How to subtract one row from multiple rows by group, for data set with multiple columns in R?

I would like to learn how to subtract one row from multiple rows by group, and save the results as a data table/matrix in R. For example, take the following data frame:
data.frame("patient" = c("a","a","a", "b","b","b","c","c","c"), "Time" = c(1,2,3), "Measure 1" = sample(1:100,size = 9,replace = TRUE), "Measure 2" = sample(1:100,size = 9,replace = TRUE), "Measure 3" = sample(1:100,size = 9,replace = TRUE))
patient Time Measure.1 Measure.2 Measure.3
1 a 1 19 5 75
2 a 2 64 20 74
3 a 3 40 4 78
4 b 1 80 91 80
5 b 2 48 31 73
6 b 3 10 5 4
7 c 1 30 67 55
8 c 2 24 13 90
9 c 3 45 31 88
For each patient, I would like to subtract the row where Time == 1 from all rows associated with that patient. The result would be:
patient Time Measure.1 Measure.2 Measure.3
1 a 1 0 0 0
2 a 2 45 15 -1
3 a 3 21 -1 3
4 b 1 0 0 0
5 b 2 -32 -60 -5
6 b 3 -70 -86 -76
7 c 1 0 0 0
....
I have tried the following code using the dplyr package, but to no avail:
raw_patient<- group_by(rawdata,patient, Time)
baseline_patient <-mutate(raw_patient,cpls = raw_patient[,]- raw_patient["Time" == 0,])
As there are multiple columns, we can use mutate_at by specifying the variables in vars and then subtract the elements from those elements in each column that corresponds to 'Time' 1 after grouping by 'patient'
library(dplyr)
df1 %>%
group_by(patient) %>%
mutate_at(vars(matches("Measure")), funs(.- .[Time==1]))
# A tibble: 9 × 5
# Groups: patient [3]
# patient Time Measure.1 Measure.2 Measure.3
# <chr> <int> <int> <int> <int>
#1 a 1 0 0 0
#2 a 2 45 15 -1
#3 a 3 21 -1 3
#4 b 1 0 0 0
#5 b 2 -32 -60 -7
#6 b 3 -70 -86 -76
#7 c 1 0 0 0
#8 c 2 -6 -54 35
#9 c 3 15 -36 33
data
df1 <- structure(list(patient = c("a", "a", "a", "b", "b", "b", "c",
"c", "c"), Time = c(1L, 2L, 3L, 1L, 2L, 3L, 1L, 2L, 3L), Measure.1 = c(19L,
64L, 40L, 80L, 48L, 10L, 30L, 24L, 45L), Measure.2 = c(5L, 20L,
4L, 91L, 31L, 5L, 67L, 13L, 31L), Measure.3 = c(75L, 74L, 78L,
80L, 73L, 4L, 55L, 90L, 88L)), .Names = c("patient", "Time",
"Measure.1", "Measure.2", "Measure.3"), class = "data.frame", row.names = c("1",
"2", "3", "4", "5", "6", "7", "8", "9"))

Writing code for calculating Cmax and Tmax of Concentration_Time data

I have a concentration-time data of many individuals. I want to find out the Cmax (maximum concentration) and Tmax (the time at Cmax) for each individual. I want to retain the results in R by adding a new "Cmax" and "Tmax" columns to the original dataset.
The data frame looks like this:
#df <-
ID TIME CONC
1 0 0
1 1 10
1 2 15
1 5 12
2 1 5
2 2 10
2 5 20
2 6 10
Ans so on. I started with something to find Cmax for an individual but its not getting me any where. Any help in fixing the code or an easier way of finding both (Cmax, and Tmax) is highly appreciable !
Cmax=function(df) {
n = length(df$CONC)
c_temp=0 # this is a temporary counter
c_max=0
for(i in 2:n){
if(df$CONC[i] > df$CONC[i-1]{
c_temp= c_temp+1
if(c_temp > c_max) c_max=c_temp # check
}
}
return(c_max)
}
Try
library(dplyr)
df %>%
group_by(ID) %>%
mutate(Cmax= max(CONC), Tmax=TIME[which.max(CONC)])
# ID TIME CONC Cmax Tmax
#1 1 0 0 15 2
#2 1 1 10 15 2
#3 1 2 15 15 2
#4 1 5 12 15 2
#5 2 1 5 20 5
#6 2 2 10 20 5
#7 2 5 20 20 5
#8 2 6 10 20 5
Or using data.table
library(data.table)
setDT(df)[, c("Cmax", "Tmax") := list(max(CONC),
TIME[which.max(CONC)]), by=ID]
Or using split from base R
unsplit(lapply(split(df, df$ID), function(x)
within(x, {Cmax <- max(CONC)
Tmax <- TIME[which.max(CONC)] })),
df$ID)
# ID TIME CONC Tmax Cmax
#1 1 0 0 2 15
#2 1 1 10 2 15
#3 1 2 15 2 15
#4 1 5 12 2 15
#5 2 1 5 5 20
#6 2 2 10 5 20
#7 2 5 20 5 20
#8 2 6 10 5 20
data
df <- structure(list(ID = c(1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L), TIME = c(0L,
1L, 2L, 5L, 1L, 2L, 5L, 6L), CONC = c(0L, 10L, 15L, 12L, 5L,
10L, 20L, 10L)), .Names = c("ID", "TIME", "CONC"), class = "data.frame",
row.names = c(NA, -8L))

Resources