subset a dataframe on multiple conditions in R - r

My dataset has several variables and I want to build a subset as well as create new variables based on those conditions
dat1
S1 S2 H1 H2 Month1 Year1 Month2 Year2
16 17 81 70 09 2017 07 2017
17 16 80 70 08 2017 08 2016
14 16 81 81 09 2016 05 2016
18 15 70 81 07 2016 09 2017
17 16 80 80 08 2016 05 2016
18 18 81 70 05 2017 04 2016
I want to subset such that if S1=16,17,18 and H1=81,80 then I create a new variable Hist=H1 , date=paste(Month1,Year1) Sip = S1
Same goes for set of S2, H2 .
My output should be: [ The first 4 rows comes for sets of S1,H1, Month1,Year2 and last 2 rows comes from S2,H2,Month2,Year2
Hist Sip Date
81 16 09-2017
80 17 08-2017
80 17 08-2016
81 18 05-2017
81 16 05-2016
80 16 05-2016
My Code :
datnew <- dat1 %>%
mutate(Date=ifelse((S1==16|S1==17|S1=18)&(H1==80|H1==81),paste(01,Month1,Year1,sep="-"),
ifelse((S2==16|S2==17|S2==18)&(H2==80|H2==81),paste(Month2,Year2,sep="-"),"NA")),
hist=ifelse((S1==16|S1==17|S1=18)&(H1==80|H1==81),H1,
ifelse((S2==16|S2==17|S2==18)&(H2==80|H2==81),H2,"NA")),
sip=ifelse((S1==16|S1==17|S1=18)&(H1==80|H1==81),S1,
ifelse((S2==16|S2==17|S2==18)&(H2==80|H2==81),S2,"NA")))
In the original data I have 10 sets of such columns ie S1-S10, H1-H10, Month1_-Month10... And for each variable I have lot more conditions of numbers.
In this method it is going on and on. Is there any better way to do this?
Thanks in advance

Here is a tidyverse solution. Separate into two data frames and bind the rows together.
library(tidyverse)
bind_rows(
dat1 %>% select(patientId, ends_with("1")) %>% rename_all(str_remove, "1"),
dat1 %>% select(patientId, ends_with("2")) %>% rename_all(str_remove, "2")
) %>%
transmute(
patientId,
Hist = H,
Sip = S,
date = paste0(Month, "-", Year)
) %>%
filter(
Sip %in% 16:18,
Hist %in% 80:81
)
#> # A tibble: 6 x 4
#> patientId Hist Sip date
#> <int> <dbl> <dbl> <chr>
#> 1 1 81 16 09-2017
#> 2 2 80 17 08-2017
#> 3 5 80 17 08-2016
#> 4 6 81 18 05-2017
#> 5 3 81 16 05-2016
#> 6 5 80 16 05-2016

Related

How best to do row operations in R

Below is the sample data
year <- c (2016,2017,2018,2019,2020,2021,2016,2017,2018,2019,2020,2021,2016,2017,2018,2019,2020,2021,2016,2017,2018,2019,2020,2021)
indcode <- c(71,71,71,71,71,71,72,72,72,72,72,72,44,44,44,44,44,44,45,45,45,45,45,45)
avgemp <- c(44,46,48,50,55,56,10,11,12,13,14,15,21,22,22,23,25,25,61,62,62,63,69,77)
ownership <-c(50,50,50,50,50,50,50,50,50,50,50,50,50,50,50,50,50,50,50,50,50,50,50,50)
test3 <- data.frame (year,indcode,avgemp,ownership)
The desired result is to have where it sums the avgemp for two specific combinations (71+72 and 44+45) and produces one additional row per year. The items in the parentheses (below) are just there to illustrate which numbers get added. The primary source of my confusion is how to have it select and therefore add certain indcode combinations. My initial thought is that you would pivot wider, add the columns, and the pivot_longer but hoping for something a bit less convoluted.
year indcode avgemp ownership
2016 71+72 54 (44+10) 50
2016 71 44 50
2016 72 10
2017 71+72 57 (46+11) 50
2018 71+72 60 (48+12) 50
2019 71+72 63 (50+13) 50
2020 71+72 69 (55+14) 50
2021 71+72 71 (56+15) 50
I know that it would start something like this
test3 <- test3 %>% group_by (indcode) %>% mutate("71+72" = (something that filters out 71 and 72)
group_by(year, gr = indcode %/%10) %>%
summarise(indcode = paste(unique(indcode), collapse = '+'),
avgemp = sum(avgemp), ownership = ownership[1], .groups = 'drop') %>%
select(-gr)%>%
arrange(indcode)
# A tibble: 12 x 4
year indcode avgemp ownership
<dbl> <chr> <dbl> <dbl>
1 2016 44+45 82 50
2 2017 44+45 84 50
3 2018 44+45 84 50
4 2019 44+45 86 50
5 2020 44+45 94 50
6 2021 44+45 102 50
7 2016 71+72 54 50
8 2017 71+72 57 50
9 2018 71+72 60 50
10 2019 71+72 63 50
11 2020 71+72 69 50
12 2021 71+72 71 50
Using data.table - convert the data.frame to 'data.table' with setDT, grouped by 'year', 'ownership', and the 'indcode' created by an ifelse/fcase method), get the sum of 'avgemp' as a summarised output
library(data.table)
setDT(test3)[, .(avgemp = sum(avgemp)), .(year, ownership,
indcode = fcase(indcode %in% c(71, 72), '71+72', default = '44+45'))]
-output
year ownership indcode avgemp
<num> <num> <char> <num>
1: 2016 50 71+72 54
2: 2017 50 71+72 57
3: 2018 50 71+72 60
4: 2019 50 71+72 63
5: 2020 50 71+72 69
6: 2021 50 71+72 71
7: 2016 50 44+45 82
8: 2017 50 44+45 84
9: 2018 50 44+45 84
10: 2019 50 44+45 86
11: 2020 50 44+45 94
12: 2021 50 44+45 102

How to add two data frames together in R?

I have a data frame delineated by ownership, private(50) and state(30). Looking to create 5 new rows that are the sum of ownership 50 and ownership 30 as long as they have a matching area value. Desired result is below.
naics <- c(611,611,611,611,611,611,611,611,611,611)
ownership <- c(50,50,50,50,50,30,30,30,30,10)
area <- c(001,003,005,009,011,001,003,005,011,001)
d200201 <- c(14,17,20,23,26,3,5,7,9,100)
d200202 <- c(15,18,21,24,28,9,11,13,15,105)
private <- data.frame(naics,ownership,area,d200201,d200202)
naics ownership area d200201 d200202
611 50 001 17 24
611 50 003 22 29
611 50 005 27 34
611 50 009 23 24 (no sum because no 30 value)
611 50 011 35 43
Is this what you are looking for?
library(dplyr)
private %>%
group_by(naics, area) %>%
summarize(
across(c(d200201, d200202), ~sum(.x[ownership %in% c(30, 50)])),
ownership = 50, .groups = "drop"
)
Output
# A tibble: 5 x 5
naics area d200201 d200202 ownership
<dbl> <dbl> <dbl> <dbl> <dbl>
1 611 1 17 24 50
2 611 3 22 29 50
3 611 5 27 34 50
4 611 9 23 24 50
5 611 11 35 43 50
library(tidyverse)
private %>%
filter(ownership %in% c(50, 30)) %>%
group_by(area) %>%
summarize(across(starts_with("d200"), sum))
#> # A tibble: 5 × 3
#> area d200201 d200202
#> <dbl> <dbl> <dbl>
#> 1 1 17 24
#> 2 3 22 29
#> 3 5 27 34
#> 4 9 23 24
#> 5 11 35 43
Created on 2022-01-08 by the reprex package (v2.0.1)

Subtract columns when column name is a year

I'm certain the answer to this is simple, but I can't figure it out.
I have a pivot table where the column headings are years.
# A tibble: 3 x 5
country 2012 2013 2014 2015
<chr> <dbl> <dbl> <dbl> <dbl>
USA 45 23 12 42
Canada 67 98 14 25
Mexico 89 104 78 3
I want to create a new column that calculate the difference between two other columns. Rather than recognize the year as a column heading, however, R takes the difference between the two years.
Below is a sample of my code. If I put " " around the years, I get an error: "x non-numeric argument to binary operator". Without " ", R creates a new column with the value -3, simply subtracting the years.
df %>%
pivot_wider(names_from = year, values_from = value) %>%
mutate(diff = 2012 - 2015)
How do I re-write this to get the following table:
# A tibble: 3 x 6
country 2012 2013 2014 2015 diff
<chr> <dbl> <dbl> <dbl> <dbl> <dbl>
USA 45 23 12 47 -2
Canada 67 98 14 25 42
Mexico 89 104 78 3 86
You may try
df %>%
pivot_wider(names_from = year, values_from = value) %>%
mutate(diff = .$'2012' - .$'2015')
with your data,
df <- read.table(text = "country 2012 2013 2014 2015
USA 45 23 12 42
Canada 67 98 14 25
Mexico 89 104 78 3
", header = T)
names(df) <- c("country", 2012, 2013, 2014, 2015 )
df %>%
mutate(diff = .$'2012' - .$'2015')
country 2012 2013 2014 2015 diff
1 USA 45 23 12 42 3
2 Canada 67 98 14 25 42
3 Mexico 89 104 78 3 86

Calculate area under the curve for time serie data

I want to calculate the area under the curve for the time points for each id and column. Any suggestions? Which R packages to use? Many thanks!
id <- rep(1:3,each=5)
time <- rep(c(10,20,30,40,50),3)
q1 <- sample(100,15, replace=T)
q2 <- sample(100,15, replace=T)
q3 <- sample(100,15, replace=T)
df <- data.frame(id,time,q1,q2,q3)
df
id time q1 q2 q3
1 10 38 55 38
1 20 46 29 88
1 30 16 28 97
1 40 37 20 81
1 50 59 27 42
2 10 82 81 54
2 20 45 3 23
2 30 82 67 59
2 40 27 3 42
2 50 45 71 45
3 10 39 8 29
3 20 12 6 90
3 30 92 11 7
3 40 52 8 37
3 50 81 57 80
Wanted output, something like this:
q1 q2 q3
1 area area area
2 area area area
3 area area area
library(tidyverse)
id <- rep(1:3,each=5)
time <- rep(c(10,20,30,40,50),3)
q1 <- sample(100,15, replace=T)
q2 <- sample(100,15, replace=T)
q3 <- sample(100,15, replace=T)
df <- data.frame(id,time,q1,q2,q3)
df %>%
arrange(time) %>%
pivot_longer(cols = c(q1, q2, q3)) -> longer_df
longer_df %>%
ggplot(aes(x = time, y = value, col = factor(id))) +
geom_line() +
geom_point() +
facet_wrap(. ~ name)
longer_df %>%
group_by(id, name) %>%
mutate(lag_value = lag(value),
midpoint_value = (value + lag_value)/2) %>%
summarize(area = 10*sum(midpoint_value, na.rm = T)) %>%
pivot_wider(values_from = area)
#> `summarise()` has grouped output by 'id'. You can override using the `.groups` argument.
#> # A tibble: 3 x 4
#> # Groups: id [3]
#> id q1 q2 q3
#> <int> <dbl> <dbl> <dbl>
#> 1 1 1960 1980 2075
#> 2 2 1025 2215 2180
#> 3 3 2105 1590 2110
Created on 2021-06-30 by the reprex package (v2.0.0)
Here I will use the trapz function to calculate the integral.
library(data.table)
library(caTools) # integrate with its trapz function
# data
df <- fread("id time q1 q2 q3
1 10 38 55 38
1 20 46 29 88
1 30 16 28 97
1 40 37 20 81
1 50 59 27 42
2 10 82 81 54
2 20 45 3 23
2 30 82 67 59
2 40 27 3 42
2 50 45 71 45
3 10 39 8 29
3 20 12 6 90
3 30 92 11 7
3 40 52 8 37
3 50 81 57 80")
# calculate the area with `trapz`
df[,lapply(.SD[,2:4], function(y) trapz(time,y)),by=id]
#> id q1 q2 q3
#> 1: 1 1475 1180 3060
#> 2: 2 2175 1490 1735
#> 3: 3 2160 575 1885
Created on 2021-06-30 by the reprex package (v2.0.0)

Reshaping data frame in r according to the longest row

I have a 189 by 1443 data frame containing heart rate data for 189 days for every minute of the day:
year month day `00:00` `00:01` `00:02` `00:03` `00:04` `00:05` ...
2018 04 07 NA 63 NA NA 62 NA ...
2018 04 08 57 NA 58 NA NA NA ...
2018 04 09 NA NA NA 52 NA 51 ...
I need to transform this data frame into 189 by 131 (which is the most amount of entries in one day), so basically align all entries to the left (in the way that the rows with <131 entries would have NAs from column x to 131).
The end result would have to look like this:
year month day `1` `2` `3` `4` `5` `6` ... `131`
2018 04 07 63 62 63 64 61 60 ... 59
2018 04 08 57 58 56 55 56 55 ... NA
2018 04 09 52 51 49 50 48 52 ... NA
.
.
.
Could anyone help me with that? Sadly, I don't have a clue where to start.
See if this works for you:
library(tidyverse)
df %>%
gather(minute, value, -c(year:day)) %>%
drop_na() %>%
group_by(year, month, day) %>%
arrange(year, month, day, minute) %>%
mutate(row = row_number()) %>%
select(-minute) %>%
spread(row, value)
# A tibble: 3 x 5
# Groups: year, month, day [3]
year month day `1` `2`
<dbl> <chr> <chr> <dbl> <dbl>
1 2018 04 07 63 62
2 2018 04 08 57 58
3 2018 04 09 52 51

Resources