Below is the sample data
year <- c (2016,2017,2018,2019,2020,2021,2016,2017,2018,2019,2020,2021,2016,2017,2018,2019,2020,2021,2016,2017,2018,2019,2020,2021)
indcode <- c(71,71,71,71,71,71,72,72,72,72,72,72,44,44,44,44,44,44,45,45,45,45,45,45)
avgemp <- c(44,46,48,50,55,56,10,11,12,13,14,15,21,22,22,23,25,25,61,62,62,63,69,77)
ownership <-c(50,50,50,50,50,50,50,50,50,50,50,50,50,50,50,50,50,50,50,50,50,50,50,50)
test3 <- data.frame (year,indcode,avgemp,ownership)
The desired result is to have where it sums the avgemp for two specific combinations (71+72 and 44+45) and produces one additional row per year. The items in the parentheses (below) are just there to illustrate which numbers get added. The primary source of my confusion is how to have it select and therefore add certain indcode combinations. My initial thought is that you would pivot wider, add the columns, and the pivot_longer but hoping for something a bit less convoluted.
year indcode avgemp ownership
2016 71+72 54 (44+10) 50
2016 71 44 50
2016 72 10
2017 71+72 57 (46+11) 50
2018 71+72 60 (48+12) 50
2019 71+72 63 (50+13) 50
2020 71+72 69 (55+14) 50
2021 71+72 71 (56+15) 50
I know that it would start something like this
test3 <- test3 %>% group_by (indcode) %>% mutate("71+72" = (something that filters out 71 and 72)
group_by(year, gr = indcode %/%10) %>%
summarise(indcode = paste(unique(indcode), collapse = '+'),
avgemp = sum(avgemp), ownership = ownership[1], .groups = 'drop') %>%
select(-gr)%>%
arrange(indcode)
# A tibble: 12 x 4
year indcode avgemp ownership
<dbl> <chr> <dbl> <dbl>
1 2016 44+45 82 50
2 2017 44+45 84 50
3 2018 44+45 84 50
4 2019 44+45 86 50
5 2020 44+45 94 50
6 2021 44+45 102 50
7 2016 71+72 54 50
8 2017 71+72 57 50
9 2018 71+72 60 50
10 2019 71+72 63 50
11 2020 71+72 69 50
12 2021 71+72 71 50
Using data.table - convert the data.frame to 'data.table' with setDT, grouped by 'year', 'ownership', and the 'indcode' created by an ifelse/fcase method), get the sum of 'avgemp' as a summarised output
library(data.table)
setDT(test3)[, .(avgemp = sum(avgemp)), .(year, ownership,
indcode = fcase(indcode %in% c(71, 72), '71+72', default = '44+45'))]
-output
year ownership indcode avgemp
<num> <num> <char> <num>
1: 2016 50 71+72 54
2: 2017 50 71+72 57
3: 2018 50 71+72 60
4: 2019 50 71+72 63
5: 2020 50 71+72 69
6: 2021 50 71+72 71
7: 2016 50 44+45 82
8: 2017 50 44+45 84
9: 2018 50 44+45 84
10: 2019 50 44+45 86
11: 2020 50 44+45 94
12: 2021 50 44+45 102
Related
I have a problem with the humans here; they're giving me Citizen Science data in spreadsheets formatted to be attractive and legible. I figured out the right sequence of pivots _longer and _wider to get it into an analyzable format but first I had to do a whole bunch of hand edits to make the column labels usable. I've just been given a corrected spreadsheet so now I have to do the same hand edits all over. Can I avoid this?
reprex <- read_csv("reprex.csv", col_names = FALSE)
gives:
X1 X2 X3 X4 X5 X6 X7 X8 X9 X10
1 NA NA 2014 NA NA 2015 NA NA 2016 NA
2 NA Total F M Total F M Total F M
3 SiteA 180 92 88 134 40 94 34 20 14
4 SiteB NA NA NA 247 143 104 8 8 0
5 SiteC 237 194 43 220 95 125 62 45 17
I want column labels like "2014 Total", "2014 F", ... like so:
Location `2014 Total` `2014 F` `2014 M` `2015 Total` `2015 F` `2015 M` `2016 Total` `2016 F` `2016 M`
1 SiteA 180 92 88 134 40 94 34 20 14
2 SiteB NA NA NA 247 143 104 8 8 0
3 SiteC 237 194 43 220 95 125 62 45 17
...which would allow me to twist it up until I get to something like:
Location date Total F M
1 SiteA 2014 180 92 88
2 SiteB 2014 NA NA NA
3 SiteC 2014 237 194 43
4 SiteA 2015 134 40 94
5 SiteB 2015 247 143 104
6 SiteC 2015 220 95 125
7 SiteA 2016 34 20 14
8 SiteB 2016 8 8 0
9 SiteC 2016 62 45 17
The part from the second table to the third I've got; the problem is in how to get from the first table to the second. It would seem like you could pivot the first and then fill in the missing dates with fill(.direction="updown") except that the dates are the grouping value you need to be following.
For this example we could do like this:
library(tidyverse)
df_helper <- df %>%
slice(1:2) %>%
pivot_longer(cols= everything()) %>%
fill(value, .direction = "up") %>%
mutate(x = lead(value, 11)) %>%
drop_na() %>%
unite("name", c(value, x), sep = " ", remove = FALSE) %>%
pivot_wider(names_from = name)
df %>%
setNames(names(df_helper)) %>%
rename(Location = x) %>%
slice(-c(1:2))
Location 2014 Total 2014 F 2014 M 2015 Total 2015 F 2015 M 2016 Total 2016 F 2016 M
3 SiteA 180 92 88 134 40 94 34 20 14
4 SiteB <NA> <NA> <NA> 247 143 104 8 8 0
5 SiteC 237 194 43 220 95 125 62 45 17
I'm certain the answer to this is simple, but I can't figure it out.
I have a pivot table where the column headings are years.
# A tibble: 3 x 5
country 2012 2013 2014 2015
<chr> <dbl> <dbl> <dbl> <dbl>
USA 45 23 12 42
Canada 67 98 14 25
Mexico 89 104 78 3
I want to create a new column that calculate the difference between two other columns. Rather than recognize the year as a column heading, however, R takes the difference between the two years.
Below is a sample of my code. If I put " " around the years, I get an error: "x non-numeric argument to binary operator". Without " ", R creates a new column with the value -3, simply subtracting the years.
df %>%
pivot_wider(names_from = year, values_from = value) %>%
mutate(diff = 2012 - 2015)
How do I re-write this to get the following table:
# A tibble: 3 x 6
country 2012 2013 2014 2015 diff
<chr> <dbl> <dbl> <dbl> <dbl> <dbl>
USA 45 23 12 47 -2
Canada 67 98 14 25 42
Mexico 89 104 78 3 86
You may try
df %>%
pivot_wider(names_from = year, values_from = value) %>%
mutate(diff = .$'2012' - .$'2015')
with your data,
df <- read.table(text = "country 2012 2013 2014 2015
USA 45 23 12 42
Canada 67 98 14 25
Mexico 89 104 78 3
", header = T)
names(df) <- c("country", 2012, 2013, 2014, 2015 )
df %>%
mutate(diff = .$'2012' - .$'2015')
country 2012 2013 2014 2015 diff
1 USA 45 23 12 42 3
2 Canada 67 98 14 25 42
3 Mexico 89 104 78 3 86
I want to calculate the area under the curve for the time points for each id and column. Any suggestions? Which R packages to use? Many thanks!
id <- rep(1:3,each=5)
time <- rep(c(10,20,30,40,50),3)
q1 <- sample(100,15, replace=T)
q2 <- sample(100,15, replace=T)
q3 <- sample(100,15, replace=T)
df <- data.frame(id,time,q1,q2,q3)
df
id time q1 q2 q3
1 10 38 55 38
1 20 46 29 88
1 30 16 28 97
1 40 37 20 81
1 50 59 27 42
2 10 82 81 54
2 20 45 3 23
2 30 82 67 59
2 40 27 3 42
2 50 45 71 45
3 10 39 8 29
3 20 12 6 90
3 30 92 11 7
3 40 52 8 37
3 50 81 57 80
Wanted output, something like this:
q1 q2 q3
1 area area area
2 area area area
3 area area area
library(tidyverse)
id <- rep(1:3,each=5)
time <- rep(c(10,20,30,40,50),3)
q1 <- sample(100,15, replace=T)
q2 <- sample(100,15, replace=T)
q3 <- sample(100,15, replace=T)
df <- data.frame(id,time,q1,q2,q3)
df %>%
arrange(time) %>%
pivot_longer(cols = c(q1, q2, q3)) -> longer_df
longer_df %>%
ggplot(aes(x = time, y = value, col = factor(id))) +
geom_line() +
geom_point() +
facet_wrap(. ~ name)
longer_df %>%
group_by(id, name) %>%
mutate(lag_value = lag(value),
midpoint_value = (value + lag_value)/2) %>%
summarize(area = 10*sum(midpoint_value, na.rm = T)) %>%
pivot_wider(values_from = area)
#> `summarise()` has grouped output by 'id'. You can override using the `.groups` argument.
#> # A tibble: 3 x 4
#> # Groups: id [3]
#> id q1 q2 q3
#> <int> <dbl> <dbl> <dbl>
#> 1 1 1960 1980 2075
#> 2 2 1025 2215 2180
#> 3 3 2105 1590 2110
Created on 2021-06-30 by the reprex package (v2.0.0)
Here I will use the trapz function to calculate the integral.
library(data.table)
library(caTools) # integrate with its trapz function
# data
df <- fread("id time q1 q2 q3
1 10 38 55 38
1 20 46 29 88
1 30 16 28 97
1 40 37 20 81
1 50 59 27 42
2 10 82 81 54
2 20 45 3 23
2 30 82 67 59
2 40 27 3 42
2 50 45 71 45
3 10 39 8 29
3 20 12 6 90
3 30 92 11 7
3 40 52 8 37
3 50 81 57 80")
# calculate the area with `trapz`
df[,lapply(.SD[,2:4], function(y) trapz(time,y)),by=id]
#> id q1 q2 q3
#> 1: 1 1475 1180 3060
#> 2: 2 2175 1490 1735
#> 3: 3 2160 575 1885
Created on 2021-06-30 by the reprex package (v2.0.0)
My dataset has several variables and I want to build a subset as well as create new variables based on those conditions
dat1
S1 S2 H1 H2 Month1 Year1 Month2 Year2
16 17 81 70 09 2017 07 2017
17 16 80 70 08 2017 08 2016
14 16 81 81 09 2016 05 2016
18 15 70 81 07 2016 09 2017
17 16 80 80 08 2016 05 2016
18 18 81 70 05 2017 04 2016
I want to subset such that if S1=16,17,18 and H1=81,80 then I create a new variable Hist=H1 , date=paste(Month1,Year1) Sip = S1
Same goes for set of S2, H2 .
My output should be: [ The first 4 rows comes for sets of S1,H1, Month1,Year2 and last 2 rows comes from S2,H2,Month2,Year2
Hist Sip Date
81 16 09-2017
80 17 08-2017
80 17 08-2016
81 18 05-2017
81 16 05-2016
80 16 05-2016
My Code :
datnew <- dat1 %>%
mutate(Date=ifelse((S1==16|S1==17|S1=18)&(H1==80|H1==81),paste(01,Month1,Year1,sep="-"),
ifelse((S2==16|S2==17|S2==18)&(H2==80|H2==81),paste(Month2,Year2,sep="-"),"NA")),
hist=ifelse((S1==16|S1==17|S1=18)&(H1==80|H1==81),H1,
ifelse((S2==16|S2==17|S2==18)&(H2==80|H2==81),H2,"NA")),
sip=ifelse((S1==16|S1==17|S1=18)&(H1==80|H1==81),S1,
ifelse((S2==16|S2==17|S2==18)&(H2==80|H2==81),S2,"NA")))
In the original data I have 10 sets of such columns ie S1-S10, H1-H10, Month1_-Month10... And for each variable I have lot more conditions of numbers.
In this method it is going on and on. Is there any better way to do this?
Thanks in advance
Here is a tidyverse solution. Separate into two data frames and bind the rows together.
library(tidyverse)
bind_rows(
dat1 %>% select(patientId, ends_with("1")) %>% rename_all(str_remove, "1"),
dat1 %>% select(patientId, ends_with("2")) %>% rename_all(str_remove, "2")
) %>%
transmute(
patientId,
Hist = H,
Sip = S,
date = paste0(Month, "-", Year)
) %>%
filter(
Sip %in% 16:18,
Hist %in% 80:81
)
#> # A tibble: 6 x 4
#> patientId Hist Sip date
#> <int> <dbl> <dbl> <chr>
#> 1 1 81 16 09-2017
#> 2 2 80 17 08-2017
#> 3 5 80 17 08-2016
#> 4 6 81 18 05-2017
#> 5 3 81 16 05-2016
#> 6 5 80 16 05-2016
I have a data frame of baseball player information:
playerID nameFirst nameLast bats throws yearID stint teamID lgID G AB R H X2B X3B HR RBI SB CS BB SO IBB
81955 rolliji01 Jimmy Rollins B R 2007 1 PHI NL 162 716 139 212 38 20 30 94 41 6 49 85 5
103358 wilsowi02 Willie Wilson B R 1980 1 KCA AL 161 705 133 230 28 15 3 49 79 10 28 81 3
93082 suzukic01 Ichiro Suzuki L R 2004 1 SEA AL 161 704 101 262 24 5 8 60 36 11 49 63 19
83973 samueju01 Juan Samuel R R 1984 1 PHI NL 160 701 105 191 36 19 15 69 72 15 28 168 2
15201 cashda01 Dave Cash R R 1975 1 PHI NL 162 699 111 213 40 3 4 57 13 6 56 34 5
75531 pierrju01 Juan Pierre L L 2006 1 CHN NL 162 699 87 204 32 13 3 40 58 20 32 38 0
HBP SH SF GIDP average
81955 7 0 6 11 0.2960894
103358 6 5 1 4 0.3262411
93082 4 2 3 6 0.3721591
83973 7 0 1 6 0.2724679
15201 4 0 7 8 0.3047210
75531 8 10 1 6 0.2918455
I want to return a maximum value of the batting average ('average') column where the at-bats ('AB') are greater than 100. There are also 'NaN' in the average column.
If you want to return the entire row for which the two conditions are TRUE, you can do something like this.
library(tidyverse)
data <- tibble(
AB = sample(seq(50, 150, 10), 10),
avg = c(runif(9), NaN)
)
data %>%
filter(AB >= 100) %>%
filter(avg == max(avg, na.rm = TRUE))
Where the first filter is to only keep rows where AB is greater than or equal to 100 and the second filter is to select the entire row where it is max. If you want to to only get the maximum value, you can do something like this:
data %>%
filter(AB >= 100) %>%
summarise(max = max(avg, na.rm = TRUE))