Count the length of each event by group - r

I am counting the events with multiple conditions and my final result is represented for each year seperately. The example of my df:
year <- c(rep(1981,20))
k1 <- c(rep(NA,5),rep("COLD",4),rep(NA,4),"COLD",NA,"COLD",rep(NA,4))
k2 <- c(rep(NA,10),rep("COLD",2),rep(NA,8))
k3 <- c(rep(NA,3),"COLD",rep(NA,16))
k4 <- c(rep(NA,3),rep("COLD",5),rep(NA,2),rep("COLD",6),NA,rep("COLD",3))
k5 <- c(rep(NA,3),"COLD",rep(NA,3),"COLD",rep(NA,3),"COLD",rep(NA,8))
df <- data.frame(year,k1,k2,k3,k4,k5)
The code I use is below:
rle_col <- function(k_col) {
with(rle(is.na(k_col)), {
i1 <- values
i1[values & lengths <= 2] <- 'Invalid'
sum(!values & lengths >= 5 &
(lead(i1) != "Invalid" & lag(i1)>=1), na.rm = TRUE)
})
}
rezult <- df %>%
group_by(year) %>%
summarise(across(starts_with('k'), rle_col))
My code works well, but I need to get one more variable - the length of each event (or the number of members in each event). This means that instead of sum here sum(!values & lengths >= 5 & (lead(i1) != "Invalid" & lag(i1)>=1), na.rm = TRUE) I need to get the number of values within each event, which satisfies these conditions. Is it possible? Thank you for any help.
The result I would like to see should be as follows:
Year k4
1981 5
1981 10

There have to be easy solutions for this problem. Here is a ridiculous complicated (and not well performing) approach using tidyverse:
library(tidyr)
library(dplyr)
library(purrr)
map_df(
names(df)[grepl("^k", names(df))],
~df %>%
group_by(year) %>%
mutate(
grp = (!!sym(.x) == "COLD" & !is.na(!!sym(.x))) |
!(is.na(lead(!!sym(.x))) | is.na(lag(!!sym(.x))))
) %>%
group_by(year, grp2 = cumsum(!grp)) %>%
filter(grp) %>%
count(name = .x) %>%
filter(!!sym(.x) >= 5) %>%
ungroup() %>%
select(-grp2)
)
This returns
# A tibble: 2 x 6
year k1 k2 k3 k4 k5
<dbl> <int> <int> <int> <int> <int>
1 1981 NA NA NA 5 NA
2 1981 NA NA NA 10 NA
Issues:
It's not performing well.
If there are several groups in different columns the output looks like this:
# A tibble: 9 x 6
year k1 k2 k3 k4 k5
<dbl> <int> <int> <int> <int> <int>
1 1981 4 NA NA NA NA
2 1981 3 NA NA NA NA
3 1981 NA 2 NA NA NA
4 1981 NA NA 1 NA NA
5 1981 NA NA NA 5 NA
6 1981 NA NA NA 10 NA
7 1981 NA NA NA NA 1
8 1981 NA NA NA NA 1
9 1981 NA NA NA NA 1
At the moment I don't know how to change it into something like
# A tibble: 2 x 6
year k1 k2 k3 k4 k5
<dbl> <int> <int> <int> <int> <int>
1 1981 4 2 1 5 1
2 1981 3 NA NA 10 1

Related

Collapse data frame so NAs are removed

I want to collapse this data frame so NA's are removed. How to accomplish this? Thanks!!
id <- c(1,1,1,2,2,3,4,5,5)
q1 <- c(23,55,7,88,90,34,11,22,99)
df <- data.frame(id,q1)
df$row <- 1:nrow(df)
spread(df, id, q1)
row 1 2 3 4 5
1 23 NA NA NA NA
2 55 NA NA NA NA
3 7 NA NA NA NA
4 NA 88 NA NA NA
5 NA 90 NA NA NA
6 NA NA 34 NA NA
7 NA NA NA 11 NA
8 NA NA NA NA 22
9 NA NA NA NA 89
I want it to look like this:
1 2 3 4 5
23 88 34 11 22
55 90 NA NA 89
7 NA NA NA NA
::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::
The row should be created on the sequence of 'id'. In addition, pivot_wider would be a more general function compared to spread
library(dplyr)
library(tidyr)
df %>%
group_by(id) %>%
mutate(row = row_number()) %>%
ungroup %>%
pivot_wider(names_from = id, values_from = q1) %>%
select(-row)
-output
# A tibble: 3 × 5
`1` `2` `3` `4` `5`
<dbl> <dbl> <dbl> <dbl> <dbl>
1 23 88 34 11 22
2 55 90 NA NA 99
3 7 NA NA NA NA
Or use dcast
library(data.table)
dcast(setDT(df), rowid(id) ~ id, value.var = 'q1')[, id := NULL][]
1 2 3 4 5
<num> <num> <num> <num> <num>
1: 23 88 34 11 22
2: 55 90 NA NA 99
3: 7 NA NA NA NA
Here's a base R solution. I sort each column so the non-NA values are at the top, find the number of non-NA values in the column with the most non-NA values (n), and return the top n rows from the data frame.
library(tidyr)
id <- c(1,1,1,2,2,3,4,5,5)
q1 <- c(23,55,7,88,90,34,11,22,99)
df <- data.frame(id,q1)
df$row <- 1:nrow(df)
df <- spread(df, id, q1)
collapse_df <- function(df) {
move_na_to_bottom <- function(x) x[order(is.na(x))]
sorted <- sapply(df, move_na_to_bottom)
count_non_na <- function(x) sum(!is.na(x))
n <- max(apply(df, 2, count_non_na))
sorted[1:n, ]
}
collapse_df(df[, -1])

R: Turning row data from one dataframe into column data by group in another

I have data in the following format:
ID
Age
Sex
1
29
M
2
32
F
3
18
F
4
89
M
5
45
M
and;
ID
subID
Type
Status
Year
1
3
Car
Y
1
11
Toyota
NULL
2011
1
23
Kia
NULL
2009
2
5
Car
N
3
2
Car
Y
3
4
Honda
NULL
2019
3
7
Fiat
NULL
2006
3
8
Mitsubishi
NULL
2020
4
1
Car
N
5
7
Car
Y
Each ID in the second table has a row specifying if they have a car, and additional rows stating the brand of car/s they own. Each person has a maximum of 3 cars. I want to simplify this data into a single table as so.
ID
Age
Sex
Car?
Car.1
Car1.year
Car.2
Car2.year
Car.3
Car3.year
1
29
M
Y
Toyota
2011
Kia
2009
NULL
NULL
2
32
F
N
NULL
NULL
NULL
NULL
NULL
NULL
3
18
F
Y
Honda
2019
Fiat
2006
Mitsubishi
2020
4
89
M
N
NULL
NULL
NULL
NULL
NULL
NULL
5
45
M
Y
NULL
NULL
NULL
NULL
NULL
NULL
I've tried using the mutate function in dplyr with the case_when function, but I can't check conditions in another dataframe. If I try to join the tables together, I would have multiple rows for each ID which I want to avoid. The non-standard set up of the second table makes things complicated. My only remaining idea is to switch to Python/Pandas and create a for loop that slowly loops through each ID, searches the second dataframe if the person has a car and the car brands, then mutates a column in the first dataframe. But given the size of my dataset, this would be inefficient and take a long time.
What is the best way to do this?
You can try the following codes:
library(tidyverse)
df1
# A tibble: 5 x 3
ID Age Sex
<dbl> <dbl> <chr>
1 1 29 M
2 2 32 F
3 3 18 F
4 4 89 M
5 5 45 M
df2
# A tibble: 10 x 5
ID subID Type Status Year
<dbl> <dbl> <chr> <chr> <dbl>
1 1 3 Car Y NA
2 1 11 Toyota Y 2011
3 1 23 Kia Y 2009
4 2 5 Car N NA
5 3 2 Car Y NA
6 3 4 Honda Y 2019
7 3 7 Fiat Y 2006
8 3 8 Mitsubishi Y 2020
9 4 1 Clothed N NA
10 5 7 Clothed Y NA
df2 <- df2 %>% mutate(Status = if_else(Status == "NULL", "Y", Status))
df3 <- df2 %>% filter(!is.na(Year)) %>% group_by(ID) %>% mutate(index = row_number())
df4 <- df3 %>% pivot_wider(id_cols = c(ID), values_from = c(Type, Year), names_from = index )
So your desired output will be produced:
df1 %>% left_join(df2 %>% select(ID, Status) %>% distinct()) %>% left_join(df4)
# A tibble: 5 x 10
ID Age Sex Status Type_1 Type_2 Type_3 Year_1 Year_2 Year_3
<dbl> <dbl> <chr> <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl>
1 1 29 M Y Toyota Kia NA 2011 2009 NA
2 2 32 F N NA NA NA NA NA NA
3 3 18 F Y Honda Fiat Mitsubishi 2019 2006 2020
4 4 89 M N NA NA NA NA NA NA
5 5 45 M Y NA NA NA NA NA NA

Convert list of lists to data frame retaining names and all columns

I'd like to convert chemical formulas to a data frame containing columns for 1) the mineral name, 2) the chemical formula and 3) a set of columns for each element that is extracted from the formula. I am given the first two columns and I can extract the number of elements from each formula using CHNOSZ::makeup(). However, I'm not familiar working with lists and not sure how to rbind() the lists back into a data frame that contains everything I'm looking for (i.e. see 1-3 above).
Here is what I have so far - appreciate any help (including a link to a good tutorial on how to convert data from nested lists into dataframes).
library(tidyverse)
library(CHNOSZ)
formulas <- structure(list(Mineral = c("Abelsonite", "Abernathyite", "Abhurite",
"Abswurmbachite", "Acanthite", "Acetamide"), Composition = c("C31H32N4Ni",
"K(UO2)(AsO4)4(H2O)", "Sn3O(OH)2Cl2", "CuMn6(SiO4)O8", "Ag2S",
"CH3CONH2")), class = c("tbl_df", "tbl", "data.frame"), row.names = c(NA,
-6L))
test <- formulas %>%
select(Composition) %>%
map(CHNOSZ::makeup) %>%
flatten
test2 <- do.call(rbind,test)
> test2
As H K O U
[1,] 31 32 4 1 31
[2,] 4 2 1 19 1
[3,] 2 2 3 3 2
[4,] 1 6 12 1 1
[5,] 2 1 2 1 2
[6,] 2 5 1 1 2
which is not right.
You could do something like this
library(tidyverse)
library(CNOSZ)
test <- formulas %>%
mutate(res = map(Composition, ~stack(makeup(.x)))) %>%
unnest(cols = res) %>%
spread(ind, values)
## A tibble: 6 x 17
# Mineral Composition C H N Ni As K O U Cl
# <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#1 Abelso… C31H32N4Ni 31 32 4 1 NA NA NA NA NA
#2 Aberna… K(UO2)(AsO… NA 2 NA NA 4 1 19 1 NA
#3 Abhuri… Sn3O(OH)2C… NA 2 NA NA NA NA 3 NA 2
#4 Abswur… CuMn6(SiO4… NA NA NA NA NA NA 12 NA NA
#5 Acanth… Ag2S NA NA NA NA NA NA NA NA NA
#6 Acetam… CH3CONH2 2 5 1 NA NA NA 1 NA NA
## … with 6 more variables: Sn <dbl>, Cu <dbl>, Mn <dbl>, Si <dbl>, Ag <dbl>,
## S <dbl>

Calculating a ratio from two columns of data by parameters set in another column

I have date values in wide from and I'm trying to calculate the ratio of the date value with the baseline only within the Start Date and End Dates.
For example:
ID Start Date End Date Baseline 1/18 2/18 3/18 4/18 5/18 6/18 7/18 8/18
A 1/1/2018 5/1/2018 5 2 4 1 3 5 2 4 5
B 6/1/2018 8/1/2018 2 4 2 4 3 6 6 2 1
C 2/1/2018 3/1/2018 8 3 5 5 3 2 7 8 2
D 5/1/2015 7/1/2018 9 1 3 5 7 4 8 9 1
I would like to output to be:
ID Start Date End Date Baseline 1/18 2/18 3/18 4/18 5/18 6/18 7/18 8/18
A 1/1/2018 5/1/2018 5 0.4 0.8 0.2 0.6 1
B 6/1/2018 8/1/2018 2 3 1 0.5
C 2/1/2018 3/1/2018 8 0.625 0.625
D 5/1/2015 7/1/2018 9 0.44 0.88 1
Thank you!
A very inelegant solution with dplyr and tidyr, which someone can probably build on:
library(dplyr)
library(tidyr)
sample <- sample %>% mutate_at(vars(5:12), funs(round(./Baseline, digits = 3))) ## perform the initial simple proportion calculation
sample <- sample %>% gather(5:12, key = "day", value = "value") %>%
rowwise() %>% ## allow for rowwise operations
mutate(value_temp = case_when(any(grepl(as.numeric(str_extract(day, "^[:digit:]{1,2}(?=/)")),
as.numeric(str_extract(StartDate, "^[:digit:]{1,2}(?=/)")):as.numeric(str_extract(EndDate, "^[:digit:]{1,2}(?=/)")))) == T ~ T,
TRUE ~ NA)) ## create a logical vector which indicates TRUE if the "day" is included in the range of days of StartDate and EndDate
sample$value[is.na(sample$value_temp)] <- NA ## sets values which aren't included in the vector of days to NA
sample$value_temp <- NULL ## remove the temp variable
sample <- sample %>% spread(day, value) ## spread to original df
> sample
# A tibble: 4 x 12
ID StartDate EndDate Baseline `1/18` `2/18` `3/18` `4/18` `5/18` `6/18` `7/18` `8/18`
<chr> <chr> <chr> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 A 1/1/2018 5/1/2018 5 0.4 0.8 0.2 0.6 1 NA NA NA
2 B 6/1/2018 8/1/2018 2 NA NA NA NA NA 3 1 0.5
3 C 2/1/2018 3/1/2018 8 NA 0.625 0.625 NA NA NA NA NA
4 D 5/1/2015 7/1/2018 9 NA NA NA NA 0.444 0.889 1 NA
Update:
sample <- sample %>% mutate_at(vars(5:12), funs(round(./Baseline, digits = 3)))
sample <- sample %>% gather(5:12, key = "day", value = "value") %>%
rowwise() %>%
mutate(value_temp = case_when(any(grepl(as.numeric(str_extract(day, "^[:digit:]{1,2}(?=/)")),
as.numeric(str_extract(Start_Date, "^[:digit:]{1,2}(?=/)")):as.numeric(str_extract(End_Date, "^[:digit:]{1,2}(?=/)")))) == T &
any(grepl(as.numeric(str_extract(day, "[:digit:]{2}$")),
as.numeric(str_extract(Start_Date, "[:digit:]{2}$")):as.numeric(str_extract(End_Date, "[:digit:]{2}$")))) ~ T,
TRUE ~ NA))
sample$value[is.na(sample$value_temp)] <- NA
sample$value_temp <- NULL
sample$day <- sample$day %>% as_factor()
sample <- sample %>% spread(day, value)

Replacing NA's with a specific condition in R [duplicate]

This question already has answers here:
Replace NA values by row means
(3 answers)
Closed 4 years ago.
In case 2017 is NA and columns of 2015 and 2016 have value, I want to assign average of them to 2017 based on the same row.
Index 2015 2016 2017
1 NA 6355698 10107023
2 13000000 73050000 NA
4 NA NA NA
5 10500000 NA 8000000
6 331000000 659000000 1040000000
7 55500000 NA 32032920
8 NA NA 20000000
9 2521880 5061370 7044288
...
Here is that I tried, didn't work!
ind <- which(is.na(df), arr.ind=TRUE)
df[ind] <- rowMeans(df, na.rm = TRUE)[ind[,1]]
Also if we have values in 2015 and 2017 columns and 2016 is NA, I want to assign average of them to the column of 2016 based on the same row. Any help would be appreciated!
Disclaimer: I'm not entirely clear on what your expected output is. My solution below is based on the assumption that you want to replace NA values with either the mean of all values for every year or with the mean value of all values for every Index.
Here is a tidyverse option first spreading from wide to long, replacing NAs with the mean value per year, and finally converting back from long to wide.
library(tidyverse)
df %>%
gather(year, value, -Index) %>%
group_by(year) %>%
mutate(value = ifelse(is.na(value), mean(value, na.rm = T), value)) %>%
spread(year, value)
## A tibble: 8 x 4
# Index `2015` `2016` `2017`
# <int> <dbl> <dbl> <dbl>
#1 1 115507293. 6355698. 10107023.
#2 2 13000000. 223472356. 186197372.
#3 4 115507293. 223472356. 186197372.
#4 5 115507293. 223472356. 8000000.
#5 6 331000000. 659000000. 1040000000.
#6 7 115507293. 223472356. 32032920.
#7 8 115507293. 223472356. 20000000.
#8 9 2521880. 5061370. 7044288.
Note that here we replace NAs with mean value per year. If instead you want to replace NAs with the mean value per Index value, simply replace group_by(year) with group_by(Index):
df %>%
gather(year, value, -Index) %>%
group_by(Index) %>%
mutate(value = ifelse(is.na(value), mean(value, na.rm = T), value)) %>%
spread(year, value)
## A tibble: 8 x 4
## Groups: Index [8]
# Index `2015` `2016` `2017`
# <int> <dbl> <dbl> <dbl>
#1 1 8231360. 6355698. 10107023.
#2 2 13000000. 13000000. 13000000.
#3 4 NaN NaN NaN
#4 5 8000000. 8000000. 8000000.
#5 6 331000000. 659000000. 1040000000.
#6 7 32032920. 32032920. 32032920.
#7 8 20000000. 20000000. 20000000.
#8 9 2521880. 5061370. 7044288.
Update
To only replace NAs in column 2017 with the row average based on the 2015,2016 values you can do
df <- read_table("Index 2015 2016 2017
1 NA 6355698 10107023
2 13000000 73050000 NA
4 NA NA NA
5 10500000 NA 8000000
6 331000000 659000000 1040000000
7 55500000 NA 32032920
8 NA NA 20000000
9 2521880 5061370 7044288")
df %>%
mutate(`2017` = ifelse(is.na(`2017`), 0.5 * (`2015` + `2016`), `2017`))
## A tibble: 8 x 4
# Index `2015` `2016` `2017`
# <int> <int> <int> <dbl>
#1 1 NA 6355698 10107023.
#2 2 13000000 73050000 43025000.
#3 4 NA NA NA
#4 5 10500000 NA 8000000.
#5 6 331000000 659000000 1040000000.
#6 7 55500000 NA 32032920.
#7 8 NA NA 20000000.
#8 9 2521880 5061370 7044288.
Sample data
df <- read_table("Index 2015 2016 2017
1 NA 6355698 10107023
2 13000000 NA NA
4 NA NA NA
5 NA NA 8000000
6 331000000 659000000 1040000000
7 NA NA 32032920
8 NA NA 20000000
9 2521880 5061370 7044288")

Resources