Conditionally set value in previous row, within group - r

I have a data frame "df" like this, grouped by "nest."
nest laid stage
1 NA 2
1 5 4
1 -10 NA
2 NA 1
2 3 1
2 -8 NA
I want to make a condition so that if "laid" is > 0, the "stage" of that nest at the previous visit is set to 0. If "laid" is not greater than 0, I want no change in "stage".
Desired outcome:
nest laid stage
1 NA 0
1 5 4
1 -10 NA
2 NA 0
2 3 1
2 -8 NA
I've tried different versions of code below (dplyr and tidyr), with various errors:
df1 <- df %>%
group_by(nest) %>%
mutate(stage, if(laid > 0){stage = 0}) %>%
fill(stage, .direction = "up")
I've gone over similar questions, but they all use ifelse. Any tips are much appreciated!

You can use if_else (or ifelse if you are not certain of the column data types), which is a vectorized version of if/else; To check the next laid value, use lead:
df %>%
group_by(nest) %>%
mutate(stage = if_else(lead(laid) > 0, 0L, stage))
# A tibble: 6 x 3
# Groups: nest [2]
# nest laid stage
# <int> <int> <int>
#1 1 NA 0
#2 1 5 4
#3 1 -10 NA
#4 2 NA 0
#5 2 3 1
#6 2 -8 NA

Related

How to change NA into 0 based on other variable / how many times it was recorded

I am still new to R and need help. I want to change the NA value in variables x1,x2,x3 to 0 based on the value of count. Count specifies the number of observations, and the x1,x2,x3 stand for the visit to the site (or replication). The value in each 'X' variable is the number of species found. However, not all sites were visited 3 times. The variable count is telling us how many times the site was actually visited. I want to identify the actual NA and real 0 (which means no species found). I want to change the NA into 0 if the site is actually visited and keep it NA if the site is not visited. For example from the dummy data, 'zhask' site is visited 2 times, then the NA in x1 of zhask needs to be replaced with 0.
This is the dummy data:
site x1 x2 x3 count
1 miya 1 2 1 3
2 zhask NA 1 NA 2
3 balmond 3 NA 2 3
4 layla NA 1 NA 2
5 angela NA 3 NA 2
So, it the table need to be changed into:
site x1 x2 x3 count
1 miya 1 2 1 3
2 zhask 0 1 NA 2
3 balmond 3 0 2 3
4 layla 0 1 NA 2
5 angela 0 3 NA 2
I've tried many things and try to make my own function, however, it is not working:
for(i in 1:nrow(df))
{
if( is.na(df$x1[i]) && (i < df$count[i]))
{df$x1[i]=0}
else
{df$x1[i]=df$x1[i]}
}
this is the script for the dummy dataframe:
x1= c(1,NA,3, NA, NA)
x2= c(2,1, NA, 1, 3)
x3 = c(1, NA, 2, NA, NA)
count=c(3,2,3,2,2)
site=c("miya", "zhask", "balmond", "layla", "angela")
df=data.frame(site,x1,x2,x3,count)
Any help will be very much appreciated!
One way to be to apply a function over all of your count columns. Here's a way to do that.
cols <- c("x1", "x2", "x3")
df[, cols] <- mapply(function(col, idx, count) {
ifelse(idx <=count & is.na(col), 0, col)
}, df[,cols], seq_along(cols), MoreArgs=list(count=df$count))
# site x1 x2 x3 count
# 1 miya 1 2 1 3
# 2 zhask 0 1 NA 2
# 3 balmond 3 0 2 3
# 4 layla 0 1 NA 2
# 5 angela 0 3 NA 2
We use mapply to iterate over the columns and the index of the column. We also pass in the count value each time (since it's the same for all columns, it goes in the MoreArgs= parameter). This mapply will return a list and we can use that to replace the columns with the updated values.
If you wanted to use dplyr, that might look more like
library(dplyr)
cols <- c("x1"=1, "x2"=2, "x3"=3)
df %>%
mutate(across(starts_with("x"), ~if_else(cols[cur_column()]<count & is.na(.x), 0, .x)))
I used the cols vector to get the index of the column which doesn't seem to be otherwise available when using across().
But a more "tidy" way to tackle this problem would be to pivot your data first to a "tidy" format. Then you can clean the data more easily and pivot back if necessary
library(dplyr)
library(tidyr)
df %>%
pivot_longer(cols=starts_with("x")) %>%
mutate(index=readr::parse_number(name)) %>%
mutate(value=if_else(index < count & is.na(value), 0, value)) %>%
select(-index) %>%
pivot_wider(names_from=name, values_from=value)
# site count x1 x2 x3
# <chr> <dbl> <dbl> <dbl> <dbl>
# 1 miya 3 1 2 1
# 2 zhask 2 0 1 NA
# 3 balmond 3 3 0 2
# 4 layla 2 0 1 NA
# 5 angela 2 0 3 NA
Via some indexing of the columns:
vars <- c("x1","x2","x3")
df[vars][is.na(df[vars]) & (col(df[vars]) <= df$count)] <- 0
# site x1 x2 x3 count
#1 miya 1 2 1 3
#2 zhask 0 1 NA 2
#3 balmond 3 0 2 3
#4 layla 0 1 NA 2
#5 angela 0 3 NA 2
Essentially this is:
selecting the variables/columns and storing in vars
flagging the NA cells within those variables with is.na(df[vars])
col(df[vars]) returns a column number for every cell, which can be checked if it is less than the df$count in each corresponding row
the values meeting both the above criteria are overwritten <- with 0
This could be yet another solution using purrr::pmap:
purrr::pmap is used for row-wise operations when applied on a data frame. It enables us to iterate over multiple arguments at the same time. So here c(...) refers to all corresponding elements of the selected variable (all except site) in each row
I think the rest of the solution is pretty clear but please let me know if I need to explain more about this.
library(dplyr)
library(purrr)
library(tidyr)
df %>%
mutate(output = pmap(df[-1], ~ {x <- head(c(...), -1)
inds <- which(is.na(x))
req <- tail(c(...), 1) - sum(!is.na(x))
x[inds[seq_len(req)]] <- 0
x})) %>%
select(site, output, count) %>%
unnest_wider(output)
# A tibble: 5 x 5
site x1 x2 x3 count
<chr> <dbl> <dbl> <dbl> <dbl>
1 miya 1 2 1 3
2 zhask 0 1 NA 2
3 balmond 3 0 2 3
4 layla 0 1 NA 2
5 angela 0 3 NA 2

Creating a count variable for NA cases in data frame

I have an R data frame including a few columns of numerical data with NA values too. See the example with first 2 columns below. I want to create a new column (3rd one below called output) which shows an incremental count of NA values for each of my group variables. For example, region A has 2 NA values so it will show 1 and 2 next to the relevant rows. Region B has only one NA value so will show 1 next to it. If a region X has 10 NA values it should show 1,2,3 ... , 10 next to each case, as move down the data frame.
Region
Value
Output
Region A
5
0
Region B
2
0
Region B
NA
1
Region A
NA
1
Region A
9
0
Region A
NA
2
Region A
4
0
I am familiar with dplyr so happy to see a solution around it. Ideally i don't want to use a for loop, but could do if the best solution. In my example above i used zero values for my non-NA cases, that can be anything, doesn't have to be 0.
thanks! :)
You can use cumsum to count up NA within each group. An ifelse will only assign these counts to NA, otherwise will include 0 in output.
library(dplyr)
df %>%
group_by(Region) %>%
mutate(Output = ifelse(is.na(Value), cumsum(is.na(Value)), 0))
Output
Region Value Output
<chr> <int> <dbl>
1 A 5 0
2 B 2 0
3 B NA 1
4 A NA 1
5 A 9 0
6 A NA 2
7 A 4 0
You could create a new column with is.na(value), than group by region and than use cumsum() to create your desired output
df%>%mutate(output=ifelse(!is.na(Value), 0, 1))%>%group_by(Region, output)%>%mutate(output=cumsum(output))
# A tibble: 7 x 3
# Groups: Region, output [5]
Region Value output
<fct> <dbl> <dbl>
1 A 5 0
2 B 2 0
3 B NA 1
4 A NA 1
5 A 9 0
6 A NA 2
7 A 4 0

Sort across rows to obtain three largest values

There is a injury score called ISS score
I have a table of injury data in rows according to pt ID.
I would like to obtain the top three values for the 6 injury columns.
Column values range from 0-5.
pt_id head face abdo pelvis Extremity External
1 4 0 0 1 0 3
2 3 3 5 0 3 2
3 0 0 2 1 1 1
4 2 0 0 0 0 1
5 5 0 0 2 0 1
My output for the above example would be
pt-id n1 n2 n3
1 4 3 1
2 5 3 3
3 2 1 1
4 2 1 0
5 5 2 1
values can be in a list or in new columns as calculating the score is simple from that point on.
I had thought that I would be able to create a list for the 6 injury columns and then apply a sort to each list taking the top three values. My code for that was:
ais$ais_list <- setNames(split(ais[,2:7], seq(nrow(ais))), rownames(ais))
But I struggled to apply the sort to the lists within the data frame as unfortunately some of the data in my data set includes NA values
We could use apply row-wise and sort the dataframe and take only first three values in each row.
cbind(df[1], t(apply(df[-1], 1, sort, decreasing = TRUE)[1:3, ]))
# pt_id 1 2 3
#1 1 4 3 1
#2 2 5 3 3
#3 3 2 1 1
#4 4 2 1 0
#5 5 5 2 1
As some values may contain NA it is better we apply sort using anonymous function and then take take top 3 values using head.
cbind(df[1], t(apply(df[-1], 1, function(x) head(sort(x, decreasing = TRUE), 3))))
A tidyverse option is to first gather the data, arrange it in descending order and for every row select only first three values. We then replace the injury column with the column names which we want and finally spread the data back to wide format.
library(tidyverse)
df %>%
gather(injury, value, -pt_id) %>%
arrange(desc(value)) %>%
group_by(pt_id) %>%
slice(1:3) %>%
mutate(injury = 1:3) %>%
spread(injury, value)
# pt_id `1` `2` `3`
# <int> <int> <int> <int>
#1 1 4 3 1
#2 2 5 3 3
#3 3 2 1 1
#4 4 2 1 0
#5 5 5 2 1

Using Purrr::map2 to Loop Over Two Vectors of Column Names in Order to Conditionally Recode Multiple Columns into New Variables

library(tidyverse)
Using the sample code below, I want to use "mutate" or "mutate_at" to recode multiple columns into new columns based on the value of another column. Basically, I would like to recode the variables ending is "s" (q25s,q26s,etc...), based on the value of the corresponding non "s" variable. So for example, if q25 = 1, then q25s will be recoded so that 1 = 0, 2=0, 3=0, 4=1, 5=1, and 88=Missing, and the new name will be q25_new. If q25 does not equal 1, then is should not be recoded and q25_new should just be NA.
However, in order to achieve this I'm attempting to use tidyverse to create named vectors of column names, and then use "mutate", "recode", and "if_else" together with purr::map2.
I'm thinking something like the code below should be possible? I can't quite get it to work...and I feel like I need to use "paste0" somewhere to name all the new column variable names that start with "_new".
cols1<-Df %>%select(q25:q29)
cols2<-Df %>% select(q25s:q29s)
Df<- Df %>% map2(Df[cols1],Df[cols2],
~if_else(.x==1, mutate_at(vars (.y),funs(recode(.,`1`=0,`2`=0,`3`=0,`4`=1,`5`=1),"NA"))))
Here is the sample code.
Here is the sample code:
q25<-c(2,1,88,2,1)
q26<-c(2,88,88,88,2)
q27<-c(2,2,1,1,1)
q28<-c(88,1,1,2,2)
q29<-c(1,1,1,2,2)
q25s<-c(3,5,88,4,1)
q26s<-c(4,4,5,5,1)
q27s<-c(3,3,4,1,4)
q28s<-c(4,5,88,1,3)
q29s<-c(88,88,3,4,4)
Df<-data.frame(q25,q26,q27,q28,q29,q25s,q26s,q27s,q28s,q29s)
Would this work ?
map2(Df[1:5],Df[6:10], ~ if_else(.x==1, recode(.y,`1`=0,`2`=0,`3`=0,`4`=1,`5`=1,`88` = NA_real_),NA_real_)) %>%
as.data.frame %>%
rename_all(paste0,"_new") %>%
cbind(Df,.)
# q25 q26 q27 q28 q29 q25s q26s q27s q28s q29s q25_new q26_new q27_new q28_new q29_new
# 1 2 2 2 88 1 3 4 3 4 88 3 4 3 4 NA
# 2 1 88 2 1 1 5 4 3 5 88 1 4 3 1 NA
# 3 88 88 1 1 1 88 5 4 88 3 88 5 1 NA 0
# 4 2 88 1 2 2 4 5 1 1 4 4 5 0 1 4
# 5 1 2 1 2 2 1 1 4 3 4 0 1 1 3 4
OK in the end I couldn't resist the challenge, so here's the pretty much 100% tidy way to go at it (same output) :
library(tidyr)
Df %>%
mutate(n=row_number()) %>%
gather(key,value,-n) %>%
mutate(key2 = ifelse(grepl("s",key),"s","x"),
key=sub("s","",key)) %>%
spread(key2,value) %>%
mutate(`_new` = if_else(x==1, recode(s,`1`=0,`2`=0,`3`=0,`4`=1,`5`=1,`88` = NA_real_),Inf)) %>%
gather(key3,value,s,x,`_new`) %>%
unite(key,key,key3,sep="") %>%
spread(key,value) %>%
rename_all(~gsub("x","",.x)) %>%
select(order(nchar(names(.))),-n)

Add missing values in time series efficiently

I have 500 datasets (panel data). In each I have a time series (week) across different shops (store). Within each shop, I would need to add missing time series observations.
A sample of my data would be:
store week value
1 1 50
1 3 52
1 4 10
2 1 4
2 4 84
2 5 2
which I would like to look like:
store week value
1 1 50
1 2 0
1 3 52
1 4 10
2 1 4
2 2 0
2 3 0
2 4 84
2 5 2
I currently use the following code (which works, but takes very very long on my data):
stores<-unique(mydata$store)
for (i in 1:length(stores)){
mydata <- merge(
expand.grid(week=min(mydata$week):max(mydata$week)),
mydata, all=TRUE)
mydata[is.na(mydata)] <- 0
}
Are there better and more efficient ways to do so?
Here's a dplyr/tidyr option you could try:
library(dplyr); library(tidyr)
group_by(df, store) %>%
complete(week = full_seq(week, 1L), fill = list(value = 0))
#Source: local data frame [9 x 3]
#
# store week value
# (int) (int) (dbl)
#1 1 1 50
#2 1 2 0
#3 1 3 52
#4 1 4 10
#5 2 1 4
#6 2 2 0
#7 2 3 0
#8 2 4 84
#9 2 5 2
By default, if you don't specify the fill parameter, new rows will be filled with NA. Since you seem to have many other columns, I would advise to leave out the fill parameter so you end up with NAs, and if required, make another step with mutate_each to turn NAs into 0 (if that's appropriate).
group_by(df, store) %>%
complete(week = full_seq(week, 1L)) %>%
mutate_each(funs(replace(., which(is.na(.)), 0)), -store, -week)

Resources