Creating a count variable for NA cases in data frame - r

I have an R data frame including a few columns of numerical data with NA values too. See the example with first 2 columns below. I want to create a new column (3rd one below called output) which shows an incremental count of NA values for each of my group variables. For example, region A has 2 NA values so it will show 1 and 2 next to the relevant rows. Region B has only one NA value so will show 1 next to it. If a region X has 10 NA values it should show 1,2,3 ... , 10 next to each case, as move down the data frame.
Region
Value
Output
Region A
5
0
Region B
2
0
Region B
NA
1
Region A
NA
1
Region A
9
0
Region A
NA
2
Region A
4
0
I am familiar with dplyr so happy to see a solution around it. Ideally i don't want to use a for loop, but could do if the best solution. In my example above i used zero values for my non-NA cases, that can be anything, doesn't have to be 0.
thanks! :)

You can use cumsum to count up NA within each group. An ifelse will only assign these counts to NA, otherwise will include 0 in output.
library(dplyr)
df %>%
group_by(Region) %>%
mutate(Output = ifelse(is.na(Value), cumsum(is.na(Value)), 0))
Output
Region Value Output
<chr> <int> <dbl>
1 A 5 0
2 B 2 0
3 B NA 1
4 A NA 1
5 A 9 0
6 A NA 2
7 A 4 0

You could create a new column with is.na(value), than group by region and than use cumsum() to create your desired output
df%>%mutate(output=ifelse(!is.na(Value), 0, 1))%>%group_by(Region, output)%>%mutate(output=cumsum(output))
# A tibble: 7 x 3
# Groups: Region, output [5]
Region Value output
<fct> <dbl> <dbl>
1 A 5 0
2 B 2 0
3 B NA 1
4 A NA 1
5 A 9 0
6 A NA 2
7 A 4 0

Related

How to change NA into 0 based on other variable / how many times it was recorded

I am still new to R and need help. I want to change the NA value in variables x1,x2,x3 to 0 based on the value of count. Count specifies the number of observations, and the x1,x2,x3 stand for the visit to the site (or replication). The value in each 'X' variable is the number of species found. However, not all sites were visited 3 times. The variable count is telling us how many times the site was actually visited. I want to identify the actual NA and real 0 (which means no species found). I want to change the NA into 0 if the site is actually visited and keep it NA if the site is not visited. For example from the dummy data, 'zhask' site is visited 2 times, then the NA in x1 of zhask needs to be replaced with 0.
This is the dummy data:
site x1 x2 x3 count
1 miya 1 2 1 3
2 zhask NA 1 NA 2
3 balmond 3 NA 2 3
4 layla NA 1 NA 2
5 angela NA 3 NA 2
So, it the table need to be changed into:
site x1 x2 x3 count
1 miya 1 2 1 3
2 zhask 0 1 NA 2
3 balmond 3 0 2 3
4 layla 0 1 NA 2
5 angela 0 3 NA 2
I've tried many things and try to make my own function, however, it is not working:
for(i in 1:nrow(df))
{
if( is.na(df$x1[i]) && (i < df$count[i]))
{df$x1[i]=0}
else
{df$x1[i]=df$x1[i]}
}
this is the script for the dummy dataframe:
x1= c(1,NA,3, NA, NA)
x2= c(2,1, NA, 1, 3)
x3 = c(1, NA, 2, NA, NA)
count=c(3,2,3,2,2)
site=c("miya", "zhask", "balmond", "layla", "angela")
df=data.frame(site,x1,x2,x3,count)
Any help will be very much appreciated!
One way to be to apply a function over all of your count columns. Here's a way to do that.
cols <- c("x1", "x2", "x3")
df[, cols] <- mapply(function(col, idx, count) {
ifelse(idx <=count & is.na(col), 0, col)
}, df[,cols], seq_along(cols), MoreArgs=list(count=df$count))
# site x1 x2 x3 count
# 1 miya 1 2 1 3
# 2 zhask 0 1 NA 2
# 3 balmond 3 0 2 3
# 4 layla 0 1 NA 2
# 5 angela 0 3 NA 2
We use mapply to iterate over the columns and the index of the column. We also pass in the count value each time (since it's the same for all columns, it goes in the MoreArgs= parameter). This mapply will return a list and we can use that to replace the columns with the updated values.
If you wanted to use dplyr, that might look more like
library(dplyr)
cols <- c("x1"=1, "x2"=2, "x3"=3)
df %>%
mutate(across(starts_with("x"), ~if_else(cols[cur_column()]<count & is.na(.x), 0, .x)))
I used the cols vector to get the index of the column which doesn't seem to be otherwise available when using across().
But a more "tidy" way to tackle this problem would be to pivot your data first to a "tidy" format. Then you can clean the data more easily and pivot back if necessary
library(dplyr)
library(tidyr)
df %>%
pivot_longer(cols=starts_with("x")) %>%
mutate(index=readr::parse_number(name)) %>%
mutate(value=if_else(index < count & is.na(value), 0, value)) %>%
select(-index) %>%
pivot_wider(names_from=name, values_from=value)
# site count x1 x2 x3
# <chr> <dbl> <dbl> <dbl> <dbl>
# 1 miya 3 1 2 1
# 2 zhask 2 0 1 NA
# 3 balmond 3 3 0 2
# 4 layla 2 0 1 NA
# 5 angela 2 0 3 NA
Via some indexing of the columns:
vars <- c("x1","x2","x3")
df[vars][is.na(df[vars]) & (col(df[vars]) <= df$count)] <- 0
# site x1 x2 x3 count
#1 miya 1 2 1 3
#2 zhask 0 1 NA 2
#3 balmond 3 0 2 3
#4 layla 0 1 NA 2
#5 angela 0 3 NA 2
Essentially this is:
selecting the variables/columns and storing in vars
flagging the NA cells within those variables with is.na(df[vars])
col(df[vars]) returns a column number for every cell, which can be checked if it is less than the df$count in each corresponding row
the values meeting both the above criteria are overwritten <- with 0
This could be yet another solution using purrr::pmap:
purrr::pmap is used for row-wise operations when applied on a data frame. It enables us to iterate over multiple arguments at the same time. So here c(...) refers to all corresponding elements of the selected variable (all except site) in each row
I think the rest of the solution is pretty clear but please let me know if I need to explain more about this.
library(dplyr)
library(purrr)
library(tidyr)
df %>%
mutate(output = pmap(df[-1], ~ {x <- head(c(...), -1)
inds <- which(is.na(x))
req <- tail(c(...), 1) - sum(!is.na(x))
x[inds[seq_len(req)]] <- 0
x})) %>%
select(site, output, count) %>%
unnest_wider(output)
# A tibble: 5 x 5
site x1 x2 x3 count
<chr> <dbl> <dbl> <dbl> <dbl>
1 miya 1 2 1 3
2 zhask 0 1 NA 2
3 balmond 3 0 2 3
4 layla 0 1 NA 2
5 angela 0 3 NA 2

R: Matching and repeating occurence [duplicate]

This question already has answers here:
Complete dataframe with missing combinations of values
(2 answers)
Closed 2 years ago.
(sample code below) I have two data sets. One is a library of products, the other is customer id, date and viewed product and another detail.I want to get a merge where I see per each id AND date all the library of products as well as where the match was. I have tried using full_join and merge and right and left joins, but they do not repeat the rows. below is the sample of what i am trying to achieve.
id=c(1,1,1,1,2,2)
date=c(1,1,2,2,1,3)
offer=c('a','x','y','x','y','a')
section=c('general','kitchen','general','general','general','kitchen')
t=data.frame(id,date,offer,section)
offer=c('a','x','y','z')
library=data.frame(offer)
######
t table
id date offer section
1 1 1 a general
2 1 1 x kitchen
3 1 2 y general
4 1 2 x general
5 2 1 y general
6 2 3 a kitchen
library table
offer
1 a
2 x
3 y
4 z
and i want to get this:
id date offer section
1 1 1 a general
2 1 1 x kitchen
3 1 1 y NA
4 1 1 z general
...
(there would have to be 6*4 observations)
I realize because I match by offer it is not going to repeat the values like so, but what is another option to do that? Thanks a lot!!
You can use complete to get all combinations of library$offer for each id and date.
tidyr::complete(t, id, date, offer = library$offer)
# A tibble: 24 x 4
# id date offer section
# <dbl> <dbl> <chr> <chr>
# 1 1 1 a general
# 2 1 1 x kitchen
# 3 1 1 y NA
# 4 1 1 z NA
# 5 1 2 a NA
# 6 1 2 x general
# 7 1 2 y general
# 8 1 2 z NA
# 9 1 3 a NA
#10 1 3 x NA
# … with 14 more rows
You can use tidyr and dplyr to get the data. The crossing() function will create all combinations of the variables you pass in
library(dplyr)
library(tidyr)
t %>%
select(id, date) %>%
{crossing(id=.$id, date=.$date, library)} %>%
left_join(t)

Sort across rows to obtain three largest values

There is a injury score called ISS score
I have a table of injury data in rows according to pt ID.
I would like to obtain the top three values for the 6 injury columns.
Column values range from 0-5.
pt_id head face abdo pelvis Extremity External
1 4 0 0 1 0 3
2 3 3 5 0 3 2
3 0 0 2 1 1 1
4 2 0 0 0 0 1
5 5 0 0 2 0 1
My output for the above example would be
pt-id n1 n2 n3
1 4 3 1
2 5 3 3
3 2 1 1
4 2 1 0
5 5 2 1
values can be in a list or in new columns as calculating the score is simple from that point on.
I had thought that I would be able to create a list for the 6 injury columns and then apply a sort to each list taking the top three values. My code for that was:
ais$ais_list <- setNames(split(ais[,2:7], seq(nrow(ais))), rownames(ais))
But I struggled to apply the sort to the lists within the data frame as unfortunately some of the data in my data set includes NA values
We could use apply row-wise and sort the dataframe and take only first three values in each row.
cbind(df[1], t(apply(df[-1], 1, sort, decreasing = TRUE)[1:3, ]))
# pt_id 1 2 3
#1 1 4 3 1
#2 2 5 3 3
#3 3 2 1 1
#4 4 2 1 0
#5 5 5 2 1
As some values may contain NA it is better we apply sort using anonymous function and then take take top 3 values using head.
cbind(df[1], t(apply(df[-1], 1, function(x) head(sort(x, decreasing = TRUE), 3))))
A tidyverse option is to first gather the data, arrange it in descending order and for every row select only first three values. We then replace the injury column with the column names which we want and finally spread the data back to wide format.
library(tidyverse)
df %>%
gather(injury, value, -pt_id) %>%
arrange(desc(value)) %>%
group_by(pt_id) %>%
slice(1:3) %>%
mutate(injury = 1:3) %>%
spread(injury, value)
# pt_id `1` `2` `3`
# <int> <int> <int> <int>
#1 1 4 3 1
#2 2 5 3 3
#3 3 2 1 1
#4 4 2 1 0
#5 5 5 2 1

value of certain column based on multiple conditions in two data frames R

As shown above, there are df1 and df2
If you look at btime one df1 there are NAs
I want to fill up the btime NAs with all unique + stnseq = 1, so only the first NA of each Unique will be filled
the value i would like it to fill is in df2. The condition would be for all unique + boardstation = 8501970 add the value in the departure column.
i have tried the aggregate function but i do not know how to make the condition for only boardstation 8501970.
Thanks anyone for any help
If I understood the question correctly then this might help.
library(dplyr)
df2 %>%
group_by(unique) %>%
summarise(departure_sum = sum(departure[boardstation==8501970])) %>%
right_join(df1, by="unique") %>%
mutate(btime = ifelse(is.na(btime) & stnseq==1, departure_sum, btime)) %>%
select(-departure_sum) %>%
data.frame()
Since the sample data is in image format I cooked my own data as below:
df1
unique stnseq btime
1 1 1 NA
2 1 2 NA
3 2 1 NA
4 2 2 200
df2
unique boardstation departure
1 1 8501970 1
2 1 8501970 2
3 1 123 3
4 2 8501970 4
5 2 456 5
6 3 900 6
Output is:
unique stnseq btime
1 1 1 3
2 1 2 NA
3 2 1 4
4 2 2 200

Conditionally set value in previous row, within group

I have a data frame "df" like this, grouped by "nest."
nest laid stage
1 NA 2
1 5 4
1 -10 NA
2 NA 1
2 3 1
2 -8 NA
I want to make a condition so that if "laid" is > 0, the "stage" of that nest at the previous visit is set to 0. If "laid" is not greater than 0, I want no change in "stage".
Desired outcome:
nest laid stage
1 NA 0
1 5 4
1 -10 NA
2 NA 0
2 3 1
2 -8 NA
I've tried different versions of code below (dplyr and tidyr), with various errors:
df1 <- df %>%
group_by(nest) %>%
mutate(stage, if(laid > 0){stage = 0}) %>%
fill(stage, .direction = "up")
I've gone over similar questions, but they all use ifelse. Any tips are much appreciated!
You can use if_else (or ifelse if you are not certain of the column data types), which is a vectorized version of if/else; To check the next laid value, use lead:
df %>%
group_by(nest) %>%
mutate(stage = if_else(lead(laid) > 0, 0L, stage))
# A tibble: 6 x 3
# Groups: nest [2]
# nest laid stage
# <int> <int> <int>
#1 1 NA 0
#2 1 5 4
#3 1 -10 NA
#4 2 NA 0
#5 2 3 1
#6 2 -8 NA

Resources