I am trying to get a total number of friends that will become the denominator in a later step.
example data:
set.seed(24) ## for sake of reproducibility
n <- 5
data <- data.frame(id=1:n,
Q1= c("same", "diff", NA, NA, NA),
Q2= c("diff", "diff", "same", "diff", NA),
Q3= c("same", "diff", NA ,NA, "diff"),
Q4= c("diff", "same", NA, NA, NA))
i first need to create a column that contains a numeric count of how many columns each participant responded to (either "same" or "diff", not counting NAs/blanks). I have tried the following
friendship <- total.friends <- rowSums(c(data$Q1, data$Q2, data$Q3, data$Q4)), != "")
friendship <- total.friends <-rowSums(!is.na(c(data$Q1, data$Q2, data$Q3, data$Q4)))
Neither is effective, likely because my data is not numeric. the first did count the cells but did not group by id as I require. is there any function i can use to count the populated cells? how can i edit this to count cells populated only with "diff" so that i can then start on the second step (making the proportion)?
You could
data2 <- apply(data[,-1],MARGIN=1,function(x){c <- length(x[!is.na(x)])})
result <- as.data.frame(cbind(data[,1],data2)) %>% setNames(c("id","number"))
And result will hold the amount of not NA each id has.
The data2 is basically a count of the number of not NAs for each id, it uses the apply function with margin 1 which basically takes each row of your dataframe and applies a function to that row. The function that is being applied is the c<-length(x[!is.na(x)] part. Which basically, the 'x[!is.na(x)]' filters away all the NA entries in each row so that it only has NOT NA entries of the row, then we apply the length() function to that result so it gives us how many entries where there after filtering the NAs.
The result of that apply will be a single column array, in which each row is the result of computing that procedure to each row, and considering you have a row for each id. It translates as computing that function to each id
Lastly, in the result line I simply add the id back to the previous step, for the sake of having in it well identified and not just one column of results.
Hope this works for you :)
Here's a regex solution with grep:
data$count <- apply(data, 1, function(x) length(grep("[a-z]", x, value = T)))
Here using length you count the number of times grep finds a lower-case letter in any row cell.
Result:
data
id Q1 Q2 Q3 Q4 count
1 1 same diff same diff 4
2 2 diff diff diff same 4
3 3 <NA> same <NA> <NA> 1
4 4 <NA> diff <NA> <NA> 1
5 5 <NA> <NA> diff <NA> 1
You can also accomplish this using c_across and rowwise from the dplyr library:
library(dplyr)
data %>%
dplyr::rowwise() %>%
dplyr::mutate(Total = sum(!is.na(c_across(Q1:Q4)))) %>%
dplyr::ungroup()
Note: alternatively you can use starts_with("Q") inside of c_across to do this across all columns that start with "Q" (shown below).
To count the number of a specific response you can do or compute other variables that depend on a newly created variable, like a proportion, in the mutate statement:
data %>%
dplyr::rowwise() %>%
dplyr::mutate(Total = sum(!is.na(c_across(starts_with("Q")))),
Diff = sum(c_across(starts_with("Q")) == "diff", na.rm = T),
Prop = Diff / Total) %>%
dplyr::ungroup()
id Q1 Q2 Q3 Q4 Total Diff Prop
<int> <chr> <chr> <chr> <chr> <int> <int> <dbl>
1 1 same diff same diff 4 2 0.5
2 2 diff diff diff same 4 3 0.75
3 3 NA same NA NA 1 0 0
4 4 NA diff NA NA 1 1 1
5 5 NA NA diff NA 1 1 1
Related
I have a dataframe composed of 9 columns with more than 4000 observations. For this question I will present a simpler dataframe (I use the tidyverse library)
Let's say I have the following dataframe:
library(tidyverse)
df <- tibble(Product = c("Bread","Oranges","Eggs","Bananas","Whole Bread" ),
Weight = c(NA, 1, NA, NA, NA),
Units = c(2,6,1,2,1),
Price = c(1,3.5,0.5,0.75,1.5))
df
I want to replace the NA values of the Weight column for a number multiplied by the results of Units depending on the word showed by the column Product. Basically, is a rule like:
Replace NA in Weight for 2.5*number of units if Product contains the word "Bread". Replace for 1 if Product contains the word "Eggs"
The thing is that I don't know how to code somehting like that in R. I tried the following code that a kind user gave me for a similar question:
df <- df %>%
mutate(Weight = case_when(Product == "bread" & is.na(Weight) ~ 0.25*Units))
But it doesn't work and it doesn't take into account the fact that if there is "Whole Bread" written in my dataframe it also has to apply the rule.
Does anyone have an idea?
Some of them are not exact matches, so use str_detect
library(dplyr)
library(stringr)
df %>%
mutate(Weight = case_when(is.na(Weight) &
str_detect(Product, regex("Bread", ignore_case = TRUE)) ~ 2.5 * Units,
is.na(Weight) & Product == "Eggs"~ Units, TRUE ~ Weight))
-output
# A tibble: 5 × 4
Product Weight Units Price
<chr> <dbl> <dbl> <dbl>
1 Bread 5 2 1
2 Oranges 1 6 3.5
3 Eggs 1 1 0.5
4 Bananas NA 2 0.75
5 Whole Bread 2.5 1 1.5
This question already has answers here:
Select last non-NA value in a row, by row
(3 answers)
Closed 2 years ago.
A bit difficult to explain, but I have a dataframe with values that look like a staircase - for every date, there are different columns that have NA for some dates. I want to create a new column that has the last non-NA column value in it.
Hopefuly it makes more sense with this example:
Sample dataframe:
test <- data.frame("date" = c(as.Date("2020-01-01"), as.Date("2020-01-02"), as.Date("2020-01-03")),
"a" = c(4, 3, 4),
"b" = c(NA, 2, 1),
"c" = c(NA, NA, 5))
Desired output:
date............val
2020-01-01...... 4
2020-01-02...... 2
2020-01-03...... 5
I'd also prefer not to do something like take the row number of the date and take that column number + 1, but if that's the only way to do it, that's that. Thanks!
Here's a Tidyverse-based approach - convert the columns to rows using pivot_longer, then get the last row where the value isn't NA for each date:
library(dplyr)
library(tidyr)
test %>%
pivot_longer(-date) %>%
filter(!is.na(value)) %>%
group_by(date) %>%
summarize(value = tail(value, 1), .groups = "drop")
You can use max.col with ties.method set as "last" to get last non-NA value in each row.
test$val <- test[cbind(1:nrow(test), max.col(!is.na(test), ties.method = 'last'))]
test
# date a b c val
#1 2020-01-01 4 NA NA 4
#2 2020-01-02 3 2 NA 2
#3 2020-01-03 4 1 5 5
You can also do this with dplyr's coalesce function, which takes the first non-missing element from the provided vectors.
library(dplyr)
test %>%
mutate(val = coalesce(c, b, a))
#> date a b c val
#> 1 2020-01-01 4 NA NA 4
#> 2 2020-01-02 3 2 NA 2
#> 3 2020-01-03 4 1 5 5
Created on 2020-07-07 by the reprex package (v0.3.0)
Note that if you have many columns, #tfehring & #Ronak's solutions will be better suited, as for this method you'll have to manually specify your columns. It does have the benefit of being short & sweet, though.
I have a dataset in long-format (i.e. multiple observations per ID). Each ID contains multiple visits at which the individual was diagnosed for disease (in the toy example, I show 3 but in my real data I have as many as 30), which are coded in consecutive columns (disease1-disease3). A value of 1 means they were diagnosed with the disease at the time of diagnosis_dt, and 0 means the did not have it. For each ID, I'm interested in summarizing whether or not they had any disease across all visits where diagnosis_dt falls between start_dt and end_dt. Some IDs don't have diagnosis information, and consequently are coded as NAs in the respective columns. I'd still like to keep this information.
A toy example of my dataset is below:
library(dplyr)
library(data.table)
ex_dat <- data.frame(ID = c(rep("a",3),
rep("b",4),
rep("c",5)),
start_dt = as.Date(c(rep("2009-01-01",3),
rep("2009-04-01",4),
rep("2009-02-01",5))),
end_dt = as.Date(c(rep("2010-12-31",3),
rep("2011-03-31",4),
rep("2011-01-31",5))),
diagnosis_dt = c(as.Date(c("2011-01-03","2010-11-01","2009-12-01")),
as.Date(c("2011-04-03","2010-11-01","2009-12-01","2011-12-01")),
rep(NA,5)),
disease1 = c(c(1,0,0),
c(1,1,0,1),
rep(NA,5)),
disease2 = c(c(1,1,0),
c(0,0,0,1),
rep(NA,5)),
disease3 = c(c(0,0,0),
c(0,0,1,0),
rep(NA,5))
)
The desired output is:
ID disease1 disease2 disease3
1 a 0 1 0
2 b 1 0 1
3 c NA NA NA
I've been trying this for hours now and my latest attempt is:
out <- ex_dat %>% group_by(ID) %>%
mutate_at(vars(disease1:disease3),
function(x) ifelse(!is.na(.$diagnosis_dt) &
between(.$diagnosis_dt,.$start_dt,.$end_dt) &
sum(x)>0,
1,0)) %>%
slice(1) %>%
select(ID,disease1:disease3)
Here is a tidyverse solution using filter to eliminate the rows that do not meet the desired condition and then use complete to complete the missing groups with NA.
library(tidyverse)
ex_dat %>%
#Group by ID
group_by(ID) %>%
# Stay with the rows for which diagnosis_dt is between start_dt and end_dt
filter(diagnosis_dt >= start_dt & diagnosis_dt <= end_dt ) %>%
# summarize all variables that start with disease by taking its max value
summarize_at(vars(starts_with("disease")), max) %>%
# Complete the missing IDs, those that only had NA or did not meet the criteria in
# the filter
complete(ID)
# A tibble: 3 x 4
# ID disease1 disease2 disease3
# <fct> <dbl> <dbl> <dbl>
# 1 a 0 1 0
# 2 b 1 0 1
# 3 c NA NA NA
Here's an approach with the dplyr across functionality (version >= 1.0.0):
library(dplyr)
ex_dat %>%
group_by(ID) %>%
summarize(across(-one_of(c("start_dt","end_dt","diagnosis_dt")),
~ if_else(any(diagnosis_dt > start_dt & diagnosis_dt < end_dt & .),
1, 0)))
## A tibble: 3 x 4
# ID disease1 disease2 disease3
# <fct> <dbl> <dbl> <dbl>
#1 a 0 1 0
#2 b 1 0 1
#3 c NA NA NA
Note that using the & operator on the integer column . converts to logical. I'm using the -one_of tidyselect verb because then we don't even need to know how many diseases there are. The columns that are actively being group_by-ed are automatically excluded.
Your version isn't working because 1) you need to summarize, not mutate, and 2) inside the function call . refers to the column that is being worked on, not the data from piping. Instead, you need to access those columns without $ from the calling environment.
I would like to add a new column that is "numbers lagged" but if the group changes I do not want to pick up the previous group as a lagged numbers. This is shown below using ifelse but how would you do this with apply()?
mydata=data.frame(groups = c("A","A","B","B"), numbers= c(1,2,3,4))
mydata$numbers_lagged = lag(mydata$numbers, k=1)
mydata$groups_lagged= lag(mydata$groups, k=1)
mydata$numbers_lagged= ifelse(mydata$groups != mydata$groups_lagged,NA, mydata$numbers_lagged) #if the group does not equal the previous group then set to NA
mydata
Compare with apply I recommend using group_by in dplyr
library(dplyr)
mydata%>%group_by(groups)%>%dplyr::mutate(numbers_lagged=lag(numbers))%>%
ungroup()%>%
arrange(groups)%>%
mutate(groups_lagged=lag(groups))
groups numbers numbers_lagged groups_lagged
<fctr> <dbl> <dbl> <fctr>
1 A 1 NA NA
2 A 2 1 A
3 B 3 NA A
4 B 4 3 B
I am trying to calculate a grouped rolling sum based on a window size k but, in the event that the within group row index (n) is less than k, I want to calculate the rolling sum using the condition k=min(n,k).
My issue is similar to this question R dplyr rolling sum but I am looking for a solution that provides a non-NA value for each row.
I can get part of the way there using dplyr and rollsum:
library(zoo)
library(dplyr)
df <- data.frame(Date=rep(seq(as.Date("2000-01-01"),
as.Date("2000-12-01"),by="month"),2),
ID=c(rep(1,12),rep(2,12)),value=1)
df <- tbl_df(df)
df <- df %>%
group_by(ID) %>%
mutate(total3mo=rollsum(x=value,k=3,align="right",fill="NA"))
df
Source: local data frame [24 x 4]
Groups: ID [2]
Date ID value tota3mo
(date) (dbl) (dbl) (dbl)
1 2000-01-01 1 1 NA
2 2000-02-01 1 1 NA
3 2000-03-01 1 1 3
4 2000-04-01 1 1 3
5 2000-05-01 1 1 3
6 2000-06-01 1 1 3
7 2000-07-01 1 1 3
8 2000-08-01 1 1 3
9 2000-09-01 1 1 3
10 2000-10-01 1 1 3
.. ... ... ... ...
In this case, what I would like is to return the value 1 for observations on 2000-01-01 and the value 2 for observations on 2000-02-01. More generally, I would like the rolling sum to be calculated over the largest window possible but no larger than k.
In this particular case it's not too difficult to change some NA values by hand. However, ultimately I would like to add several more columns to my data frame that will be rolling sums calculated over various windows. In this more general case it will get quite tedious to go back change many NA values by hand.
Using the partial=TRUE argument of rollapplyr :
df %>%
group_by(ID) %>%
mutate(roll = rollapplyr(value, 3, sum, partial = TRUE)) %>%
ungroup()
or without dplyr (still need zoo):
roll <- function(x) rollapplyr(x, 3, sum, partial = TRUE)
transform(df, roll = ave(value, ID, FUN = roll))