Mutate multiply columns based on conditional and column name - r

I have a dataframe with the following structure (See example). The dots after OperatedIn2007 column signify multiple columns with same name, changing only the year (e.g OperatedIn2008, OperatedIn2009, etc.).
I wish to do the following procedure:
If the group is 1, then add one in all columns whose names start with OperatedIn.
The expected result should be similar to the one presented in the desired output.
A nonscalable solution would be to use:
df <- df %<%
mutate(OperatedIn2006 = ifelse(group == 1, 1, 0)) %<%
[...]
I imagine there is some slick solution using dplyr or data.table, but I could not think of it myself.
Example
ID group OperatedIn2006 OperatedIn2007 ...
1 1 0 0
2 2 0 0
3 3 0 0
4 4 0 0
5 1 0 0
6 2 0 0
Desired output
ID group OperatedIn2006 OperatedIn2007 ...
1 1 1 1
2 2 0 0
3 3 0 0
4 4 0 0
5 1 1 1
6 2 0 0

We could use across with an ifelse statement:
library(dplyr)
df %>%
mutate(across(-c(ID, group), ~ifelse(group==1, 1, .)))
ID group OperatedIn2006 OperatedIn2007
1 1 1 1 1
2 2 2 0 0
3 3 3 0 0
4 4 4 0 0
5 5 1 1 1
6 6 2 0 0

Related

Summarizing/counting multiple binary variables

For the purpose of this question, my data set includes 16 columns (c1_d, c2_d, ..., c16_d) and 364 rows (1-364). This is what it briefly looks like:
c1_d c2_d c3_d c4_d c5_d c6_d c7_d c8_d c9_d c10_d c11_d c12_d c13_d c14_d c15_d c16_d
1 1 0 0 1 0 0 0 1 0 1 0 0 1 0 0 0
2 1 1 0 1 1 1 0 1 1 1 1 0 1 0 0 0
3 1 1 0 1 1 1 1 1 0 1 1 0 1 0 1 0
4 0 0 0 0 0 1 0 1 0 0 1 0 0 0 1 0
5 1 0 1 1 1 1 0 1 0 1 1 0 0 0 1 0
Please note that for example row 1, has five 1s and 11 0s.
This is what I'm trying to do: Basically counting how many rows have how many of the value 1 assigned to them (i.e. by the end of this analysis I want to get something like 20 rows had zero value 1 assigned to them, 33 rows had one value 1 assigned to them, 100 rows had 10 value 1 assigned to them, etc.).
I tried to create a data frame including all rows (364) and columns (16) I needed. I tried using the print.data.frame function, and its results are shown above, but it doesn't give me the number of 0s and 1s per row. I tried using functions such as table, ftable, and xtab, but they don't really work for more than three variables.
I would highly appreciate your help on this.
If I understand correctly:
library(dplyr)
library(tidyr)
df %>%
transmute(count0 = rowSums(df==0),
count1 = rowSums(df==1)) %>%
pivot_longer(everything()) %>%
count(name, value)
name value n
<chr> <dbl> <int>
1 count0 5 1
2 count0 6 1
3 count0 7 1
4 count0 11 1
5 count0 12 1
6 count1 4 1
7 count1 5 1
8 count1 9 1
9 count1 10 1
10 count1 11 1

Conditionally delete individuals from longtidunal data [duplicate]

This question already has answers here:
Select groups which have at least one of a certain value
(3 answers)
Closed 1 year ago.
I have a longitudinal data set where I want to drop individuals (id) if they do no fulfill the criterion indicated by criteria == 1 at any time points. To put it in context we could say that criteria denotes if the individual was living in the region of interest at any time during.
Using some toy-data that have a similar structure as mine:
id <- c(1,1,1,2,2,2,3,3,3,4,4,4,5,5,5)
time <- c(1,2,3,1,2,3,1,2,3,1,2,3,1,2,3)
event <- c(0,1,0,1,0,0,0,0,0,0,1,0,1,0,1)
criteria <- c(1,0,0,0,0,0, 0, 0, 0, 1, 1, 1,0,0,1)
df <- data.frame(cbind(id,time,event, criteria))
> df
id time event criteria
1 1 1 0 1
2 1 2 1 0
3 1 3 0 0
4 2 1 1 0
5 2 2 0 0
6 2 3 0 0
7 3 1 0 0
8 3 2 0 0
9 3 3 0 0
10 4 1 0 1
11 4 2 1 1
12 4 3 0 1
13 5 1 1 0
14 5 2 0 0
15 5 3 1 1
So by removing any id that have criteria == 0 at all time points (time) would lead to an end result looking like this:
id time event criteria
1 1 1 0 1
2 1 2 1 0
3 1 3 0 0
4 4 1 0 1
5 4 2 1 1
6 4 3 0 1
7 5 1 1 0
8 5 2 0 0
9 5 3 1 1
I've been trying to achieve this by using dplyr::group_by(id) and then filter on the criterion but that does not achieve the result I want to. I'd prefer a tidyverse solution! :D
Thanks!
df %>%
group_by(id) %>%
# looking for the opposite (i.e. !) of criteria == 1 at least 1 time
mutate(is_good = !any(criteria == 1)) %>%
filter(is_good)
If you'd be willing to look into data.table's, which I recommend, it would be as simple as this:
library(data.table)
setDT(df) # make it a data.table
df[ , .SD[ !all(criteria==0) ], by=id ]
See this page for a general introduction and an explanation of the .SD idiom:
https://cran.r-project.org/web/packages/data.table/vignettes/datatable-intro.html

Condensing groups of multiple rows into single row maintaining the values with the highest x in R?

I have a data frame that includes multiple rows of data for each time and would like to group by time to create a condensed data frame. Columns a and b are cumulative sums from other columns and should maintain the values from the row with the highest x for each time group rather than be sums or averages.
x time group value cumsum_A cumsum_B
1 0 A 0 0 0
2 0 B 0 0 0
3 0 A 0 0 0
4 1 A 0 0 0
5 1 B 1 0 1
6 1 B 0 0 1
7 2 B 1 0 2
8 2 A 1 1 2
9 2 A 1 2 2
10 2 A -1 1 2
11 3 A 0 1 2
12 3 B 1 1 3
The ideal result would be the following:
x time group value cumsum_A cumsum_B
3 0 A 0 0 0
6 1 B 0 0 1
10 2 A -1 1 2
12 3 B 1 1 3
An option would be to group by 'time', 'group' and slice the rows where the value of 'x is max (which.max)
library(dplyr)
df1 %>%
group_by(time, group) %>%
slice(which.max(x))

how to merge or join data frame and keep the row names as well?

I have few data frame , one column is values and their corresponding names.
I want when I merge them I keep the row names there too
for example
df1<- data.frame(replicate(1,sample(0:1,10,rep=TRUE)))
df2<- data.frame(replicate(1,sample(0:1,10,rep=TRUE)))
df3<- data.frame(replicate(1,sample(0:1,10,rep=TRUE)))
I expect to have an output like
row.names1 variable row.names2 variable row.names2 variable
1 1 1 1 1 0
2 1 2 0 2 1
3 0 3 0 3 1
4 0 4 1 4 1
5 0 5 0 5 0
6 0 6 0 6 0
7 0 7 1 7 0
8 0 8 1 8 0
9 0 9 0 9 0
10 1 10 1 10 1
I guess you want to cbind the datasets keeping the rownames. An option using data.table is
library(data.table) #data.table_1.9.5
dt <- do.call(cbind,lapply(mget(paste0("df",1:3)),
as.data.table, keep.rownames=TRUE))
setnames(dt, seq(2,ncol(dt),by=2), rep('variable',3))
setnames(dt, seq(1,ncol(dt), by=2), paste0('row.names', 1:(ncol(dt)/2)))
head(dt,2)
# row.names1 variable row.names2 variable row.names3 variable
#1: 1 0 1 1 1 1
#2: 2 0 2 1 2 0
do.call(cbind,mget(paste0("df",1:3)))

randomly sum values from rows and assign them to 2 columns in R

I have a data.frame with 8 columns. One is for the list of subjects (one row per subject) and the other 7 rows are a score of either 1 or 0.
This is what the data looks like:
>head(splitkscores)
subject block3 block4 block5 block6 block7 block8 block9
1 40002 0 0 1 0 0 0 0
2 40002 0 0 1 0 0 1 1
3 40002 1 1 1 1 1 1 1
4 40002 1 1 0 0 0 1 0
5 40002 0 1 0 0 0 1 1
6 40002 0 1 1 0 1 1 1
I want to create a data.frame with 3 columns. One column for subjects. In the other two columns, one must have the sum of 3 or 4 randomly chosen numbers from each row of my data.frame (except the subject) and the other column must have the sum of the remaining values which were not chosen in the first random sample.
Help is much appreciated.
Thanks in advance
Here's a neat and tidy solution free of unnecessary complexity (assume the input is called df):
chosen=sort(sample(setdiff(colnames(df),"subject"),sample(c(3,4),1)))
notchosen=setdiff(colnames(df),c("subject",chosen))
out=data.frame(subject=df$subject,
sum1=apply(df[,chosen],1,sum),sum2=apply(df[,notchosen],1,sum))
In plain English: sample from the column names other than "subject", choosing a sample size of either 3 or 4, and call those column names chosen; define notchosen to be the other columns (excluding "subject" again, obviously); then return a data frame with the list of subjects, the sum of the chosen columns, and the sum of the non-chosen columns. Done.
I think this'll do it: [changed the way data were read in based on the other response because I made a manual mistake...]
splitkscores <- read.table(text = " subject block3 block4 block5 block6 block7 block8 block9
1 40002 0 0 1 0 0 0 0
2 40002 0 0 1 0 0 1 1
3 40002 1 1 1 1 1 1 1
4 40002 1 1 0 0 0 1 0
5 40002 0 1 0 0 0 1 1
6 40002 0 1 1 0 1 1 1", header = TRUE)
df2 <- data.frame(subject = splitkscores$subject, sum3or4 = NA, leftover = NA)
df2$sum3or4 <- apply(splitkscores[,2:ncol(splitkscores)], 1, function(x){
sum(sample(x, sample(c(3,4),1), replace = FALSE))
})
df2$leftover <- rowSums(splitkscores[,2:ncol(splitkscores)]) - df2$sum3or4
df2
subject sum3or4 leftover
1 40002 1 0
2 40002 2 1
3 40002 3 4
4 40002 1 2
5 40002 2 1
6 40002 1 4

Resources