recode using ifelse clause within groups - r

I'm trying to set up column (called 'combined) to indicate the combined information of owner and Head within each group (Group). There is only 1 owner in each group, and 'Head' is basically the first row of each group that has the minimum id value.
This combined column should flag '1' if the ID is flagged as owner, then the rest of the id within each group will be 0 regardless of the information in 'Head'. However for groups that do not have any Owner in the IDs (i.e. all 0 in owner within the group), then this column will take the Head column information. My data looks like this and the last column (combined) is the desired outcome.
sample <- data.frame(Group = c("46005589", "46005589","46005590","46005591", "46005591","46005592","46005592","46005592", "46005593", "46005594"), ID= c("189199", "2957073", "272448", "1872092", "10374996", "1153514", "2771118","10281300", "2610301", "3564526"), Owner = c(0, 1, 1, 0, 0, 0, 1, 0, 1, 1), Head = c(1, 0, 0, 1, 0, 1, 0, 0, 1, 1), combined = c(0, 1, 1, 1, 0, 0, 1, 0, 1, 1))
> sample
Group ID Owner Head combined
1 46005589 189199 0 1 0
2 46005589 2957073 1 0 1
3 46005590 272448 1 0 1
4 46005591 1872092 0 1 1
5 46005591 10374996 0 0 0
6 46005592 1153514 0 1 0
7 46005592 2771118 1 0 1
8 46005592 10281300 0 0 0
9 46005593 2610301 1 1 1
10 46005594 3564526 1 1 1
I've tried a few dplyr and ifelse clauses and it didn't seem to give outputs to what I wanted. How should I recode this column? Thanks.

I don't think this is the best way but you might look at visually inspecting IDs with all 0s. You could do this with rowSums and specify these IDs using %in%. Here is a possible solution:
library(dplyr)
df %>%
mutate_at(vars(ID,Group),funs(as.factor)) %>%
mutate(Combined=if_else(Owner==1,1,0),
NewCombi=ifelse(ID== "1872092",Head,Combined))
This yields: NewCombi is our target.
# Group ID Owner Head Combined NewCombi
#1 46005589 189199 0 1 0 0
#2 46005589 2957073 1 0 1 1
#3 46005590 272448 1 0 1 1
#4 46005591 1872092 0 1 0 1
#5 46005591 10374996 0 0 0 0
#6 46005592 1153514 0 1 0 0
#7 46005592 2771118 1 0 1 1
#8 46005592 10281300 0 0 0 0
#9 46005593 2610301 1 1 1 1
#10 46005594 3564526 1 1 1 1

The new combined column can be created in two steps in dplyr: first use filter(all(Owner == 0))by creating a column that only contains 'Head' information of IDs that do not contain any 'Owner', then merge this column back to the original dataframe, sum up the 1s in this column and the 1s 'Owner' column to obtain the combined info.
library(dplyr)
sample2 <- sample %>%
group_by(Group) %>%
filter(all(Owner == 0)) %>%
mutate(Head_nullowner = ifelse(Head == 1, 1, 0)) #select all rows of IDs that do not have any owners
#merge Head_nullowner with the original dataframe by both Group and ID
sample <- merge(sample, sample2[c("Group", "ID", "Head_nullowner")], by.x = c("Group", "ID"), by.y = c("Group", "ID"), all.x = T)
sample$Head_nullowner[is.na(sample$Head_nullowner)] <- 0
sample$OwnerHead_combined = sample$Owner + sample$Head_nullowner
> sample
Group ID Owner Head combined Head_nullowner OwnerHead_combined
1 46005589 189199 0 1 0 0 0
2 46005589 2957073 1 0 1 0 1
3 46005590 272448 1 0 1 0 1
4 46005591 10374996 0 0 0 0 0
5 46005591 1872092 0 1 1 1 1
6 46005592 10281300 0 0 0 0 0
7 46005592 1153514 0 1 0 0 0
8 46005592 2771118 1 0 1 0 1
9 46005593 2610301 1 1 1 0 1
10 46005594 3564526 1 1 1 0 1

Related

Getting combination Percentages in r

Relatively new to R and very new here in stackoverflow.
I'm trying to analyze .csv output files from a microscope.
The output will tell me whether each cell on the image is "positive" (expressed with a 1) or "negative" (with a 0)
my_data <- data.frame(cell = 1:4, marker_a = c(1, 0, 0, 0), marker_b = c(0,1,1,1), marker_c = c(0,1,1,0))
Sometimes we measure 4 markers, sometimes more.
I already wrote something that gives me a vector with the "used markers" and discards the "unused markers" (in this case it would be marker e,f, g which also show up in the .csv file).
I want to automatically get all the possible combinations that a cell can take.
A cell can be 0 for all markers, or can be positive for marker_a but negative for marker_b,marker_c,marker_d.
My end goal is to quantify all the cells that fall under each category/combination.
I would want a vector that would name each possible combination from all markers with a 0 value, to all of them wiht a 1 value.
What I have been doing so far is manually generating the combinations.
no_marker <- my_data$marker_a == 0 & my_data$marker_b == 0 & my_data$marker_c == 0
a_positive <- my_data$marker_a == 1 & my_data$marker_b == 0 & my_data$marker_c == 0...
Then I can just create a data.frame to add more samples later.
cell_phenotypes <- c("no_marker", "a_positive", "ab_positive", "abc_positive", "abcd_positive", "b_ positive", "bc_positive"...)
I just don't want to manually create the vector every time.
It sounds like you want expand.grid.
expand.grid(
marker_a = c(0, 1),
marker_b = c(0, 1),
marker_c = c(0, 1),
marker_d = c(0, 1)
)
#> marker_a marker_b marker_c marker_d
#> 1 0 0 0 0
#> 2 1 0 0 0
#> 3 0 1 0 0
#> 4 1 1 0 0
#> 5 0 0 1 0
#> 6 1 0 1 0
#> 7 0 1 1 0
#> 8 1 1 1 0
#> 9 0 0 0 1
#> 10 1 0 0 1
#> 11 0 1 0 1
#> 12 1 1 0 1
#> 13 0 0 1 1
#> 14 1 0 1 1
#> 15 0 1 1 1
#> 16 1 1 1 1
Note that 16 is the right number; you can check since 2**4 = 16.

propagate changes down a column

I would like to use dplyr to go through a dataframe row by row, and if A == 0, then set B to the value of B in the previous row, otherwise leave it unchanged. However, I want "the value of B in the previous row" to refer to the previous row during the computation, not before the computation began, because the value may have changed -- in other words, I'd like changes to propagate downwards. For example, with the following data:
dat <- data.frame(A=c(1,0,0,0,1),B=c(0,1,1,1,1))
A B
1 0
0 1
0 1
0 1
1 1
I would like the result of the computation to be:
result <- data.frame(A=c(1,0,0,0,1),B=c(0,0,0,0,1))
A B
1 0
0 0
0 0
0 0
1 1
If I use something like result <- dat %>% mutate(B = ifelse(A==0,lag(B),B) then changes won't propagate downwards: result$B will be equal to c(0,0,1,1,1), not c(0,0,0,0,1).
More generally, how do you use dplyr::mutate to create a column that depends on itself (as it updates during the computation, not a copy of what it was before)?
Seems like you want a "last observation carried forward" approach. The most common R implementation is zoo::na.locf which fills in NA values with the last observation. All we need to do to use it in this case is to first set to NA all the B values that we want to fill in:
mutate(dat,
B = ifelse(A == 0, NA, B),
B = zoo::na.locf(B))
# A B
# 1 1 0
# 2 0 0
# 3 0 0
# 4 0 0
# 5 1 1
As to my comment, do note that the only thing mutate does is add the column to the data frame. We could do it just as well without mutate:
result = dat
result$B = with(result, ifelse(A == 0, NA, B))
result$B = zoo::na.locf(result$B)
Whether you use mutate or [ or $ or any other method to access/add the columns is tangential to the problem.
We could use fill from tidyr after changing the 'B' values to NA that corresponds to 0 in 'A'
library(dplyr)
library(tidyr)
dat %>%
mutate(B = NA^(!A)*B) %>%
fill(B)
# A B
#1 1 0
#2 0 0
#3 0 0
#4 0 0
#5 1 1
NOTE: By default, the .direction (argument in fill) is "down", but it can also take "up" i.e. fill(B, .direction="up")
Here's a solution using grouping, and rleid (Run length encoding id) from data.table. I think it should be faster than the zoo solution, since zoo relies on doing multiple revs and a cumsum. And rleid is blazing fast
Basically, we only want the last value of the previous group, so we create a grouping variable based on the diff vector of the rleid and add that to the rleid if A == 1. Then we group and take the first B-value of the group for every case where A == 0
library(dplyr)
library(data.table)
dat <- data.frame(A=c(1,0,0,0,1),B=c(0,1,1,1,1))
dat <- dat %>%
mutate(grp = data.table::rleid(A),
grp = ifelse(A == 1, grp + c(diff(grp),0),grp)) %>%
group_by(grp) %>%
mutate(B = ifelse(A == 0, B[1],B)) # EDIT: Always carry forward B on A == 0
dat
Source: local data frame [5 x 3]
Groups: grp [2]
A B grp
<dbl> <dbl> <dbl>
1 1 0 2
2 0 0 2
3 0 0 2
4 0 0 2
5 1 1 3
EDIT: Here's an example with a longer dataset so we can really see the behavior: (Also, switched, it should be if all A != 1 not if not all A == 1
set.seed(30)
dat <- data.frame(A=sample(0:1,15,replace = TRUE),
B=sample(0:1,15,replace = TRUE))
> dat
A B
1 0 1
2 0 0
3 0 1
4 0 1
5 0 0
6 0 0
7 1 1
8 0 0
9 1 0
10 0 0
11 0 0
12 0 0
13 1 0
14 1 1
15 0 0
Result:
Source: local data frame [15 x 3]
Groups: grp [5]
A B grp
<int> <int> <dbl>
1 0 1 1
2 0 1 1
3 0 1 1
4 0 1 1
5 0 1 1
6 0 1 1
7 1 1 3
8 0 1 3
9 1 0 5
10 0 0 5
11 0 0 5
12 0 0 5
13 1 0 6
14 1 1 7
15 0 1 7

How to refer to previous cell in a data-frame column (lagged cell), in R

I’m working in R and am trying to find a way to refer to the previous cell within a vector when that vector belongs to a data frame. By previous cell, I’m essentially hoping for a “lag” command of some sort so that I can compare one cell to the cell previous. As an example, I have these data:
A <- c(1,0,0,0,1,0,0)
B <- c(1,1,1,1,1,0,0)
AB_df <- cbind (A,B)
What I want is for a given cell in a given row, if that cell’s value is less than the previous cell’s value for the same column vector, to return a value of 1 and if not to return a value of 0. For this example, the new columns would be called “A-flag” and “B-flag” below.
A B A-flag B-flag
1 1 0 0
0 1 1 0
0 1 0 0
0 1 0 0
1 1 0 0
0 0 1 1
0 0 0 0
Any suggestions for syntax that can do this? Ideally, to just create a new column variable into an existing data-frame.
Here is one solution using dplyr package and it's lag method:
library(dplyr)
AB_df <- data.frame(A = A, B = B)
AB_df %>% mutate(A.flag = ifelse(A < lag(A, default = 0), 1, 0),
B.flag = ifelse(B < lag(B, default = 0), 1, 0))
A B A.flag B.flag
1 1 1 0 0
2 0 1 1 0
3 0 1 0 0
4 0 1 0 0
5 1 1 0 0
6 0 0 1 1
7 0 0 0 0

How to select columns by values in a row in R

I have a large data frame marking occurrences of trigrams in a string, where the strings are the rows, the trigrams are the columns, and the values mark whether an trigram occurs in a string.
so something like this:
strs <- c('this', 'that', 'chat', 'chin')
thi <- c(1, 0, 0, 0)
tha <- c(0, 1, 0, 0)
hin <- c(0, 0, 0, 1)
hat <- c(0, 1, 1, 0)
df <- data.frame(strs, thi, tha, hin, hat)
df
# strs thi tha hin hat
#1 this 1 0 0 0
#2 that 0 1 0 1
#3 chat 0 0 0 1
#4 chin 0 0 1 0
I want to get all of the columns/trigrams that have a 1 for a given row or a given string.
So for row 2, the string 'that', the result would a data frame that looks like this:
str tha hat
1 this 0 0
2 that 1 1
3 chat 0 1
4 chin 0 0
How could I do this?
This will give you the desired output df.
givenStr <- "that"
row <- df[df$strs==givenStr,]
df[,c(1,1+which(row[,-1]==1))]
In a one liner:
df[as.logical(df[df$strs=='that',])]
# strs tha hat
#1 this 0 0
#2 that 1 1
#3 chat 0 1
#4 chin 0 0

Generate a new variable based upon the values of the previous and next row by group

I am using panel data with multiple subjects (id) and have an event (first_occurrence) that occurs on different days. My goal is to create a new variable (result) that is 1 on the 2 days preceding the first occurrence, the day of the first occurrence, and the 2 days following the first occurrence.
Here is an example that includes both the sample data and the desired output:
data <- structure(list(id = c(1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2,
2, 3, 3, 3, 3, 3, 3, 3), day = c(0, 1, 2, 3, 4, 5, 6, 7, 0, 1,
2, 3, 4, 5, 0, 1, 2, 3, 4, 5, 6), first_occurrence = c(0, 0,
1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1), desired_output = c(1,
1, 1, 1, 1, 0, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1)), .Names = c("id",
"day", "first_occurrence", "desired_output"), row.names = c(NA,
-21L), class = "data.frame")
Although this may not be the most efficient solution, I managed to get the code working in Stata (please see below for Stata code), although I would like to get it working in R as well and would appreciate any thoughts folks have.
Thanks!
Stata code:
tsset id day
gen run = .
by id: replace run = cond(L.run == ., 1, L.run + 1)
gen test = .
replace test = run if(first_occurrence == 1)
gen test2 = .
by id: replace test2 = test[_n-1]
gen test3 = .
by id: replace test3 = test[_n-2]
gen test4 = .
by id: replace test4 = test[_n+1]
gen test5 = .
by id: replace test5 = test[_n+2]
egen test_sum = rowtotal(test test2 test3 test4 test5)
replace test_sum = 1 if(test_sum >= 1)
rename test_sum result
drop run test test2 test3 test4 test5
Here's another approach using the package dplyr:
require(dplyr) #install and load the package
data %.%
arrange(id, day) %.% # to sort the data by id and day. If it is already, you can remove this row
group_by(id) %.%
mutate(n = 1:n(),
result = ifelse(abs(n - n[first_occurrence == 1]) <= 2, 1, 0)) %.%
select(-n)
# id day first_occurrence desired_output result
#1 1 0 0 1 1
#2 1 1 0 1 1
#3 1 2 1 1 1
#4 1 3 0 1 1
#5 1 4 0 1 1
#6 1 5 0 0 0
#7 1 6 0 0 0
#8 1 7 0 0 0
#9 2 0 0 0 0
#10 2 1 0 0 0
#11 2 2 0 1 1
#12 2 3 0 1 1
#13 2 4 1 1 1
#14 2 5 0 1 1
#15 3 0 0 0 0
#16 3 1 0 0 0
#17 3 2 0 0 0
#18 3 3 0 0 0
#19 3 4 0 1 1
#20 3 5 0 1 1
#21 3 6 1 1 1
What the code does is, first group by id and then it will add another column (n) where it counts the rows per group from 1 to the number of rows per group. Then it creates another column result with an ifelse that will check the absolute difference between the current n (for each row) and the n where first_occurrence is 1. If that difference is equal to or less than 2, result will be 1 otherwise 0. The last line removes the column n.
Edit:
It would probably be more efficient to place the mutate(n = 1:n()) before the group_by:
data %.%
arrange(id, day) %.% # to sort the data by id and day. If it is already, you can remove this row
mutate(n = 1:n()) %.%
group_by(id) %.%
mutate(result = ifelse(abs(n - n[first_occurrence == 1]) <= 2, 1, 0)) %.%
select(-n)
Here's one way. You can use ave to look by group, and then you can use which.max to find the first occurrence and then calculate the distance from that value for all the other values
close<-(with(data, ave(first_occurrence, id, FUN=function(x)
abs(seq_along(x)-which.max(x)))
)<=2)+0
Here I use +0 to turn the logical values into 0/1 values. Now you can combine that with your existing data
cbind(data, close)
And that gives
id day first_occurrence desired_output close
1 1 0 0 1 1
2 1 1 0 1 1
3 1 2 1 1 1
4 1 3 0 1 1
5 1 4 0 1 1
6 1 5 0 0 0
7 1 6 0 0 0
8 1 7 0 0 0
9 2 0 0 0 0
10 2 1 0 0 0
11 2 2 0 1 1
12 2 3 0 1 1
13 2 4 1 1 1
14 2 5 0 1 1
15 3 0 0 0 0
16 3 1 0 0 0
17 3 2 0 0 0
18 3 3 0 0 0
19 3 4 0 1 1
20 3 5 0 1 1
21 3 6 1 1 1
as desired. Note that this method assumes that the data is sorted by day.

Resources