Combining two columns conditionally in R - r

I am trying to create a new column based on the values of two other columns. The values in the initial columns are 1 and 2. For the new column, I want the value to be 1 if the value in either of the first two columns is 1 and to be 0 otherwise. (If the person is either Vegetarian or Vegan, the VegYesNo column should be 1. Otherwise, it should be 0).
Vegetarian Vegan VegYesNo
1 2 1
2 2 0
2 1 1
I've searched the other questions on here and didn't find one that gave me an answer, but please let me know if you know of a question that has a solution that would work.

You can do that with:
mydata$VegYesNo <- as.integer(rowSums(mydata == 1) > 0)
Or with:
mydata$VegYesNo <- 1 * (rowSums(mydata == 1) > 0)
The result:
> mydata
Vegetarian Vegan VegYesNo
1 1 2 1
2 2 2 0
3 2 1 1
Data:
mydata <- read.table(text="Vegetarian Vegan VegYesNo
1 2 1
2 2 0
2 1 1", header=TRUE)[,-3]

Using dplyr package:
mydf <- text <- "Vegetarian Vegan
1 2
2 2
2 1"
mydf <- read.table(text = text, header = TRUE)
library(dplyr)
mydf %>%
mutate(VegYesNo=case_when(
Vegetarian==1 | Vegan==1 ~ 1,
TRUE ~ 0
))
The result is:
Vegetarian Vegan VegYesNo
1 1 2 1
2 2 2 0
3 2 1 1

I suppose you have a data.frame
data = data.frame(vegetarian = c(1,2,2),Vegan = c(2,2,1))
data$VegYesNo = as.numeric(data$vegetarian==1 | data$Vegan==1)

Related

How to combine several binary variables into a new categorical variable

I am trying to combine several binary variables into one categorical variable. I have ten categorial variables, each describing tasks of a job.
Data looks something like this:
Personal_Help <- c(1,1,2,1,2,1)
PR <- c(2,1,1,2,1,2)
Fundraising <- c(1,2,1,2,2,1)
# etc.
My goal is to combine them into one variable, where the value 1 (=Yes) of each binary variable will be a seperate level of the categorical variable.
To illustrate what I imagine (wrong code obviously):
If Personal_Help = 1 -> Jobcontent = 1
If PR = 1 -> Jobcontent = 2
If Fundraising = 1 -> Jobcontent = 3
etc.
Thank you very much in advance!
Edit:
Thanks for your Answers and apologies for my late answer. I think more context from my side is needed. The goal of combining the binary variables into a categorical variable is to print them into one graphic (using ggplot). The graphic should display how many respondants report the above mentioned tasks as part of their work.
if you're interested only in the first occurrence of 1 among your variables:
df <- data.frame(t(data.frame(Personal_Help, PR,Fundraising)))
result <- sapply(df, function(x) which(x==1)[1])
X1 X2 X3 X4 X5 X6
1 1 2 1 2 1
Of course, this will depend on what you want to do when multiple values are 1 as asked in comments.
Since there are three different variables, and each variable can take either of 2 values, there are 2^3 = 8 possible unique combinations of the three variables, each of which should have a unique number associated.
One way to do this is to imagine each column as being a digit in a three digit binary number. If we subtract 1 from each column, we get a 1 for "no" and a 0 for "yes". This means that our eight possible unique values, and the binary numbers associated with each would be:
binary decimal
0 0 0 = 0
0 0 1 = 1
0 1 0 = 2
0 1 1 = 3
1 0 0 = 4
1 0 1 = 5
1 1 0 = 6
1 1 1 = 7
This system will work for any number of columns, and can be achieved as follows:
Personal_Help <- c(1,1,2,1,2,1)
PR <- c(2,1,1,2,1,2)
Fundraising <- c(1,2,1,2,2,1)
df <- data.frame(Personal_Help, PR, Fundraising)
New_var <- 0
for(i in seq_along(df)) New_var <- New_var + (2^(i - 1)) * (df[[i]] - 1)
df$New_var <- New_var
The end result would then be:
df
#> Personal_Help PR Fundraising New_var
#> 1 1 2 1 2
#> 2 1 1 2 4
#> 3 2 1 1 1
#> 4 1 2 2 6
#> 5 2 1 2 5
#> 6 1 2 1 2
In your actual data, there will be 1024 possible combinations of tasks, so this will generate numbers for New_var between 0 and 1023. Because of how it is generated, you can actually use this single number to reverse engineer the entire row as long as you know the original column order.
As #ulfelder commented, you need to clarify how you want to handle cases where more than one column is 1.
Assuming you want to use the first column equal to 1, you can use which.min(), applied by row:
data <- data.frame(Personal_Help, PR, Fundraising)
data$Jobcontent <- apply(data, MARGIN = 1, which.min)
Result:
Personal_Help PR Fundraising Jobcontent
1 1 2 1 1
2 1 1 2 1
3 2 1 1 2
4 1 2 2 1
5 2 1 2 2
6 1 2 1 1
If you’d like Jobcontent to include the name of each job, you can index into names(data):
data$Jobcontent <- names(data)[apply(data, MARGIN = 1, which.min)]
Result:
Personal_Help PR Fundraising Jobcontent
1 1 2 1 Personal_Help
2 1 1 2 Personal_Help
3 2 1 1 PR
4 1 2 2 Personal_Help
5 2 1 2 PR
6 1 2 1 Personal_Help
max.col may help here:
Jobcontent <- max.col(-data.frame(Personal_Help, PR, Fundraising), "first")
Jobcontent
#> [1] 1 1 2 1 2 1

How to perform the equivalent of Excel sumifs in dplyr where there are multiple conditions?

I get the correct output shown below, with code beneath, in the SumIfs_1 column which calculates the sum of all Code2's in the array for the single condition where all Code1's in the array are < current row Code1:
Name Group Code1 Code2 SumIfs_1
1 B 1 0 1 1
2 R 1 1 0 2
3 R 1 1 2 2
4 R 2 3 0 4
5 R 2 3 1 4
6 B 3 4 2 5
7 A 3 -1 1 0
8 A 4 0 0 1
9 A 1 0 0 1
Code:
library(dplyr)
myData <-
data.frame(
Name = c("B","R","R","R","R","B","A","A","A"),
Group = c(1,1,1,2,2,3,3,4,1),
Code1 = c(0,1,1,3,3,4,-1,0,0),
Code2 = c(1,0,2,0,1,2,1,0,0)
)
myData %>% mutate(SumIfs_1 = sapply(1:n(), function(x) sum(Code2[1:n()][Code1[1:n()] < Code1[x]])))
I'd like to expand the code to add another condition to the above sumifs() equivalent, creating a sumifs() with multiple conditions, where we add only those Code 2's for Groups in the array that are < than the current row Group, as further explained in this image (orange shows what already works in the Excel equivalent of the above code for SumIfs_1, yellow shows the sumifs() with more conditions that I am trying to add (SumIfs_2)):
Any recommendations for how to do this?
I'd like to stick with sapply() if possible, and more importantly I'd like to stick with dplyr or base R as I'm trying to prevent package bloat.
For what it's worth, here's my humble attempt to generate the SumIfs_2 column (which does not work):
myData %>% mutate(SumIfs_2 = sapply(1:n(), function(x) sum(Code2[1:n()][Code1[1:n()] < Code1[x]][Group[1:n()] < Group[x]])))
You're doing the same thing pretty much, you just need to add another & condition where you are subsetting.
Also you don't need to call Code1[1:n()], when you call Code1 it already takes all of the values in the Code1 column.
I believe you are looking for
myData %>% mutate(SumIfs_2 = sapply(1:n(), function(x) sum(Code2[(Code1 < Code1[x]) & (Group < Group[x])])))
Name Group Code1 Code2 SumIfs_2
1 B 1 0 1 0
2 R 1 1 0 0
3 R 1 1 2 0
4 R 2 3 0 3
5 R 2 3 1 3
6 B 3 4 2 4
7 A 3 -1 1 0
8 A 4 0 0 1
9 A 1 0 0 0

How to identify frequencies of multiple columns based on condition using R

I have a dataframe which contains 63 columns and 50 rows. I have given below a toy dataset.
>df
rs_1 rs_2 rs_3 rs_4 ... rs_60 A.Ag B.Ag C.Ag
0 0 1 2 ... 1 02:/01 02:/07 03:07/04:01
1 2 1 2 ... 0 02:/01 02:/07 03:07/04:01
2 1 1 2 ... 2 02:/01 02:/07 03:07/04:01
0 0 1 0 ... 2 02:/01 02:/07 03:07/04:01
Now I need to find the highest frequencies of the columns (A.Ag, B.Ag and C.Ag) for each rs_* =0, 1 and 2 separately. The desire outcome would be for example rs_*=0
rs_id Code A.Ag Code B.Ag Code C.Ag
rs_1 02:/01 2 02:/07 5 03:07 5
rs_2 02:/01 3 01:/05 2 05:00 4
could you please help me with this? I tried with the following function
for (i in 1:60){
if (file[,i]==0)
{
temp1 = data.frame(sort(table(file[,61]), decreasing = TRUE)) #onlr for A.Ag coulmn
temp1$Var1 = names(file)[i]
res_types = rbind(res_types, temp1)
}
}
I got the number of frequencies and rs_id. But could not get the code. Can anyone help me with this?
The desire outcome will be
rs_id Code Combination A.A Combination B.Ag Combination C.Ag
rs_1 0 1:01/1:01 7 13:02/13:02 2 03:04/03:04 3
rs_1 0 1:01/11:01 5 13:02/49:01 2 03:04/15:02 3
rs_1 0 1:01/2:01 4 13:02/57:01 2 03:04/7:01 3
rs_1 1 1:01/2:05 3 13:02/8:01 4 06:02/06:02 3
rs_1 1 1:01/24:02 3 14:01/14:02 3 06:02/15:02 3
rs_1 1 1:01/24:02 3 14:01/14:02 2 06:02/15:02 3
rs_2 0 1:01/31:01 3 15:01/15:01 1 06:02/3:03 4
rs_2 0 11:01/2:01 4 15:01/18:01 1 06:02/3:04 1
It might be easier to do this using data.table package. Explanation inline.
library(data.table)
#convert into a long format
longDat <- melt(dat, measure.vars=patterns("^rs"), variable.name="rs_id",
value.name="val_id")
#for each group of rs_id (rs_1, ..., rs_60) and val_id in (0,1,2),
#count the frequency of each code
longDat[,
unlist(
lapply(c("A.Ag","B.Ag","C.Ag"),
function(x) setNames(aggregate(get(x), list(get(x)), length), c("Code", x))
),
recursive=FALSE),
by=c("rs_id", "val_id")]
Is this what you are looking for? Does this help?
data:
library(data.table)
dat <- fread("rs_1,rs_2,rs_3,rs_4,rs_60,A.Ag,B.Ag,C.Ag
0,0,1,2,1,02:/01,02:/07,03:07/04:01
1,2,1,2,0,02:/01,02:/07,03:07/04:01
2,1,1,2,2,02:/01,02:/07,03:07/04:01
0,0,1,0,2,02:/01,02:/07,03:07/04:01")
edit: OP request to retrieve top 3 for each rs_id, val_id and *.Ag
It is prob more readable to do it one *.Ag at a time, count and then take top 3 and then finally merge all the results as follows:
library(data.table)
#convert into a long format
longDat <- melt(dat, measure.vars=patterns("^rs"), variable.name="rs_id",
value.name="val_id")
ids <- c("rs_id", "val_id")
Reduce(function(dt1,dt2) merge(dt1,dt2,by=ids,all=TRUE),
lapply(c("A.Ag","B.Ag","C.Ag"), function(x) {
res <- longDat[, list(.N), by=c(ids, x)][order(-N)]
setnames(res[, head(.SD ,3L), by=ids], c(x, "N"), c(paste0(x,"_Code"), x))
}))

Splitting one Column to Multiple R and Giving logical value if true

I am trying to split one column in a data frame in to multiple columns which hold the values from the original column as new column names. Then if there was an occurrence for that respective column in the original give it a 1 in the new column or 0 if no match. I realize this is not the best way to explain so, for example:
df <- data.frame(subject = c(1:4), Location = c('A', 'A/B', 'B/C/D', 'A/B/C/D'))
# subject Location
# 1 1 A
# 2 2 A/B
# 3 3 B/C/D
# 4 4 A/B/C/D
and would like to expand it to wide format, something such as, with 1's and 0's (or T and F):
# subject A B C D
# 1 1 1 0 0 0
# 2 2 1 1 0 0
# 3 3 0 1 1 1
# 4 4 1 1 1 1
I have looked into tidyr and the separate function and reshape2 and the cast function but seem to getting hung up on giving logical values. Any help on the issue would be greatly appreciated. Thank you.
You may try cSplit_e from package splitstackshape:
library(splitstackshape)
cSplit_e(data = df, split.col = "Location", sep = "/",
type = "character", drop = TRUE, fill = 0)
# subject Location_A Location_B Location_C Location_D
# 1 1 1 0 0 0
# 2 2 1 1 0 0
# 3 3 0 1 1 1
# 4 4 1 1 1 1
You could take the following step-by-step approach.
## get the unique values after splitting
u <- unique(unlist(strsplit(as.character(df$Location), "/")))
## compare 'u' with 'Location'
m <- vapply(u, grepl, logical(length(u)), x = df$Location)
## coerce to integer representation
m[] <- as.integer(m)
## bind 'm' to 'subject'
cbind(df["subject"], m)
# subject A B C D
# 1 1 1 0 0 0
# 2 2 1 1 0 0
# 3 3 0 1 1 1
# 4 4 1 1 1 1

Reshaping from wide to long and vice versa (multistate/survival analysis dataset)

I am trying to reshape the following dataset with reshape(), without much results.
The starting dataset is in "wide" form, with each id described through one row. The dataset is intended to be adopted for carry out Multistate analyses (a generalization of Survival Analysis).
Each person is recorded for a given overall time span. During this period the subject can experience a number of transitions among states (for simplicity let us fix to two the maximum number of distinct states that can be visited). The first visited state is s1 = 1, 2, 3, 4. The person stays within the state for dur1 time periods, and the same applies for the second visited state s2:
id cohort s1 dur1 s2 dur2
1 1 3 4 2 5
2 0 1 4 4 3
The dataset in long format which I woud like to obtain is:
id cohort s
1 1 3
1 1 3
1 1 3
1 1 3
1 1 2
1 1 2
1 1 2
1 1 2
1 1 2
2 0 1
2 0 1
2 0 1
2 0 1
2 0 4
2 0 4
2 0 4
In practice, each id has dur1 + dur2 rows, and s1 and s2 are melted in a single variable s.
How would you do this transformation? Also, how would you cmoe back to the original dataset "wide" form?
Many thanks!
dat <- cbind(id=c(1,2), cohort=c(1, 0), s1=c(3, 1), dur1=c(4, 4), s2=c(2, 4), dur2=c(5, 3))
You can use reshape() for the first step, but then you need to do some more work. Also, reshape() needs a data.frame() as its input, but your sample data is a matrix.
Here's how to proceed:
reshape() your data from wide to long:
dat2 <- reshape(data.frame(dat), direction = "long",
idvar = c("id", "cohort"),
varying = 3:ncol(dat), sep = "")
dat2
# id cohort time s dur
# 1.1.1 1 1 1 3 4
# 2.0.1 2 0 1 1 4
# 1.1.2 1 1 2 2 5
# 2.0.2 2 0 2 4 3
"Expand" the resulting data.frame using rep()
dat3 <- dat2[rep(seq_len(nrow(dat2)), dat2$dur), c("id", "cohort", "s")]
dat3[order(dat3$id), ]
# id cohort s
# 1.1.1 1 1 3
# 1.1.1.1 1 1 3
# 1.1.1.2 1 1 3
# 1.1.1.3 1 1 3
# 1.1.2 1 1 2
# 1.1.2.1 1 1 2
# 1.1.2.2 1 1 2
# 1.1.2.3 1 1 2
# 1.1.2.4 1 1 2
# 2.0.1 2 0 1
# 2.0.1.1 2 0 1
# 2.0.1.2 2 0 1
# 2.0.1.3 2 0 1
# 2.0.2 2 0 4
# 2.0.2.1 2 0 4
# 2.0.2.2 2 0 4
You can get rid of the funky row names too by using rownames(dat3) <- NULL.
Update: Retaining the ability to revert to the original form
In the example above, since we dropped the "time" and "dur" variables, it isn't possible to directly revert to the original dataset. If you feel this is something you would need to do, I suggest keeping those columns in and creating another data.frame with the subset of the columns that you need if required.
Here's how:
Use aggregate() to get back to "dat2":
aggregate(cbind(s, dur) ~ ., dat3, unique)
# id cohort time s dur
# 1 2 0 1 1 4
# 2 1 1 1 3 4
# 3 2 0 2 4 3
# 4 1 1 2 2 5
Wrap reshape() around that to get back to "dat1". Here, in one step:
reshape(aggregate(cbind(s, dur) ~ ., dat3, unique),
direction = "wide", idvar = c("id", "cohort"))
# id cohort s.1 dur.1 s.2 dur.2
# 1 2 0 1 4 4 3
# 2 1 1 3 4 2 5
There are probably better ways, but this might work.
df <- read.table(text = '
id cohort s1 dur1 s2 dur2
1 1 3 4 2 5
2 0 1 4 4 3',
header=TRUE)
hist <- matrix(0, nrow=2, ncol=9)
hist
for(i in 1:nrow(df)) {
hist[i,] <- c(rep(df[i,3], df[i,4]), rep(df[i,5], df[i,6]), rep(0, (9 - df[i,4] - df[i,6])))
}
hist
hist2 <- cbind(df[,1:2], hist)
colnames(hist2) <- c('id', 'cohort', paste('x', seq_along(1:9), sep=''))
library(reshape2)
hist3 <- melt(hist2, id.vars=c('id', 'cohort'), variable.name='x', value.name='state')
hist4 <- hist3[order(hist3$id, hist3$cohort),]
hist4
hist4 <- hist4[ , !names(hist4) %in% c("x")]
hist4 <- hist4[!(hist4[,2]==0 & hist4[,3]==0),]
Gives:
id cohort state
1 1 1 3
3 1 1 3
5 1 1 3
7 1 1 3
9 1 1 2
11 1 1 2
13 1 1 2
15 1 1 2
17 1 1 2
2 2 0 1
4 2 0 1
6 2 0 1
8 2 0 1
10 2 0 4
12 2 0 4
14 2 0 4
Of course, if you have more than two states per id then this would have to be modified (and it might have to be modified if you have more than two cohorts). For example, I suppose with 9 sample periods one person could be in the following sequence of states:
1 3 2 4 3 4 1 1 2

Resources