Related
See table below: I want to assign 1 or 0 to a new_col but the sum of 1s per unique hhid column should not exceed the value of any element in the column "nets" as seen in the table below, assuming new_col doesn't exist
hhid nets new_col
1 1 3 1
1 1 3 1
1 1 3 1
1 1 3 0
1 2 2 1
1 2 2 1
1 2 2 0
1 3 2 1
1 3 2 1
1 3 2 0
1 3 2 0
I tried code below
df %>% group_by(hhid) %>% mutate(new_col = ifelse(summarise(across(new_col), sum)<= df$nets),1,0)
Try this:
Data:
df <- structure(list(hhid = c(1L, 1L, 1L, 1L, 2L, 2L, 2L, 3L, 3L, 3L,
3L), nets = c(3L, 3L, 3L, 3L, 2L, 2L, 2L, 2L, 2L, 2L, 2L)), class = "data.frame", row.names = c(NA,
-11L))
hhid nets
1 1 3
2 1 3
3 1 3
4 1 3
5 2 2
6 2 2
7 2 2
8 3 2
9 3 2
10 3 2
11 3 2
Code:
df %>%
group_by(hhid) %>%
mutate(new_col = ifelse(row_number() <= nets,1,0))
Output:
# A tibble: 11 x 3
# Groups: hhid [3]
hhid nets new_col
<int> <int> <dbl>
1 1 3 1
2 1 3 1
3 1 3 1
4 1 3 0
5 2 2 1
6 2 2 1
7 2 2 0
8 3 2 1
9 3 2 1
10 3 2 0
11 3 2 0
Same solution but using data.table instead of dplyr
dt <- structure(list(hhid = c(1L, 1L, 1L, 1L, 2L, 2L, 2L, 3L, 3L, 3L,
3L), nets = c(3L, 3L, 3L, 3L, 2L, 2L, 2L, 2L, 2L, 2L, 2L)), row.names = c(NA,
-11L), class = c("data.frame"))
library(data.table)
setDT(dt)
dt[, new_col := +(seq_len(.N) <= nets), by = hhid]
dt
hhid nets new_col
1: 1 3 1
2: 1 3 1
3: 1 3 1
4: 1 3 0
5: 2 2 1
6: 2 2 1
7: 2 2 0
8: 3 2 1
9: 3 2 1
10: 3 2 0
11: 3 2 0
I have a data frame like this:
no. id age var1 var2 var3 var4 var5
1 580 51 1 2 3 3 1
2 1830 24 2 1 3 8 5
3 4550 71 0 3 2 2 1
4 2760 43 4 5 8 3 2
5 3761 15 3 1 0 2 7
6 4410 72 1 2 2 1 6
7 4580 22 2 1 2 3 4
Following is a syntax:
dt <- structure(
list(
ï..no. = 1:7,
id = c(580L, 1830L, 4550L, 2760L,
3761L, 4410L, 4580L),
age = c(51L, 24L, 71L, 43L, 15L, 72L, 22L),
var1 = c(1L, 2L, 0L, 4L, 3L, 1L, 2L),
var2 = c(2L, 1L, 3L,
5L, 1L, 2L, 1L),
var3 = c(3L, 3L, 2L, 8L, 0L, 2L, 2L),
var4 = c(3L,
8L, 2L, 3L, 2L, 1L, 3L),
var5 = c(1L, 5L, 1L, 2L, 7L, 6L, 4L)
),
class = "data.frame",
row.names = c(NA,-7L)
)
However, I would like to create a new data frame based on above data. The number of observation should stem from Permutation of every two columns. Thus, original columns have pairwise with each other. In the new data frame, the total number of observations is 7P2 = 7! / (7-2)! = 7*6 = 42.
That is, data frame that I want to have is like this:
dyad no. id age var1 var2 var3 var4 var5
1 1 580 51 1 2 3 3 1
1 2 1830 24 2 1 3 8 5
2 1 580 51 1 2 3 3 1
2 3 4550 71 0 3 2 2 1
3 1 580 51 1 2 3 3 1
3 4 2760 43 4 5 8 3 2
4 1 580 51 1 2 3 3 1
4 5 3761 15 3 1 0 2 7
5 1 580 51 1 2 3 3 1
5 6 4410 72 1 2 2 1 6
6 1 580 51 1 2 3 3 1
6 7 4580 22 2 1 2 3 4
. .
. .
2 1830 24 2 1 3 8 5
1 580 51 1 2 3 3 1
2 1830 24 2 1 3 8 5
3 4550 71 0 3 2 2 1
. .
. .
7 4580 22 2 1 2 3 4
5 3761 15 3 1 0 2 7
7 4580 22 2 1 2 3 4
6 4410 72 1 2 2 1 6
I hope to get great answer for this problem.
Best regards,
Leroy
Using gtools::permutations to permute your id column (or gtools::combinations if order doesn't matter) and tidyverse to pivot and join:
library(gtools)
library(tidyverse)
gtools::permutations(nrow(df), r = 2, v = df$id) %>%
data.frame() %>%
tibble::rownames_to_column("dyad") %>%
dplyr::mutate(dyad = as.integer(dyad)) %>%
tidyr::pivot_longer(starts_with("X"),
values_to = "id") %>%
dplyr::select(-name) %>%
dplyr::left_join(df,
by = "id") %>%
dplyr::arrange(dyad)
Note: if column order is important then you can reorder the columns with dplyr >= 1.0.0 by adding a pipe to dplyr::relocate(id, .after = `no.`)
Output
dyad id no. age var1 var2 var3 var4 var5
<int> <int> <int> <int> <int> <int> <int> <int> <int>
1 1 580 1 51 1 2 3 3 1
2 1 1830 2 24 2 1 3 8 5
3 2 580 1 51 1 2 3 3 1
4 2 2760 4 43 4 5 8 3 2
5 3 580 1 51 1 2 3 3 1
6 3 3761 5 15 3 1 0 2 7
7 4 580 1 51 1 2 3 3 1
8 4 4410 6 72 1 2 2 1 6
9 5 580 1 51 1 2 3 3 1
10 5 4550 3 71 0 3 2 2 1
# ... with 74 more rows
Data
df <- structure(list(no. = 1:7, id = c(580L, 1830L, 4550L, 2760L, 3761L,
4410L, 4580L), age = c(51L, 24L, 71L, 43L, 15L, 72L, 22L), var1 = c(1L,
2L, 0L, 4L, 3L, 1L, 2L), var2 = c(2L, 1L, 3L, 5L, 1L, 2L, 1L),
var3 = c(3L, 3L, 2L, 8L, 0L, 2L, 2L), var4 = c(3L, 8L, 2L,
3L, 2L, 1L, 3L), var5 = c(1L, 5L, 1L, 2L, 7L, 6L, 4L)), class = "data.frame", row.names = c(NA,
-7L))
Since you are choosing per combination two rows, your result should have 84 observations.
Assuming that the column no is 1:NROW(df) you can do the following:
df <- data.frame(no=1:7,id=(1:7)*100,age=21:27,var1=11:17,var2=31:37) #sample data
#create all combinations
combinations <- do.call("rbind",lapply(df$no, function(i) {
matrix(c(rep(i,length(df$no)-1),setdiff(df$no,i)), ncol = 2)
}))
#choose the rows for every combination
res <- apply(combinations,1,function(startend) {df[startend,]})
#bind everything together
res <- do.call("rbind",res)
#add the dyad counting column in front
res <- cbind(data.frame(dyad = rep(1:NROW(combinations),each=2)),res)
rownames(res) <- NULL
Update: The combinations can be calculated faster using
combinations <- matrix(
c(rep(df$no,each=length(df$no)-1),
unlist(lapply(df$no, function(i) df$no[-i]))),
ncol = 2
)
On my machine its around a 5x difference.
UpUpdate:
You dont even need the apply function. You can make use of a nice indexing feature of dataframes in R. Instead of
res <- apply(combinations,1,function(startend) {
df[startend,]
})
res <- do.call("rbind",res)
you could simply do
res <- df[as.vector(t(combinations)),]
and then go on with cbind.
I'm trying to wrap my head around this data wrangling problem. My conjoint study output df looks similar to this:
id set_number card_number att1 att2 att3 att4 score
1 932 1 1 1 1 1 3 0
2 932 1 2 2 2 4 4 100
3 932 1 3 8 8 8 8 0
4 932 2 1 3 3 3 1 0
5 932 2 2 4 2 2 4 0
6 932 2 3 8 8 8 8 100
7 933 1 1 1 1 1 3 0
8 933 1 2 2 2 4 4 100
9 933 1 3 8 8 8 8 0
...
Where id refers to a person and score is a dependent variable. I need to reformat the df in order to run an analysis using ChoiceModelR package.
I am trying to figure out how to write a code (I am guessing using group_by(id and card_number) and case_when/if else statements) that would impute the card_number in the top row corresponding to each set_number, if a score is 100 for that card number. However, the score needs to be "card_number + 1" if all att1 to att4 are 8s.
The desired df needs to look like so:
id set_number card_number att1 att2 att3 att4 score
1 932 1 1 1 1 1 3 2
2 932 1 2 2 2 4 4 0
3 932 1 3 8 8 8 8 0
4 932 2 1 3 3 3 1 4
5 932 2 2 4 2 2 4 0
6 932 2 3 8 8 8 8 0
7 933 2 1 3 3 3 1 2
8 933 2 2 4 2 2 4 0
9 933 2 3 8 8 8 8 0
...
I would really appreciate any help.
My complete dataset in csv. format is here
Dput output
structure(list(id = c(932L, 932L, 932L, 932L, 932L, 932L, 932L,
932L, 932L, 932L), set_number = c(1L, 1L, 1L, 2L, 2L, 2L, 3L,
3L, 3L, 4L), card_number = c(1L, 2L, 3L, 1L, 2L, 3L, 1L, 2L,
3L, 1L), att1 = c(1L, 2L, 8L, 3L, 4L, 8L, 5L, 6L, 8L, 3L), att2 = c(1L,
2L, 8L, 3L, 2L, 8L, 4L, 3L, 8L, 1L), att3 = c(1L, 4L, 8L, 3L,
2L, 8L, 1L, 3L, 8L, 2L), att4 = c(3L, 4L, 8L, 1L, 4L, 8L, 3L,
2L, 8L, 2L), score = c(0L, 100L, 0L, 0L, 100L, 0L, 0L, 100L,
0L, 0L)), class = "data.frame", row.names = c(NA, -10L))
This is probably not the most efficient way of solving this, but here it goes (I would also welcome any other way of achieving the same thing):
df$dv = 0
for (i in seq(1, nrow(df),by = 3)){
if(df$score[i] == 100)
{df$dv[i] = 1}
if(df$score[i+1] == 100)
{df$dv[i] = 2}
if(df$score[i+2] == 100)
{df$dv[i] = 4}
}
dv is a new column that stores updated scores. I then just removed score column with a subset function.
A solution based in the tidyverse can look as follows.
library(dplyr)
library(purrr)
as_tibble(df) %>%
group_by(id, set_number) %>%
mutate(scoreX = card_number[which(score == 100)][1],
scoreX = pmap_dbl(list(att1, att2, att3, att4, score, scoreX),
~ if_else(sum(..1, ..2, ..3, ..4) == 32 & ..5 == 100,
..6 + 1, as.double(..6))),
scoreX = max(scoreX),
scoreX = if_else(row_number() == min(row.names(.)), scoreX, 0))
# id set_number card_number att1 att2 att3 att4 score scoreX
# <int> <int> <int> <int> <int> <int> <int> <int> <dbl>
# 1 932 1 1 1 1 1 3 0 2
# 2 932 1 2 2 2 4 4 100 0
# 3 932 1 3 8 8 8 8 0 0
# 4 932 2 1 3 3 3 1 0 2
# 5 932 2 2 4 2 2 4 100 0
# 6 932 2 3 8 8 8 8 0 0
# 7 932 3 1 5 4 1 3 0 2
# 8 932 3 2 6 3 3 2 100 0
# 9 932 3 3 8 8 8 8 0 0
# 10 932 4 1 3 1 2 2 0 NA
so I´m trying to set up my dataset for event-history analysis and for this I need to define a new column. My dataset is of the following form:
ID Var1
1 10
1 20
1 30
1 10
2 4
2 5
2 10
2 5
3 1
3 15
3 20
3 9
4 18
4 32
4 NA
4 12
5 2
5 NA
5 8
5 3
And I want to get to the following form:
ID Var1 Var2
1 10 0
1 20 0
1 30 1
1 10 0
2 4 0
2 5 0
2 10 0
2 5 0
3 1 0
3 15 0
3 20 1
3 9 0
4 18 0
4 32 NA
4 NA 1
4 12 0
5 2 NA
5 NA 0
5 8 1
5 3 0
So in words: I want the new variable to indicate, if the value of Var1 (with respect to the group) drops below 50% of the maximum value Var1 reaches for that group. Whether the last value is NA or 0 is not really of importance, although NA would make more sense from a theoretical perspective.
I´ve tried using something like
DF$Var2 <- df %>%
group_by(ID) %>%
ifelse(df == ave(df$Var1,df$ID, FUN = max), 0,1)
to then lag it by 1, but it returns an error on an unused argument 1 in ifelse.
Thanks for your solutions!
Here is a base R option via ave + cummax
within(df,Var2 <- ave(Var1,ID,FUN = function(x) c((x<max(x)/2 & cummax(x)==max(x))[-1],0)))
which gives
> within(df,Var2 <- ave(Var1,ID,FUN = function(x) c((x<max(x)/2 & cummax(x)==max(x))[-1],0)))
ID Var1 Var2
1 1 10 0
2 1 20 0
3 1 30 1
4 1 10 0
5 2 4 0
6 2 5 0
7 2 10 0
8 2 5 0
9 3 1 0
10 3 15 0
11 3 20 1
12 3 9 0
Data
> dput(df)
structure(list(ID = c(1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 3L, 3L,
3L, 3L), Var1 = c(10L, 20L, 30L, 10L, 4L, 5L, 10L, 5L, 1L, 15L,
20L, 9L)), class = "data.frame", row.names = c(NA, -12L))
Edit (for updated post)
f <- function(v) {
u1 <- c(replace(v,!is.na(v),0),0)[-1]
v[is.na(v)] <- v[which(is.na(v))-1]
u2 <- c((v<max(v)/2 & cummax(v)==max(v))[-1],0)
u1+u2
}
within(df,Var2 <- ave(Var1,ID,FUN = f))
such that
> within(df,Var2 <- ave(Var1,ID,FUN = f))
ID Var1 Var2
1 1 10 0
2 1 20 0
3 1 30 1
4 1 10 0
5 2 4 0
6 2 5 0
7 2 10 0
8 2 5 0
9 3 1 0
10 3 15 0
11 3 20 1
12 3 9 0
13 4 18 0
14 4 32 NA
15 4 NA 1
16 4 12 0
17 5 2 NA
18 5 NA 0
19 5 8 1
20 5 3 0
Data
df <- tructure(list(ID = c(1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 3L, 3L,
3L, 3L, 4L, 4L, 4L, 4L, 5L, 5L, 5L, 5L), Var1 = c(10L, 20L, 30L,
10L, 4L, 5L, 10L, 5L, 1L, 15L, 20L, 9L, 18L, 32L, NA, 12L, 2L,
NA, 8L, 3L)), class = "data.frame", row.names = c(NA, -20L))
I am working on conjoint analysis and trying to create a choice-task dataframe. So far, I created orthogonal dataframe using caEncodedDesign() in conjoint package and now trying to create a choice-task dataframe. I am struggling to find ways to add two additional rows under each row of design2 dataframe.
All the values in the first added row should be +1 of the original value and the second added row is +2 of the original values. what the value is 4, it has to become 1.
This is the orginal design2 d.f
> design2
price color privacy battery stars
17 2 3 2 1 1
21 3 1 3 1 1
34 1 3 1 2 1
60 3 2 1 3 1
64 1 1 2 3 1
82 1 1 1 1 2
131 2 2 3 2 2
153 3 3 2 3 2
171 3 3 1 1 3
175 1 2 2 1 3
201 3 1 2 2 3
218 2 1 1 3 3
241 1 3 3 3 3
I did the first row by hand, and I am looking for R code that could apply to the whole rows below.
>design2
price color privacy battery stars
17 2 3 2 1 1
3 1 3 2 2
1 2 1 3 3
21 3 1 3 1 1
34 1 3 1 2 1
60 3 2 1 3 1
64 1 1 2 3 1
82 1 1 1 1 2
131 2 2 3 2 2
153 3 3 2 3 2
171 3 3 1 1 3
175 1 2 2 1 3
201 3 1 2 2 3
218 2 1 1 3 3
241 1 3 3 3 3
Here's an attempt, based on duplicating rows, adding 0:2 to each column, and then replacing anything >= 4 by subtracting 3
design2 <- design2[rep(seq_len(nrow(design2)), each=3),]
design2 <- design2 + 0:2
sel <- design2 >= 4
design2[sel] <- (design2 - 3)[sel]
design2
# price color privacy battery stars
#17 2 3 2 1 1
#17.1 3 1 3 2 2
#17.2 1 2 1 3 3
#21 3 1 3 1 1
#21.1 1 2 1 2 2
#21.2 2 3 2 3 3
#34 1 3 1 2 1
#34.1 2 1 2 3 2
#34.2 3 2 3 1 3
# ..
We can use apply row-wise and for every value in the row include the missing values using setdiff
out_df <- do.call(rbind, apply(design2, 1, function(x)
data.frame(sapply(x, function(y) c(y, setdiff(1:3, y))))))
rownames(out_df) <- NULL
out_df
# price color privacy battery stars
#1 2 3 2 1 1
#2 1 1 1 2 2
#3 3 2 3 3 3
#4 3 1 3 1 1
#5 1 2 1 2 2
#6 2 3 2 3 3
#7 1 3 1 2 1
#8 2 1 2 1 2
#9 3 2 3 3 3
#.....
data
design2 <- structure(list(price = c(2L, 3L, 1L, 3L, 1L, 1L, 2L, 3L, 3L,
1L, 3L, 2L, 1L), color = c(3L, 1L, 3L, 2L, 1L, 1L, 2L, 3L, 3L,
2L, 1L, 1L, 3L), privacy = c(2L, 3L, 1L, 1L, 2L, 1L, 3L, 2L,
1L, 2L, 2L, 1L, 3L), battery = c(1L, 1L, 2L, 3L, 3L, 1L, 2L,
3L, 1L, 1L, 2L, 3L, 3L), stars = c(1L, 1L, 1L, 1L, 1L, 2L, 2L,
2L, 3L, 3L, 3L, 3L, 3L)), class = "data.frame", row.names = c("17",
"21", "34", "60", "64", "82", "131", "153", "171", "175", "201", "218", "241"))