Creating a new columns from a data.frame - r

I have a dataset which is in longformat in which Measurements (Time) are nested in Networkpartners (NP) which are nested in Persons (ID), here is an example of what it looks like (the real dataset has over thousands of rows):
ID NP Time Outcome
1 11 1 4
1 11 2 3
1 11 3 NA
1 12 1 2
1 12 2 3
1 12 3 3
2 21 1 2
2 21 2 NA
2 21 3 NA
2 22 1 4
2 22 2 4
2 22 3 4
Now I would like to create 3 new variables:
a) The Number of Networkpartners (who have no NA in the outcome at this measurement) a specific person (ID) has Time 1
b) Number of Networkpartners (who have no NA in the outcome at this measurement) a specific person (ID) at Time 2
c) Number of Networkpartners (who have no NA in the outcome at this measurement) a specific person (ID) at Time 3
So I would like to create a dataset like this:
ID NP Time Outcome NP.T1 NP.T2 NP.T3
1 11 1 4 2 2 1
1 11 2 3 2 2 1
1 11 3 NA 2 2 1
1 12 1 2 2 2 1
1 12 2 3 2 2 1
1 12 3 3 2 2 1
2 21 1 2 2 1 1
2 21 2 NA 2 1 1
2 21 3 NA 2 1 1
2 22 1 4 2 1 1
2 22 2 4 2 1 1
2 22 3 4 2 1 1
I would really appreciate your help.

You can just create one variable rather than three. I am using ddply from plyr package for
that.
mydata<-structure(list(ID = c(1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L,
2L, 2L), NP = c(11L, 11L, 11L, 12L, 12L, 12L, 21L, 21L, 21L,
22L, 22L, 22L), Time = c(1L, 2L, 3L, 1L, 2L, 3L, 1L, 2L, 3L,
1L, 2L, 3L), Outcome = c(4L, 3L, NA, 2L, 3L, 3L, 2L, NA, NA,
4L, 4L, 4L)), .Names = c("ID", "NP", "Time", "Outcome"), class = "data.frame", row.names = c(NA,
-12L))
library(plyr)
mydata1<-ddply(mydata,.(ID,Time),transform, NP.T=length(Outcome[which(Outcome !="NA")]))
>mydata1
ID NP Time Outcome NP.T
1 1 11 1 4 2
2 1 12 1 2 2
3 1 11 2 3 2
4 1 12 2 3 2
5 1 11 3 NA 1
6 1 12 3 3 1
7 2 21 1 2 2
8 2 22 1 4 2
9 2 21 2 NA 1
10 2 22 2 4 1
11 2 21 3 NA 1
12 2 22 3 4 1
Updated: You can also use interaction to create the unique variable that combines ID and Time (comb)
mydata1<-ddply(mydata,.(ID,Time),transform, NP.T=length(Outcome[which(Outcome !="NA")]),comb=interaction(ID,Time))

Related

Pasting values from a vector to a new column in a for loop with nested data

I have a dataframe that currently looks like this:
subjectID
Trial
1
3
1
3
1
3
1
4
1
4
1
5
1
5
1
5
2
1
2
1
2
3
2
3
2
3
2
5
2
5
2
6
3
1
Etc., where trial number is nested under subject ID. I need to make a new column in which column "NewTrial" is simply what order the trials now appear in. For example:
subjectID
Trial
NewTrial
1
3
1
1
3
1
1
3
1
1
4
2
1
4
2
1
5
3
1
5
3
1
5
3
2
1
1
2
1
1
2
3
2
2
3
2
2
3
2
2
5
3
2
5
3
2
6
4
3
1
1
So far, I have a for-loop written that looks like this:
for (myperson in unique(data$subjectID)){
#This line creates a vector of the number of unique trials per subject: for subject 1, c(1, 2, 3)
triallength=1:length(unique(data$Trial[data$subID==myperson]))
I'm having trouble now finding a way to paste the numbers from the created triallength vector as a column in the dataframe. Does anyone know of a way to accomplish this? I am lacking some experience with for-loops and hoping to gain more. If anyone has a tidyverse/dplyr solution, however, I am open to that as well as an alternative to a for-loop. Thanks in advance, and let me know if any clarification is needed!
Converting to factor with unique values as levels, then as.numeric in an ave should be nice.
transform(dat, NewTrial=ave(Trial, subjectID, FUN=\(x) as.numeric(factor(x, levels=unique(x)))))
# subjectID Trial NewTrial
# 1 1 3 1
# 2 1 3 1
# 3 1 3 1
# 4 1 4 2
# 5 1 4 2
# 6 1 5 3
# 7 1 5 3
# 8 1 5 3
# 9 2 1 1
# 10 2 1 1
# 11 2 3 2
# 12 2 3 2
# 13 2 3 2
# 14 2 5 3
# 15 2 5 3
# 16 2 6 4
# 17 3 1 1
Data:
dat <- structure(list(subjectID = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 3L), Trial = c(3L, 3L, 3L, 4L,
4L, 5L, 5L, 5L, 1L, 1L, 3L, 3L, 3L, 5L, 5L, 6L, 1L)), class = "data.frame", row.names = c(NA,
-17L))
We could use match on the unique values after grouping by 'subjectID'
library(dplyr)
df1 <- df1 %>%
group_by(subjectID) %>%
mutate(NewTrial = match(Trial, unique(Trial))) %>%
ungroup
We could use rleid:
library(dplyr)
library(data.table)
df %>%
group_by(subjectID) %>%
mutate(NewTrial = rleid(subjectID, Trial))
subjectID Trial NewTrial
<int> <int> <int>
1 1 3 1
2 1 3 1
3 1 3 1
4 1 4 2
5 1 4 2
6 1 5 3
7 1 5 3
8 1 5 3
9 2 1 1
10 2 1 1
11 2 3 2
12 2 3 2
13 2 3 2
14 2 5 3
15 2 5 3
16 2 6 4
17 3 1 1

Is it possible for a data frame to be expanded and restructured in a way that every single columns have pairwise in R?

I have a data frame like this:
no. id age var1 var2 var3 var4 var5
1 580 51 1 2 3 3 1
2 1830 24 2 1 3 8 5
3 4550 71 0 3 2 2 1
4 2760 43 4 5 8 3 2
5 3761 15 3 1 0 2 7
6 4410 72 1 2 2 1 6
7 4580 22 2 1 2 3 4
Following is a syntax:
dt <- structure(
list(
ï..no. = 1:7,
id = c(580L, 1830L, 4550L, 2760L,
3761L, 4410L, 4580L),
age = c(51L, 24L, 71L, 43L, 15L, 72L, 22L),
var1 = c(1L, 2L, 0L, 4L, 3L, 1L, 2L),
var2 = c(2L, 1L, 3L,
5L, 1L, 2L, 1L),
var3 = c(3L, 3L, 2L, 8L, 0L, 2L, 2L),
var4 = c(3L,
8L, 2L, 3L, 2L, 1L, 3L),
var5 = c(1L, 5L, 1L, 2L, 7L, 6L, 4L)
),
class = "data.frame",
row.names = c(NA,-7L)
)
However, I would like to create a new data frame based on above data. The number of observation should stem from Permutation of every two columns. Thus, original columns have pairwise with each other. In the new data frame, the total number of observations is 7P2 = 7! / (7-2)! = 7*6 = 42.
That is, data frame that I want to have is like this:
dyad no. id age var1 var2 var3 var4 var5
1 1 580 51 1 2 3 3 1
1 2 1830 24 2 1 3 8 5
2 1 580 51 1 2 3 3 1
2 3 4550 71 0 3 2 2 1
3 1 580 51 1 2 3 3 1
3 4 2760 43 4 5 8 3 2
4 1 580 51 1 2 3 3 1
4 5 3761 15 3 1 0 2 7
5 1 580 51 1 2 3 3 1
5 6 4410 72 1 2 2 1 6
6 1 580 51 1 2 3 3 1
6 7 4580 22 2 1 2 3 4
. .
. .
2 1830 24 2 1 3 8 5
1 580 51 1 2 3 3 1
2 1830 24 2 1 3 8 5
3 4550 71 0 3 2 2 1
. .
. .
7 4580 22 2 1 2 3 4
5 3761 15 3 1 0 2 7
7 4580 22 2 1 2 3 4
6 4410 72 1 2 2 1 6
I hope to get great answer for this problem.
Best regards,
Leroy
Using gtools::permutations to permute your id column (or gtools::combinations if order doesn't matter) and tidyverse to pivot and join:
library(gtools)
library(tidyverse)
gtools::permutations(nrow(df), r = 2, v = df$id) %>%
data.frame() %>%
tibble::rownames_to_column("dyad") %>%
dplyr::mutate(dyad = as.integer(dyad)) %>%
tidyr::pivot_longer(starts_with("X"),
values_to = "id") %>%
dplyr::select(-name) %>%
dplyr::left_join(df,
by = "id") %>%
dplyr::arrange(dyad)
Note: if column order is important then you can reorder the columns with dplyr >= 1.0.0 by adding a pipe to dplyr::relocate(id, .after = `no.`)
Output
dyad id no. age var1 var2 var3 var4 var5
<int> <int> <int> <int> <int> <int> <int> <int> <int>
1 1 580 1 51 1 2 3 3 1
2 1 1830 2 24 2 1 3 8 5
3 2 580 1 51 1 2 3 3 1
4 2 2760 4 43 4 5 8 3 2
5 3 580 1 51 1 2 3 3 1
6 3 3761 5 15 3 1 0 2 7
7 4 580 1 51 1 2 3 3 1
8 4 4410 6 72 1 2 2 1 6
9 5 580 1 51 1 2 3 3 1
10 5 4550 3 71 0 3 2 2 1
# ... with 74 more rows
Data
df <- structure(list(no. = 1:7, id = c(580L, 1830L, 4550L, 2760L, 3761L,
4410L, 4580L), age = c(51L, 24L, 71L, 43L, 15L, 72L, 22L), var1 = c(1L,
2L, 0L, 4L, 3L, 1L, 2L), var2 = c(2L, 1L, 3L, 5L, 1L, 2L, 1L),
var3 = c(3L, 3L, 2L, 8L, 0L, 2L, 2L), var4 = c(3L, 8L, 2L,
3L, 2L, 1L, 3L), var5 = c(1L, 5L, 1L, 2L, 7L, 6L, 4L)), class = "data.frame", row.names = c(NA,
-7L))
Since you are choosing per combination two rows, your result should have 84 observations.
Assuming that the column no is 1:NROW(df) you can do the following:
df <- data.frame(no=1:7,id=(1:7)*100,age=21:27,var1=11:17,var2=31:37) #sample data
#create all combinations
combinations <- do.call("rbind",lapply(df$no, function(i) {
matrix(c(rep(i,length(df$no)-1),setdiff(df$no,i)), ncol = 2)
}))
#choose the rows for every combination
res <- apply(combinations,1,function(startend) {df[startend,]})
#bind everything together
res <- do.call("rbind",res)
#add the dyad counting column in front
res <- cbind(data.frame(dyad = rep(1:NROW(combinations),each=2)),res)
rownames(res) <- NULL
Update: The combinations can be calculated faster using
combinations <- matrix(
c(rep(df$no,each=length(df$no)-1),
unlist(lapply(df$no, function(i) df$no[-i]))),
ncol = 2
)
On my machine its around a 5x difference.
UpUpdate:
You dont even need the apply function. You can make use of a nice indexing feature of dataframes in R. Instead of
res <- apply(combinations,1,function(startend) {
df[startend,]
})
res <- do.call("rbind",res)
you could simply do
res <- df[as.vector(t(combinations)),]
and then go on with cbind.

Define new variable to take on 1 if next row of another variable fulfills condition

so I´m trying to set up my dataset for event-history analysis and for this I need to define a new column. My dataset is of the following form:
ID Var1
1 10
1 20
1 30
1 10
2 4
2 5
2 10
2 5
3 1
3 15
3 20
3 9
4 18
4 32
4 NA
4 12
5 2
5 NA
5 8
5 3
And I want to get to the following form:
ID Var1 Var2
1 10 0
1 20 0
1 30 1
1 10 0
2 4 0
2 5 0
2 10 0
2 5 0
3 1 0
3 15 0
3 20 1
3 9 0
4 18 0
4 32 NA
4 NA 1
4 12 0
5 2 NA
5 NA 0
5 8 1
5 3 0
So in words: I want the new variable to indicate, if the value of Var1 (with respect to the group) drops below 50% of the maximum value Var1 reaches for that group. Whether the last value is NA or 0 is not really of importance, although NA would make more sense from a theoretical perspective.
I´ve tried using something like
DF$Var2 <- df %>%
group_by(ID) %>%
ifelse(df == ave(df$Var1,df$ID, FUN = max), 0,1)
to then lag it by 1, but it returns an error on an unused argument 1 in ifelse.
Thanks for your solutions!
Here is a base R option via ave + cummax
within(df,Var2 <- ave(Var1,ID,FUN = function(x) c((x<max(x)/2 & cummax(x)==max(x))[-1],0)))
which gives
> within(df,Var2 <- ave(Var1,ID,FUN = function(x) c((x<max(x)/2 & cummax(x)==max(x))[-1],0)))
ID Var1 Var2
1 1 10 0
2 1 20 0
3 1 30 1
4 1 10 0
5 2 4 0
6 2 5 0
7 2 10 0
8 2 5 0
9 3 1 0
10 3 15 0
11 3 20 1
12 3 9 0
Data
> dput(df)
structure(list(ID = c(1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 3L, 3L,
3L, 3L), Var1 = c(10L, 20L, 30L, 10L, 4L, 5L, 10L, 5L, 1L, 15L,
20L, 9L)), class = "data.frame", row.names = c(NA, -12L))
Edit (for updated post)
f <- function(v) {
u1 <- c(replace(v,!is.na(v),0),0)[-1]
v[is.na(v)] <- v[which(is.na(v))-1]
u2 <- c((v<max(v)/2 & cummax(v)==max(v))[-1],0)
u1+u2
}
within(df,Var2 <- ave(Var1,ID,FUN = f))
such that
> within(df,Var2 <- ave(Var1,ID,FUN = f))
ID Var1 Var2
1 1 10 0
2 1 20 0
3 1 30 1
4 1 10 0
5 2 4 0
6 2 5 0
7 2 10 0
8 2 5 0
9 3 1 0
10 3 15 0
11 3 20 1
12 3 9 0
13 4 18 0
14 4 32 NA
15 4 NA 1
16 4 12 0
17 5 2 NA
18 5 NA 0
19 5 8 1
20 5 3 0
Data
df <- tructure(list(ID = c(1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 3L, 3L,
3L, 3L, 4L, 4L, 4L, 4L, 5L, 5L, 5L, 5L), Var1 = c(10L, 20L, 30L,
10L, 4L, 5L, 10L, 5L, 1L, 15L, 20L, 9L, 18L, 32L, NA, 12L, 2L,
NA, 8L, 3L)), class = "data.frame", row.names = c(NA, -20L))

creating additional rows in R

I am working on conjoint analysis and trying to create a choice-task dataframe. So far, I created orthogonal dataframe using caEncodedDesign() in conjoint package and now trying to create a choice-task dataframe. I am struggling to find ways to add two additional rows under each row of design2 dataframe.
All the values in the first added row should be +1 of the original value and the second added row is +2 of the original values. what the value is 4, it has to become 1.
This is the orginal design2 d.f
> design2
price color privacy battery stars
17 2 3 2 1 1
21 3 1 3 1 1
34 1 3 1 2 1
60 3 2 1 3 1
64 1 1 2 3 1
82 1 1 1 1 2
131 2 2 3 2 2
153 3 3 2 3 2
171 3 3 1 1 3
175 1 2 2 1 3
201 3 1 2 2 3
218 2 1 1 3 3
241 1 3 3 3 3
I did the first row by hand, and I am looking for R code that could apply to the whole rows below.
>design2
price color privacy battery stars
17 2 3 2 1 1
3 1 3 2 2
1 2 1 3 3
21 3 1 3 1 1
34 1 3 1 2 1
60 3 2 1 3 1
64 1 1 2 3 1
82 1 1 1 1 2
131 2 2 3 2 2
153 3 3 2 3 2
171 3 3 1 1 3
175 1 2 2 1 3
201 3 1 2 2 3
218 2 1 1 3 3
241 1 3 3 3 3
Here's an attempt, based on duplicating rows, adding 0:2 to each column, and then replacing anything >= 4 by subtracting 3
design2 <- design2[rep(seq_len(nrow(design2)), each=3),]
design2 <- design2 + 0:2
sel <- design2 >= 4
design2[sel] <- (design2 - 3)[sel]
design2
# price color privacy battery stars
#17 2 3 2 1 1
#17.1 3 1 3 2 2
#17.2 1 2 1 3 3
#21 3 1 3 1 1
#21.1 1 2 1 2 2
#21.2 2 3 2 3 3
#34 1 3 1 2 1
#34.1 2 1 2 3 2
#34.2 3 2 3 1 3
# ..
We can use apply row-wise and for every value in the row include the missing values using setdiff
out_df <- do.call(rbind, apply(design2, 1, function(x)
data.frame(sapply(x, function(y) c(y, setdiff(1:3, y))))))
rownames(out_df) <- NULL
out_df
# price color privacy battery stars
#1 2 3 2 1 1
#2 1 1 1 2 2
#3 3 2 3 3 3
#4 3 1 3 1 1
#5 1 2 1 2 2
#6 2 3 2 3 3
#7 1 3 1 2 1
#8 2 1 2 1 2
#9 3 2 3 3 3
#.....
data
design2 <- structure(list(price = c(2L, 3L, 1L, 3L, 1L, 1L, 2L, 3L, 3L,
1L, 3L, 2L, 1L), color = c(3L, 1L, 3L, 2L, 1L, 1L, 2L, 3L, 3L,
2L, 1L, 1L, 3L), privacy = c(2L, 3L, 1L, 1L, 2L, 1L, 3L, 2L,
1L, 2L, 2L, 1L, 3L), battery = c(1L, 1L, 2L, 3L, 3L, 1L, 2L,
3L, 1L, 1L, 2L, 3L, 3L), stars = c(1L, 1L, 1L, 1L, 1L, 2L, 2L,
2L, 3L, 3L, 3L, 3L, 3L)), class = "data.frame", row.names = c("17",
"21", "34", "60", "64", "82", "131", "153", "171", "175", "201", "218", "241"))

in create a new variable with the max or min of another variable -- by group [duplicate]

This question already has answers here:
Adding a column of means by group to original data [duplicate]
(4 answers)
Closed 6 years ago.
R Community: I am trying to to create a new variable based on the value of existing variable, not on a row-wise basis but rather on a group-wise basis. I'm trying to create max.var and min.var below based on old.var without collapsing or aggregating the rows, that is, preserving all the id rows:
id old.var min.var max.var
1 1 1 3
1 2 1 3
1 3 1 3
2 5 5 11
2 7 5 11
2 9 5 11
2 11 5 11
3 3 3 4
3 4 3 4
structure(list(id = c(1L, 1L, 1L, 2L, 2L, 2L, 2L, 3L, 3L), old.var =
c(1L,
2L, 3L, 5L, 7L, 9L, 11L, 3L, 4L), min.var = c(1L, 1L, 1L, 5L,
5L, 5L, 5L, 3L, 3L), max.var = c(3L, 3L, 3L, 11L, 11L, 11L, 11L,
4L, 4L)), .Names = c("id", "old.var", "min.var", "max.var"), class = "data.frame", row.names = c(NA,
-9L))
I've tried using the aggregate and by functions, but they of course summarize the data. I haven't had much luck trying an Excel-like MATCH/INDEX approach either. Thanks in advance for your assistance!
You can use dplyr,
df %>%
group_by(id) %>%
mutate(min.var = min(old.var), max.var = max(old.var))
#Source: local data frame [9 x 4]
#Groups: id [3]
# id old.var min.var max.var
# (int) (int) (int) (int)
#1 1 1 1 3
#2 1 2 1 3
#3 1 3 1 3
#4 2 5 5 11
#5 2 7 5 11
#6 2 9 5 11
#7 2 11 5 11
#8 3 3 3 4
#9 3 4 3 4
Using ave as docendo discimus pointed out in the question's comments:
df$min.var <- ave(df$old.var, df$id, FUN = min)
df$max.var <- ave(df$old.var, df$id, FUN = max)
Output:
id old.var min.var max.var
1 1 1 1 3
2 1 2 1 3
3 1 3 1 3
4 2 5 5 11
5 2 7 5 11
6 2 9 5 11
7 2 11 5 11
8 3 3 3 4
9 3 4 3 4
We can use data.table
library(data.table)
setDT(df1)[, c('min.var', 'max.var') := list(min(old.var), max(old.var)) , by = id]
df1
# id old.var min.var max.var
#1: 1 1 1 3
#2: 1 2 1 3
#3: 1 3 1 3
#4: 2 5 5 11
#5: 2 7 5 11
#6: 2 9 5 11
#7: 2 11 5 11
#8: 3 3 3 4
#9: 3 4 3 4

Resources