Related
I am slowly working through a data transformation using R and the dplyr package. I started with unique rows per respondent. The data come from a conjoint experiment, so I have needed to head toward profiles (profile A or B in the experiment) nested in experimental iteration (each respondent took the experiment 5 times) nested in respondent ID.
I have successfully transformed the data to get experiments nested in respondent IDs. Now I have multiple columns X1-Xn that contain the attribute features. However, these effectively repeat attributes at this point, with, say, X1 including a variable for profile A in the experiment and X6 including the same variable but for profile B.
In the mocked image example below, I would basically need to merge columns v1a and v1b as just v1, v2a and v2b as just v2 and so forth, while generating a new column that delimits if they are from a or b.
Following up on the comments and helpful but not quite what is needed engagement with this original post, I have edited the post to include simple code for both the original data structure and the ideal outcome data:
#original dataframe
ID <- c(1, 1, 1, 2, 2, 2)
`Ex ID` <- c(1, 2, 3, 1, 2, 3)
v1a <- c(2, 4, 5, 1, 3, 5)
v2a = c(3, 4, 5, 2, 1, 5)
v3a = c(5, 4, 3, 3, 2, 1)
v1b = c(4, 5, 5, 1, 5, 4)
v2b = c(5, 2, 2, 4, 1, 4)
v3b = c(5, 5, 4, 5, 4, 5)
original <- data.frame(ID, 'Ex ID' , v1a, v2a, v3a, v1b, v2b,
v3b)
#wanted data frame
ID <- c(1, 1, 1, 1, 1, 1)
`Ex ID` <- c(1, 1, 2, 2, 3, 3)
profile <- c("a", "b", "a", "b", "a", "b")
v1ab = c(2, 4, 4, 5, 5, 5)
v2ab = c(3, 5, 4, 2, 5, 2)
v3ab = c(5, 5, 4, 5, 3, 4)
desired <- data.frame(ID, 'Ex ID', profile, v1ab, v2ab, v3ab)
I basically want to find a way to nest multiple variables within ID, experiment ID, profile IDs.
Any guidance would be greatly appreciated.
Let's take a look at a minimal working example.
df<-data.frame(ID=c(1,1,1,2,2,3),v1a=c(2,4,5,1,3,5),v1b=c(4,5,5,1,5,4))
To merge the columns v1a and v1b we can use the command paste, which concatenates strings. The new column is created using mutate which cames with the dplyr package.
df <- mutate(df,v1=paste(df$v1a,",",df$v1b, sep=""))
Result:
ID v1a v1b v1
1 1 2 4 2,4
2 1 4 5 4,5
3 1 5 5 5,5
4 2 1 1 1,1
5 2 3 5 3,5
6 3 5 4 5,4
If you want to get rid of the "old" columns v1a and v1b, you can use select
df <- select(df,- (v1a | v1b))
which results in
ID v1
1 1 2,4
2 1 4,5
3 1 5,5
4 2 1,1
5 2 3,5
6 3 5,4
We could do this with base R using sapply:
cols <- split(names(df)[-c(1,2)], substr(names(df)[-c(1,2)], start = 1, stop = 2))
cbind(df[c(1,2)], sapply(names(cols), function(col) {
do.call(paste, c(df[cols[[col]]], sep = ","))
}))
Output:
ID Ex_ID v1 v2
1 1 1 2,4 3,5
2 1 2 4,5 4,2
3 1 3 5,5 5,2
4 2 1 1,1 2,4
5 2 2 3,5 1,1
6 2 3 5,4 5,4
7 3 1 4,4 2,5
8 3 2 1,1 5,4
9 3 3 4,5 1,2
data:
df <- tibble(ID = c(1, 1, 1, 2, 2, 2, 3, 3, 3), Ex_ID = c(1,
2, 3, 1, 2, 3, 1, 2, 3), v1a = c(2, 4, 5, 1, 3, 5, 4, 1, 4),
v2a = c(3, 4, 5, 2, 1, 5, 2, 5, 1), v1b = c(4, 5, 5, 1, 5,
4, 4, 1, 5), v2b = c(5, 2, 2, 4, 1, 4, 5, 4, 2))
I'd like to find the rank of a value in a sorted vector, i.e., given a sorted (increasing) vector and a value, find the index of the value in the vector if it is present (or the mean of indices if more than once), or the index of the greatest element less than the value, if it is not present, but within the range of the vector, or something reasonable if the value is outside the range of the vector altogether.
Let's say xx is the vector and x is the value. mean(which(xx == x)) covers the value-present case, and max(which(xx < x)) covers the value-not-present-and-in-range case. 1 and length(xx) are probably reasonable outputs for the not-in-range case.
So I could do that, but I'd like to avoid creating a Boolean vector the size of xx, and also there are just enough wrinkles that I'd prefer to call a built-in or library function instead of rolling my own. Perhaps there is something simple which I've overlooked.
Here's an example. The first value, 7, is present in the vector. The second, 7.3, is not present. I'd like to get the outputs 82.5 and 86, respectively.
> sort (floor (runif (100) * 10)) -> xx
> xx
[1] 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2
[38] 2 3 3 3 3 3 3 3 3 3 4 4 4 4 4 4 4 4 4 4 4 5 5 5 5 5 5 5 5 5 5 5 5 6 6 6 6
[75] 6 6 6 6 7 7 7 7 7 7 7 7 8 8 8 8 8 8 8 8 8 8 8 9 9 9
> mean (which (xx == 7))
[1] 82.5
> max (which (xx <= 7.3))
[1] 86
EDIT: with hints from akrun, I've come up with the following. Note that when there are duplicates, make use of the fact that match returns the least index and findInterval returns the greatest.
# assume xx is sorted already
mean.rank.in <- function (xx, x) {
findInterval (x, xx) -> i
if (i == 0) 0
else
if (xx[[i]] == x)
# account for duplicates here:
# findInterval returned greatest index, call match to find least
(match(x, xx) + i)/2
else i
}
Here are some checks:
xx <- c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3,
3, 3, 3, 3, 3, 3, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 5, 5, 5, 5,
5, 5, 5, 5, 5, 5, 5, 5, 6, 6, 6, 6, 6, 6, 6, 6, 7, 7, 7, 7, 7,
7, 7, 7, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 9, 9, 9)
mean.rank.in (xx, 7) == 82.5 # expect TRUE
mean.rank.in (xx, 7.3) == 86 # expect TRUE
sapply (xx, function (x) mean.rank.in (xx, x)) # looks right
sum (sapply (xx, function (x) mean.rank.in (xx, x))) == 5050 # expect TRUE
yy <- sort (runif (100))
all (sapply (yy, function (y) mean.rank.in (yy, y)) == 1:100) # expect TRUE
dyy <- min (yy[2:100] - yy[1:99])
yy1 <- yy + dyy/2
all (sapply (yy1, function (y) mean.rank.in (yy1, y)) == 1:100) # expect TRUE
mean.rank.in (yy, yy[[1]] - 1) == 0 # expect TRUE
mean.rank.in (yy, yy[[100]] + 1) == 100 # expect TRUE
Here is one option with rank
rank(xx)[match(7, xx)]
#[1] 82.5
and with findInterval
findInterval(7.3, xx)
#[1] 86
data
xx <- c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3,
3, 3, 3, 3, 3, 3, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 5, 5, 5, 5,
5, 5, 5, 5, 5, 5, 5, 5, 6, 6, 6, 6, 6, 6, 6, 6, 7, 7, 7, 7, 7,
7, 7, 7, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 9, 9, 9)
Here's an example dataset.
structure(list(vector1 = c(1, 4, 4, 2, 1, 3, 2, 3, 4, 5, 3, 5,
5, 1, 4, 2, 4, 5, 2, 5), vector2 = c(4, 2, 3, 5, 3, 5, 2, 2,
3, 3, 4, 1, 4, 1, 2, 1, 2, 1, 1, 2)), class = "data.frame", row.names = c(NA,
-20L))
Basically what I'm trying to do is create a new variable 'Direction' based on differences between these numbers. I want to say something like:
if vector2 == vector1 or vector2 == vector1 +/- 1 than Direction == 'NS'
if vector2 < vector1 -1 or if vector 2 > vector1 + 1 than Direction == 'EW'
Hopefully this makes sense. Thanks!
A similar solution is this (slightly simpler):
Data:
df <- data.frame(
vector1 = c(1, 4, 4, 2, 1, 3, 2, 3, 4, 5, 3, 5, 5, 1, 4, 2, 4, 5, 2, 5),
vector2 = c(4, 2, 3, 5, 3, 5, 2, 2, 3, 3, 4, 1, 4, 1, 2, 1, 2, 1, 1, 2)
)
Desired new column:
df$direction <- ifelse(df$vector1==vector2 |
df$vector1==vector2 + 1 |
df$vector1==vector2 - 1, "NS","EW")
Outcome:
df
vector1 vector2 direction
1 1 4 EW
2 4 2 EW
3 4 3 NS
4 2 5 EW
5 1 3 EW
6 3 5 EW
7 2 2 NS
8 3 2 NS
9 4 3 NS
10 5 3 EW
11 3 4 NS
12 5 1 EW
13 5 4 NS
14 1 1 NS
15 4 2 EW
16 2 1 NS
17 4 2 EW
18 5 1 EW
19 2 1 NS
20 5 2 EW
you can try this
df <- structure(list(vector1 = c(1, 4, 4, 2, 1, 3, 2, 3, 4, 5, 3, 5,
5, 1, 4, 2, 4, 5, 2, 5), vector2 = c(4, 2, 3, 5, 3, 5, 2, 2,
3, 3, 4, 1, 4, 1, 2, 1, 2, 1, 1, 2)), class = "data.frame", row.names = c(NA,
-20L))
df$direction <- with(df,ifelse((vector2 == vector1) | (vector2 == (vector1 + 1)) | (vector2 == (vector1 - 1)), "NS",
ifelse(vector2 < (vector1-1) | (vector2 > (vector1 + 1)),"EW", NA)))
I have two data frames in R
df1 = data.frame(Cust = c(1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 4, 4), ItemId = c(1, 2, 3, 4, 2, 3, 2, 5, 1, 2, 5, 6, 2))
df2 = data.frame(ItemId1 = c(1, 3, 2, 3, 2, 1, 2, 3, 4, 6, 5, 3, 2, 4), ItemId2 = c(3, 1, 2, 3, 4, 1, 6, 4, 2, 4, 3, 1, 3, 5))
> df1
Cust ItemId
1 1 1
2 1 2
3 1 3
4 1 4
5 2 2
6 2 3
7 2 2
8 2 5
9 3 1
10 3 2
11 3 5
12 4 6
13 4 2
> df2
ItemId1 ItemId2
1 1 3
2 3 1
3 2 2
4 3 3
5 2 4
6 1 1
7 2 6
8 3 4
9 4 2
10 6 4
11 5 3
12 3 1
13 2 3
14 4 5
All I need is the following output which is less costly than joins/merge (because in real time I am dealing with billions of records)
> output
ItemId1 ItemId2 Cust
1 1 3 1
2 3 1 1
3 2 2 1, 2, 3, 4
4 3 3 1, 2
5 2 4 1
6 1 1 1, 3
7 2 6 4
8 3 4 1
9 4 2 1
10 6 4 NA
11 5 3 2
12 3 1 1
13 2 3 1, 2
14 4 5 NA
What happens is If ItemId1, ItemId2 of df2 combination is present in ItemId of df1 we need to return the Cust values (even if they are multiple). If they are present we need to return NA.
i.e. Take the first row as example: ItemId1 = 1, ItemId2 = 3. Only Customer = 1 has ItemId = c(1,3) in df1. Similarly the next rows.
We can do this using Joins/Merge which are costly operations. But, they are resulting in Memory Error.
This may take more time but wont take much of your memory.
Please convert for loops using apply if possible:
library(plyr)
df1 = data.frame(Cust = c(1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 4, 4), ItemId = c(1, 2, 3, 4, 2, 3, 2, 5, 1, 2, 5, 6, 2))
df2 = data.frame(ItemId1 = c(1, 3, 2, 3, 2, 1, 2, 3, 4, 6, 5, 3, 2, 4), ItemId2 = c(3, 1, 2, 3, 4, 1, 6, 4, 2, 4, 3, 1, 3, 5))
temp2 = ddply(df1[,c("Cust","ItemId")], .(Cust), summarize, ItemId = toString(unique(ItemId)))
temp3 = ddply(df1[,c("ItemId","Cust")], .(ItemId), summarize, Cust = toString(unique(Cust)))
dfout = cbind(df2[0,],data.frame(Cust = df1[0,1]))
for(i in 1:nrow(df2)){
a = df2[i,1]
b = df2[i,2]
if(a == b){
dfout = rbind(dfout,data.frame(ItemId1 = a,ItemId2 = a,Cust = temp3$Cust[temp3$ItemId == a]))
}else{
cusli = c()
for(j in 1:nrow(temp2)){
if(length(grep(a,temp2$ItemId[j]))>0 & length(grep(b,temp2$ItemId[j]))>0){
cusli = c(cusli,temp2$Cust[j])
}
}
dfout = rbind(dfout,data.frame(ItemId1 = a,ItemId2 = b,Cust = paste(cusli,collapse = ", ")))
}
}
dfout$Cust[dfout$Cust == "",] = NA
Here is my problem:
myvec <- c(1, 2, 2, 2, 3, 3,3, 4, 4, 5, 6, 6, 6, 6, 7, 8, 8, 9, 10, 10, 10)
I want to develop a function that can caterize this vector depending upon number of categories I define.
if categories 1 all newvec elements will be 1
if categories are 2 then
unique (myvec), i.e.
1 = 1, 2 =2, 3 = 1, 4 = 2, 5 =1, 6 = 2, 7 = 1, 8 = 2, 9 = 1, 10 = 2
(which is situation of odd or even numbers)
If categories are 3 then first three number will be 1:3 and then pattern will be repeated.
1 = 1, 2 = 2, 3=3, 4=1, 5 = 2, 6 = 3, 7 =1, 8 = 2, 9 = 3, 10 =1
If caterogies are 4 then first number will be 1:4 and then pattern will be repeated
1 = 1, 2 = 2, 3= 3, 4 = 4, 5 = 1, 6 = 2, 7=3, 8=4, 9 =1, 10 = 2
Similarly for n categories the first 1:n, then the pattern repeated.
This should do what you need, if I correctly understood the question. You can vary variable n to choose the number of groups.
myvec <- c(1, 2, 2, 2, 3, 3, 3, 4, 4, 5, 6, 6, 6, 6, 7, 8, 8, 9, 10, 10, 10)
out <- vector(mode="integer", length=length(myvec))
uid <- sort(unique(myvec))
n <- 3
for (i in 1:n) {
s <- seq(i, length(uid), n)
out[myvec %in% s] <- i
}
Using the recycling features of R (this gives a warning if the vector length is not divisible by n):
R> myvec <- c(1, 2, 2, 2, 3, 3, 3, 4, 4, 5, 6, 6, 6, 6, 7, 8, 8, 9, 10, 10, 10)
R> n <- 3
R> y <- cbind(x=sort(unique(myvec)), y=1:n)[, 2]
R> y
[1] 1 2 3 1 2 3 1 2 3 1
or using rep:
R> x <- sort(unique(myvec))
R> y <- rep(1:n, length.out=length(x))
R> y
[1] 1 2 3 1 2 3 1 2 3 1
Update: you could just use the modulo operator
R> myvec
[1] 1 2 2 2 3 3 3 4 4 5 6 6 6 6 7 8 8 9 10 10 10
R> n <- 4
R> ((myvec - 1) %% n) + 1
[1] 1 2 2 2 3 3 3 4 4 1 2 2 2 2 3 4 4 1 2 2 2