Data on one row by ID - r

I have a data frame with one id column and several other column grouped by couple and i'm trying to put all the data for a same id on one row. ID's do not appear the same number of times each.
My data looks like this :
df <- data.frame(id=sample(1:4, 12, T), vpcc1=1:12, hpcc1=rnorm(12), vpcc2=1:12, hpcc2=rnorm(12), vpcc3=1:12, hpcc3=rnorm(12))
df
## id vpcc1 hpcc1 vpcc2 hpcc2 vpcc3 hpcc3
## 1 1 1 0.04632267 1 -0.37404379 1 0.90711353
## 2 4 2 0.50383152 2 0.06075954 2 0.30690284
## 3 1 3 1.52450117 3 -1.21539925 3 -1.12411614
## 4 1 4 -0.50624871 4 -0.75988364 4 -0.47970608
## 5 3 5 1.64610863 5 0.03445275 5 -0.18895338
## 6 1 6 0.22019099 6 -0.32101883 6 1.29375822
## 7 2 7 -0.10041807 7 -0.17351799 7 -0.03767921
## 8 2 8 0.81683565 8 0.62449158 8 0.50474787
## 9 2 9 -0.46891269 9 1.07743469 9 -0.55539149
## 10 1 10 0.69736549 10 -0.08573679 10 0.28025325
## 11 3 11 0.73354215 11 0.80676315 11 -1.12561358
## 12 2 12 -0.40903143 12 1.94155313 12 0.64231119
For the moment i came up with this :
align2 <- function(df) {
result <- lapply(1:nrow(df), function(j) lapply(1:3, function(i) {x <- df[j, paste0(c("vpcc", "hpcc"), i)]
names(x) <- paste0(c("vpcc", "hpcc"), (i + (j-1)*4))
return(x)}))
result2 <- lapply(result, function(x) do.call(cbind, x))
result3 <- do.call(cbind, result2)
return(result3)
}
testX <- lapply(1:4, function(k) align2(as.data.frame(split(df, f=df$id)[[k]])))
library(plyr)
testX2 <- do.call(rbind.fill, testX)
testX2
## vpcc1 hpcc1 vpcc2 hpcc2 vpcc3 hpcc3 vpcc4 hpcc4 vpcc5 hpcc5 vpcc6 hpcc6 vpcc7 hpcc7 vpcc8 hpcc8 ...
## 1 1 0.04632267 1 -0.37404379 1 0.90711353 3 1.5245012 3 -1.2153992 3 -1.1241161 4 -0.5062487 4 -0.7598836 ...
## 2 7 -0.10041807 7 -0.17351799 7 -0.03767921 8 0.8168356 8 0.6244916 8 0.5047479 9 -0.4689127 9 1.0774347 ...
## 3 5 1.64610863 5 0.03445275 5 -0.18895338 11 0.7335422 11 0.8067632 11 -1.1256136 NA NA NA NA ...
## 4 2 0.50383152 2 0.06075954 2 0.30690284 NA NA NA NA NA NA NA NA NA NA ...
It's a partial solution since it don't keep the id.
But I can't imagine there's not a easier way...
Thank you for suggestions
PS : maybe there's already a solution on SO but I didn't find it...

In your example the variables vpcc1 vpcc2 etc. are redundant, since they have all the same value. So you can transform the dataset into a more economical structure:
df <- data.frame(id=sample(1:4, 12, T), vpcc=1:12, hpcc1=rnorm(12),
hpcc2=rnorm(12),hpcc3=rnorm(12))
Then use reshape() and you'll have all the values for each id in a single row, with the columns corresponding to the vpcc value, so that "hpcc3.5" means hpcc3 when vpcc is 5.
reshape(df, idvar = "id", direction = "wide", timevar = "vpcc")
EDIT:
if vpccX varies, then maybe this will give you what you need?
df <- data.frame(id=sample(1:4, 12, T), vpcc1=1:12, hpcc1=rnorm(12), vpcc2=1:12,
hpcc2=rnorm(12), vpcc3=1:12, hpcc3=rnorm(12))
df$time = ave(df$id, df$id, FUN = function(x) 1:length(x))
reshape(df, idvar = "id", direction = "wide", timevar = "time")
of course, you can rename your variables, if it's needed.

When you say "same row", is it necessary that the output is like it is in your attempt or would you be happy with something like:
x <- aggregate(df[2:ncol(df)],list(df$id),list)
which allows you to view output on one row as:
x
# Group.1 vpcc1 hpcc1 vpcc2 hpcc2 vpcc3
#1 1 9, 10 1.4651392, 0.8581344 9, 10 -1.621135, 1.391945 9, 10
#2 2 1, 3, 7 2.784998, 1.667367, -1.329005 1, 3, 7 0.2115051, 0.7871399, -0.4835389 1, 3, 7
#3 3 5, 6 -0.5024987, 0.2822224 5, 6 0.155844, 1.336449 5, 6
#4 4 2, 4, 8, 11, 12 -0.48563550, -0.92684024, -0.04016263, -0.41861021, 0.02309864 2, 4, 8, 11, 12 -0.17304058, 0.25428404, -0.49897995, 0.03101927, -0.13529866 2, 4, 8, 11, 12
# hpcc3
#1 -0.05182822, 0.28365514
#2 -0.06189895, -0.83640652, 0.19425789
#3 -0.006440312, 1.378218706
#4 0.09412386, 0.16733125, -1.15198965, -1.00839015, -0.16114475
and reference different values of vpcc and hpcc using list notation:
x$vpcc1
#$`0`
#[1] 9 10
#$`1`
#[1] 1 3 7
#$`2`
#[1] 5 6
#$`3`
#[1] 2 4 8 11 12
x$vpcc1[[1]]
#[1] 9 10
?

Related

Creating a list with column-wise partitions of a data.frame

I have a data.frame with a single "identifier" column and many additional columns. I am interested in turning this data.frame into a list of length K, whose elements are sets of columns partitioning the data.frame.
For example, given the below data.frame:
# Example data.frame
df <- data.frame(id = 1:10,
x1 = rnorm(10),
x2 = rnorm(10),
x3 = rnorm(10),
x4 = rnorm(10))
I'd like to have some function that converts it into this:
# Partitioning function
foo(df, partitions = 3)
# Expected output
list(data.frame(id = df$id, x1 = df[ ,2]),
data.frame(id = df$id, x2 = df[ ,3]),
data.frame(id = df$id, x3 = df[ ,4], x4 = df[ ,5]),
Bonus points if you can extend this so that you can specify how many non-id columns each element of the list should contain by passing a numeric vector. Imagine the same output with an input that looks like this or equivalent.
columns_per_element <- c(1,1,2)
foo(df, columns_per_element)
It is actually easier to define a function with the splitting sequence. The key functions here are repand split.default i.e.
f2 <- function(df, n, split){
i1 <- rep(seq(n), split)
res_list <- split.default(df[-1], i1)
return(lapply(res_list, function(i)cbind.data.frame(ID = df$id, i)))
}
f2(df, 3, c(1, 1, 2))
$`1`
ID x1
1 1 1.54960977
2 2 -1.59144017
3 3 0.02853548
4 4 -0.14231391
5 5 1.26989801
6 6 0.87495876
7 7 0.27373774
8 8 -0.75600983
9 9 0.32216493
10 10 -1.05113771
$`2`
ID x2
1 1 0.8529416
2 2 0.4555094
3 3 -0.3620756
4 4 1.4779813
5 5 -1.6484066
6 6 -0.5697431
7 7 -0.2139384
8 8 0.1619074
9 9 -0.5390306
10 10 -0.2228809
$`3`
ID x3 x4
1 1 -0.2579865 1.185526074
2 2 -0.0519554 -0.388179976
3 3 2.5350092 -0.675504829
4 4 -1.7051955 0.073448252
5 5 0.6207733 -0.637220508
6 6 0.3015831 -1.324024114
7 7 -0.5647717 0.969025962
8 8 0.1404714 -1.575383604
9 9 1.3049560 -1.846413101
10 10 -0.6716643 0.008675125
f2(df, 3, c(1, 2, 1))
$`1`
ID x1
1 1 1.54960977
2 2 -1.59144017
3 3 0.02853548
4 4 -0.14231391
5 5 1.26989801
6 6 0.87495876
7 7 0.27373774
8 8 -0.75600983
9 9 0.32216493
10 10 -1.05113771
$`2`
ID x2 x3
1 1 0.8529416 -0.2579865
2 2 0.4555094 -0.0519554
3 3 -0.3620756 2.5350092
4 4 1.4779813 -1.7051955
5 5 -1.6484066 0.6207733
6 6 -0.5697431 0.3015831
7 7 -0.2139384 -0.5647717
8 8 0.1619074 0.1404714
9 9 -0.5390306 1.3049560
10 10 -0.2228809 -0.6716643
$`3`
ID x4
1 1 1.185526074
2 2 -0.388179976
3 3 -0.675504829
4 4 0.073448252
5 5 -0.637220508
6 6 -1.324024114
7 7 0.969025962
8 8 -1.575383604
9 9 -1.846413101
10 10 0.008675125
Here is solution with two parameters in the function with a vectorized column select. note this assumes the first column is id and is called id. second if the sum of the vector is greater than ncol(df)-1 (this will be your input df) it will throw an error.
f2 <- function(x,y){
#keep id
id <- x[,"id" , drop = FALSE]
#keep all other variables
df2 <- x[,-1]
#get sequence for columns
y2 <- lapply(cumsum(y), function(x){sequence(x)})
#grab correct columns
y3 <- c(y2[1],mapply(dplyr::setdiff ,y2[2:length(y2)],y2[1:2]))
#recreate df
lapply(y3,
function(x){
cbind.data.frame(id, df2[,x, drop = FALSE])
})
}
f2(df, c(1,1,2))

If I have the vector `2, 7, 12`, how can I create the vector `2, 2+1, 2+2, 7, 7+1, 7+2, 12, 12+1, 12+2`?

If I have next vector:
vector <- c(1,6,10)
How can I create the vector 1 2 3 6 7 8 10 11 12?
Another example:
vector <- c(4,9,15)
My desired vector would be 4 5 6 9 10 11 15 16 17.
Any help would be great.
Thanks
We can do
c( sapply(vector, function(x) x:(x + 2)))
Or with
sort(vector + rep(0:2, each = length(vector)))
We can use outer :
vector <- c(1,6,10)
c(t(outer(vector, 0:2, `+`)))
#[1] 1 2 3 6 7 8 10 11 12

Returning values after last NA in a vector

Returning values after last NA in a vector
I can remove all NA values from a vector
v1 <- c(1,2,3,NA,5,6,NA,7,8,9,10,11,12)
v2 <- na.omit(v1)
v2
but how do I return a vector with values only after the last NA
c( 7,8,9,10,11,12)
Thank you for your help.
You could detect the last NA with which and add 1 to get the index past the last NA and index until the length(v1):
v1[(max(which(is.na(v1)))+1):length(v1)]
[1] 7 8 9 10 11 12
Here’s an alternative solution that does not use indices and only vectorised operations:
after_last_na = as.logical(rev(cumprod(rev(! is.na(v1)))))
v1[after_last_na]
The idea is to use cumprod to fill the non-NA fields from the last to the end. It’s not a terribly useful solution in its own right (I urge you to use the more obvious, index range based solution from other answers) but it shows some interesting techniques.
You could detect the last NA with which
v1[(tail(which(is.na(v1)), 1) + 1):length(v1)]
# [1] 7 8 9 10 11 12
However, the most general - as #MrFlick pointed out - seems to be this:
tail(v1, -tail(which(is.na(v1)), 1))
# [1] 7 8 9 10 11 12
which also handles the following case correctly:
v1[13] <- NA
tail(v1, -tail(which(is.na(v1)), 1))
# numeric(0)
To get the null NA case, too,
v1 <- 1:13
we can do
if (any(is.na(v1))) tail(v1, -tail(which(is.na(v1)), 1)) else v1
# [1] 1 2 3 4 5 6 7 8 9 10 11 12 13
Data
v1 <- c(1, 2, 3, NA, 5, 6, NA, 7, 8, 9, 10, 11, 12)
v1 <- c(1,2,3,NA,5,6,NA,7,8,9,10,11,12)
v1[seq_along(v1) > max(0, tail(which(is.na(v1)), 1))]
#[1] 7 8 9 10 11 12
v1 = 1:5
v1[seq_along(v1) > max(0, tail(which(is.na(v1)), 1))]
#[1] 1 2 3 4 5
v1 = c(1:5, NA)
v1[seq_along(v1) > max(0, tail(which(is.na(v1)), 1))]
#integer(0)
The following will do what you want.
i <- which(is.na(v1))
if(i[length(i)] < length(v1)){
v1[(i[length(i)] + 1):length(v1)]
}else{
NULL
}
#[1] 7 8 9 10 11 12

R - Sum list of matrix with different columns

I have a large list of matrix with different columns and I would like to sum these matrix counting 0 if column X does not exist in one matrix.
If you have used the function rbind.fill from plyr I would like something similar but with sum function. Of course I could build a function to do that, but I'm thinking about a native function efficiently programmed in Frotrain or C due to my large data.
Here an example:
This is the easy example where I have the same columns:
aa <- list(
m1 = matrix(c(1,2,3,4,5,6,7,8,9), nrow = 3, dimnames = list(c(1,2,3),c('a','b','c'))),
m2 = matrix(c(1,2,3,4,5,6,7,8,9), nrow = 3, dimnames = list(c(1,2,3),c('a','b','c')))
)
aa
Reduce('+',aa)
Giving the results:
> aa
$m1
a b c
1 1 4 7
2 2 5 8
3 3 6 9
$m2
a b c
1 1 4 7
2 2 5 8
3 3 6 9
> Reduce('+',aa)
a b c
1 2 8 14
2 4 10 16
3 6 12 18
And with my data:
bb <- list(
m1 = matrix(c(1,2,3,7,8,9), nrow = 3, dimnames = list(c(1,2,3),c('a','c'))),
m2 = matrix(c(1,2,3,4,5,6,7,8,9), nrow = 3, dimnames = list(c(1,2,3),c('a','b','c')))
)
bb
Reduce('+',bb)
Here I would like to have b = c(0,0,0) in the first matrix to sum them.
> bb
$m1
a c
1 1 7
2 2 8
3 3 9
$m2
a b c
1 1 4 7
2 2 5 8
3 3 6 9
Many thanks!
Xevi
One option would be
un1 <- sort(unique(unlist(lapply(bb, colnames))))
bb1 <- lapply(bb, function(x) {
nm1 <- setdiff(un1, colnames(x))
m1 <- matrix(0, nrow = nrow(x), ncol = length(nm1), dimnames = list(NULL, nm1))
cbind(x, m1)[, un1]})
and use the Reduce
Reduce(`+`, bb1)
# a b c
# 1 2 4 14
# 2 4 5 16
# 3 6 6 18

New variable with values depending on combination of other variables

I'm very inexperienced in R, and although this site has been tremendously helpful, I have a very specific situation and cannot find a solution. I imagine I need to write a function to accomplish this. However, my current time frame does not allow me to spend the time doing trial/error. (I apologize in advance for anything unclear).
Here is an example of my current data:
UniqueID, Time1.Feel1, Time2.Feel1.1, Time2.Feel1.2, Time2Num
1, 9, 5, 6, 1
1, 9, 7, 5, 2
2, 4, 3, 4, 1
2, 4, 5, 6, 2
3, 7, 4, 7, 1
3, 7, 6, 5, 2
I want to create a new variable: Time2.Feel1, which consists of the values of either Time2.Feel1.1 OR Time2.Feel1.2, depending on the value of Time2Num.
So, this:
UniqueID, Time1.Feel1, Time2.Feel1.1, Time2.Feel1.2, Time2Num, Time2.Feel1
1, 9, 5, 6, 1, 5
1, 9, 7, 5, 2, 5
2, 4, 3, 4, 1, 3
2, 4, 5, 6, 2, 6
3, 7, 4, 7, 1, 4
3, 7, 6, 5, 2, 5
I need to do this 30 times (i.e., Time2Num has values 1:30 and there are 30 different Time2.Feel1 variables: Time2.Feel1.1:30)
I then want to calculate a correlation between Time1.Feel1 and Time2.Feel1 for EACH UniqueID, creating a new data frame with the variables UniqueID and the new correlations. This part is less of a concern; I think I've figured out how to that, but if the combined steps could be done more simply, I'd prefer that.
Thanks in advance!
To expound on #thelatemail's comment, you could do this
dat <- read.csv(text="UniqueID, Time1.Feel1, Time2.Feel1.1, Time2.Feel1.2, Time2Num
1, 9, 5, 6, 1
1, 9, 7, 5, 2
2, 4, 3, 4, 1
2, 4, 5, 6, 2
3, 7, 4, 7, 1
3, 7, 6, 5, 2")
dat$Time2.Feel1 <- dat[c("Time2.Feel1.1","Time2.Feel1.2")][cbind(seq(nrow(dat)),dat$Time2Num)]
# UniqueID Time1.Feel1 Time2.Feel1.1 Time2.Feel1.2 Time2Num Time2.Feel1
# 1 1 9 5 6 1 5
# 2 1 9 7 5 2 5
# 3 2 4 3 4 1 3
# 4 2 4 5 6 2 6
# 5 3 7 4 7 1 4
# 6 3 7 6 5 2 5
Doing that 30 times isn't very efficient, so you could use a loop:
## creating some example data which I think matches your format
nr <- nrow(dat)
set.seed(1)
dat1 <- lapply(1:15, function(ii)
matrix(c(sample(1:9, nr * 2, replace = TRUE),
sample(1:2, nr, replace = TRUE)), nrow = nr,
dimnames = list(NULL, c(paste0('Time2.Feel1.', 1 + 2 * (ii - 1)),
paste0('Time2.Feel1.', 2 + 2 * (ii - 1)),
sprintf('Time%sNum', 2 + 2 * (ii - 1))))))
dat1 <- data.frame(do.call('cbind', dat1))
# Time2.Feel1.1 Time2.Feel1.2 Time2Num Time2.Feel1.3 Time2.Feel1.4 Time4Num
# 1 3 9 2 4 3 1
# 2 4 6 1 7 4 2
# 3 6 6 2 9 1 1
# 4 9 1 1 2 4 1
# 5 2 2 2 6 8 2
# 6 9 2 2 2 4 2
# Time2.Feel1.5 Time2.Feel1.6 Time6Num Time2.Feel1.7 Time2.Feel1.8 Time8Num
# 1 8 8 2 1 9 1
# 2 1 5 2 1 3 2
# 3 7 5 1 3 5 1
# 4 4 8 2 5 3 2
# 5 8 1 1 6 6 1
# 6 6 5 1 4 3 2
# Time2.Feel1.9 Time2.Feel1.10 Time10Num Time2.Feel1.11 Time2.Feel1.12 Time12Num
# 1 4 7 2 3 5 1
# 2 4 9 1 1 4 2
# 3 5 4 2 6 8 2
# 4 9 7 1 8 6 1
# 5 8 4 1 8 6 1
# 6 4 3 1 8 4 1
etc, etc
So you can start here. First you make the input vectors:
I call xx which is Time2.Feel1, Time2.Feel3, Time2.Feel5, etc
yy which is Time2.Feel2, Time2.Feel4, Time2.Feel6, etc; xx and yy are your two "choices"
and zz which is the "decision" column, Time2Feel1, Time4Feel1, Time6Feel1, etc
Then use mapply to do the indexing above but in a 1-1 mapping using those three input vectors with mapply. Note that zz, yy, and xx are all the same length
n <- 30
xx <- paste0('Time2.Feel1.', seq(1, n - 1, by = 2))
yy <- paste0('Time2.Feel1.', seq(2, n, by = 2))
zz <- sprintf('Time%sNum', seq(2, n, by = 2))
nn <- sprintf('Time%s.Feel1', seq(2, n, by = 2))
res <- mapply(function(x, y, z) dat1[, c(x, y)][cbind(1:nr, dat1[, z])],
xx, yy, zz, SIMPLIFY = FALSE)
res <- `colnames<-`(do.call('cbind', res), nn)
# Time2.Feel1 Time4.Feel1 Time6.Feel1 Time8.Feel1 Time10.Feel1 Time12.Feel1
# [1,] 9 4 8 1 7 3
# [2,] 4 4 5 3 4 4
# [3,] 6 9 7 3 4 8
# [4,] 9 2 8 3 9 8
# [5,] 2 8 8 6 8 8
# [6,] 2 4 6 3 4 8
And then you can combine the results back. You would need to reorder them if that is important to you
## combine results into original data
cbind(dat1, res)
When searching for the error I received when trying the answer from #user12202013, I came across this solution using ifelse, found here: Conditional assignment of one variable to the value of one of two other variables
Time2.Feel1 <- ifelse(Time2Num == 1, Time2.Feel1.1, ifelse(Time2Num == 2,
Time2.Feel1.2,""))
Although it is definitely not the most efficient solution, particularly because I need to nest it 30 times and I need to do it for 9 items, it solved my problem. A simpler answer is still welcome, though!
Thanks for your answers!
You want to do something like:
Time2.Feel1 = rep(NA, length(Time2Num))
Time2.Feel1[Time2Num == 1] <- Time2.Feel1.1
Time2.Feel1[Time2Num == 2] <- Time2.Feel1.2
This says to create a vector called Time2.Feel1 which we initialize with NA values. Then where Time2Num is one we fill in the values from Time2.Feel1.1 and where Time2Num is two we fill in the values from Time2.Feel1.2. If there is any place where Time2Num is neither 1 nor 2 thenTime2.Feel1` will have an NA value.
Edit:
Not sure what the error message is referring to since I am able to do this
# reproducible example
set.seed(1)
A <- letters
B <- sample(c(0, 1, NA), 26, TRUE)
A[B == 1] <- '5' # assignment where subscript contains NAs
A[B == 0] <- NA # assigning NA values
A
[1] NA "5" "5" "d" NA "f" "g" "5" "5" NA NA NA "m" "5" "o" "5" "q" "r" "5" "t" "u" NA "5" NA NA "5"
I would need to see more complete code to know what is causing the error.

Resources