vector to dataframe in r given length of vector - r

I have vectors of different lengths. For instance:
df1
[1] 1 95 5 2 135 4 3 135 4 4 135 4 5 135 4 6 135 4
df2
[1] 1 70 3 2 110 4 3 112 4
I'm trying to write a script in R in order to have any vector enter the function or for loop and it returns a dataframe of three columns. So a separate dataframe for each input vector. Each vector is a multiple of three (hence, the three columns). I'm fairly new to R in terms of writing functions and can't seem to figure this out. Here was my attempt:
newdf = c()
ld <- length(df1)
ld_mult <- length(df1)/3
ld_seq <- seq(from=1,to=ld,by=3)
ld_seq2 < ld_seq +2
for (i in 1:ld_mult) {
newdf[i,] <- df1[ld_seq[i]:ld_seq2[i]]
}
the output I want for df1 would be:
1 95 5
2 135 4
3 135 4
4 135 4
5 135 4
6 135 4

Here's an example of how you could use matrix for that purpose:
x <- c(1, 95, 5,2, 135, 4, 3, 135, 4)
as.data.frame(matrix(x, ncol = 3, byrow = TRUE))
# V1 V2 V3
#1 1 95 5
#2 2 135 4
#3 3 135 4
And
y <- c(1, 70, 3, 2, 110, 4, 3, 112, 4)
as.data.frame(matrix(y, ncol = 3, byrow = TRUE))
# V1 V2 V3
#1 1 70 3
#2 2 110 4
#3 3 112 4
Or if you want to make it a custom function:
newdf <- function(vec) {
as.data.frame(matrix(vec, ncol = 3, byrow = TRUE))
}
newdf(y)
#V1 V2 V3
#1 1 70 3
#2 2 110 4
#3 3 112 4
You could also let the user specify the number of columns he wants to create with the function if you add another argument to newdf:
newdf <- function(vec, cols = 3) {
as.data.frame(matrix(vec, ncol = cols, byrow = T))
}
Now, the default number of columns is 3, if the user doesnt specify a number. If he wants to, he could use it like this:
newdf(z, 5) # to create 5 columns
Another nice little addon for the function would be a check if the input vector length is a multiple of the number of columns specified in the function call:
newdf <- function(vec, cols = 3) {
if(length(vec) %% cols != 0) {
stop("Number of columns is not a multiple of input vector length. Please double check.")
}
as.data.frame(matrix(vec, ncol = cols, byrow = T))
}
newdf(x, 4)
#Error in newdf(x, 4) :
# Number of columns is not a multiple of input vector length. Please double check.
If you had multiple vectors sitting in a list, here's how you could convert each of them to be a data.frame:
> l <- list(x,y)
> l
#[[1]]
#[1] 1 95 5 2 135 4 3 135 4
#
#[[2]]
#[1] 1 70 3 2 110 4 3 112 4
> lapply(l, newdf)
#[[1]]
# V1 V2 V3
#1 1 70 3
#2 2 110 4
#3 3 112 4
#
#[[2]]
# V1 V2 V3
#1 1 70 3
#2 2 110 4
#3 3 112 4

Related

Combining columns with a grouping in R

I need to categorize information from column names and restructure a dataset. Here is how my sample dataset looks like:
df <- data.frame(id = c(111,112,113),
Demo_1_Color_Naming = c("Text1","Text1","Text1"),
Demo_1.Errors =c(0,1,2),
Item_1_Color_Naming = c("Text1","Text1","Text1"),
Item_1.Time_in_Seconds =c(10,NA, 44),
Item_1.Errors = c(0,1,NA),
Demo_2_Shape_Naming = c("Text1","Text1","Text1"),
Demo_2.Errors =c(4,1,5),
Item_2_Shape_Naming = c("Text1","Text1","Text1"),
Item_2.Time_in_Seconds =c(55,35, 22),
Item_2.Errors = c(5,2,NA))
> df
id Demo_1_Color_Naming Demo_1.Errors Item_1_Color_Naming Item_1.Time_in_Seconds Item_1.Errors Demo_2_Shape_Naming Demo_2.Errors
1 111 Text1 0 Text1 10 0 Text1 4
2 112 Text1 1 Text1 NA 1 Text1 1
3 113 Text1 2 Text1 44 NA Text1 5
Item_2_Shape_Naming Item_2.Time_in_Seconds Item_2.Errors
1 Text1 55 5
2 Text1 35 2
3 Text1 22 NA
The columns are grouped by the numbers 1,2,3,... Each number represesents a grouping name. For example number 1 in this dataset represents Color grouping where number 2 represents Shape grouping. I would like to keep Time_in_seconds info and Errors info. Then I need to sum both time and errors.
Additionally, this dataset is only limited to two grouping. The bigger dataset has more than 2 grouping. I need to handle this for a multi group/column.
How can I achieve this below:
> df1
id ColorTime ShapeTime ColorError ShapeError TotalTime TotalError
1 111 10 55 0 5 65 5
2 112 NA 35 1 2 35 3
3 113 44 22 NA NA 66 NA
We may do
cbind(df['id'], do.call(cbind, lapply(setNames(c("Time_in_Seconds",
"Item.*Errors"), c("Time_in_Seconds", "Errors")), \(x) {
tmp <- df[grep(x, names(df), value = TRUE)]
out <- setNames(as.data.frame(sapply(split.default(tmp,
gsub("\\D+", "", names(tmp))), rowSums, na.rm = TRUE)), c("Color", "Shape"))
transform(out, Total = rowSums(out))
})))
-output
id Time_in_Seconds.Color Time_in_Seconds.Shape Time_in_Seconds.Total Errors.Color Errors.Shape Errors.Total
1 111 10 55 65 0 5 5
2 112 0 35 35 1 2 3
3 113 44 22 66 0 0 0
If we need the NAs
cbind(df['id'], do.call(cbind, lapply(setNames(c("Time_in_Seconds",
"Item.*Errors"), c("Time_in_Seconds", "Errors")), \(x) {
tmp <- df[grep(x, names(df), value = TRUE)]
out <- setNames(as.data.frame(sapply(split.default(tmp,
gsub("\\D+", "", names(tmp))), \(u) Reduce(`+`, u))), c("Color", "Shape")); transform(out, Total = rowSums(out, na.rm = TRUE)) })))

R: How to fill out values in a DF which are dependent on previous rows

I have a dataframe, and I want to do some calculations depending on the previous rows (like dragging informations down in excel). My DF looks like this:
set.seed(1234)
df <- data.frame(DA = sample(1:3, 6, rep = TRUE) ,HB = sample(0:600, 6, rep = TRUE), D = sample(1:5, 6, rep = TRUE), AD = sample(1:14, 6, rep = TRUE), GM = sample(30:31, 6, rep = TRUE), GL = NA, R =NA, RM =0 )
df$GL[1] = 646
df$R[1] = 60
df$DA[5] = 2
df
# DA HB D AD GM GL R RM
# 1 2 399 4 13 30 646 60 0
# 2 2 97 4 10 31 NA NA 0
# 3 1 102 5 5 31 NA NA 0
# 4 3 325 4 2 31 NA NA 0
# 5 2 78 3 14 30 NA NA 0
# 6 1 269 4 8 30 NA NA 0
I want to fill out the missing values in my GL, R and RM columns, and the values are dependent on each other. So eg.
attach(df)
#calc GL and R for the 2nd row
df$GL[2] <- GL[1]+HB[2]+RM[1]
df$R[2] <- df$GL[2]*D[2]/GM[2]*AD[2]
#calc GL and R for the 3rd row
df$GL[3] <- df$GL[2]+HB[3]+df$RM[2]
df$R[3] <-df$GL[3]*D[3]/GM[3]*AD[3]
#and so on..
Is there a way to do all the calculations at once, instead of row by row?
In addition, each time the column 'DA' = 1, the previous values for 'R' should be summed up for the same row for 'RM', but only from the last occurence. So that
attach(df)
df$RM[3] <-R[1]+R[2]+R[3]
#and RM for the 6th row is calculated by
#df$RM[6] <-R[4]+R[5]+R[6]
Thanks a lot in advance!
You can use a for loop to calculate GL values and once you have them you can do the calculation for R columns directly.
for(i in 2:nrow(df)) {
df$GL[i] <- with(df, GL[i-1]+HB[i]+RM[i-1])
}
df$R <- with(df, (GL* D)/(GM *AD))
You can use indexing to solve the first two problems:
> # Original code from question~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> set.seed(1234)
> df <- data.frame(DA = sample(1:3, 6, rep = TRUE), HB = sample(0:600, 6, rep = TRUE),
+ D = sample(1:5, 6, rep = TRUE), AD = sample(1:14, 6, rep = TRUE),
+ GM = sample(30:31, 6, rep = TRUE), GL = NA, R =NA, RM =0 )
> df$GL[1] = 646
> df$R[1] = 60
> df$DA[5] = 2
> #~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> # View df
> df
DA HB D AD GM GL R RM
1 2 399 4 13 30 646 60 0
2 2 97 4 10 31 NA NA 0
3 1 102 5 5 31 NA NA 0
4 3 325 4 2 31 NA NA 0
5 2 78 3 14 30 NA NA 0
6 1 269 4 8 30 NA NA 0
> # Solution below, based on indexing
> # 1. GL column
> df$GL <- cumsum(c(df$GL[1], df$HB[-1] + df$RM[-nrow(df)]))
> # 2. R column
> df$R[-1] <- (df$GL * df$D / df$GM * df$AD)[-1]
> # May be more clear like this (same result)
> df$R[-1] <- df$GL[-1] * df$D[-1] / df$GM[-1] * df$AD[-1]
> # Or did you mean this for last *?
> df$R[-1] <- (df$GL * df$D / (df$GM * df$AD))[-1]
The third problem can be solved with a loop.
> df$RM[1] <- df$R[1]
> for (i in 2:nrow(df)) {
+ df$RM[i] <- df$R[i] + df$RM[i-1] * (df$DA[i] != 2)
+ }
> df
DA HB D AD GM GL R RM
1 2 399 4 13 30 646 60.000000 60.000000
2 2 97 4 10 31 743 9.587097 9.587097
3 1 102 5 5 31 845 27.258065 36.845161
4 3 325 4 2 31 1170 75.483871 112.329032
5 2 78 3 14 30 1248 8.914286 8.914286
6 1 269 4 8 30 1517 25.283333 34.197619
Do these results look correct?
Update: Assuming RM should = R unless DA = 1, and in that case RM = sum of current row and previous R up to (not including) the above row with DA = 1, try the following loop.
df$RM[1] <- cs <- df$R[1]
for (i in 2:nrow(df)) {
df$RM[i] <- df$R[i] + cs * (df$DA[i] == 1)
cs <- cs * (df$DA[i] != 1) + df$R[i]
}

Calculate mean of specific row pattern

I have a dataframe like this:
V1 = paste0("AB", seq(1:48))
V2 = seq(1:48)
test = data.frame(name = V1, value = V2)
I want to calculate the means of the value-column and specific rows.
The pattern of the rows is pretty complicated:
Rows of MeanA1: 1, 5, 9
Rows of MeanA2: 2, 6, 10
Rows of MeanA3: 3, 7, 11
Rows of MeanA4: 4, 8, 12
Rows of MeanB1: 13, 17, 21
Rows of MeanB2: 14, 18, 22
Rows of MeanB3: 15, 19, 23
Rows of MeanB4: 16, 20, 24
Rows of MeanC1: 25, 29, 33
Rows of MeanC2: 26, 30, 34
Rows of MeanC3: 27, 31, 35
Rows of MeanC4: 28, 32, 36
Rows of MeanD1: 37, 41, 45
Rows of MeanD2: 38, 42, 46
Rows of MeanD3: 39, 43, 47
Rows of MeanD4: 40, 44, 48
As you see its starting at 4 different points (1, 13, 25, 37) then always +4 and for the following 4 means its just stepping 1 more row down.
I would like to have an output of all these means in one list.
Any ideas? NOTE: In this example the mean is of course always the middle number, but my real df is different.
Not quite sure about the output format you require, but the following codes can calculate what you want anyhow.
calc_mean1 <- function(x) mean(test$value[seq(x, by = 4, length.out = 3)])
calc_mean2 <- function(x){sapply(x:(x+3), calc_mean1)}
output <- lapply(seq(1, 37, 12), calc_mean2)
names(output) <- paste0('Mean', LETTERS[seq_along(output)]) # remove this line if more than 26 groups.
output
## $MeanA
## [1] 5 6 7 8
## $MeanB
## [1] 17 18 19 20
## $MeanC
## [1] 29 30 31 32
## $MeanD
## [1] 41 42 43 44
An idea via base R is to create a grouping variable for every 4 rows, split the data every 12 rows (nrow(test) / 4) and aggregate to find the mean, i.e.
test$new = rep(1:4, nrow(test)%/%4)
lapply(split(test, rep(1:4, each = nrow(test) %/% 4)), function(i)
aggregate(value ~ new, i, mean))
# $`1`
# new value
# 1 1 5
# 2 2 6
# 3 3 7
# 4 4 8
# $`2`
# new value
# 1 1 17
# 2 2 18
# 3 3 19
# 4 4 20
# $`3`
# new value
# 1 1 29
# 2 2 30
# 3 3 31
# 4 4 32
# $`4`
# new value
# 1 1 41
# 2 2 42
# 3 3 43
# 4 4 44
And yet another way.
fun <- function(DF, col, step = 4){
run <- nrow(DF)/step^2
res <- lapply(seq_len(step), function(inc){
inx <- seq_len(run*step) + (inc - 1)*run*step
dftmp <- DF[inx, ]
tapply(dftmp[[col]], rep(seq_len(step), run), mean, na.rm = TRUE)
})
names(res) <- sprintf("Mean%s", LETTERS[seq_len(step)])
res
}
fun(test, 2, 4)
#$MeanA
#1 2 3 4
#5 6 7 8
#
#$MeanB
# 1 2 3 4
#17 18 19 20
#
#$MeanC
# 1 2 3 4
#29 30 31 32
#
#$MeanD
# 1 2 3 4
#41 42 43 44
Since you said you wanted a long list of the means, I assumed it could also be a vector where you just have all these values. You would get that like this:
V1 = paste0("AB", seq(1:48))
V2 = seq(1:48)
test = data.frame(name = V1, value = V2)
meanVector <- NULL
for (i in 1:(nrow(test)-8)) {
x <- c(test$value[i], test$value[i+4], test$value[i+8])
m <- mean(x)
meanVector <- c(meanVector, m)
}

Divide vector into groups according difference between two neighbouring numbers

My dummy input vector looks like this:
x <- c(10, 20, 30, 70, 80, 90, 130, 190, 200)
What I want: Add group factor to each number. Group is assigned according difference between neighbouring numbers.
Example:
Difference (absolute) between 10 and 20 is 10, hence they belong to same group.
Difference between 30 and 20 is 10 - they belong to same group.
Difference between 30 and 70 is 40 - they belong to different groups.
Given maximal difference 20 wanted result is:
x group
10 1
20 1
30 1
70 4
80 4
90 4
130 7
190 8
200 8
My code:
library(data.table)
library(foreach)
x <- c(10, 20, 30, 70, 80, 90, 130, 190, 200)
x <- data.table(x, group = 1)
y <- nrow(x)
maxGap <- 20
g <- 1
groups <-
foreach(i = 2:y, .combine = rbind) %do% {
if (x[i, x] - x[i - 1, x] < maxGap) {
g
} else {
g <- i
g
}
}
x[2:y]$group <- as.vector(groups)
My question
Given code works, but is too slow with large data (number of rows > 10mil). Is there simpler and quicker solution (not using loop)?
library(IRanges)
x <- c(10, 20, 30, 70, 80, 90, 130, 190, 200)
# If the distance between two integers is larger than 30,
# then they would be in two groups. Otherwise, they would
# be in the same group.
ther <- 15
df.1 <- data.frame(val=x, left=x-15, right=x+15)
df.ir <- IRanges(df.1$left, df.1$right)
df.ir.re <- findOverlaps(df.ir, reduce(df.ir))
df.1$group <- subjectHits(df.ir.re)
df.1
# val left right group
# 1 10 -5 25 1
# 2 20 5 35 1
# 3 30 15 45 1
# 4 70 55 85 2
# 5 80 65 95 2
# 6 90 75 105 2
# 7 130 115 145 3
# 8 190 175 205 4
# 9 200 185 215 4
An implementation which uses the rleid and shift functions of data.table:
x <- c(10, 20, 30, 70, 80, 90, 130, 190, 200)
DT <- data.table(x)
DT[, grp := rleid(cumsum(x - shift(x,1L,0) > 20))]
which gives:
> DT
x grp
1: 10 1
2: 20 1
3: 30 1
4: 70 2
5: 80 2
6: 90 2
7: 130 3
8: 190 4
9: 200 4
Explanation: With x - shift(x,1L,0) you calculate the difference with the previous observation of x. By comparing it to 20 (i.e.: the > 20 part) and wrapping that in cumsum and rleid a runlength id is created.
In response to #Roland's comments: you can leave the rleid-part out if you set the fill parameter in shift to -Inf:
DT[, grp := cumsum((x - shift(x, 1L, -Inf)) > 20)]
test <- c(TRUE, diff(x) > 20) #test the differences
res <- factor(cumsum(test)) #groups
#[1] 1 1 1 2 2 2 3 4 4
#Levels: 1 2 3 4
levels(res) <- which(test) #fix levels
res
#[1] 1 1 1 4 4 4 7 8 8
#Levels: 1 4 7 8

R apply code to different factors or levels

Below is code to generate data to demonstrate the problem.
con <- textConnection('
Nu Na Vo
100 A 60
103 A 2
104 A 2
106 A 5
107 A 1
108 A 1
112 A 50
100 B 1
108 B 4
109 B 2
120 B 30
109 C 40
')
tt <- read.table(con, header = T)
close(con)
test <- as.data.frame(tt)
I've the following code. It is to assign value to "Sta" column subject to the specific condition and to add the difference in "Nu" between i and i+1 row into "Lag" column.
library(dplyr)
# to sort "Na" column and arrange "Nu" in descending order
# in order to apply the code below.
test2 <- tt %.% arrange(Na, -Nu)
for (i in 1:nrow(test2)) {
if (i < nrow(test2)) {
if (test2[i, ]$Nu - 2 > test2[i+1, ]$Nu) {
test2[i, 4] <- "N"
test2[i, 5] <- test2[i, ]$Nu - test2[i+1, ]$Nu
} else if (test2[i, ]$Nu - 2 <= test2[i+1, ]$Nu) {
test2[i, 4] <- "Y"
test2[i, 5] <- test2[i, ]$Nu - test2[i+1, ]$Nu
}
} else if (i == nrow(test2)) {
test2[i, 4] <- "N"
test2[i, 5] <- 0
}
}
names(test2)[names(test2) == "V4"] <- "Sta"
names(test2)[names(test2) == "V5"] <- "Lag"
test2
After running the code, it produces the result as below:
Nu Na Vo Sta Lag
1 112 A 50 N 4
2 108 A 1 Y 1
3 107 A 1 Y 1
4 106 A 5 Y 2
5 104 A 2 Y 1
6 103 A 2 N 3
7 100 A 60 Y -20
8 120 B 30 N 11
9 109 B 2 Y 1
10 108 B 4 N 8
11 100 B 1 Y -9
12 109 C 40 N 0
The values under "Sta" column are properly assigned but not for the "Lag" column. The original intention is to apply the code based on different values/levels in "Na", that is "A", "B", "C". Don't how to apply the code to "A", "B", "C" separately and combine separate results into ONE table. Desired outcome should be:
Nu Na Vo Sta Lag
1 112 A 50 N 4
2 108 A 1 Y 1
3 107 A 1 Y 1
4 106 A 5 Y 2
5 104 A 2 Y 1
6 103 A 2 N 3
7 100 A 60 Y 0 << Last row for "A". "Lag" should be "0"; "Sta" should be "N".
8 120 B 30 N 11
9 109 B 2 Y 1
10 108 B 4 N 8
11 100 B 1 Y 0 << Last row for "B". "Lag" should be "0"; "Sta" should be "N".
12 109 C 40 N 0 << Last row for "C". "Lag" should be "0"; "Sta" should be "N".
Edited
Not sure how to apply the code to different factors / levels of "Na": "A", "B" and "C". Possible to use split() or apply family of functions? As could see from the result and intent of the code above, the result should be FACTOR / LEVEL / Element dependent (hope I'm using the proper terminology) and will affect values under both "Sta" and "Lag" columns. However my code could not distinguish this. Appreciate for any help provided. Thanks
An inelegant solution!
For completeness, I post herewith a possible solution. I code it the hard way. If anyone could help simplify it, it would be very much appreciated.
con <- textConnection('
Nu Na Vo
100 A 60
103 A 2
104 A 2
106 A 5
107 A 1
108 A 1
112 A 50
100 B 1
108 B 4
109 B 2
120 B 30
109 C 40
')
tt <- read.table(con, header = T)
close(con)
require(dplyr); require(data.table)
test2 <- tt %.% arrange(Na, -Nu)
spl <- split(test2, test2$Na)
spl
for (i in 1:length(levels(test2$Na))) {
for (j in 1:nrow(spl[[i]])) {
if (j < nrow(spl[[i]])) {
if (spl[[i]][j, ]$Nu - 2 > spl[[i]][j+1, ]$Nu) {
spl[[i]][j, 4] <- "N"
spl[[i]][j, 5] <- spl[[i]][j, ]$Nu - spl[[i]][j+1, ]$Nu
} else if (spl[[i]][j, ]$Nu - 2 <= spl[[i]][j+1, ]$Nu) {
spl[[i]][j, 4] <- "Y"
spl[[i]][j, 5] <- spl[[i]][j, ]$Nu - spl[[i]][j+1, ]$Nu
}
} else if (j == nrow(spl[[i]])) {
spl[[i]][j, 4] <- "N"
spl[[i]][j, 5] <- 0
}
}
}
spl <- rbindlist(spl)
setnames(spl, c("V4", "V5"), c("Sta", "Lag"))
spl
ave to the rescue - if applied twice this will essentially do the same comparisons as your long loop code.
First, calculate the lag differences using diff for each group, and set the value for the last row in each group to 0. Then use the computed lag values to determine the "Sta" column, forcing the last row in each group's value to be assigned "N".
test2$Lag <- with(test2, ave(Nu, Na, FUN=function(x) -c(diff(x),0)) )
test2$Sta <- with(test2, ave(Lag, Na, FUN=function(x) {
out <- ifelse(x > 2, "N", "Y"); out[length(out)] <- "N"; out}))
Same result as requested:
test2[c(1:3,5,4)]
# Nu Na Vo Sta Lag
#1 112 A 50 N 4
#2 108 A 1 Y 1
#3 107 A 1 Y 1
#4 106 A 5 Y 2
#5 104 A 2 Y 1
#6 103 A 2 N 3
#7 100 A 60 N 0
#8 120 B 30 N 11
#9 109 B 2 Y 1
#10 108 B 4 N 8
#11 100 B 1 N 0
#12 109 C 40 N 0

Resources