How to use row number inside apply function in R - r

I want to process data frame as follows, where I want to get the sum of 2 vectors and append it to a data frame as a row vector. 2 vectors are row vector of considering row and column vector which start just below the considering row with a fixed length.
data
A b1 b2 b3
1 2 2 2
2 3 3 3
3 4 4 4
4 5 5 5
5 6 6 6
output (expected)
A b1 b2 b3
1 4 5 6
2 6 7 8
3 8 9 -
4 10 - -
5 - - -
In the example if 1st row is considered, two vectors are
row vector r- [2 2 2]
column vector c - [2,3,4]
After getting the transpose of column vector I can add tow vectors and append it to a new data frame. This process must be done to all the rows.
Easiest way to do this is looping, but in R loops are not efficient, instead apply function can be used. However in this scenario, to do that need to know what is the current row number.
Is there a way to do this efficiently in R

1) rollapply We can use rollapply to form the matrix of subvectors of A and then add that together with an initial column of zero to m. Note that we pad A with NA values so that the result of rollapply is the appropriate shape.
library(zoo)
m <- cbind(A = 1:5, b1 = 2:6, b2 = 2:6, b3 = 2:6) # input matrix
nc1 <- ncol(m) - 1
A <- c(m[, 1], rep(NA, nc1))
cbind(0, rollapply(A[-1], nc1, c)) + m
giving:
A b1 b2 b3
[1,] 1 4 5 6
[2,] 2 6 7 8
[3,] 3 8 9 NA
[4,] 4 10 NA NA
[5,] 5 NA NA NA
2) base This solution is similar but does not use any packages. The first two lines are the same as in (1).
nc1 <- ncol(m) - 1
A <- c(m[, 1], rep(NA, nc1))
cbind(0, embed(A[-1], nc1)[, seq(nc1, 1)]) + m
giving:
A b1 b2 b3
[1,] 1 4 5 6
[2,] 2 6 7 8
[3,] 3 8 9 NA
[4,] 4 10 NA NA
[5,] 5 NA NA NA

Related

Interpolation / stretching out of values in vector to a specified length

I have vectors of different length
For example,
a1 = c(1,2,3,4,5,6,7,8,9,10) a2 = c(1,3,4,5) a3 = c(1,2,5,6,9)
I want to stretch out a2 and a3 to the length of a1, so I can run some algorithms on it that requires the lengths of the vectors to be the same. I would truncate a1 to be same as a2 and a3, but i end up losing valuable data.
ie perhaps a2 could look something like 1 1 1 3 3 3 4 4 5 5 ?
Any suggestions would be great!
thanks
EDIT: I need it to work for vectors with duplicate values, such as c(1,1,2,2,2,2,3,3) and the stretched out values to represent the number of duplicate values in the original vector, for example if i stretched the example vector out to a length of 100 i would expect more two's than one's.
It sounds like you're looking for something like:
lengthen <- function(vec, length) {
vec[sort(rep(seq_along(vec), length.out = length))]
}
lengthen(a2, length(a1))
# [1] 1 1 1 3 3 3 4 4 5 5
lengthen(a3, length(a1))
# [1] 1 1 2 2 5 5 6 6 9 9
lengthen(a4, length(a1))
# [1] 5 5 5 1 1 1 3 3 4 4
lengthen(a5, length(a1))
# [1] 1 1 1 1 1 1 4 4 5 5
Where:
a1 = c(1,2,3,4,5,6,7,8,9,10)
a2 = c(1,3,4,5)
a3 = c(1,2,5,6,9)
a4 = c(5,1,3,4)
a5 = c(1,1,4,5)
One way could be to create a sequence between two points with defined length.
#Put the data in a list
list_data <- list(a1 = a1, a2 = a2, a3 = a3)
#Get the max length
max_len <- max(lengths(list_data))
#Create a sequence
list_data <- lapply(list_data, function(x)
seq(min(x), max(x), length.out = max_len))
#$a1
# [1] 1 2 3 4 5 6 7 8 9 10
#$a2
# [1] 1.000 1.444 1.889 2.333 2.778 3.222 3.667 4.111 4.556 5.000
#$a3
# [1] 1.000 1.889 2.778 3.667 4.556 5.444 6.333 7.222 8.111 9.000
Get them in separate vectors if needed :
list2env(list_data, .GlobalEnv)
This however does not guarantee that your original data points would remain in the data. For example, a2 had 3 and 4 in data but it is not present in this modified vector.

Using mapply to set values based on values in other columns

Based on my previous question, I need help with using the mapply function correctly.
x <- data.frame(a = seq(1,3), b = seq(2,4), c = seq(3,5), d = seq(4,6), b2 = seq(5,7), c2 = seq(6,8), d2 = seq(7,9))
# a b c d b2 c2 d2
# 1 2 3 4 5 6 7
# 2 3 4 5 6 7 8
# 3 4 5 6 7 8 9
My goal is to look at the columns b2 to d2 and, based on their values, change the values in columns b to d respectively. I can do this for a single column quite easily:
x[which(x$b2 == 7),][b] <- NA_real_
My problem is that I want this applied across all my columns but I don't know how to convert this single column formula to work on multiple columns. I tried:
onez <- c(2:4)
twoz <- c(5:7)
f <- function(df, ones, twos) {
df[which(df[,twos] == 7),][ones] <- NA_real_
}
mapply(f, df = x, ones = onez, twos = twoz)
But I'm getting error messages (incorrect dimensions etc) and I see that my function is messy but I lack the knowledge how to fix it.
One way to do it is to tell it to:
Get the subset of the data frame with columns 5, 6, 7: x[5:7]
Check from that subset which values satisfy your condition: x[5:7] == 7
Replace those values with NA: ... <- NA
This gives the following,
x[5:7][x[5:7] == 7] <- NA
x
# a b c d b2 c2 d2
#1 1 2 3 4 5 6 NA
#2 2 3 4 5 6 NA 8
#3 3 4 5 6 NA 8 9
If you want the NAs to be replaced at x[2:4], then you can do,
x[2:4][x[5:7] == 7] <- NA
x
# a b c d b2 c2 d2
#1 1 2 3 NA 5 6 7
#2 2 3 NA 5 6 7 8
#3 3 NA 5 6 7 8 9

How to do iterations in R?

I'm operating with a dataset that contains the values of same variables at different points in time. In the example below I have the values of variables a and b at time points 1 and 2.
> set.seed(1)
> data <- data.frame(matrix(sample(16), ncol = 4))
> names(data) <- paste(rep(c("a", "b"), each = 2), 1:2, sep = "")
> data
a1 a2 b1 b2
1 5 3 14 13
2 6 10 1 8
3 9 11 2 4
4 12 15 7 16
Now, suppose I want to calculate a new variable for both time points so that it would contain the sum of a and b (instead of the NAs as in example below). Since my actual dataset contains about 15 different variables and 10 time points (so 150 columns), I want to automate this calculation of 10 new variables.
> data[, paste("ab", 1:2, sep = "")] <- NA
> data
a1 a2 b1 b2 ab1 ab2
1 5 3 14 13 NA NA
2 6 10 1 8 NA NA
3 9 11 2 4 NA NA
4 12 15 7 16 NA NA
I've previously used Stata where I could create a simple 'foreach' loop to do this. Something like below.
foreach t of numlist 1/2 {
generate ab`t' = a`t' + b`t'
}
But I've learned that using loops in R is not feasible, nor have I any idea how to loop over variable names like that in R.
So what would be the correct solution for my problem in R?
This will replicate the same foreach loop you used in Stata.
for(i in 1:2){
data[, paste("ab", i, sep="")] <-
data[,paste("a", i, sep="")] + data[, paste("b", i, sep="")]
}
The output looks like this:
> data
a1 a2 b1 b2 ab1 ab2
1 15 1 16 12 31 13
2 10 7 14 3 24 10
3 2 5 9 4 11 9
4 6 8 13 11 19 19
to do this the R way,
make use of some native iteration via a *apply function
use the built-in rowSums (as in #Sotos) answer
make use of assignment into the data.frame, that is `]`<-
all together
data[paste0('ab', 1:2)] <- sapply(1:2,
function(i)
rowSums(data[paste0(c('a', 'b'), i)]))
data
# a1 a2 b1 b2 ab1 ab2
# 1 5 3 14 13 19 16
# 2 6 10 1 8 7 18
# 3 9 11 2 4 11 15
# 4 12 15 7 16 19 31
ps, in a program use vapply instead, you'll need to provide an additional argument specifying the shape of the output but its safer and sometimes faster
You can do without iteration:
data$ab1 <- data$a1 + data$b1
data$ab2 <- data$a2 + data$b2
or
data <- transform(data, ab1=a1+b1, ab2=a2+b2)
BTW:
It is better not to name an object data because data= is often a parameter in functions.
Here is one way to do it. We iterate over the unique values of the column names and we calculate the rowSums when those unique values match the colname values.
sapply(unique(sub('\\D', '', names(data))),
function(i) rowSums(data[,grepl(i, sub('\\D', '', names(data)))]))
# 1 2
#[1,] 17 23
#[2,] 24 22
#[3,] 14 10
#[4,] 15 11

Append a data frame to a master data frame if some columns are common [duplicate]

This question already has answers here:
Combine two data frames by rows (rbind) when they have different sets of columns
(14 answers)
Closed 7 years ago.
I want to append one data frame to another (the master one). The problem is that only subset of their columns are common. Also, the order of their columns might be different.
Master dataframe:
a b c
r1 1 2 -2
r2 2 4 -4
r3 3 6 -6
r4 4 8 -8
New dataframe:
d a c
r1 -120 10 -20
r2 -140 20 -40
Expected result:
a b c
r1 1 2 -2
r2 2 4 -4
r3 3 6 -6
r4 4 8 -8
r5 10 NaN -20
r6 20 NaN -40
Is there any smart way of doing this? This is a similar question but the setup is different.
Check out the bind_rows function in the dplyr package. It will do some nice things for you by default, such as filling in columns that exist in one data.frame but not the other with NAs instead of just failing. Here is an example:
# Use the dplyr package for binding rows and for selecting columns
library(dplyr)
# Generate some example data
a <- data.frame(a = rnorm(10), b = rnorm(10))
b <- data.frame(a = rnorm(5), c = rnorm(5))
# Stack data frames
bind_rows(a, b)
Source: local data frame [15 x 3]
a b c
1 2.2891895 0.1940835 NA
2 0.7620825 -0.2441634 NA
3 1.8289665 1.5280338 NA
4 -0.9851729 -0.7187585 NA
5 1.5829853 1.6609695 NA
6 0.9231296 1.8052112 NA
7 -0.5801230 -0.6928449 NA
8 0.2033514 -0.6673596 NA
9 -0.8576628 0.5163021 NA
10 0.6296633 -1.2445280 NA
11 2.1693068 NA -0.2556584
12 -0.1048966 NA -0.3132198
13 0.2673514 NA -1.1181995
14 1.0937759 NA -2.5750115
15 -0.8147180 NA -1.5525338
To solve the problem in your question, you would want to select for the columns in your master data.frame first. If a is the master data.frame, and b contains data that you want to add, you can use the select function from dplyr to get the columns that you need first.
# Select all columns in b with the same names as in master data, a
# Use select_() instead of select() to do standard evaluation.
b <- select_(b, names(a))
# Combine
bind_rows(a, b)
Source: local data frame [15 x 2]
a b
1 2.2891895 0.1940835
2 0.7620825 -0.2441634
3 1.8289665 1.5280338
4 -0.9851729 -0.7187585
5 1.5829853 1.6609695
6 0.9231296 1.8052112
7 -0.5801230 -0.6928449
8 0.2033514 -0.6673596
9 -0.8576628 0.5163021
10 0.6296633 -1.2445280
11 2.1693068 NA
12 -0.1048966 NA
13 0.2673514 NA
14 1.0937759 NA
15 -0.8147180 NA
try this:
library(plyr) # thanks to comment #ialm
df <- data.frame(a=1:4,b=seq(2,8,2),c=seq(-2,-8,-2))
new <- data.frame(d=c(-120,-140),a=c(10,20),c=c(-20,40))
# we use %in% to pull the columns that are the same in the master
# then we use rbind.fill to put in this dataframe below the master
# filling any missing data with NA values
res <- rbind.fill(df,new[,colnames(new) %in% colnames(df)])
> res
a b c
1 1 2 -2
2 2 4 -4
3 3 6 -6
4 4 8 -8
5 10 NA -20
6 20 NA 40
The dplyr- and plyr-based solutions posted here are very natural for this task using bind_rows and rbind.fill, respectively, though it is also possible as a one-liner in base R. Basically I would loop through the names of the first data frame, grabbing the corresponding column of the second data frame if it's there or otherwise returning all NaN values.
rbind(A, sapply(names(A), function(x) if (x %in% names(B)) B[,x] else rep(NaN, nrow(B))))
# a b c
# r1 1 2 -2
# r2 2 4 -4
# r3 3 6 -6
# r4 4 8 -8
# 5 10 NaN -20
# 6 20 NaN -40
another option is using rbind.fill from the plyr package
bring in your sample data
toread <- "
a b c
1 2 -2
2 4 -4
3 6 -6
4 8 -8"
master <- read.table(textConnection(toread), header = TRUE)
toread <- "
d a c
-120 10 -20
-140 20 -40"
to.append <- read.table(textConnection(toread), header = TRUE)
bind data
library(plyr)
rbind.fill(master, to.append)

Choose one cell per row in data frame

I have a vector that tells me, for each row in a date frame, the column index for which the value in this row should be updated.
> set.seed(12008); n <- 10000; d <- data.frame(c1=1:n, c2=2*(1:n), c3=3*(1:n))
> i <- sample.int(3, n, replace=TRUE)
> head(d); head(i)
c1 c2 c3
1 1 2 3
2 2 4 6
3 3 6 9
4 4 8 12
5 5 10 15
6 6 12 18
[1] 3 2 2 3 2 1
This means that for rows 1 and 4, c3 should be updated; for rows 2, 3 and 5, c2 should be updated (among others). What is the cleanest way to achieve this in R using vectorized operations, i.e, without apply and friends? EDIT: And, if at all possible, without R loops?
I have thought about transforming d into a matrix and then address the matrix elements using an one-dimensional vector. But then I haven't found a clean way to compute the one-dimensional address from the row and column indexes.
With your example data, and using only the first few rows (D and I below) you can easily do what you want via a matrix as you surmise.
set.seed(12008)
n <- 10000
d <- data.frame(c1=1:n, c2=2*(1:n), c3=3*(1:n))
i <- sample.int(3, n, replace=TRUE)
## just work with small subset
D <- head(d)
I <- head(i)
First, convert D into a matrix:
dmat <- data.matrix(D)
Next compute the indices of the vector representation of the matrix corresponding to rows and columns indicated by I. For this, it is easy to generate the row indices as well as the column index (given by I) using seq_along(I) which in this simple example is the vector 1:6. To compute the vector indices we can use:
(I - 1) * nrow(D) + seq_along(I)
where the first part ( (I - 1) * nrow(D) ) gives us the correct multiple of the number of rows (6 here) to index the start of the Ith column. We then add on the row index to get the index for the n-th element in the Ith column.
Using this we just index into dmat using "[", treating it like a vector. The replacement version of "[" ("[<-") allows us to do the replacement in a single line. Here I replace the indicated elements with NA to make it easier to see that the correct elements were identified:
> dmat
c1 c2 c3
1 1 2 3
2 2 4 6
3 3 6 9
4 4 8 12
5 5 10 15
6 6 12 18
> dmat[(I - 1) * nrow(D) + seq_along(I)] <- NA
> dmat
c1 c2 c3
1 1 2 NA
2 2 NA 6
3 3 NA 9
4 4 8 NA
5 5 NA 15
6 NA 12 18
If you are willing to first convert your data.frame to a matrix, you can index elements-to-be-replaced using a two-column matrix. (Beginning with R-2.16.0, this will be possible with data.frames directly.) The indexing matrix should have row indices in its first column and column indices in its second column.
Here's an example:
## Create a subset of the your data
set.seed(12008); n <- 6
D <- data.frame(c1=1:n, c2=2*(1:n), c3=3*(1:n))
i <- seq_len(nrow(D)) # vector of row indices
j <- sample(3, n, replace=TRUE) # vector of column indices
ij <- cbind(i, j) # a 2-column matrix to index a 2-D array
# (This extends smoothly to higher-D arrays.)
## Convert it to a matrix
Dmat <- as.matrix(D)
## Replace the elements indexed by 'ij'
Dmat[ij] <- NA
Dmat
# c1 c2 c3
# [1,] 1 2 NA
# [2,] 2 NA 6
# [3,] 3 NA 9
# [4,] 4 8 NA
# [5,] 5 NA 15
# [6,] NA 12 18
Beginning with R-2.16.0, you will be able to use the same syntax for dataframes (i.e. without having to first convert dataframes to matrices).
From the R-devel NEWS file:
Matrix indexing of dataframes by two column numeric indices is now supported for replacement as well as extraction.
Using the current R-devel snapshot, here's what that looks like:
D[ij] <- NA
D
# c1 c2 c3
# 1 1 2 NA
# 2 2 NA 6
# 3 3 NA 9
# 4 4 8 NA
# 5 5 NA 15
# 6 NA 12 18
Here's one way:
d[which(i == 1), "c1"] <- "one"
d[which(i == 2), "c2"] <- "two"
d[which(i == 3), "c3"] <- "three"
c1 c2 c3
1 1 2 three
2 2 two 6
3 3 two 9
4 4 8 three
5 5 two 15
6 one 12 18

Resources